Mark a Little Spider lufy December 06, 2017 <p>Mark my first little spider about scrapy.</p> <p>Motivation:</p> <p>Recent days, I was busy with testing the VMServer 4.3 (a virtualization software of <a title="Puhua software" href="http://www.i-soft.com.cn" target="_blank">PUHUA</a>). The beta2 version is just released last morning. VMServer 4.3 has several parts: clients on different OSes and different server softwares, and they're released in different layers on an apache web site. As a tester, it's almost suffering to download one by one. So, the spider is to download files from apache site recursively.</p> <p>Reference:</p> <p>Mostly learnt from scrapy documentation:https://doc.scrapy.org/en/latest/intro/tutorial.html</p> <p>And other python basic articles on Internet.</p> <p>Create scrapy project and generate basic spider:</p> <p><span style="font-family: 'courier new', courier;"># scrapy startproject test01</span></p> <p><span style="font-family: 'courier new', courier;"># scrapy genspider beta2 http://192.168.32.38/test/</span></p> <p>Filling spider:</p> <p><span style="font-family: 'courier new', courier;">vim test01/spider/beta2.py</span></p> <p><span style="font-family: 'courier new', courier;"># -*- coding: utf-8 -*-</span><br /><span style="font-family: 'courier new', courier;">import scrapy</span><br /><span style="font-family: 'courier new', courier;">import os</span><br /><span style="font-family: 'courier new', courier;">import urllib</span></p> <p><span style="font-family: 'courier new', courier;">down_path = '/tmp/'</span></p> <p><span style="font-family: 'courier new', courier;">class Beta2Spider(scrapy.Spider):</span><br /><span style="font-family: 'courier new', courier;"> name = 'beta2'</span><br /><span style="font-family: 'courier new', courier;"> allowed_domains = ['192.168.32.38']</span><br /><span style="font-family: 'courier new', courier;"> start_urls = [</span><br /><span style="font-family: 'courier new', courier;"> 'http://192.168.32.38/test/',</span><br /><span style="font-family: 'courier new', courier;"> ]</span></p> <p><br /><span style="font-family: 'courier new', courier;"> def parse(self, response):</span><br /><span style="font-family: 'courier new', courier;"> dir_name = '/'.join(response.url.split("/")[3:-1])</span><br /><span style="font-family: 'courier new', courier;"> dir_full_name = down_path + dir_name</span><br /><span style="font-family: 'courier new', courier;"> os.mkdir(dir_full_name)</span></p> <p><span style="font-family: 'courier new', courier;"> urls = response.xpath("//td/a/@href").extract()[1:]</span></p> <p><span style="font-family: 'courier new', courier;"> for url in urls:</span><br /><span style="font-family: 'courier new', courier;"> file_full_name = dir_full_name + "/" + url</span><br /><span style="font-family: 'courier new', courier;"> file_url = response.url + url</span><br /><span style="font-family: 'courier new', courier;"> if url[-1] == "/":</span><br /><span style="font-family: 'courier new', courier;"> # this is a directory, follow the url.</span><br /><span style="font-family: 'courier new', courier;"> yield response.follow(url, callback=self.parse)</span><br /><span style="font-family: 'courier new', courier;"> else:</span><br /><span style="font-family: 'courier new', courier;"> # this is a file, we'll download it.</span><br /><span style="font-family: 'courier new', courier;"> urllib.urlretrieve(file_url, file_full_name)</span></p> <p><span style="font-family: 'courier new', courier;">############### codes end ########################</span></p> <p>Then run spider, would get all the files under <span style="font-family: 'courier new', courier;">down_path</span> defined in the codes, also the directory structure is kept.</p> <p><span style="font-family: 'courier new', courier;"># scrapy runspider beta2.py</span></p>
Comments (0)
Leave a Comment
No comments yet. Be the first to comment!