Mark a Little Spider

lufy

December 06, 2017

Mark my first little spider about scrapy.
Motivation:
Recent days, I was busy with &nbsp;testing the VMServer 4.3 (a virtualization software of <a title="Puhua software" href="http://www.i-soft.com.cn" target="_blank">PUHUA</a>). The beta2 version is just released last morning. VMServer 4.3 has several parts: clients on different OSes and different server softwares, and they're released in different layers on an apache web site. As a tester, it's almost suffering to download one by one. So, the spider is to download files from apache site recursively.
Reference:
Mostly learnt from scrapy documentation:https://doc.scrapy.org/en/latest/intro/tutorial.html
And other python basic articles on Internet.
Create scrapy project and generate basic spider:
# scrapy startproject test01
# scrapy genspider beta2 http://192.168.32.38/test/
Filling spider:
vim test01/spider/beta2.py
# -*- coding: utf-8 -*- import scrapy import os import urllib
down_path = '/tmp/'
class Beta2Spider(scrapy.Spider): name = 'beta2' allowed_domains = ['192.168.32.38'] start_urls = [ &nbsp; 'http://192.168.32.38/test/', &nbsp; ]
 def parse(self, response): &nbsp; dir_name = '/'.join(response.url.split("/")[3:-1]) &nbsp; dir_full_name = down_path + dir_name &nbsp; os.mkdir(dir_full_name)
&nbsp; urls = response.xpath("//td/a/@href").extract()[1:]
&nbsp; for url in urls: &nbsp; &nbsp; file_full_name = dir_full_name + "/" + url &nbsp; &nbsp; file_url = response.url + url &nbsp; &nbsp; if url[-1] == "/": &nbsp; &nbsp; &nbsp; &nbsp;# this is a directory, follow the url. &nbsp; &nbsp; &nbsp; &nbsp;yield response.follow(url, callback=self.parse) &nbsp; &nbsp; else: &nbsp; &nbsp; &nbsp; &nbsp;# this is a file, we'll download it. &nbsp; &nbsp; &nbsp; &nbsp;urllib.urlretrieve(file_url, file_full_name)
############### codes end ########################
Then &nbsp;run spider, would get all the files under&nbsp;down_path defined in the codes, also the directory structure is kept.
# scrapy runspider beta2.py

Mark a Little Spider

Comments (0)

Leave a Comment