Mark a Little Spider

lufy

December 06, 2017

Mark my first little spider about scrapy.

Motivation:

Recent days, I was busy with testing the VMServer 4.3 (a virtualization software of PUHUA). The beta2 version is just released last morning. VMServer 4.3 has several parts: clients on different OSes and different server softwares, and they're released in different layers on an apache web site. As a tester, it's almost suffering to download one by one. So, the spider is to download files from apache site recursively.

Reference:

Mostly learnt from scrapy documentation:https://doc.scrapy.org/en/latest/intro/tutorial.html

And other python basic articles on Internet.

Create scrapy project and generate basic spider:

# scrapy startproject test01

# scrapy genspider beta2 http://192.168.32.38/test/

Filling spider:

vim test01/spider/beta2.py

# -- coding: utf-8 --
import scrapy
import os
import urllib

down_path = '/tmp/'

class Beta2Spider(scrapy.Spider):
name = 'beta2'
allowed_domains = ['192.168.32.38']
start_urls = [
'http://192.168.32.38/test/',
]

def parse(self, response):
dir_name = '/'.join(response.url.split("/")[3:-1])
dir_full_name = down_path + dir_name
os.mkdir(dir_full_name)

urls = response.xpath("//td/a/@href").extract()[1:]

for url in urls:
file_full_name = dir_full_name + "/" + url
file_url = response.url + url
if url[-1] == "/":
# this is a directory, follow the url.
yield response.follow(url, callback=self.parse)
else:
# this is a file, we'll download it.
urllib.urlretrieve(file_url, file_full_name)

############### codes end ########################

Then run spider, would get all the files under down_path defined in the codes, also the directory structure is kept.

# scrapy runspider beta2.py

Mark a Little Spider

Comments (0)

Leave a Comment