Mark a Little Spider
Mark my first little spider about scrapy.
Motivation:
Recent days, I was busy with testing the VMServer 4.3 (a virtualization software of PUHUA). The beta2 version is just released last morning. VMServer 4.3 has several parts: clients on different OSes and different server softwares, and they're released in different layers on an apache web site. As a tester, it's almost suffering to download one by one. So, the spider is to download files from apache site recursively.
Reference:
Mostly learnt from scrapy documentation:https://doc.scrapy.org/en/latest/intro/tutorial.html
And other python basic articles on Internet.
Create scrapy project and generate basic spider:
# scrapy startproject test01
# scrapy genspider beta2 http://192.168.32.38/test/
Filling spider:
vim test01/spider/beta2.py
# -- coding: utf-8 --
import scrapy
import os
import urllib
down_path = '/tmp/'
class Beta2Spider(scrapy.Spider):
name = 'beta2'
allowed_domains = ['192.168.32.38']
start_urls = [
'http://192.168.32.38/test/',
]
def parse(self, response):
dir_name = '/'.join(response.url.split("/")[3:-1])
dir_full_name = down_path + dir_name
os.mkdir(dir_full_name)
urls = response.xpath("//td/a/@href").extract()[1:]
for url in urls:
file_full_name = dir_full_name + "/" + url
file_url = response.url + url
if url[-1] == "/":
# this is a directory, follow the url.
yield response.follow(url, callback=self.parse)
else:
# this is a file, we'll download it.
urllib.urlretrieve(file_url, file_full_name)
############### codes end ########################
Then run spider, would get all the files under down_path defined in the codes, also the directory structure is kept.
# scrapy runspider beta2.py
Comments (0)
Leave a Comment
No comments yet. Be the first to comment!