Mark a Little Spider

lufy
December 06, 2017

Mark my first little spider about scrapy.


Motivation:


Recent days, I was busy with  testing the VMServer 4.3 (a virtualization software of PUHUA). The beta2 version is just released last morning. VMServer 4.3 has several parts: clients on different OSes and different server softwares, and they're released in different layers on an apache web site. As a tester, it's almost suffering to download one by one. So, the spider is to download files from apache site recursively.


Reference:


Mostly learnt from scrapy documentation:https://doc.scrapy.org/en/latest/intro/tutorial.html


And other python basic articles on Internet.


Create scrapy project and generate basic spider:


# scrapy startproject test01


# scrapy genspider beta2 http://192.168.32.38/test/


Filling spider:


vim test01/spider/beta2.py


# -- coding: utf-8 --
import scrapy
import os
import urllib


down_path = '/tmp/'


class Beta2Spider(scrapy.Spider):
name = 'beta2'
allowed_domains = ['192.168.32.38']
start_urls = [
  'http://192.168.32.38/test/',
  ]



def parse(self, response):
  dir_name = '/'.join(response.url.split("/")[3:-1])
  dir_full_name = down_path + dir_name
  os.mkdir(dir_full_name)


  urls = response.xpath("//td/a/@href").extract()[1:]


  for url in urls:
    file_full_name = dir_full_name + "/" + url
    file_url = response.url + url
    if url[-1] == "/":
       # this is a directory, follow the url.
       yield response.follow(url, callback=self.parse)
    else:
       # this is a file, we'll download it.
       urllib.urlretrieve(file_url, file_full_name)


############### codes end ########################


Then  run spider, would get all the files under down_path defined in the codes, also the directory structure is kept.


# scrapy runspider beta2.py

Comments (0)

Leave a Comment
Maximum 1000 characters

No comments yet. Be the first to comment!