本文实例讲述了Python爬虫框架scrapy实现的文件下载功能。分享给大家供大家参考,具体如下:
我们在写普通脚本的时候,从一个网站拿到一个文件的下载url,然后下载,直接将数据写入文件或者保存下来,但是这个需要我们自己一点一点的写出来,而且反复利用率并不高,为了不重复造轮子,scrapy提供很流畅的下载文件方式,只需要随便写写便可用了。
mat.py文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractor import LinkExtractor from weidashang.items import matplotlib class MatSpider(scrapy.Spider): name = "mat" allowed_domains = [ "matplotlib.org" ] start_urls = [ 'https://matplotlib.org/examples' ] def parse( self , response): #抓取每个脚本文件的访问页面,拿到后下载 link = LinkExtractor(restrict_css = 'div.toctree-wrapper.compound li.toctree-l2' ) for link in link.extract_links(response): yield scrapy.Request(url = link.url,callback = self .example) def example( self ,response): #进入每个脚本的页面,抓取源码文件按钮,并和base_url结合起来形成一个完整的url href = response.css( 'a.reference.external::attr(href)' ).extract_first() url = response.urljoin(href) example = matplotlib() example[ 'file_urls' ] = [url] return example |
pipelines.py
1
2
3
4
|
class MyFilePlipeline(FilesPipeline): def file_path( self , request, response = None , info = None ): path = urlparse(request.url).path return join(basename(dirname(path)),basename(path)) |
settings.py
1
2
3
4
|
ITEM_PIPELINES = { 'weidashang.pipelines.MyFilePlipeline' : 1 , } FILES_STORE = 'examples_src' |
items.py
1
2
3
|
class matplotlib(Item): file_urls = Field() files = Field() |
run.py
1
2
|
from scrapy.cmdline import execute execute([ 'scrapy' , 'crawl' , 'mat' , '-o' , 'example.json' ]) |
希望本文所述对大家Python程序设计有所帮助。
原文链接:https://www.cnblogs.com/lei0213/p/8098180.html