唯一图库爬取图片

项目一：唯一图库

项目概述：根据需要到唯一图库爬取图片

所用技术：scrapy,urllib，字符串处理，百分号格式化

爬虫程序根据setting和item配置把爬取到的数据交给pipline处理

相关操作记录如下：

爬虫程序

 1 #!/usr/bin/env python
 2 #-*- coding:utf-8 -*-
 3 #s1.py
 4 import scrapy
 5 from scrapy.selector import HtmlXPathSelector
 6 from spider1 import items
 7 
 8 class LL(scrapy.spiders.Spider):
 9     name = 'xx'
10     start_urls=['http://www.mmonly.cc/sgtp/',]
11     def parse(self,response):
12         hxs = HtmlXPathSelector(response)
13 
14         item = items.Spider1Item()
15         item['names'] = hxs.select('//div[@class="item_t"]//img/@alt').extract()
16         item['imgs'] = hxs.select('//div[@class="item_t"]//img/@src').extract()
17         yield item

View Code

settings.py

 1 # -*- coding: utf-8 -*-
 2 #settings.py
 3 BOT_NAME = 'spider1'
 4 #
 5 SPIDER_MODULES = ['spider1.spiders']
 6 NEWSPIDER_MODULE = 'spider1.spiders'
 7 ROBOTSTXT_OBEY = True
 8 ITEM_PIPELINES = {
 9    'spider1.pipelines.Spider1Pipeline': 100,
10 }

View Code

items.py

 1 # -*- coding: utf-8 -*-
 2 #items.py
 3 # Define here the models for your scraped items
 4 #
 5 # See documentation in:
 6 # https://doc.scrapy.org/en/latest/topics/items.html
 7 
 8 import scrapy
 9 
10 
11 class Spider1Item(scrapy.Item):
12     # define the fields for your item here like:
13     # name = scrapy.Field()
14     names = scrapy.Field()
15     imgs = scrapy.Field()

View Code

pipelines.py

 1 # -*- coding: utf-8 -*-
 2 #pipelines.py
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 
 8 
 9 class Spider1Pipeline(object):
10     def process_item(self, item, spider):
11 
12         names = item['names']
13         imgs = item['imgs']
14 
15         print names[0],imgs[0]
16         for i in range(len(names)):
17             if names[i] and imgs[i]:
18                 img_name = names[i] + '.jpg'
19                 c = str(imgs[i])
20                 if c.startswith('h'):
21                     net_url = c
22                 else:
23                     net_url = 'http://www.xiaohuar.com' + c
24                 local_file = 'C:\\Users\\wenxianfeng\\Desktop\\img\\%s' % img_name
25                 import urllib
26                 urllib.urlretrieve(net_url, local_file)
27 
28         return item

View Code

今天的文章唯一图库爬取图片分享到此就结束了，感谢您的阅读，如果确实帮到您，您可以动动手指转发给其他人。

版权声明：本文内容由互联网用户自发贡献，该文观点仅代表作者本人。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容，请发送邮件至举报，一经查实，本站将立刻删除。
如需转载请保留出处：https://bianchenghao.cn/55619.html

唯一图库爬取图片

相关推荐

发表回复