python爬虫----(4. scrapy框架,官方文档以及例子)
官方文档: http://doc.scrapy.org/en/latest/
github例子: https://github.com/search?utf8=%E2%9C%93&q=scrapy
剩下的待会再整理...... 买饭去...... --2014年08月20日19:29:20
の...刚搜狗输入法出问题,直接注销重新登陆,结果刚才的那些内容全部没了。看来草稿箱也不是太靠谱呀!!!
再重新整理下吧
-- 2014年08月21日04:02:37
(一)基本的 -- scrapy.spider.Spider
(1)使用交互shell
dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"
2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines:
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
2014-08-21 04:09:11+0800 [default] INFO: Spider opened
2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0xa483cec>
[s] item {}
[s] request <GET http://www.baidu.com/>
[s] response <200 http://www.baidu.com/>
[s] settings <scrapy.settings.Settings object at 0xa0de78c>
[s] spider <Spider 'default' at 0xa78086c>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>>
# response.body 返回的所有内容
# response.xpath('//ul/li') 可以测试所有的xpath内容
More important, if you type response.selector you will access a selector object you can use to query the response, and convenient shortcuts like response.xpath() and response.css() mapping to response.selector.xpath() and response.selector.css()
也就是可以很方便的,以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的,但是并不能保证每次都能正确的选择出内容。
也可使用:
scrapy shell ’http://scrapy.org’ --nolog
# 参数 --nolog 没有日志
(2)示例
from scrapy import Spider
from scrapy_test.items import DmozItem
class DmozSpider(Spider):
name = 'dmoz'
allowed_domains = ['dmoz.org']
start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',
'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,'
'']
def parse(self, response):
for sel in response.xpath('//ul/li'):
item = DmozItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
(3)保存文件
可以使用,保存文件。格式可以 json,xml,csv
scrapy crawl -o 'a.json' -t 'json'
(4)使用模板创建spider
scrapy genspider baidu baidu.com
# -*- coding: utf-8 -*-
import scrapy
class BaiduSpider(scrapy.Spider):
name = "baidu"
allowed_domains = ["baidu.com"]
start_urls = (
'http://www.baidu.com/',
)
def parse(self, response):
pass
这段先这样吧,记得之前5个的,现在只能想起4个来了. :-(
千万记得随手点下保存按钮。否则很是影响心情的(⊙o⊙)!
(二)高级 -- scrapy.contrib.spiders.CrawlSpider
(1)CrawlSpider
class scrapy.contrib.spiders.CrawlSpider
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for
following links by defining a set of rules. It may not be the best suited for your particular web sites or project,
but it’s generic enough for several cases, so you can start from it and override it as needed for more custom
functionality, or just implement your own spider.
Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute:
rules
Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the
site. Rules objects are described below. If multiple rules match the same link, the first one will be used,
according to the order they’re defined in this attribute.
This spider also exposes an overrideable method:
parse_start_url(response)
This method is called for the start_urls responses. It allows to parse the initial responses and must return
either a Item object, a Request object, or an iterable containing any of them.
(2)例子
#coding=utf-8
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
import scrapy
class TestSpider(CrawlSpider):
name = 'test'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/']
rules = (
# 元组
Rule(LinkExtractor(allow=('category.php', ), deny=('subsection.php', ))),
Rule(LinkExtractor(allow=('item.php', )), callback='pars_item'),
)
def parse_item(self, response):
self.log('item page : %s' % response.url)
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re('ID:(d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
(3)其他的。
其他的还有 XMLFeedSpider,这个有空再研究吧。
class scrapy.contrib.spiders.XMLFeedSpider
class scrapy.contrib.spiders.CSVFeedSpider
class scrapy.contrib.spiders.SitemapSpider
(三)选择器
>>> from scrapy.selector import Selector >>> from scrapy.http import HtmlResponse 可以灵活的使用 .css() 和 .xpath() 来快速的选取目标数据
!!!关于选择器,需要好好研究一下。xpath() 和 css() ,还要继续熟悉 正则.
当通过class来进行选择的时候,尽量使用 css() 来选择,然后再用 xpath() 来选择元素的熟悉
(四)Item Pipeline
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical use for item pipelines are:
• cleansing HTML data # 清除HTML数据
• validating scraped data (checking that the items contain certain fields) # 验证数据
• checking for duplicates (and dropping them) # 检查重复
• storing the scraped item in a database # 存入数据库
(1)验证数据
from scrapy.exceptions import DropItem
class PricePipeline(object):
vat_factor = 1.5
def process_item(self, item, spider):
if item['price']:
if item['price_excludes_vat']:
item['price'] *= self.vat_factor
else:
raise DropItem('Missing price in %s' % item)
(2)写Json文件
import json
class JsonWriterPipeline(object):
def __init__(self):
self.file = open('json.jl', 'wb')
def process_item(self, item, spider):
line = json.dumps(dict(item)) + 'n'
self.file.write(line)
return item
(3)检查重复
from scrapy.exceptions import DropItem
class Duplicates(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem('Duplicate item found : %s' % item)
else:
self.ids_seen.add(item['id'])
return item
至于将数据写入数据库,应该也很简单。在 process_item 函数中,将 item 存入进去即可了。
看了一晚上,看到85页。 算是把基本的看的差不多了。
-- 2014年08月21日06:39:41
(五)
- SQL 2008 r2 安装提示 visual studio 2008 版本错误解决方法
- mssql 获取表空间大小
- SQLite 带你入门
- Windows下Nginx+Mysql+Php(wnmp)环境搭建
- LNMP源码编译安装(centos7+nginx1.9+mysql5.6+php7)
- MySQL SHOW PROFILE(剖析报告)的查看
- PHP连接MySQL数据库的三种方式(mysql、mysqli、pdo)
- 如何查看已经安装的nginx、apache、mysql和php的编译参数
- 连仕彤博客Centos7安装Mysql数据库
- sql server 2008 操作数据表
- sql server 使用函数辅助查询
- sql server存储过程编程
- sql server 2008 数据库的完整性约束
- sql server T-SQL 基础
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- nifi.sh 脚本解读
- 关于当前PHP脚本运行时系统信息相关函数
- NIFI启动源码解读
- 高并发系统三大利器之降级
- 简单学习PHP中的层次性能分析器
- 常见乱码问题分析
- 深入理解 Vue 模板渲染:Vue 模板反编译
- 彻底搞懂 etcd 系列文章(七):etcd gRPC 服务 API
- NIFI 开发注解详述
- [已解决]java请求爬取https网站报错javax.net.ssl.SSLHandshakeException的解决办法
- 面经手册 · 第7篇《ArrayList也这么多知识?一个指定位置插入就把谢飞机面晕了!》
- JsonPath实践(六)
- 自定义Processor组件
- Android开发第三讲,布局管理器
- Android 开发第四讲 TextView的基本使用