Scrapy源码(2)——爬虫开始的地方
Scrapy运行命令
一般来说,运行Scrapy项目的写法有,(这里不考虑从脚本运行Scrapy)
Usage examples:
$ scrapy crawl myspider
[ ... myspider starts crawling ... ]
$ scrapy runspider myspider.py
[ ... spider starts crawling ... ]
但是更好的写法是,新建一个Python文件,如下,(便于调试)
from scrapy import cmdline
cmdline.execute("scrapy crawl myspider".split())
很容易就发现,Scrapy运行文件是cmdline.py
文件里面的execute()
函数,下面学习下这个函数在做什么。
分析源码
def execute(argv=None, settings=None):
if argv is None:
argv = sys.argv
# --- backwards compatibility for scrapy.conf.settings singleton ---
if settings is None and 'scrapy.conf' in sys.modules:
from scrapy import conf
if hasattr(conf, 'settings'):
settings = conf.settings
# ------------------------------------------------------------------
寻找 scrapy.conf
配置文件,argv直接取sys.argv
if settings is None:
settings = get_project_settings()
# set EDITOR from environment if available
try:
editor = os.environ['EDITOR']
except KeyError: pass
else:
settings['EDITOR'] = editor
check_deprecated_settings(settings)
# --- backwards compatibility for scrapy.conf.settings singleton ---
import warnings
from scrapy.exceptions import ScrapyDeprecationWarning
with warnings.catch_warnings():
warnings.simplefilter("ignore", ScrapyDeprecationWarning)
from scrapy import conf
conf.settings = settings
# ------------------------------------------------------------------
set EDITOR from environment if available
读取settings
设置文件,导入项目,调用get_project_settings()
函数,此处为utils
文件夹下的project.py
文件:
def get_project_settings():
if ENVVAR not in os.environ:
project = os.environ.get('SCRAPY_PROJECT', 'default')
init_env(project)
project.py
init_env()
函数如下:
def init_env(project='default', set_syspath=True):
"""Initialize environment to use command-line tool from inside a project
dir. This sets the Scrapy settings module and modifies the Python path to
be able to locate the project module.
"""
cfg = get_config()
if cfg.has_option('settings', project):
os.environ['SCRAPY_SETTINGS_MODULE'] = cfg.get('settings', project)
closest = closest_scrapy_cfg()
if closest:
projdir = os.path.dirname(closest)
if set_syspath and projdir not in sys.path:
sys.path.append(projdir)
conf.py
如注释所说,初始化环境,循环递归找到用户项目中的配置文件settings.py
,并且将其设置到环境变量Scrapy settings module
中。然后修改Python路径,确保能找到项目模块。
settings = Settings()
settings_module_path = os.environ.get(ENVVAR)
if settings_module_path:
settings.setmodule(settings_module_path, priority='project')
# XXX: remove this hack
pickled_settings = os.environ.get("SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE")
if pickled_settings:
settings.setdict(pickle.loads(pickled_settings), priority='project')
# XXX: deprecate and remove this functionality
env_overrides = {k[7:]: v for k, v in os.environ.items() if
k.startswith('SCRAPY_')}
if env_overrides:
settings.setdict(env_overrides, priority='project')
return settings
project.py
至此,get_project_settings()
该函数结束,如函数名字一样,最后返回项目配置,到此为止,接着往下看
inproject = inside_project()
cmds = _get_commands_dict(settings, inproject)
cmdname = _pop_command_name(argv)
parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(),
conflict_handler='resolve')
if not cmdname:
_print_commands(settings, inproject)
sys.exit(0)
elif cmdname not in cmds:
_print_unknown_command(settings, cmdname, inproject)
sys.exit(2)
导入相应的module爬虫模块(inside_project)
执行环境是否在项目中,主要检查scrapy.cfg配置文件是否存在,读取commands文件夹,把所有的命令类转换为{cmd_name: cmd_instance}
的字典
cmd = cmds[cmdname]
parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
parser.description = cmd.long_desc()
settings.setdict(cmd.default_settings, priority='command')
cmd.settings = settings
cmd.add_options(parser)
opts, args = parser.parse_args(args=argv[1:])
_run_print_help(parser, cmd.process_options, args, opts)
根据命令名称找到对应的命令实例,设置项目配置和级别为command,添加解析规则,解析命令参数,并交由Scrapy命令实例处理。
最后,看看下面这段代码。
cmd.crawler_process = CrawlerProcess(settings)
_run_print_help(parser, _run_command, cmd, args, opts)
sys.exit(cmd.exitcode)
初始化CrawlerProcess
实例,将对应的命令执行,这里是crawl
def _run_command(cmd, args, opts):
if opts.profile:
_run_command_profiled(cmd, args, opts)
else:
cmd.run(args, opts)
看到这,想起了文档中的介绍 Run Scrapy from a script
# Here’s an example showing how to run a single spider with it.
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
# Your spider definition
...
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished
所以Scrapy爬虫运行都有用使用到CrawlerProcess
,想要深入了解可以去看看源码 scrapy/scrapy/crawler.py
"""
A class to run multiple scrapy crawlers in a process simultaneously.
This class extends :class:`~scrapy.crawler.CrawlerRunner` by adding support
for starting a Twisted `reactor`_ and handling shutdown signals, like the
keyboard interrupt command Ctrl-C. It also configures top-level logging.
This utility should be a better fit than
:class:`~scrapy.crawler.CrawlerRunner` if you aren't running another
Twisted `reactor`_ within your application.
The CrawlerProcess object must be instantiated with a
:class:`~scrapy.settings.Settings` object.
:param install_root_handler: whether to install root logging handler
(default: True)
This class shouldn't be needed (since Scrapy is responsible of using it
accordingly) unless writing scripts that manually handle the crawling
process. See :ref:`run-from-script` for an example.
"""
最后,附上Scrapy的路径图
总结
简单来说,有这么几步:
- 读取配置文件,应用到爬虫中
- 把所有的命令类转换名称与实例字典
- 初始化
CrawlerProcess
实例,运行爬虫
(看的头疼,好多函数名记不住)
- Sublime快速入门
- SQL学习之汇总数据之聚集函数
- Sedo榜单中,域名“加密世界”CryptoWorld.com七位数夺冠
- ExtJs学习笔记(20)-利用ExtJs的Ajax与服务端WCF交互
- 2018年热点分享:比特币到底是什么?
- JVM快速入门
- SQL学习之使用常用函数处理数据
- Javascript快速入门(上篇)
- SQL练习之不反复执行相同的计算
- SQL练习之求解填字游戏
- 快速入门系列--WCF--08扩展与新特性
- SQL练习之两个列值的交换
- Parcel,零配置开发 React 应用!
- 像 React Native 开发 APP 一样,用wn-cli 开发 weapp (微信小程序)
- JavaScript 教程
- JavaScript 编辑工具
- JavaScript 与HTML
- JavaScript 与Java
- JavaScript 数据结构
- JavaScript 基本数据类型
- JavaScript 特殊数据类型
- JavaScript 运算符
- JavaScript typeof 运算符
- JavaScript 表达式
- JavaScript 类型转换
- JavaScript 基本语法
- JavaScript 注释
- Javascript 基本处理流程
- Javascript 选择结构
- Javascript if 语句
- Javascript if 语句的嵌套
- Javascript switch 语句
- Javascript 循环结构
- Javascript 循环结构实例
- Javascript 跳转语句
- Javascript 控制语句总结
- Javascript 函数介绍
- Javascript 函数的定义
- Javascript 函数调用
- Javascript 几种特殊的函数
- JavaScript 内置函数简介
- Javascript eval() 函数
- Javascript isFinite() 函数
- Javascript isNaN() 函数
- parseInt() 与 parseFloat()
- escape() 与 unescape()
- Javascript 字符串介绍
- Javascript length属性
- javascript 字符串函数
- Javascript 日期对象简介
- Javascript 日期对象用途
- Date 对象属性和方法
- Javascript 数组是什么
- Javascript 创建数组
- Javascript 数组赋值与取值
- Javascript 数组属性和方法
- 如何用R语言绘制生成正态分布图表
- hdu 5143 NPY and arithmetic progression(暴力+思维)
- 正则表达式之简易markdown文件解析器
- 2015.CCF计算机软件能力认证试题第二题
- Codeforces Round #547 (Div. 3)A. Game 23
- Codeforces Round #547 (Div. 3)C. Polycarp Restores Permutation
- 动态规划入门_数塔问题
- Rust所有者被修改了会发生什么?
- 如何编写高质量代码
- 动态规划入门_钱币兑换问题
- Codeforces Round #547 (Div. 3)D. Colored Boots
- JavaScript 性能优化
- 优化循环的方法-循环展开
- 程序性能优化-局部性原理
- Codeforces Round #547 (Div. 3)E. Superhero Battle