发布于 2016-10-04 00:54:49 | 195 次阅读 | 评论: 0 | 来源: 网友投递
这里有新鲜出炉的Scrapy 0.24 中文文档,程序狗速度看过来!
Scrapy Python的爬虫框架
Scrapy是一个Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 1.2.0 发布了。
New FEED_EXPORT_ENCODING
setting to customize the encoding used when writing items to a file. This can be used to turn off uXXXX
escapes in JSON output. This is also useful for those wanting something else than UTF-8 for XML or CSV output (#2034).
startproject
command now supports an optional destination directory to override the default one based on the project name (#2005).
New SCHEDULER_DEBUG
setting to log requests serialization failures (#1610).
JSON encoder now supports serialization of set
instances (#2058).
Interpret application/json-amazonui-streaming
as TextResponse
(#1503).
scrapy
is imported by default when using shell tools (shell
,inspect_response
) (#2248).
DefaultRequestHeaders middleware now runs before UserAgent middleware (#2088). Warning: this is technically backwards incompatible, though we consider this a bug fix.
HTTP cache extension and plugins that use the .scrapy
data directory now work outside projects (#1581). Warning: this is technically backwards incompatible, though we consider this a bug fix.
Selector
does not allow passing both response
and text
anymore (#2153).
Fixed logging of wrong callback name with scrapy parse
(#2169).
Fix for an odd gzip decompression bug (#1606).
Fix for selected callbacks when using CrawlSpider
with scrapy parse
(#2225).
Fix for invalid JSON and XML files when spider yields no items (#872).
Implement flush()
for StreamLogger
avoiding a warning in logs (#2125).