发布于 2016-05-12 00:21:46 | 192 次阅读 | 评论: 0 | 来源: 网友投递
Scrapy Python的爬虫框架
Scrapy是一个Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 1.1.0 发布了。
改进记录如下:
Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See:ref:`news_betapy3` for more details and some limitations.
Hot new features:
Item loaders now support nested loaders (:issue:`1467`).
FormRequest.from_response
improvements (:issue:`1382`, :issue:`1137`).
Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved AutoThrottle docs (:issue:`1324`).
Added response.text
to get body as unicode (:issue:`1730`).
Anonymous S3 connections (:issue:`1358`).
Deferreds in downloader middlewares (:issue:`1473`). This enables better robots.txt handling (:issue:`1471`).
HTTP caching now follows RFC2616 more closely, added settings:setting:`HTTPCACHE_ALWAYS_STORE` and:setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`).
Selectors were extracted to the parsel library (:issue:`1409`). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy.
HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can also set the SSL/TLS method using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`.
These bug fixes may require your attention:
Don't retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add 400
to :setting:`RETRY_HTTP_CODES`.
Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try scrapy shell index.html
it will try to load the URL http://index.html, use scrapy shell ./index.html
to load a local file.
Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in settings.py
file after creating a new project.
Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use PythonItemExporter
, you may want to update your code to disable binary mode which is now deprecated.
Accept XML node names containing dots as valid (:issue:`1533`).
When uploading files or images to S3 (with FilesPipeline
orImagesPipeline
), the default ACL policy is now "private" instead of "public" Warning: backwards incompatible!. You can use :setting:`FILES_STORE_S3_ACL` to change it.
We've reimplemented canonicalize_url()
for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs.Warning: backwards incompatible!.
下载地址: