Scrapy 1.1.0 发布，web 爬虫框架

发布于 2016-05-12 00:21:46 | 201 次阅读 | 评论: 0 | 来源: 网友投递

Scrapy Python的爬虫框架

Scrapy是一个Python开发的一个快速,高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。

Scrapy 1.1.0 发布了。

改进记录如下：

Scrapy 1.1 has beta Python 3 support (requires Twisted >= 15.5). See:ref:`news_betapy3` for more details and some limitations.
Hot new features:
- Item loaders now support nested loaders (:issue:`1467`).
- FormRequest.from_response improvements (:issue:`1382`, :issue:`1137`).
- Added setting :setting:`AUTOTHROTTLE_TARGET_CONCURRENCY` and improved AutoThrottle docs (:issue:`1324`).
- Added response.text to get body as unicode (:issue:`1730`).
- Anonymous S3 connections (:issue:`1358`).
- Deferreds in downloader middlewares (:issue:`1473`). This enables better robots.txt handling (:issue:`1471`).
- HTTP caching now follows RFC2616 more closely, added settings:setting:`HTTPCACHE_ALWAYS_STORE` and:setting:`HTTPCACHE_IGNORE_RESPONSE_CACHE_CONTROLS` (:issue:`1151`).
- Selectors were extracted to the parsel library (:issue:`1409`). This means you can use Scrapy Selectors without Scrapy and also upgrade the selectors engine without needing to upgrade Scrapy.
- HTTPS downloader now does TLS protocol negotiation by default, instead of forcing TLS 1.0. You can also set the SSL/TLS method using the new :setting:`DOWNLOADER_CLIENT_TLS_METHOD`.
These bug fixes may require your attention:
- Don't retry bad requests (HTTP 400) by default (:issue:`1289`). If you need the old behavior, add 400 to :setting:`RETRY_HTTP_CODES`.
- Fix shell files argument handling (:issue:`1710`, :issue:`1550`). If you try scrapy shell index.html it will try to load the URL http://index.html, use scrapy shell ./index.html to load a local file.
- Robots.txt compliance is now enabled by default for newly-created projects (:issue:`1724`). Scrapy will also wait for robots.txt to be downloaded before proceeding with the crawl (:issue:`1735`). If you want to disable this behavior, update :setting:`ROBOTSTXT_OBEY` in settings.py file after creating a new project.
- Exporters now work on unicode, instead of bytes by default (:issue:`1080`). If you use PythonItemExporter, you may want to update your code to disable binary mode which is now deprecated.
- Accept XML node names containing dots as valid (:issue:`1533`).
- When uploading files or images to S3 (with FilesPipeline orImagesPipeline), the default ACL policy is now "private" instead of "public" Warning: backwards incompatible!. You can use :setting:`FILES_STORE_S3_ACL` to change it.
- We've reimplemented canonicalize_url() for more correct output, especially for URLs with non-ASCII characters (:issue:`1947`). This could change link extractors output compared to previous scrapy versions. This may also invalidate some cache entries you could still have from pre-1.1 runs.Warning: backwards incompatible!.

下载地址：

Scrapy 1.1.0 发布，web 爬虫框架

Scrapy Python的爬虫框架

后端技术

前端技术

数据库

热门框架

常用IDE

其他