发布于 2015-09-04 06:52:30 | 279 次阅读 | 评论: 0 | 来源: 网络整理
ITEM_PIPELINES
is now defined as a dict (instead of a list)Thanks to everyone who contribute to this release!
List of contributors sorted by number of commits:
69 Daniel Graña <dangra@...> 37 Pablo Hoffman <pablo@...> 13 Mikhail Korobov <kmike84@...> 9 Alex Cepoi <alex.cepoi@...> 9 alexanderlukanin13 <alexander.lukanin.13@...> 8 Rolando Espinoza La fuente <darkrho@...> 8 Lukasz Biedrycki <lukasz.biedrycki@...> 6 Nicolas Ramirez <nramirez.uy@...> 3 Paul Tremberth <paul.tremberth@...> 2 Martin Olveyra <molveyra@...> 2 Stefan <misc@...> 2 Rolando Espinoza <darkrho@...> 2 Loren Davie <loren@...> 2 irgmedeiros <irgmedeiros@...> 1 Stefan Koch <taikano@...> 1 Stefan <cct@...> 1 scraperdragon <dragon@...> 1 Kumara Tharmalingam <ktharmal@...> 1 Francesco Piccinno <stack.box@...> 1 Marcos Campal <duendex@...> 1 Dragon Dave <dragon@...> 1 Capi Etheriel <barraponto@...> 1 cacovsky <amarquesferraz@...> 1 Berend Iwema <berend@...>
--pdb
option to scrapy
command line toolXPathSelector.remove_namespaces()
which allows to remove all namespaces from XML documents for convenience (to work with namespace-less XPaths). Documented in 选择器(Selectors).Thanks to everyone who contribute to this release. Here is a list of contributors sorted by number of commits:
130 Pablo Hoffman <pablo@...> 97 Daniel Graña <dangra@...> 20 Nicolás Ramírez <nramirez.uy@...> 13 Mikhail Korobov <kmike84@...> 12 Pedro Faustino <pedrobandim@...> 11 Steven Almeroth <sroth77@...> 5 Rolando Espinoza La fuente <darkrho@...> 4 Michal Danilak <mimino.coder@...> 4 Alex Cepoi <alex.cepoi@...> 4 Alexandr N Zamaraev (aka tonal) <tonal@...> 3 paul <paul.tremberth@...> 3 Martin Olveyra <molveyra@...> 3 Jordi Llonch <llonchj@...> 3 arijitchakraborty <myself.arijit@...> 2 Shane Evans <shane.evans@...> 2 joehillen <joehillen@...> 2 Hart <HartSimha@...> 2 Dan <ellisd23@...> 1 Zuhao Wan <wanzuhao@...> 1 whodatninja <blake@...> 1 vkrest <v.krestiannykov@...> 1 tpeng <pengtaoo@...> 1 Tom Mortimer-Jones <tom@...> 1 Rocio Aramberri <roschegel@...> 1 Pedro <pedro@...> 1 notsobad <wangxiaohugg@...> 1 Natan L <kuyanatan.nlao@...> 1 Mark Grey <mark.grey@...> 1 Luan <luanpab@...> 1 Libor Nenadál <libor.nenadal@...> 1 Juan M Uys <opyate@...> 1 Jonas Brunsgaard <jonas.brunsgaard@...> 1 Ilya Baryshev <baryshev@...> 1 Hasnain Lakhani <m.hasnain.lakhani@...> 1 Emanuel Schorsch <emschorsch@...> 1 Chris Tilden <chris.tilden@...> 1 Capi Etheriel <barraponto@...> 1 cacovsky <amarquesferraz@...> 1 Berend Iwema <berend@...>
Scrapy changes:
-o
and -t
to the runspider
commandAUTOTHROTTLE_ENABLED
stats_spider_opened
, etc). Stats are much simpler now, backwards compatibility is kept on the Stats Collector API and signals.process_start_requests()
method to spider middlewaresscrapy.xlib.BeautifulSoup
and scrapy.xlib.ClientForm
cookiejar
Request meta key to support multiple cookie sessions per spiderREFERER_ENABLED
setting, to control referer middlewareScrapy/VERSION (+http://scrapy.org)
HTMLImageLinkExtractor
class from scrapy.contrib.linkextractors.image
USER_AGENT
spider attribute will no longer work, use user_agent
attribute insteadDOWNLOAD_TIMEOUT
spider attribute will no longer work, use download_timeout
attribute insteadENCODING_ALIASES
setting, as encoding auto-detection has been moved to the w3lib libraryDOWNLOAD_HANDLERS
setting) now receive settings as the first argument of the constructorscrapy.utils.memory
modulescrapy.mail.mail_sent
TRACK_REFS
setting, now trackrefs is always enabledlog_count/LEVEL
)response_received_count
)scrapy.log.started
attributeSupport for AJAX crawleable urls
New persistent scheduler that stores requests on disk, allowing to suspend and resume crawls (r2737)
added -o
option to scrapy crawl
, a shortcut for dumping scraped items into a file (or standard output using -
)
Added support for passing custom settings to Scrapyd schedule.json
api (r2779, r2783)
New ChunkedTransferMiddleware
(enabled by default) to support chunked transfer encoding (r2769)
Add boto 2.0 support for S3 downloader handler (r2763)
In request errbacks, offending requests are now received in failure.request attribute (r2738)
CONCURRENT_REQUESTS_PER_SPIDER
setting has been deprecated and replaced by:
check the documentation for more details
Added builtin caching DNS resolver (r2728)
Moved Amazon AWS-related components/extensions (SQS spider queue, SimpleDB stats collector) to a separate project: [scaws](https://github.com/scrapinghub/scaws) (r2706, r2714)
Moved spider queues to scrapyd: scrapy.spiderqueue -> scrapyd.spiderqueue (r2708)
Moved sqlite utils to scrapyd: scrapy.utils.sqlite -> scrapyd.sqlite (r2781)
Real support for returning iterators on start_requests() method. The iterator is now consumed during the crawl when the spider is getting idle (r2704)
Added REDIRECT_ENABLED
setting to quickly enable/disable the redirect middleware (r2697)
Added RETRY_ENABLED
setting to quickly enable/disable the retry middleware (r2694)
Added CloseSpider
exception to manually close spiders (r2691)
Improved encoding detection by adding support for HTML5 meta charset declaration (r2690)
Refactored close spider behavior to wait for all downloads to finish and be processed by spiders, before closing the spider (r2688)
Added SitemapSpider
(see documentation in Spiders page) (r2658)
Added LogStats
extension for periodically logging basic stats (like crawled pages and scraped items) (r2657)
Make handling of gzipped responses more robust (#319, r2643). Now Scrapy will try and decompress as much as possible from a gzipped response, instead of failing with an IOError.
Simplified !MemoryDebugger extension to use stats for dumping memory debugging info (r2639)
Added new command to edit spiders: scrapy edit
(r2636) and -e flag to genspider command that uses it (r2653)
Changed default representation of items to pretty-printed dicts. (r2631). This improves default logging by making log more readable in the default case, for both Scraped and Dropped lines.
Added spider_error
signal (r2628)
Added COOKIES_ENABLED
setting (r2625)
Stats are now dumped to Scrapy log (default value of STATS_DUMP
setting has been changed to True). This is to make Scrapy users more aware of Scrapy stats and the data that is collected there.
Added support for dynamically adjusting download delay and maximum concurrent requests (r2599)
Added new DBM HTTP cache storage backend (r2576)
Added listjobs.json
API to Scrapyd (r2571)
CsvItemExporter
: added join_multivalued
parameter (r2578)
Added namespace support to xmliter_lxml
(r2552)
Improved cookies middleware by making COOKIES_DEBUG nicer and documenting it (r2579)
Several improvements to Scrapyd and Link extractors
Scraped Item...
were removedPassed Item...
were renamed to Scraped Item...
lines and downgraded to DEBUG
levelRemoved unused function: scrapy.utils.request.request_info() (r2577)
Removed googledir project from examples/googledir. There’s now a new example project called dirbot available on github: https://github.com/scrapy/dirbot
Removed support for default field values in Scrapy items (r2616)
Removed experimental crawlspider v2 (r2632)
Removed scheduler middleware to simplify architecture. Duplicates filter is now done in the scheduler itself, using the same dupe fltering class as before (DUPEFILTER_CLASS setting) (r2640)
Removed support for passing urls to scrapy crawl
command (use scrapy parse
instead) (r2704)
Removed deprecated Execution Queue (r2704)
Removed (undocumented) spider context extension (from scrapy.contrib.spidercontext) (r2780)
removed CONCURRENT_SPIDERS
setting (use scrapyd maxproc instead) (r2789)
Renamed attributes of core components: downloader.sites -> downloader.slots, scraper.sites -> scraper.slots (r2717, r2718)
Renamed setting CLOSESPIDER_ITEMPASSED
to CLOSESPIDER_ITEMCOUNT
(r2655). Backwards compatibility kept.
The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
item
argument of the item_passed
(#273)scrapy version
command, useful for bug reports (#298)-c
argument to scrapy shell
commandlibxml2
optional (#260)deploy
command (#261)CLOSESPIDER_PAGECOUNT
setting (#253)CLOSESPIDER_ERRORCOUNT
setting (#254)runserver
command in favor of server
command which starts a Scrapyd server. See also: Scrapyd changesqueue
command in favor of using Scrapyd schedule.json
API. See also: Scrapyd changesThe numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
scrapyd
for deploying Scrapy crawlers in production (#218) (documentation available)url
and body
attributes of Request objects are now read-only (#230)
Request.copy()
and Request.replace()
now also copies their callback
and errback
attributes (#231)
Removed UrlFilterMiddleware
from scrapy.contrib
(already disabled by default)
Offsite middelware doesn’t filter out any request coming from a spider that doesn’t have a allowed_domains attribute (#225)
Removed Spider Manager load()
method. Now spiders are loaded in the constructor itself.
scrapy.core.manager.ScrapyManager
class renamed to scrapy.crawler.Crawler
scrapy.core.manager.scrapymanager
singleton moved to scrapy.project.crawler
Moved module: scrapy.contrib.spidermanager
to scrapy.spidermanager
Spider Manager singleton moved from scrapy.spider.spiders
to the spiders` attribute of ``scrapy.project.crawler
singleton.
scrapy.stats.collector.StatsCollector
to scrapy.statscol.StatsCollector
scrapy.stats.collector.SimpledbStatsCollector
to scrapy.contrib.statscol.SimpledbStatsCollector
default per-command settings are now specified in the default_settings
attribute of command object class (#201)
process_item()
method from (spider, item)
to (item, spider)
scrapy.core.signals
module to scrapy.signals
scrapy.core.exceptions
module to scrapy.exceptions
added handles_request()
class method to BaseSpider
dropped scrapy.log.exc()
function (use scrapy.log.err()
instead)
dropped component
argument of scrapy.log.msg()
function
dropped scrapy.log.log_level
attribute
Added from_settings()
class methods to Spider Manager, and Item Pipeline Manager
HTTPCACHE_IGNORE_SCHEMES
setting to ignore certain schemes on !HttpCacheMiddleware (#225)SPIDER_QUEUE_CLASS
setting which defines the spider queue to use (#220)KEEP_ALIVE
setting (#220)SERVICE_QUEUE
setting (#220)COMMANDS_SETTINGS_MODULE
setting (#201)REQUEST_HANDLERS
to DOWNLOAD_HANDLERS
and make download handlers classes (instead of functions)The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
MAIL_USER
, MAIL_PASS
(r2065 | #149)LOG_ENCODING
setting (r1956, documentation available)RANDOMIZE_DOWNLOAD_DELAY
setting (enabled by default) (r1923, doc available)MailSender
is no longer IO-blocking (r1955 | #146)Spider.domain_name
to Spider.name
(SEP-012, r1975)Response.encoding
is now the detected encoding (r1961)HttpErrorMiddleware
now returns None or raises an exception (r2006 | #157)scrapy.command
modules relocation (r2035, r2036, r2037)ExecutionQueue
for feeding spiders to scrape (r2034)ExecutionEngine
singleton (r2039)S3ImagesStore
(images pipeline) to use boto and threads (r2033)scrapy.management.telnet
to scrapy.telnet
(r2047)The numbers like #NNN reference tickets in the old issue tracker (Trac) which is no longer available.
dont_click
argument to FormRequest.from_response()
method (r1813, r1816)clickdata
argument to FormRequest.from_response()
method (r1802, r1803)HttpProxyMiddleware
) (r1781, r1785)Changed scrapy.utils.response.get_meta_refresh()
signature (r1804)
Removed deprecated scrapy.item.ScrapedItem
class - use scrapy.item.Item instead
(r1838)
Removed deprecated scrapy.xpath
module - use scrapy.selector
instead. (r1836)
Removed deprecated core.signals.domain_open
signal - use core.signals.domain_opened
instead (r1822)
log.msg()
now receives a spider
argument (r1822)
spider
argument and pass spider references. If you really want to pass a string, use the component
argument instead.Changed core signals domain_opened
, domain_closed
, domain_idle
domain
argument of process_item()
item pipeline method was changed to spider
, the new signature is: process_item(spider, item)
(r1827 | #105)spider.domain_name
where you previously used domain
.StatsCollector
was changed to receive spider references (instead of domains) in its methods (set_value
, inc_value
, etc).StatsCollector.iter_spider_stats()
methodStatsCollector.list_domains()
methodspider.domain_name
where you previously used domain
. spider_stats
contains exactly the same data as domain_stats
.CloseDomain
extension moved to scrapy.contrib.closespider.CloseSpider
(r1833)
CLOSEDOMAIN_TIMEOUT
to CLOSESPIDER_TIMEOUT
CLOSEDOMAIN_ITEMCOUNT
to CLOSESPIDER_ITEMCOUNT
Removed deprecated SCRAPYSETTINGS_MODULE
environment variable - use SCRAPY_SETTINGS_MODULE
instead (r1840)
Renamed setting: REQUESTS_PER_DOMAIN
to CONCURRENT_REQUESTS_PER_SPIDER
(r1830, r1844)
Renamed setting: CONCURRENT_DOMAINS
to CONCURRENT_SPIDERS
(r1830)
Refactored HTTP Cache middleware
HTTP Cache middleware has been heavilty refactored, retaining the same functionality except for the domain sectorization which was removed. (r1843 )
Renamed exception: DontCloseDomain
to DontCloseSpider
(r1859 | #120)
Renamed extension: DelayedCloseDomain
to SpiderCloseDelay
(r1861 | #121)
Removed obsolete scrapy.utils.markup.remove_escape_chars
function - use scrapy.utils.markup.replace_escape_chars
instead (r1865)
First release of Scrapy.