发布于 2016-10-29 08:00:57 | 154 次阅读 | 评论: 0 | 来源: 网络整理
项目(Item)对象是Python中的常规的字典类型。我们可以用下面的语法来访问类的属性:
>>> item = DemoItem()
>>> item['title'] = 'sample title'
>>> item['title']
'sample title'
添加上述代码到下面的例子中:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
import scrapy
from first_scrapy.items import DemoItem
class firstSpider(scrapy.Spider):
name = "first"
allowed_domains = ["demo.com"]
start_urls = [
"http://www.demo.com/scrapy/scrapy_create_project.html",
"http://www.demo.com/scrapy/scrapy_environment.html"
]
def parse(self, response):
# 所有教程名称及链接 ...
for sel in response.xpath('//ul/li'):
item = DemoItem()
item['title'] = sel.xpath('a/text()').extract()
item['link'] = sel.xpath('a/@href').extract()
item['desc'] = sel.xpath('text()').extract()
yield item
因此,上述蜘蛛的部分输出结果是:
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/python3/'],
'title': [u'Python3u6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/php7/'],
'title': [u'PHP7u6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/excel/'],
'title': [u'Excelu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/html/uml/'],
'title': [u'UML']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/html/socket/'],
'title': [u'Socketu7f16u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/html/radius/'],
'title': [u'Radiusu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/nodejs/'],
'title': [u'Node.jsu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/svn/'],
'title': [u'SVNu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/git/'],
'title': [u'Gitu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/makefile/'],
'title': [u'Makefile']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/unix/'],
'title': [u'Unix']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/unix_commands/'],
'title': [u'Linux/Unixu547du4ee4']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/unix_system_calls/'],
'title': [u'Unix/Linuxu7cfbu7edfu8c03u7528']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/shell/'],
'title': [u'Shell']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/drools/'],
'title': [u'Droolsu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/linq/'],
'title': [u'LinQu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/wcf/'],
'title': [u'WCFu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/mysql/'],
'title': [u'MySQLu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/plsql/'],
'title': [u'PL/SQLu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/postgresql/'],
'title': [u'PostgreSQLu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/mongodb/'],
'title': [u'MongoDBu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/sqlite/'],
'title': [u'SQLiteu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/db2/'],
'title': [u'DB2u6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/redis/'],
'title': [u'Redisu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/memcached/'],
'title': [u'Memcachedu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/access/'],
'title': [u'Accessu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/sql/'],
'title': [u'SQLu6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/sql_server/'],
'title': [u'SQL Serveru6559u7a0b']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/java/'],
'title': [u'Java']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/python/'],
'title': [u'Python']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/mysql/'],
'title': [u'MySQL']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/articles'],
'title': [u'u6700u65b0u6587u7ae0']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'],
'link': [u'http://www.demo.com/login/byqq'],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
',
u'
',
u'
',
u'
',
u'
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
', u'
'], 'link': [], 'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
'], 'link': [], 'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u5b89u88c5xa0', u'&amd64
'],
'link': [u'http://sourceforge.net/projects/pywin32/'],
'title': [u'pywin32']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u5b89u88c5 Python2.7.9 u4ee5u4e0bu7684xa0',
u'xa0u6216u8005u4e0bu8f7du5730u5740uff1axa0',
u'
'],
'link': [u'https://pip.pypa.io/en/latest/installing/',
u'https://pypi.python.org/pypi/setuptools#files',
u'https://pypi.python.org/pypi/setuptools#files'],
'title': [u'pip', u'https://pypi.python.org/pypi/setuptools#files', u'xa0']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u60a8u53efu4ee5u901au8fc7u4f7fu7528u4ee5u4e0bu547du4ee4u6765u68c0u67e5 pip u7248u672cuff1a
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u5b89u88c5twisteduff0cu4e0bu8f7du5730u5740 -',
u'
'],
'link': [u'https://pypi.python.org/packages/2.7/T/Twisted/Twisted-13.0.0.win32-py2.7.msi#md5=c2d453a344f56cf6f77204c5769288c0'],
'title': [u'https://pypi.python.org/packages/2.7/T/Twisted/Twisted-13.0.0.win32-py2.7.msi#md5=c2d453a344f56cf6f77204c5769288c0']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u5b89u88c5xa0zope u63a5u53e3uff1a',
u'xa0u9009u62e9u5012u6570u7b2cu4e8cu4e2axa0',
u'xa0',
u'
'],
'link': [u'https://pypi.python.org/pypi/zope.interface/4.1.0',
u'https://pypi.python.org/packages/2.7/z/zope.interface/zope.interface-4.1.0.win32-py2.7.exe#md5=c0100a3cd6de6ecc3cd3b4d678ec7931'],
'title': [u'https://pypi.python.org/pypi/zope.interface/4.1.0',
u'zope.interface-4.1.0.win32-py2.7.exe']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u5b89u88c5 lxml uff0cu7248u672cu8981u9009u5bf9u5e94u7cfbu7edfuff0cu9519u8befu7684u662fu7528u4e0du4e86u7684u3002u4e0bu8f7du5730u5740uff1axa0',
u'
'],
'link': [u'https://pypi.python.org/pypi/lxml/3.2.3'],
'title': [u'https://pypi.python.org/pypi/lxml/3.2.3']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u8981u5b89u88c5scrapyuff0cu8fd0u884cu4ee5u4e0bu547du4ee4uff1a
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
', u'
'], 'link': [], 'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
', u'
'], 'link': [], 'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
', u'
', u'
'], 'link': [], 'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u5b89u88c5', u'
'],
'link': [u'http://brew.sh/'],
'title': [u'homebrew']}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u8bbeu7f6eu73afu5883u53d8u91cf PATH u6307u5b9axa0homebrewxa0u5305u5728u7cfbu7edfu8f6fu4ef6u5305u524du4f7fu7528uff1a
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u53d8u66f4u5b8cu6210u540euff0cu91cdu65b0u52a0u8f7d .bashrc u4f7fu7528u4e0bu9762u7684u547du4ee4uff1a
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u63a5u4e0bu6765uff0cu4f7fu7528u4e0bu9762u7684u547du4ee4u5b89u88c5xa0Pythonuff1a
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] DEBUG: Scraped from <200 http://www.demo.com/scrapy/scrapy_environment.html>
{'desc': [u'
u63a5u4e0bu6765uff0cu5b89u88c5scrapyuff1a
',
u'
'],
'link': [],
'title': []}
2016-10-03 13:11:06 [scrapy] INFO: Closing spider (finished)
2016-10-03 13:11:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 709,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 15401,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 10, 3, 5, 11, 6, 478000),
'item_scraped_count': 210,
'log_count/DEBUG': 214,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2016, 10, 3, 5, 11, 5, 197000)}
2016-10-03 13:11:06 [scrapy] INFO: Spider closed (finished)
D:first_scrapy>