发布于 2016-01-15 03:34:02 | 284 次阅读 | 评论: 0 | 来源: PHPERZ
这里有新鲜出炉的Scrapy 0.24 中文文档,程序狗速度看过来!
Scrapy Python的爬虫框架
Scrapy是一个Python开发的一个快速,高层次的屏幕抓取和web抓取框架,用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
class TestSmtSpider(scrapy.Spider):
name = "test_smt"
#allowed_domains = ["http://www.maiziedu.com/"]
start_urls = ('http://www.maiziedu.com/',
)
def parse(self, response):
# scrapy.FormRequest()
return scrapy.FormRequest(url='http://www.maiziedu.com/user/login/',
formdata={'account_l':'******',
'password_l':'****'},
callback=self.after_l)
def after_l(self,response):
print(response.body)
return scrapy.Request(url='http://www.maiziedu.com/user/center/?source=login',
callback=self.after_lo)
def after_lo(self,response):
rel = r'class="dt-username"([\s\S]*?)v5-icon v5-icon-rd'
my_name = re_fuc(response.body,rel)[0]
print('*******************', my_name)
另外一种:
It is usual for web sites to provide pre-populated form fields through <input type="hidden"> elements, such
as session related data or authentication tokens (for login pages). When scraping, you’ll want these fields to be
automatically pre-populated and only override a couple of them, such as the user name and password. You can use the
FormRequest.from_response() method for this job. Here’s an example spider which uses it:
import scrapy
class LoginSpider(scrapy.Spider):
name = 'example.com'
start_urls = ['http://www.example.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata={'username': 'john', 'password': 'secret'},
callback=self.after_login
)
def after_login(self, response):
# check login succeed before going on
if "authentication failed" in response.body:
self.logger.error("Login failed")
return