一、简单配置,获取单个网页上的内容。
(1)创建scrapy项目
scrapy startproject getblog
(2)编辑 items.py
# -*- coding: utf-8 -*- # Define here the models for your scraped items # # See documentation in: # http://doc.scrapy.org/en/latest/topics/items.html from scrapy.item import Item, Field class BlogItem(Item): title = Field() desc = Field()
(3)在 spiders 文件夹下,创建 blog_spider.py
需要熟悉下xpath选择,感觉跟JQuery选择器差不多,但是不如JQuery选择器用着舒服( w3school教程: http://www.w3school.com.cn/xpath/ )。
# coding=utf-8
from scrapy.spider import Spider
from getblog.items import BlogItem
from scrapy.selector import Selector
class BlogSpider(Spider):
# 标识名称
name = \'blog\'
# 起始地址
start_urls = [\'http://www.cnblogs.com/\']
def parse(self, response):
sel = Selector(response) # Xptah 选择器
# 选择所有含有class属性,值为‘post_item\'的div 标签内容
# 下面的 第2个div 的 所有内容
sites = sel.xpath(\'//div[@class=\"post_item\"]/div[2]\')
items = []
for site in sites:
item = BlogItem()
# 选取h3标签下,a标签下,的文字内容 ‘text()\'
item[\'title\'] = site.xpath(\'h3/a/text()\').extract()
# 同上,p标签下的 文字内容 ‘text()\'
item[\'desc\'] = site.xpath(\'p[@class=\"post_item_summary\"]/text()\').extract()
items.append(item)
return items
(4)运行,
scrapy crawl blog # 即可
(5)输出文件。
在 settings.py 中进行输出配置。
# 输出文件位置 FEED_URI = \'blog.xml\' # 输出文件格式 可以为 json,xml,csv FEED_FORMAT = \'xml\'
输出位置为项目根文件夹下。
二、基本的 — scrapy.spider.Spider
(1)使用交互shell
dizzy@dizzy-pc:~$ scrapy shell \"http://www.baidu.com/\"
2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {\'LOGSTATS_INTERVAL\': 0}
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines:
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
2014-08-21 04:09:11+0800 [default] INFO: Spider opened
2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response <200 http://www.baidu.com/>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>>>
# response.body 返回的所有内容
# response.xpath(\'//ul/li\') 可以测试所有的xpath内容
More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()
也就是可以很方便的,以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的,但是并不能保证每次都能正确的选择出内容。
也可使用:
scrapy shell \'http://scrapy.org\' --nolog # 参数 --nolog 没有日志
(2)示例
from scrapy import Spider
from scrapy_test.items import DmozItem
class DmozSpider(Spider):
name = \'dmoz\'
allowed_domains = [\'dmoz.org\']
start_urls = [\'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/\',
\'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,\'
\'\']
def parse(self, response):
for sel in response.xpath(\'//ul/li\'):
item = DmozItem()
item[\'title\'] = sel.xpath(\'a/text()\').extract()
item[\'link\'] = sel.xpath(\'a/@href\').extract()
item[\'desc\'] = sel.xpath(\'text()\').extract()
yield item
(3)保存文件
可以使用,保存文件。格式可以 json,xml,csv
scrapy crawl -o \'a.json\' -t \'json\'
(4)使用模板创建spider
scrapy genspider baidu baidu.com
# -*- coding: utf-8 -*-
import scrapy
class BaiduSpider(scrapy.Spider):
name = \"baidu\"
allowed_domains = [\"baidu.com\"]
start_urls = (
\'http://www.baidu.com/\',
)
def parse(self, response):
pass
这段先这样吧,记得之前5个的,现在只能想起4个来了.