Python的Scrapy爬虫框架简单学习笔记_程序人生

Python的Scrapy爬虫框架简单学习笔记

admin

2023-07-31 02:33:48

0次

一、简单配置，获取单个网页上的内容。
（1）创建scrapy项目

scrapy startproject getblog

（2）编辑 items.py

# -*- coding: utf-8 -*-
 
# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html
 
from scrapy.item import Item, Field
 
class BlogItem(Item):
  title = Field()
  desc = Field()

（3）在 spiders 文件夹下，创建 blog_spider.py

需要熟悉下xpath选择，感觉跟JQuery选择器差不多，但是不如JQuery选择器用着舒服（ w3school教程： http://www.w3school.com.cn/xpath/ ）。

# coding=utf-8
 
from scrapy.spider import Spider
from getblog.items import BlogItem
from scrapy.selector import Selector
 
 
class BlogSpider(Spider):
  # 标识名称
  name = \'blog\'
  # 起始地址
  start_urls = [\'http://www.cnblogs.com/\']
 
  def parse(self, response):
    sel = Selector(response) # Xptah 选择器
    # 选择所有含有class属性，值为‘post_item\'的div 标签内容
    # 下面的 第2个div 的 所有内容
    sites = sel.xpath(\'//div[@class=\"post_item\"]/div[2]\')
    items = []
    for site in sites:
      item = BlogItem()
      # 选取h3标签下，a标签下，的文字内容 ‘text()\'
      item[\'title\'] = site.xpath(\'h3/a/text()\').extract()
      # 同上，p标签下的 文字内容 ‘text()\'
      item[\'desc\'] = site.xpath(\'p[@class=\"post_item_summary\"]/text()\').extract()
      items.append(item)
    return items

（4）运行，

scrapy crawl blog # 即可

（5）输出文件。

在 settings.py 中进行输出配置。

# 输出文件位置
FEED_URI = \'blog.xml\'
# 输出文件格式 可以为 json，xml，csv
FEED_FORMAT = \'xml\'

输出位置为项目根文件夹下。

二、基本的 — scrapy.spider.Spider

（1）使用交互shell

dizzy@dizzy-pc:~$ scrapy shell \"http://www.baidu.com/\"

2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django
2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {\'LOGSTATS_INTERVAL\': 0}
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines: 
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024
2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081
2014-08-21 04:09:11+0800 [default] INFO: Spider opened
2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200)  (referer: None)
[s] Available Scrapy objects:
[s]  crawler  
[s]  item    {}
[s]  request  
[s]  response  <200 http://www.baidu.com/>
[s]  settings  
[s]  spider   
[s] Useful shortcuts:
[s]  shelp()      Shell help (print this help)
[s]  fetch(req_or_url) Fetch request (or URL) and update local objects
[s]  view(response)  View response in a browser
 
>>> 
  # response.body 返回的所有内容
  # response.xpath(\'//ul/li\') 可以测试所有的xpath内容
    More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()

也就是可以很方便的，以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的，但是并不能保证每次都能正确的选择出内容。

也可使用：

scrapy shell \'http://scrapy.org\' --nolog
# 参数 --nolog 没有日志

（2）示例

from scrapy import Spider
from scrapy_test.items import DmozItem
 
 
class DmozSpider(Spider):
  name = \'dmoz\'
  allowed_domains = [\'dmoz.org\']
  start_urls = [\'http://www.dmoz.org/Computers/Programming/Languages/Python/Books/\',
         \'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,\'
         \'\']
 
  def parse(self, response):
    for sel in response.xpath(\'//ul/li\'):
      item = DmozItem()
      item[\'title\'] = sel.xpath(\'a/text()\').extract()
      item[\'link\'] = sel.xpath(\'a/@href\').extract()
      item[\'desc\'] = sel.xpath(\'text()\').extract()
      yield item

（3）保存文件

可以使用，保存文件。格式可以 json，xml，csv

scrapy crawl -o \'a.json\' -t \'json\'

（4）使用模板创建spider

scrapy genspider baidu baidu.com
 
# -*- coding: utf-8 -*-
import scrapy
 
 
class BaiduSpider(scrapy.Spider):
  name = \"baidu\"
  allowed_domains = [\"baidu.com\"]
  start_urls = (
    \'http://www.baidu.com/\',
  )
 
  def parse(self, response):
    pass

这段先这样吧，记得之前5个的，现在只能想起4个来了.

python 爬虫 scrapy

上一篇：使用Python编写爬虫的基本模块及框架使用指南

下一篇：Python随手笔记第一篇（2）之初识列表和元组

Python的Scrapy爬虫框架简单学习笔记

相关内容

热门资讯