scrapy startproject 项目名
项目就建好了,项目结构图如下:
class BookName(Item):
name=Field()
url=Field()
class BookContent(Item):
id=Field()
title=Field()
text=Field()
在Spider文件夹内添加文件book.py。并书写代码如下:
class BookSpider(CrawlSpider):
name=\"book\"#爬虫名字
allowed_domains=[\"shicimingju.com\"]#允许的域名
start_urls=[#开始爬取的网站
\"http://www.shicimingju.com/book/sanguoyanyi.html\"
]
我直接是使用的CrawlSpider,一个更加方便定义爬取连接规则的类型。
allow对应爬取的网站,使用了正则,callback对应对于该网站使用的方法。
rules=(
Rule(LinkExtractor(allow=(\'http://www.shicimingju.com/book/sanguoyanyi.html\',)),callback=\'bookMenu_item\'),
Rule(LinkExtractor(allow=\"http://www.shicimingju.com/book/.*?\\d*?\\.html\"),callback=\"bookContent_item\")
)
def bookMenu_item(self,response):
sel=Selector(response)
bookName_items=[]
bookName_item=BookName()
bookName_item[\'name\']=sel.xpath(\'//*[@id=\"bookinfo\"]/h1/text()\').extract()
print(bookName_item)
bookName_items.append(bookName_item)
return bookName_items
def bookContent_item(self,response):
global num
print(num)
num+=1
sel = Selector(response)
bookContent_items = []
bookContent_item = BookContent()
bookContent_item[\'id\']=re.findall(\'.*?(\\d+).*?\',response.url)
bookContent_item[\'title\']=sel.xpath(\'//*[@id=\"con\"]/h2/text()\').extract()
bookContent_item[\'text\']=\"\".join(sel.xpath(\'//*[@id=\"con2\"]/p/descendant::text()\').extract())
bookContent_item[\'text\']=re.sub(\'\\xa0\',\' \',bookContent_item.get(\'text\'))
print(bookContent_item)
bookContent_items.append(bookContent_item)
return bookContent_items
scrapy crawl book
上一篇:Django学习笔记
下一篇:Django学习笔记