昨天初学了下scrapy,今天测试下效果,看见网上很多都是用豆瓣的页面做测试,那么久换个不一样的,就选择 天天美剧 了
#coding:utf-8
import json
import scrapy
from my_scrapy_project.items import DmozItem
class DmozSpider(scrapy.Spider):
name = \"dmoz\"
allowed_domains = [\"ttmeiju.com\"]
start_urls = [
\"http://www.ttmeiju.com/\"
]
def parse(self, response):
for sel in response.xpath(\"//table[contains(@class,\'seedtable\')]/tr[contains(@class,\'Scontent\')]\"):
item = DmozItem()
title = sel.xpath(\'td[2]/a/text()\').extract()[0]
link = sel.xpath(\'td[2]/a/@href\').extract()
download = sel.xpath(\'td[3]/a/@href\').extract()
item[\'title\'] = title
item[\'link\'] = link
item[\'download\'] = download
yield item
输出结果:
{\"download\": [\"http://pan.baidu.com/s/1i3CcdQd\"], \"link\": [\"http://www.ttmeiju.com/seed/38897.html\"], \"title\": \"\\n\\u840c\\u5ba0\\u4e5f\\u75af\\u72c2 Pets Wild at Heart S01E02 HR-HDTV \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4 \\n \\n \"},
{\"download\": [\"http://pan.baidu.com/s/1c0rT2pi\"], \"link\": [\"http://www.ttmeiju.com/seed/38896.html\"], \"title\": \"\\n\\u840c\\u5ba0\\u4e5f\\u75af\\u72c2 Pet Wild at Heart S01E01 HR-HDTV \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4 \\n \\n \"},
{\"download\": [\"http://pan.baidu.com/s/1qWp3jgo\"], \"link\": [\"http://www.ttmeiju.com/seed/38895.html\"], \"title\": \"\\n\\u840c\\u5ba0\\u4e5f\\u75af\\u72c2 Pet Wild At Heart S01E01 \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4 \\n \\n \"},
{\"download\": [\"http://pan.baidu.com/s/1bnxvmFl\"], \"link\": [\"http://www.ttmeiju.com/seed/38894.html\"], \"title\": \"\\n\\u7231\\u306e\\u65c5\\u9986 The Love Hotel \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4 \\n \\n \"},
又出现中文字符编码问题 。。。。 明天解决