scrapy笔记（2）爬天天美剧首页_程序人生

scrapy笔记（2）爬天天美剧首页

admin

2023-07-30 20:41:45

0次

昨天初学了下scrapy，今天测试下效果，看见网上很多都是用豆瓣的页面做测试，那么久换个不一样的，就选择天天美剧了

＃coding:utf-8
import json
import scrapy
from my_scrapy_project.items import DmozItem
class DmozSpider(scrapy.Spider):
    name = \"dmoz\"
    allowed_domains = [\"ttmeiju.com\"]
    start_urls = [
    \"http://www.ttmeiju.com/\"
    ]

def parse(self, response):
    for sel in response.xpath(\"//table[contains(@class,\'seedtable\')]/tr[contains(@class,\'Scontent\')]\"):
        item = DmozItem()
        title = sel.xpath(\'td[2]/a/text()\').extract()[0]
        link = sel.xpath(\'td[2]/a/@href\').extract()
        download = sel.xpath(\'td[3]/a/@href\').extract()
        item[\'title\'] = title
        item[\'link\'] = link
        item[\'download\'] = download
        yield item

response.xpath(\”//table[contains(@class,\’seedtable\’)]/tr[contains(@class,\’Scontent\’)]\”) 这段选择了天天美剧首页新的资源板块，意思是选择class为seedtable的table里面class为scontent的tr
sel.xpath(\’td[2]/a/text()\’).extract()[0] 选择的是片源的名字，通过审查元素，查看源代码可以看到
sel.xpath(\’td[3]/a/@href\’).extract() 这是资源的各种下载链接

输出结果：

{\"download\": [\"http://pan.baidu.com/s/1i3CcdQd\"], \"link\": [\"http://www.ttmeiju.com/seed/38897.html\"], \"title\": \"\\n\\u840c\\u5ba0\\u4e5f\\u75af\\u72c2 Pets Wild at Heart S01E02 HR-HDTV \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4                    \\n                    \\n                    \"},
{\"download\": [\"http://pan.baidu.com/s/1c0rT2pi\"], \"link\": [\"http://www.ttmeiju.com/seed/38896.html\"], \"title\": \"\\n\\u840c\\u5ba0\\u4e5f\\u75af\\u72c2 Pet Wild at Heart S01E01 HR-HDTV \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4                    \\n                    \\n                    \"},
{\"download\": [\"http://pan.baidu.com/s/1qWp3jgo\"], \"link\": [\"http://www.ttmeiju.com/seed/38895.html\"], \"title\": \"\\n\\u840c\\u5ba0\\u4e5f\\u75af\\u72c2 Pet Wild At Heart S01E01 \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4                    \\n                    \\n                    \"},
{\"download\": [\"http://pan.baidu.com/s/1bnxvmFl\"], \"link\": [\"http://www.ttmeiju.com/seed/38894.html\"], \"title\": \"\\n\\u7231\\u306e\\u65c5\\u9986 The Love Hotel \\u5927\\u5bb6\\u5b57\\u5e55\\u7ec4                    \\n                    \\n                    \"},

又出现中文字符编码问题。。。。明天解决

上一篇：Python 科学计算环境Ubuntu平台搭建说明

下一篇：我的笔记4.16：python入门一周心得

scrapy笔记（2）爬天天美剧首页

相关内容

热门资讯