功能分析
网上爬取豆瓣电影排行很多,但是由于做一个h5画廊需求大量电影素材,而列表与详情页的图片清晰度满足不了要求,所以决定爬取豆瓣原图,在查看是发现需要登陆,然而登陆后原图链接查看并没有任何cookie信息在请求头,遂想直接构造链接爬取,爬取过程出现302,重新分析请求头,发现有referer,嗯,直接加了豆瓣首页作为referer,mdfuck居然没有用,后来发现每个原图需要带着该图前中等缩略图链接作为referer,可以看出豆瓣在服务器端作了比较严格的过滤处理,虽然在nginx上也做过图片防盗链,自认为没有做如此猥琐,别人猥琐,咱也不能谦虚是吧,一个个构造请求头,利用python爬取原图。
所需模块
网页内容抓取用urllib2
页面解析用pyquery(一个可以用jquey方式解析html的模块)
下面是代码部分:(好久没写python,表示有点手生ฅʕ•̫͡•ʔฅ,图片大概200m,页面爬取很快,最后是io下载,所以速度还是取决网速)
#coding:utf-8
import urllib2
import re
import sys
import time
from pyquery import PyQuery as pq
#http://movie.douban.com/top250?start=0&filter=&type=
class Douban:
def __init__(self):
reload(sys)
sys.setdefaultencoding(\'utf-8\')
self.start = 0
self.param = \'&filter=&type=\'
self.headers = {\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36\'}
self.movieList = []
self.filePath = \'./img/\'
self.imgpath = \'https://img3.doubanio.com/view/photo/raw/public/\'
self.refer = \'https://movie.douban.com/photos/photo/\'
def getPage(self):
print \'---getpagestart---\'
try:
URL = \'http://movie.douban.com/top250?start=\' + str(self.start)
request = urllib2.Request(url = URL, headers=self.headers)
response = urllib2.urlopen(request)
page = response.read().decode(\'utf-8\')
pageNum = (self.start + 25)/25
print \'scrabing \' + str(pageNum) + \'page...\'
self.start += 25;
return page
except urllib2.URLError, e:
if hasattr(e, \'reason\'):
print \'failed reason\', e.reason
def htmlparse(self):
print \'---getMoviestart---\'
while self.start < 250:
page = self.getPage()
html = pq(page);
list = html(\".grid_view>li\")
info = {};
for item in list:
item = pq(item)
info[\'name\'] = item(\".hd>a\").text()
info[\'des\'] = item(\".bd p:first\").text()
info[\'img\'] = item(\".pic img\").attr(\'src\');
group = re.findall(r\'\\\\/(p(\\\\d+)\\\\.jpg)\', info[\'img\'])
info[\'img\'] = self.imgpath + group[0][0]
info[\'refer\'] = self.refer + group[0][1] + \'/\'
#print info
self.movieList.append([info[\'name\'], info[\'des\'], info[\'img\'],info[\'refer\']])
def hook(self):
mfile = open(self.filePath + \'movielist.txt\', \'w\')
try:
for index, movie in enumerate(self.movieList):
print movie[0].encode(\'gbk\',\'ignore\')
#print movie
self.downImg(movie[2], movie[3], self.filePath + \'movie\' + str(index+1) + \'.jpg\')
mfile.write(str(index+1) + \'、\' + movie[0] + \'\\\\n\'+ movie[1] + \'\\\\n\')
print \'wirte done\'
finally:
mfile.close()
def downImg(self, URL, refer, imgpath):
head = {\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.113 Safari/537.36\'};
head[\'Referer\'] = refer
request = urllib2.Request(url = URL, headers = head)
try:
f = open(imgpath, \'wb\')
res = urllib2.urlopen(request).read()
f.write(res)
f.close()
except urllib2.URLError, e:
if hasattr(e, \'reason\'):
print \'failed reason\', e.reason
def main(self):
print \'---mainstart---\'
self.htmlparse()
print len(self.movieList)
self.hook()
DB = Douban()
DB.main()
运行效果
图片发自简书App
图片发自简书App
图片发自简书App
图片发自简书App
下一篇:微信点餐平台开发(二)