使用python爬虫抓站的一些技巧总结：进阶篇_程序人生

使用python爬虫抓站的一些技巧总结：进阶篇

admin

2023-07-30 22:42:48

0次

以前写过一篇使用python爬虫抓站的一些技巧总结，总结了诸多爬虫使用的方法；那篇东东现在看来还是挺有用的，但是当时很菜（现在也菜，但是比那时进步了不少），很多东西都不是很优，属于”只是能用”这么个层次。这篇进阶篇打算把“能用”提升到“用得省事省心”这个层次。

一、gzip/deflate支持

现在的网页普遍支持gzip压缩，这往往可以解决大量传输时间，以VeryCD的主页为例，未压缩版本247K，压缩了以后45K，为原来的1/5。这就意味着抓取速度会快5倍。

然而python的urllib/urllib2默认都不支持压缩，要返回压缩格式，必须在request的header里面写明’accept-encoding’，然后读取response后更要检查header查看是否有’content-encoding’一项来判断是否需要解码，很繁琐琐碎。如何让urllib2自动支持gzip, defalte呢？

其实可以继承BaseHanlder类，然后build_opener的方式来处理：

123456789101112131415161718192021222324252627282930313233343536

import urllib2from gzip import GzipFilefrom StringIO import StringIOclass ContentEncodingProcessor(urllib2.BaseHandler): \”\”\”A handler to add gzip capabilities to urllib2 requests \”\”\” # add headers to requests def http_request(self, req): req.add_header(\”Accept-Encoding\”, \”gzip, deflate\”) return req # decode def http_response(self, req, resp): old_resp = resp # gzip if resp.headers.get(\”content-encoding\”) == \”gzip\”: gz = GzipFile( fileobj=StringIO(resp.read()), mode=\”r\” ) resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) resp.msg = old_resp.msg # deflate if resp.headers.get(\”content-encoding\”) == \”deflate\”: gz = StringIO( deflate(resp.read()) ) resp = urllib2.addinfourl(gz, old_resp.headers, old_resp.url, old_resp.code) # \’class to add info() and resp.msg = old_resp.msg return resp # deflate supportimport zlibdef deflate(data): # zlib only provides the zlib compress format, not the deflate format; try: # so on top of all there\’s this workaround: return zlib.decompress(data, –zlib.MAX_WBITS) except zlib.error: return zlib.decompress(data)

然后就简单了，

12345

encoding_support = ContentEncodingProcessoropener = urllib2.build_opener( encoding_support, urllib2.HTTPHandler ) #直接用opener打开网页，如果服务器支持gzip/defalte则自动解压缩content = opener.open(url).read()

二、更方便地多线程

总结一文的确提及了一个简单的多线程模板，但是那个东东真正应用到程序里面去只会让程序变得支离破碎，不堪入目。在怎么更方便地进行多线程方面我也动了一番脑筋。先想想怎么进行多线程调用最方便呢？

1、用twisted进行异步I/O抓取

事实上更高效的抓取并非一定要用多线程，也可以使用异步I/O法：直接用twisted的getPage方法，然后分别加上异步I/O结束时的callback和errback方法即可。例如可以这么干：

12345678910111213141516171819

from twisted.web.client import getPagefrom twisted.internet import reactor links = [ \’http://www.verycd.com/topics/%d/\’%i for i in range(5420,5430) ] def parse_page(data,url): print len(data),url def fetch_error(error,url): print error.getErrorMessage(),url 么个层次。这篇进阶篇打算把“能用”提升到“用得省事省心”这个层次。

一、gzip/deflate支持

其实可以继承BaseHanlder类，然后build_opener的方式来处理：

123456789101112131415161718192021222324252627282930313233343536

然后就简单了，

12345

二、更方便地多线程

1、用twisted进行异步I/O抓取

12345678910111213141516171819

from twisted.web.client import getPagefrom twisted.internet import reactor links = [ \’http://www.verycd.com/topics/%d/\’%i for i in range(5420,5430) ] def parse_page(data,url): print len(data),url def fetch_error(error,url): print error.getErrorMessage(),url 2-12\”> # decode def http_response(self, req, resp

上一篇：利用 scrapy爬知乎用户关系网以及下载头像

下一篇：用python爬虫抓站的一些技巧总结

使用python爬虫抓站的一些技巧总结：进阶篇

一、gzip/deflate支持

二、更方便地多线程

1、用twisted进行异步I/O抓取

一、gzip/deflate支持

二、更方便地多线程

1、用twisted进行异步I/O抓取

相关内容

热门资讯