本文摘要自Web Scraping with Python – 2015
书籍下载地址:https://bitbucket.org/xurongzhong/python-chinese-library/downloads
源码地址:https://bitbucket.org/wswp/code
演示站点:http://example.webscraping.com/
演示站点代码:http://bitbucket.org/wswp/places
推荐的python基础教程: http://www.diveintopython.net
HTML和JavaScript基础:
http://www.w3schools.com
web抓取简介
网购的时候想比较下各个网站的价格,也就是实现惠惠购物助手的功能。有API自然方便,但是通常是没有API,此时就需要web抓取。
抓取的数据,个人使用不违法,商业用途或重新发布则需要考虑授权,另外需要注意礼节。根据国外已经判决的案例,一般来说位置和电话可以重新发布,但是原创数据不允许重新发布。
更多参考:
http://www.bvhd.dk/uploads/tx_mocarticles/S_-_og_Handelsrettens_afg_relse_i_Ofir-sagen.pdf
http://www.austlii.edu.au/au/cases/cth/FCA/2010/44.html
http://caselaw.findlaw.com/us-supreme-court/499/340.html
robots.txt和Sitemap可以帮助了解站点的规模和结构,还可以使用谷歌搜索和WHOIS等工具。
比如:http://example.webscraping.com/robots.txt
| 1234567891011 | # section 1User–agent: BadCrawlerDisallow: / # section 2User–agent: *Crawl–delay: 5Disallow: /trap # section 3Sitemap: http://example.webscraping.com/sitemap.xml |
更多关于web机器人的介绍参见 http://www.robotstxt.org。
Sitemap的协议: http://www.sitemaps.org/protocol.html,比如:
| 1234 | http://example.webscraping.com/view/Afghanistan-1http://example.webscraping.com/view/Aland-Islands-2http://example.webscraping.com/view/Albania-3... |
站点地图经常不完整。
站点大小评估:
通过google的site查询 比如:site:automationtesting.sinaapp.com
站点技术评估:
| 12345678910 | # pip install builtwith# ipythonIn [1]: import builtwith In [2]: builtwith.parse(\’http://automationtesting.sinaapp.com/\’)Out[2]: {u\’issue-trackers\’: [u\’Trac\’], u\’javascript-frameworks\’: [u\’jQuery\’], u\’programming-languages\’: [u\’Python\’], u\’web-servers\’: [u\’Nginx\’]} |
分析网站所有者:
| 12345678910111213141516171819202122232425262728293031323334 | # pip install python-whois# ipythonIn [1]: import whois In [2]: print whois.whois(\’http://automationtesting.sinaapp.com\’){ \”updated_date\”: \”2016-01-07 00:00:00\”, \”status\”: [ \”serverDeleteProhibited https://www.icann.org/epp#serverDeleteProhibited\”, \”serverTransferProhibited https://www.icann.org/epp#serverTransferProhibited\”, \”serverUpdateProhibited https://www.icann.org/epp#serverUpdateProhibited\” ], \”name\”: null, \”dnssec\”: null, \”city\”: null, \”expiration_date\”: \”2021-06-29 00:00:00\”, \”zipcode\”: null, \”domain_name\”: \”SINAAPP.COM\”, \”country\”: null, \”whois_server\”: \”whois.paycenter.com.cn\”, \”state\”: null, \”registrar\”: \”XIN NET TECHNOLOGY CORPORATION\”, \”referral_url\”: \”http://www.xinnet.com\”, \”address\”: null, \”name_servers\”: [ \”NS1.SINAAPP.COM\”, \”NS2.SINAAPP.COM\”, \”NS3.SINAAPP.COM\”, \”NS4.SINAAPP.COM\” ], \”org\”: null, \”creation_date\”: \”2009-06-29 00:00:00\”, \”emails\”: null} |
简单的爬虫(crawling)代码如下: