早就听说Nodejs的异步策略是多么的好,I/O是多么的牛逼……反正就是各种好。今天我就准备给nodejs和python来做个比较。能体现异步策略和I/O优势的项目,我觉得莫过于爬虫了。那么就以一个爬虫项目来一较高下吧。

爬虫项目

众筹网-众筹中项目 http://www.zhongchou.com/brow…,我们就以这个网站为例,我们爬取它所有目前正在众筹中的项目,获得每一个项目详情页的URL,存入txt文件中。

实战比较

python原始版

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647 # -*- coding:utf-8 -*-\’\’\’Created on 20160827@author: qiukang\’\’\’import requests,timefrom BeautifulSoup import BeautifulSoup    # HTML #请求头headers = {   \’Accept\’:\’text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\’,   \’Accept-Encoding\’:\’gzip, deflate, sdch\’,   \’Accept-Language\’:\’zh-CN,zh;q=0.8\’,   \’Connection\’:\’keep-alive\’,   \’Host\’:\’www.zhongchou.com\’,   \’Upgrade-Insecure-Requests\’:1,   \’User-Agent\’:\’Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36\’} # 获得项目url列表def getItems(allpage):    no = 0    items = open(\’pystandard.txt\’,\’a\’)    for page in range(allpage):        if page==0:            url = \’http://www.zhongchou.com/browse/di\’        else:            url = \’http://www.zhongchou.com/browse/di-p\’+str(page+1)        # print url #①        r1 = requests.get(url,headers=headers)        html = r1.text.encode(\’utf8\’)        soup = BeautifulSoup(html);        lists = soup.findAll(attrs={\”class\”:\”ssCardItem\”})        for i in range(len(lists)):            href = lists[i].a[\’href\’]            items.write(href+\”n\”)            no +=1    items.close()    return no    if __name__ == \’__main__\’:    start = time.clock()    allpage = 30    no = getItems(allpage)    end = time.clock()    print(\’it takes %s Seconds to get %s items \’%(endstart,no)) 

实验5次的结果:

123456 it takes 48.1727159614 Seconds to get 720 items it takes 45.3397999415 Seconds to get 720 items   it takes 44.4811429862 Seconds to get 720 items it takes 44.4619293082 Seconds to get 720 items it takes 46.669706593 Seconds to get 720 items  

python多线程版

123456789101112um\” data-line=\”crayon-5812b0c9f3ee5422157416-8\”>89101112owse/di\”>http://www.zhongchou.com/brow…,我们就以这个网站为例,我们爬取它所有目前正在众筹中的项目,获得每一个项目详情页的URL,存入txt文件中。

实战比较

python原始版

1234567891011121314151617181920212223242526272829303132333435363738394041424344454647 # -*- coding:utf-8 -*-\’\’\’Created on 20160827@author: qiukang\’\’\’import requests,timefrom BeautifulSoup import BeautifulSoup    # HTML #请求头headers = {   \’Accept\’:\’text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8\’,   \’Accept-Encoding\’:\’gzip, deflate, sdch\’,   \’Accept-Language\’:\’zh-CN,zh;q=0.8\’,   \’Connection\’:\’keep-alive\’,   \’Host\’:\’www.zhongchou.com\’,   \’Upgrade-Insecure-Requests\’:1,   \’User-Agent\’:\’Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2593.0 Safari/537.36\’} # 获得项目url列表def getItems(allpage):    no = 0    items = open(\’pystandard.txt\’,\’a\’)    for page in range(allpage):        if page==0:            url = \’http://www.zhongchou.com/browse/di\’        else:            url = \’http://www.zhongchou.com/browse/di-p\’+str(page+1)        # print url #①        r1 = requests.get(url,headers=headers)        html = r1.text.encode(\’utf8\’)        soup = BeautifulSoup(html);        lists = soup.findAll(attrs={\”class\”:\”ssCardItem\”})        for i in range(len(lists)):            href = lists[i].a[\’href\’]            items.write(href+\”n\”)            no +=1    items.close()    return no    if __name__ == \’__main__\’:    start = time.clock()    allpage = 30    no = getItems(allpage)    end = time.clock()    print(\’it takes %s Seconds to get %s items \’%(endstart,no)) 

实验5次的结果:

123456 it takes 48.1727159614 Seconds to get 720 items it takes 45.3397999415 Seconds to get 720 items   it takes 44.4811429862 Seconds to get 720 items it takes 44.4619293082 Seconds to get 720 items it takes 46.669706593 Seconds to get 720 items  

python多线程版

1234567891011