python实现爬虫统计学校BBS男女比例之数据处理（三）_程序人生

python实现爬虫统计学校BBS男女比例之数据处理（三）

admin

2023-07-31 02:33:03

0次

本文主要介绍了数据处理方面的内容，希望大家仔细阅读。

一、数据分析

得到了以下列字符串开头的文本数据，我们需要进行处理

二、回滚

我们需要对httperror的数据进行再处理

因为代码的原因，具体可见本系列文章（二），会导致文本里面同一个id连续出现几次httperror记录：

//httperror265001_266001.txt
265002 httperror
265002 httperror
265002 httperror
265002 httperror
265003 httperror
265003 httperror
265003 httperror
265003 httperror

所以我们在代码里要考虑这种情形，不能每一行的id都进行处理，是判断是否重复的id。

java里面有缓存方法可以避免频繁读取硬盘上的文件，python其实也有，可以见这篇文章。

def main():
  reload(sys)
  sys.setdefaultencoding(\'utf-8\')
  global sexRe,timeRe,notexistRe,url1,url2,file1,file2,file3,file4,startNum,endNum,file5
  sexRe = re.compile(u\'em>\\u6027\\u522b(.*?)\\u4e0a\\u6b21\\u6d3b\\u52a8\\u65f6\\u95f4(.*?))\\u62b1\\u6b49\\uff0c\\u60a8\\u6307\\u5b9a\\u7684\\u7528\\u6237\\u7a7a\\u95f4\\u4e0d\\u5b58\\u5728<\')
  url1 = \'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s\'
  url2 = \'http://rs.xidian.edu.cn/home.php?mod=space&uid=%s&do=profile\'
  file1 = \'ruisi\\\\correct_re.txt\'
  file2 = \'ruisi\\\\errTime_re.txt\'
  file3 = \'ruisi\\\\notexist_re.txt\'
  file4 = \'ruisi\\\\unkownsex_re.txt\'
  file5 = \'ruisi\\\\httperror_re.txt\'

  #遍历文件夹里面以httperror开头的文本
  for filename in os.listdir(r\'E:\\pythonProject\\ruisi\'):
    if filename.startswith(\'httperror\'):
      count = 0
      newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (filename)
      readFile = open(newName,\'r\')
      oldLine = \'0\'
      for line in readFile:
        #newLine 用来比较是否是重复的id
        newLine = line
        if (newLine != oldLine):
          nu = newLine.split()[0]
          oldLine = newLine
          count += 1
          searchWeb((int(nu),))
      print \"%s deal %s lines\" %(filename, count)

本代码为了简便，没有再把httperror的那些id分类，直接存储为下面这5个文件里

 file1 = \'ruisi\\\\correct_re.txt\'
  file2 = \'ruisi\\\\errTime_re.txt\'
  file3 = \'ruisi\\\\notexist_re.txt\'
  file4 = \'ruisi\\\\unkownsex_re.txt\'
  file5 = \'ruisi\\\\httperror_re.txt\'

可以看下输出Log记录，总共处理了多少个httperror的数据。

\"D:\\Program Files\\Python27\\python.exe\" E:/pythonProject/webCrawler/reload.py
httperror132001-133001.txt deal 21 lines
httperror2001-3001.txt deal 4 lines
httperror251001-252001.txt deal 5 lines
httperror254001-255001.txt deal 1 lines

三、单线程统计unkownsex 数据

代码简单，我们利用单线程统计一下unkownsex（由于权限原因无法获取、或者该用户没有填写）的用户。另外，经过我们检查，没有性别的用户也是没有活动时间的。

数据格式如下：

253042 unkownsex
253087 unkownsex
253102 unkownsex
253118 unkownsex
253125 unkownsex
253136 unkownsex
253161 unkownsex

import os,time
sumCount = 0

startTime = time.clock()

for filename in os.listdir(r\'E:\\pythonProject\\ruisi\'):
  if filename.startswith(\'unkownsex\'):
    count = 0
    newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (filename)
    readFile = open(newName,\'r\')
    for line in open(newName):
      count += 1
      sumCount +=1
    print \"%s deal %s lines\" %(filename, count)
print \'%s unkowns sex\' %(sumCount)

endTime = time.clock()
print \"cost time \" + str(endTime - startTime) + \" s\"

处理速度很快，输出如下：

unkownsex1-1001.txt deal 204 lines
unkownsex100001-101001.txt deal 50 lines
unkownsex10001-11001.txt deal 206 lines
#...省略中间输出信息
unkownsex99001-100001.txt deal 56 lines
unkownsex_re.txt deal 1085 lines
14223 unkowns sex
cost time 0.0813142301261 s

四、单线程统计 correct 数据

数据格式如下：

31024 男 2014-11-11 13:20
31283 男 2013-3-25 19:41
31340 保密 2015-2-2 15:17
31427 保密 2014-8-10 09:17
31475 保密 2013-7-2 08:59
31554 保密 2014-10-17 17:02
31621 男 2015-5-16 19:27
31872 保密 2015-1-11 16:49
31915 保密 2014-5-4 11:01
31997 保密 2015-5-16 20:14

代码如下，实现思路就是一行一行读取，利用line.split()获取性别信息。sumCount 是统计一个多少人，boycount 、girlcount 、secretcount 分别统计男、女、保密的人数。我们还是利用unicode进行正则匹配。

import os,sys,time
reload(sys)
sys.setdefaultencoding(\'utf-8\')
startTime = time.clock()
sumCount = 0
boycount = 0
girlcount = 0
secretcount = 0
for filename in os.listdir(r\'E:\\pythonProject\\ruisi\'):
  if filename.startswith(\'correct\'):
    newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (filename)
    readFile = open(newName,\'r\')
    for line in readFile:
      sexInfo = line.split()[1]
      sumCount +=1
      if sexInfo == u\'\\u7537\' :
        boycount += 1
      elif sexInfo == u\'\\u5973\':
        girlcount +=1
      elif sexInfo == u\'\\u4fdd\\u5bc6\':
        secretcount +=1
    print \"until %s, sum is %s boys; %s girls; %s secret;\" %(filename, boycount,girlcount,secretcount)
print \"total is %s; %s boys; %s girls; %s secret;\" %(sumCount, boycount,girlcount,secretcount)
endTime = time.clock()
print \"cost time \" + str(endTime - startTime) + \" s\"

注意，我们输出的是截止某个文件的统计信息，而不是单个文件的统计情况。输出结果如下：

until correct1-1001.txt, sum is 110 boys; 7 girls; 414 secret;
until correct100001-101001.txt, sum is 125 boys; 13 girls; 542 secret;
#...省略
until correct99001-100001.txt, sum is 11070 boys; 3113 girls; 26636 secret;
until correct_re.txt, sum is 13937 boys; 4007 girls; 28941 secret;
total is 46885; 13937 boys; 4007 girls; 28941 secret;
cost time 3.60047888495 s

五、多线程统计数据

为了更快统计，我们可以利用多线程。
作为对比，我们试下单线程需要的时间。

# encoding: UTF-8
import threading
import time,os,sys

#全局变量
SUM = 0
BOY = 0
GIRL = 0
SECRET = 0
NUM =0

#本来继承自threading.Thread，覆盖run()方法，用start()启动线程
#这和java里面很像
class StaFileList(threading.Thread):
  #文本名称列表
  fileList = []

  def __init__(self, fileList):
    threading.Thread.__init__(self)
    self.fileList = fileList

  def run(self):
    global SUM, BOY, GIRL, SECRET
    #可以加上个耗时时间，这样多线程更加明显，而不是顺序的thread-1,2,3
    #time.sleep(1)
    #acquire获取锁
    if mutex.acquire(1):
      self.staFiles(self.fileList)
      #release释放锁
      mutex.release()

  #处理输入的files列表，统计男女人数
  #注意这儿数据同步问题，global使用全局变量
  def staFiles(self, files):
    global SUM, BOY, GIRL, SECRET
    for name in files:
      newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (name)
      readFile = open(newName,\'r\')
      for line in readFile:
        sexInfo = line.split()[1]
        SUM +=1
        if sexInfo == u\'\\u7537\' :
          BOY += 1
        elif sexInfo == u\'\\u5973\':
          GIRL +=1
        elif sexInfo == u\'\\u4fdd\\u5bc6\':
          SECRET +=1
      # print \"thread %s, until %s, total is %s; %s boys; %s girls;\" \\
      #    \" %s secret;\" %(self.name, name, SUM, BOY,GIRL,SECRET)


def test():
  #files保存多个文件，可以设定一个线程处理多少个文件
  files = []

  #用来保存所有的线程，方便最后主线程等待所以子线程结束
  staThreads = []
  i = 0
  for filename in os.listdir(r\'E:\\pythonProject\\ruisi\'):
    #没获取10个文本，就创建一个线程
    if filename.startswith(\'correct\'):
      files.append(filename)
      i+=1
      #一个线程处理20个文件
      if i == 20 :
        staThreads.append(StaFileList(files))
        files = []
        i = 0
  #最后剩余的files，很可能长度不足10个
  if files:
    staThreads.append(StaFileList(files))

  for t in staThreads:
    t.start()
  # 主线程中等待所有子线程退出，如果不加这个，速度更快些？
  for t in staThreads:
    t.join()



if __name__ == \'__main__\':
  reload(sys)
  sys.setdefaultencoding(\'utf-8\')
  startTime = time.clock()
  mutex = threading.Lock()
  test()
  print \"Multi Thread, total is %s; %s boys; %s girls; %s secret;\" %(SUM, BOY,GIRL,SECRET)
  endTime = time.clock()
  print \"cost time \" + str(endTime - startTime) + \" s\"

输出

Multi Thread, total is 46885; 13937 boys; 4007 girls; 28941 secret;
cost time 0.132137192794 s

我们发现时间和单线程差不多。因为这儿涉及到线程同步问题，获取锁和释放锁都是需要时间开销的，线程间切换保存中断和恢复中断也都是需要时间开销的。

六、较多数据的单线程和多线程对比

我们可以对correct、errTime 、unkownsex的文本都进行处理。
单线程代码

# coding=utf-8
import os,sys,time
reload(sys)
sys.setdefaultencoding(\'utf-8\')
startTime = time.clock()
sumCount = 0
boycount = 0
girlcount = 0
secretcount = 0
unkowncount = 0
for filename in os.listdir(r\'E:\\pythonProject\\ruisi\'):
  # 有性别、活动时间
  if filename.startswith(\'correct\') :
    newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (filename)
    readFile = open(newName,\'r\')
    for line in readFile:
      sexInfo =line.split()[1]
      sumCount +=1
      if sexInfo == u\'\\u7537\' :
        boycount += 1
      elif sexInfo == u\'\\u5973\':
        girlcount +=1
      elif sexInfo == u\'\\u4fdd\\u5bc6\':
        secretcount +=1
    # print \"until %s, sum is %s boys; %s girls; %s secret;\" %(filename, boycount,girlcount,secretcount)
  #没有活动时间，但是有性别
  elif filename.startswith(\"errTime\"):
    newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (filename)
    readFile = open(newName,\'r\')
    for line in readFile:
      sexInfo =line.split()[1]
      sumCount +=1
      if sexInfo == u\'\\u7537\' :
        boycount += 1
      elif sexInfo == u\'\\u5973\':
        girlcount +=1
      elif sexInfo == u\'\\u4fdd\\u5bc6\':
        secretcount +=1
    # print \"until %s, sum is %s boys; %s girls; %s secret;\" %(filename, boycount,girlcount,secretcount)
  #没有性别，也没有时间，直接统计行数
  elif filename.startswith(\"unkownsex\"):
    newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (filename)
    # count = len(open(newName,\'rU\').readlines())
    #对于大文件用循环方法，count 初始值为 -1 是为了应对空行的情况，最后+1得到0行
    count = -1
    for count, line in enumerate(open(newName, \'rU\')):
      pass
    count += 1
    unkowncount += count
    sumCount += count
    # print \"until %s, sum is %s unkownsex\" %(filename, unkowncount)



print \"Single Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex;\" %(sumCount, boycount,girlcount,secretcount,unkowncount)
endTime = time.clock()
print \"cost time \" + str(endTime - startTime) + \" s\"

输出为

Single Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.37444645628 s

多线程代码

__author__ = \'admin\'
# encoding: UTF-8
#多线程处理程序
import threading
import time,os,sys

#全局变量
SUM = 0
BOY = 0
GIRL = 0
SECRET = 0
UNKOWN = 0

class StaFileList(threading.Thread):
  #文本名称列表
  fileList = []

  def __init__(self, fileList):
    threading.Thread.__init__(self)
    self.fileList = fileList

  def run(self):
    global SUM, BOY, GIRL, SECRET
    if mutex.acquire(1):
      self.staManyFiles(self.fileList)
      mutex.release()

  #处理输入的files列表，统计男女人数
  #注意这儿数据同步问题
  def staCorrectFiles(self, files):
    global SUM, BOY, GIRL, SECRET
    for name in files:
      newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (name)
      readFile = open(newName,\'r\')
      for line in readFile:
        sexInfo = line.split()[1]
        SUM +=1
        if sexInfo == u\'\\u7537\' :
          BOY += 1
        elif sexInfo == u\'\\u5973\':
          GIRL +=1
        elif sexInfo == u\'\\u4fdd\\u5bc6\':
          SECRET +=1
      # print \"thread %s, until %s, total is %s; %s boys; %s girls;\" \\
      #    \" %s secret;\" %(self.name, name, SUM, BOY,GIRL,SECRET)

  def staManyFiles(self, files):
    global SUM, BOY, GIRL, SECRET,UNKOWN
    for name in files:
      if name.startswith(\'correct\') :
        newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (name)
        readFile = open(newName,\'r\')
        for line in readFile:
          sexInfo = line.split()[1]
          SUM +=1
          if sexInfo == u\'\\u7537\' :
            BOY += 1
          elif sexInfo == u\'\\u5973\':
            GIRL +=1
          elif sexInfo == u\'\\u4fdd\\u5bc6\':
            SECRET +=1
        # print \"thread %s, until %s, total is %s; %s boys; %s girls;\" \\
        #    \" %s secret;\" %(self.name, name, SUM, BOY,GIRL,SECRET)
      #没有活动时间，但是有性别
      elif name.startswith(\"errTime\"):
        newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (name)
        readFile = open(newName,\'r\')
        for line in readFile:
          sexInfo = line.split()[1]
          SUM +=1
          if sexInfo == u\'\\u7537\' :
            BOY += 1
          elif sexInfo == u\'\\u5973\':
            GIRL +=1
          elif sexInfo == u\'\\u4fdd\\u5bc6\':
            SECRET +=1
        # print \"thread %s, until %s, total is %s; %s boys; %s girls;\" \\
        #    \" %s secret;\" %(self.name, name, SUM, BOY,GIRL,SECRET)
      #没有性别，也没有时间，直接统计行数
      elif name.startswith(\"unkownsex\"):
        newName = \'E:\\\\pythonProject\\\\ruisi\\\\%s\' % (name)
        # count = len(open(newName,\'rU\').readlines())
        #对于大文件用循环方法，count 初始值为 -1 是为了应对空行的情况，最后+1得到0行
        count = -1
        for count, line in enumerate(open(newName, \'rU\')):
          pass
        count += 1
        UNKOWN += count
        SUM += count
        # print \"thread %s, until %s, total is %s; %s unkownsex\" %(self.name, name, SUM, UNKOWN)


def test():
  files = []
  #用来保存所有的线程，方便最后主线程等待所以子线程结束
  staThreads = []
  i = 0
  for filename in os.listdir(r\'E:\\pythonProject\\ruisi\'):
    #没获取10个文本，就创建一个线程
    if filename.startswith(\"correct\") or filename.startswith(\"errTime\") or filename.startswith(\"unkownsex\"):
      files.append(filename)
      i+=1
      if i == 20 :
        staThreads.append(StaFileList(files))
        files = []
        i = 0
  #最后剩余的files，很可能长度不足10个
  if files:
    staThreads.append(StaFileList(files))

  for t in staThreads:
    t.start()
  # 主线程中等待所有子线程退出
  for t in staThreads:
    t.join()



if __name__ == \'__main__\':
  reload(sys)
  sys.setdefaultencoding(\'utf-8\')
  startTime = time.clock()
  mutex = threading.Lock()
  test()
  print \"Multi Thread, total is %s; %s boys; %s girls; %s secret; %s unkownsex\" %(SUM, BOY,GIRL,SECRET,UNKOWN)
  endTime = time.clock()
  print \"cost time \" + str(endTime - startTime) + \" s\"
  endTime = time.clock()
  print \"cost time \" + str(endTime - startTime) + \" s\"

输出为

Multi Thread, total is 61111; 13937 boys; 4009 girls; 28942 secret;
cost time 1.23049112201 s
可以看出多线程还是优于单线程的，由于使用的同步，数据统计是一直的。

注意python在类内部经常需要加上self，这点和java区别很大。

 def __init__(self, fileList):
    threading.Thread.__init__(self)
    self.fileList = fileList

  def run(self):
    global SUM, BOY, GIRL, SECRET
    if mutex.acquire(1):
      #调用类内部方法需要加self
      self.staFiles(self.fileList)
      mutex.release()

total is 61111; 13937 boys; 4009 girls; 28942 secret; 14223 unkownsex;
cost time 1.25413238673 s

以上就是本文的全部内容，希望对大家的学习有所帮助。

python 爬虫统计

上一篇：简单实现python爬虫功能

下一篇：python实现爬虫统计学校BBS男女比例之多线程爬虫（二）

python实现爬虫统计学校BBS男女比例之数据处理（三）

相关内容

热门资讯