當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python爬取qq群成员_Python爬取QQ群群员

發(fā)布時(shí)間：2023/12/10 python 25 豆豆

生活随笔收集整理的這篇文章主要介紹了 python爬取qq群成员_Python爬取QQ群群员小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

昨天發(fā)現(xiàn)了一個(gè)群中人雖然很多，有一千人，但是沒有幾個(gè)人說話，群中一位朋友說有許多人是死號，我好奇去看了看，發(fā)現(xiàn)確實(shí)如此，有許多人的空間中說說，照片，日志都是0，訪客只有幾百，甚至幾十，想通過學(xué)過的Python做一點(diǎn)事，思路是通過Python+selenium通過瀏覽器動(dòng)態(tài)登錄qq空間，然后通過selenium的find_elements_by_class_name動(dòng)態(tài)獲取網(wǎng)頁的內(nèi)容中qq成員的網(wǎng)址，然后存入數(shù)組，加載完后，遍歷數(shù)組中每一個(gè)qq成員去加載成員空間網(wǎng)址，之后，通過find_element_by_id得到每個(gè)成員的說說，照片，日志數(shù)量，當(dāng)然有的空間限制非好友訪問，抓取不到，默認(rèn)為是活號。還有的空間有自定義空間，與其他空間不同，也抓取不到數(shù)據(jù)，默認(rèn)為是活好。只有說說，照片，日志數(shù)量都是0的號，我們認(rèn)為是死號。

于是用Python寫了代碼查看，結(jié)果如下：

其中success:70是訪問對方空間成功并且說說，照片，日志的數(shù)量不全為0，有70人。

fail：71是訪問權(quán)限受限，有71人。false:853是訪問對方空間成功并且說說，照片，日志的數(shù)量全為0，有853人。這說明這個(gè)群有853死號，這些號的qq號大多數(shù)是10位，以3開頭，應(yīng)該是買來的。

首先：

安裝Python3，32位的先，去官網(wǎng)下，然后去找一個(gè)集成開發(fā)環(huán)境(IDE),我安裝的pycharm。現(xiàn)在可以抓取一些靜態(tài)網(wǎng)頁了，記住(Python2中urllib2,在Python3中是urllib,自己查)，我首先通過瀏覽器的開發(fā)人員工具(快捷鍵F12)去觀看源代碼，如圖：

查找源代碼特征(審查元素是一個(gè)好東西，用審查元素能節(jié)省你查找信息的時(shí)間)，發(fā)現(xiàn)了一些特征，見圖：

每個(gè)成員的網(wǎng)址在href中，而且他們都有共同的class，這一點(diǎn)我們可以利用，想到這里，是不是大功告成了？很遺憾，遠(yuǎn)遠(yuǎn)不夠，我試著獲取，卻發(fā)現(xiàn)沒有這個(gè)東西。這個(gè)文件是動(dòng)態(tài)加載的，跟ajax有關(guān)系，(現(xiàn)在許多網(wǎng)頁都是動(dòng)態(tài)加載，urllib的作用不如以前，用selenium可以)，selenium是什么東西？Selenium也是一個(gè)用于Web應(yīng)用程序測試的工具。Selenium測試直接運(yùn)行在瀏覽器中，就像真正的用戶在操作一樣。selenium就是一個(gè)自動(dòng)化測試工具。現(xiàn)在有三種方法獲取成員的href：直接在開發(fā)人員工具中獲取，復(fù)制到一個(gè)文本文件中，然后通過代碼讀取文件，正則匹配【reg = r'href="http://user.qzone.qq.com/\d{8,11}" class='】，在寫入到文件(用realaddr = ''.join(realaddrlist)可以將一個(gè)list轉(zhuǎn)換成string，當(dāng)然有其他方法)，然后Python寫代碼加載成員的空間，判斷是否死號。

通過抓包工具(fiddler等)，抓包，但我沒試過，因?yàn)槲也恢雷ツ膫€(gè)包。

通過Python代碼登錄QQ空間，進(jìn)入群空間，用selenium獲取成員信息(推薦)，這種方式是最好的，因?yàn)榭偛荒軟]查一個(gè)網(wǎng)站，就去復(fù)制粘貼，正則匹配一番。

下面看登錄空間代碼：

import urllib

from selenium import webdriver

import time

import win32com.client

option= webdriver.ChromeOptions()

option.add_argument("--test-type")

option.add_argument(r'user-data-dir=******') #設(shè)置成用戶自己的數(shù)據(jù)目錄(谷歌瀏覽器用 chrome://version/查個(gè)人資料路徑)

#chromedriver2.9與chrome不兼容，用chromedriver2.4

driver = webdriver.Chrome(executable_path='c:\chromedriver\chromedriver.exe',chrome_options=option)

#不用上面的數(shù)據(jù)是可以正常登陸的，即只用下面一行，加載我的數(shù)據(jù)目錄更好

# driver = webdriver.Chrome(executable_path='c:\chromedriver\chromedriver.exe')

#從網(wǎng)頁端進(jìn)入群成員界面，復(fù)制網(wǎng)址

driver.get('http://qun.qzone.qq.com/group#!/******/member')

driver.switch_to_frame('login_frame')

driver.find_element_by_id('switcher_plogin').click()

driver.find_element_by_id('u').clear()

#你的qq號

driver.find_element_by_id('u').send_keys('******')

driver.find_element_by_id('p').clear()

#你的密碼

driver.find_element_by_id('p').send_keys('*****')

driver.find_element_by_id('login_button').click()

time.sleep(10)

記得chrome瀏覽器要下載一個(gè)chromedriver驅(qū)動(dòng)，怎么下自己去找。登錄后：

for handle in driver.window_handles:#方法二，始終獲得當(dāng)前最后的窗口，所以多要多次使用

driver.switch_to_window(handle) #此行代碼用來定位當(dāng)前頁面,要不然抓取的不是此頁面

#從去空間取得全部成員標(biāo)簽類[數(shù)組]

qqslist = driver.find_elements_by_class_name("avatar_50")

print(driver.title)

numqq = 0

path = r'C:\Users\Desktop\qqmsg.txt'

#從去空間取得每個(gè)成員的信息[遍歷得到網(wǎng)址]

qqhreflist = list()

for qq in qqslist:

qqhref = qq.get_attribute("href")

numqq=numqq+1

print(qqhref)

qqhreflist.append(qqhref)

writeTxt(path,qqhref)

time.sleep(1)

print("getqqs sucesss")

print("start process")

然后到這里已經(jīng)獲取到了所有成員的href數(shù)組：qqhreflist,接下來：

wpsApp = win32com.client.Dispatch("KET.Application")

wpsApp.Visible=1

xlBook = wpsApp.Workbooks.Add()

getSele(qqhreflist,xlBook)

try:

xlBook.SaveAs(r"C:\Users\Desktop\qqmsg.xls")

finally:

xlBook.Close()

wpsApp.Quit()

del wpsApp

driver.close()

driver.quit()

這里有一個(gè)win32com，自己下載，可能少一個(gè)capicom.dll，下載，管理員運(yùn)行cmd 輸入:C:\Windows\SysWOW64\regsvr32.exe C:\Windows\SysWOW64\capicom.dll 就行了，這里我用的wps來接受數(shù)據(jù)，存入表格。

接下來是重頭戲，判別是否是死號，函數(shù)getSele(),代碼如下：

def getSele(list,xlBook):

xlBook.ActiveSheet.Cells(1, 1).Value = "昵稱" xlBook.ActiveSheet.Cells(1, 2).Value = "照片" xlBook.ActiveSheet.Cells(1, 3).Value = "說說" xlBook.ActiveSheet.Cells(1, 4).Value = "日志" path = r'C:\Users\煙魂\Desktop\data1.txt' numfalse = 0 numtrue = 0 numfail = 0 sum=0 for data in list:

sum = sum+1 print("當(dāng)前第幾個(gè)："+str(sum))

driver.get(data)

time.sleep(0.1)

try:

driver.find_element_by_id("QM_Profile_Photo_Cnt")

driver.find_element_by_id("QM_Profile_Mood_Cnt")

driver.find_element_by_id("QM_Profile_Blog_Cnt")

except:

print("fail")

numfail = numfail + 1 try:

user_name = driver.find_elements_by_class_name("user_name")[0].text

if user_name=="":

selfname = "無名字，未知錯(cuò)誤" print(selfname)

xlBook.ActiveSheet.Cells(sum + 1, 1).Value = selfname

else:

print(user_name)

xlBook.ActiveSheet.Cells(sum + 1, 1).Value = user_name

xlBook.ActiveSheet.Cells(sum + 1, 2).Value = "-1" xlBook.ActiveSheet.Cells(sum + 1, 3).Value = "-1" xlBook.ActiveSheet.Cells(sum + 1, 4).Value = "-1" print("-1")

writeTxt(path, user_name)

except:

print("none")

selfname = "自定義面板或自己" print(selfname)

xlBook.ActiveSheet.Cells(sum + 1, 1).Value = selfname

continue writeTxt(path, "fail")

continue numphoto = driver.find_element_by_id("QM_Profile_Photo_Cnt").text

nummood = driver.find_element_by_id("QM_Profile_Mood_Cnt").text

numblog = driver.find_element_by_id("QM_Profile_Blog_Cnt").text

if(numblog=="0" and numphoto=="0" and nummood=="0"):

numfalse=numfalse+1 else:

numtrue = numtrue +1 print(driver.title)

print(numphoto)

print(nummood)

print(numblog)

xlBook.ActiveSheet.Cells(sum+1, 1).Value = driver.title

xlBook.ActiveSheet.Cells(sum+1, 2).Value = numphoto

xlBook.ActiveSheet.Cells(sum+1, 3).Value = nummood

xlBook.ActiveSheet.Cells(sum+1, 4).Value = numblog

print("sucess:"+str(numtrue)+" fail:"+str(numfail)+" false:"+str(numfalse))

data = " drivertitle:" + driver.title + " numphoto:" + numphoto + " nummood:" + nummood + " numblog:" + numblog

writeTxt(path, data)

datanum = "numtrue: "+str(numtrue)+" numfalse:"+str(numfalse)+" numfail:"+str(numfail)

print(datanum)

writeTxt(path,datanum)

num = numfalse/(numtrue+numfail+numfalse)

datanums = "num: " + str(num)

print(datanums)

writeTxt(path,datanums)

writeTxt(path,"success")

然后一些其他函數(shù)：

def getHtml(url):

page = urllib.request.urlopen(url)

html = page.read()

html = html.decode('UTF-8')

return html

def readTxt(path):

file_object = open(path,'r',encoding= 'utf-8')

try:

all_the_text = file_object.read()

finally:

file_object.close()

return all_the_text

def writeTxt(path,data):

file_object = open(path,'a',encoding= 'utf-8')

try:

file_object.write(data)

finally:

file_object.close()

然后完了,Python自己抓取幾十分鐘就行了。代碼截圖

總結(jié)

以上是生活随笔為你收集整理的python爬取qq群成员_Python爬取QQ群群员的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： python水平_如何在python中水
下一篇： javascript --- 编程风格