Python爬虫与信息提取(八)将新浪热搜排名导入数据库
生活随笔
收集整理的這篇文章主要介紹了
Python爬虫与信息提取(八)将新浪热搜排名导入数据库
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
python爬取新浪熱搜排名并導(dǎo)入數(shù)據(jù)庫
上一篇文章簡單介紹了如何使用python爬取新浪微博的熱搜排名:
爬蟲實(shí)例:爬取新浪微博熱搜排名
如果了解清楚原理的話是非常容易懂的,但是這樣單純的進(jìn)行查詢結(jié)果顯示顯然沒有意義
學(xué)習(xí)了數(shù)據(jù)庫之后,就嘗試做了以下改進(jìn):
數(shù)據(jù)庫我用的是mysql
目前只設(shè)計(jì)了一個(gè)名為hotsou-db的table來簡單地存放內(nèi)容:
在上篇文章的基礎(chǔ)上,添加如下內(nèi)容:
import pymysql import time def CommitDB(ranklist, num, tt):try:# 連接mysqlconn = pymysql.connect(host=HOST,user=USER, password=PASSWORD,port=PORT,database=DATABASE,charset=CHARSET)except:Logs(tt, 1, 'DB Connection Failed')try:cursor = conn.cursor()for i in range(1, num):rank = int(ranklist[i][0])name =ranklist[i][1]number = int(ranklist[i][2])sql = "INSERT INTO {0} VALUES('{1}', {2}, {3}, {4})".format(DB_NAME, name, rank, number, eval(tt))cursor.execute(sql)conn.commit()cursor.close()conn.close()Logs(tt, 0, 'Commit Done')except:Logs(tt, 1, 'Commit Failed')def Logs(tt, number, text):f = open("./weibo.log", "a")# 用于生成日志 - 0是正確運(yùn)行日志, 1是錯(cuò)誤日志if number is 0:f.write('[' + tt + '] -- ' + text + '\n')else:f.write('[' + tt + '] -- ' + text + '\n')f.close()在main中做如下更改即可:
tt = time.strftime("%Y%m%d%H%M%S", time.localtime()) soup = HTMLTextconvert(text) HTMLSearch(soup, rank, TOP_COUNT+1) Rankprint(rank, TOP_COUNT) CommitDB(rank, TOP_COUNT+1, tt)# 每隔3min記錄一次 time.sleep(180)可以看到主要是用到了pymysql這個(gè)庫來實(shí)現(xiàn)與數(shù)據(jù)庫的交互操作
最終效果:
導(dǎo)入數(shù)據(jù)庫之后就可以做些有趣的事情了
附全代碼:
import requests import re import bs4 import pymysql import timeHOST = "" USER="" PASSWORD="" PORT=0 DB_NAME = "" DATABASE="" CHARSET="utf8" TOP_COUNT=20 # 前多少熱搜條數(shù) TOP_SHOW=False # 是否顯示置頂# 獲取一個(gè)中英文混合的字符串文本的字符寬度部分 widths = [(126, 1), (159, 0), (687, 1), (710, 0), (711, 1),(727, 0), (733, 1), (879, 0), (1154, 1), (1161, 0),(4347, 1), (4447, 2), (7467, 1), (7521, 0), (8369, 1),(8426, 0), (9000, 1), (9002, 2), (11021, 1), (12350, 2),(12351, 1), (12438, 2), (12442, 0), (19893, 2), (19967, 1),(55203, 2), (63743, 1), (64106, 2), (65039, 1), (65059, 0),(65131, 2), (65279, 1), (65376, 2), (65500, 1), (65510, 2),(120831, 1), (262141, 2), (1114109, 1), ]def get_width(a):global widthsif a == 0xe or a == 0xf:return 0for num, wid in widths:if a <= num:return widreturn 1def length(str):sum = 0for ch in str:sum += get_width(ord(ch))return sum# 獲取HTML文本 def getHTMLText(url):try:# 模擬瀏覽器kv = {'user-agent': 'Mozilla/5.0'}r = requests.get(url, headers=kv, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:print("InternetError!")return " "# 解析并且返回HTML文本 def HTMLTextconvert(html):try:soup = bs4.BeautifulSoup(html, "html.parser")return soupexcept:print("HTMLConvertError!")return " "# 檢索HTML中的信息,獲取搜索排名信息 # 存在置頂?shù)那闆r,需要特殊判斷 def HTMLSearch(html, ranklist, cnt):try:flag = 0# 計(jì)數(shù)器mm = 0# 找到所有tbody標(biāo)簽下的所有內(nèi)容,并且遍歷所有的兒子節(jié)點(diǎn)for tr in html.find("tbody").children:# 添加判斷:獲得的內(nèi)容是否為標(biāo)簽Tag類型if isinstance(tr, bs4.element.Tag):# 使用flag特判置頂?shù)那闆rif flag == 0:rank = "置頂"# 注意由于class屬性會(huì)和python中的關(guān)鍵字重名,因此需要變?yōu)閏lass_td02 = tr.find_all(class_=re.compile('td-02'))for i in td02:if isinstance(i, bs4.element.Tag):# trans得到的類型為列表trans = i.find_all("a")number = " "ranklist.append([rank, trans[0].string, number])flag = 1else:# 排名信息在td標(biāo)簽下的class=td-01屬性中td01 = tr.find_all(class_=re.compile("td-01"))for i in td01:if isinstance(i, bs4.element.Tag):rank = i.string# 熱搜內(nèi)容和搜索量在td標(biāo)簽下的class=td-02屬性中:內(nèi)容是a標(biāo)簽,搜索量是span標(biāo)簽td02 = tr.find_all(class_=re.compile("td-02"))for i in td02:name = i.find_all("a")column = i.find_all("span")# 使用string獲取字符串信息不準(zhǔn)確,因?yàn)槲⒉┻€有一些熱搜標(biāo)題為含有表情的,因此使用了textranklist.append([rank, name[0].text, column[0].text])mm += 1if mm == cnt:breakexcept:print("HTMLSearchError!")# 打印排名 def Rankprint(ranklist, num):try:# 先打印表頭,總長為70個(gè)字符,其中{1}和{3}是變化的空格數(shù)量計(jì)算,默認(rèn)為:# 排名*4,空格*3,名稱*50,空格*5,點(diǎn)擊量*8a = " "print("——————————————————————————————————————")print("{0}{1}{2}{3}{4}\n".format("排名", a * 5, "熱搜內(nèi)容", a * 45, "搜索量" + a * 2))if TOP_SHOW is True:print("{0}{1}{2}\n".format(ranklist[0][0], a * 3, ranklist[0][1]))for i in range(1, num+1):# c是排名有一位、兩位的數(shù)字,用來糾正空格c = 7 - len(ranklist[i][0])# 根據(jù)內(nèi)容來隨時(shí)計(jì)算所要填充的空格數(shù)量bstr = ranklist[i][1]b = 62 - length(ranklist[i][1]) - len(ranklist[i][0]) - cprint("{0}{1}{2}{3}{4:<8}".format(ranklist[i][0], a * c, ranklist[i][1], a * b, ranklist[i][2]))print("\n")except:print("RankPrintError!")def CommitDB(ranklist, num, tt):try:# 連接mysqlconn = pymysql.connect(host=HOST,user=USER, password=PASSWORD,port=PORT,database=DATABASE,charset=CHARSET)except:Logs(tt, 1, 'DB Connection Failed')try:cursor = conn.cursor()for i in range(1, num):rank = int(ranklist[i][0])name =ranklist[i][1]number = int(ranklist[i][2])sql = "INSERT INTO {0} VALUES('{1}', {2}, {3}, {4})".format(DB_NAME, name, rank, number, eval(tt))cursor.execute(sql)conn.commit()cursor.close()conn.close()Logs(tt, 0, 'Commit Done')except:Logs(tt, 1, 'Commit Failed')def Logs(tt, number, text):f = open("./weibo.log", "a")# 用于生成日志 - 0是正確運(yùn)行日志, 1是錯(cuò)誤日志if number is 0:f.write('SUCCESS: [' + tt + '] -- ' + text + '\n')else:f.write('ERROR: [' + tt + '] -- ' + text + '\n')f.close()# 主函數(shù) def main():# while(True):try:# 微博熱搜的網(wǎng)站url = "https://s.weibo.com/top/summary?Refer=top_hot&topnav=1&wvr=6"# 使用二維列表存儲(chǔ)每一條熱搜信息的rank信息和內(nèi)容rank = []text = getHTMLText(url)tt = time.strftime("%Y%m%d%H%M%S", time.localtime())# print("當(dāng)前時(shí)間: " + tt)soup = HTMLTextconvert(text)HTMLSearch(soup, rank, TOP_COUNT+1)Rankprint(rank, TOP_COUNT)# CommitDB(rank, TOP_COUNT+1, tt)# 每隔3min記錄一次# time.sleep(180)except:Logs(tt, 1, 'SystemError!');return 0if __name__ == '__main__':print("獲取前"+str(TOP_COUNT)+"條")print("顯示置頂"+str(TOP_SHOW))main()總結(jié)
以上是生活随笔為你收集整理的Python爬虫与信息提取(八)将新浪热搜排名导入数据库的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 推荐系统系列:新浪搜索团队FiBiNET
- 下一篇: 放弃了又何需执着 ?