當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

“百度百科六度分隔理论”（简单版）

發(fā)布時(shí)間：2023/12/20 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了 “百度百科六度分隔理论”（简单版）小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

“百度百科六度分隔理論”（簡單版）

相信大家都聽說過“維基百科六度分隔理論”，本文在此只研究該理論的前期過程，即構(gòu)建一個(gè)從一個(gè)頁面到另一個(gè)頁面的爬蟲。本文選用百度百科的金融詞條進(jìn)行測驗(yàn)。

前期準(zhǔn)備

解決url亂碼問題：百度百科的url顯示出來會(huì)出現(xiàn)亂碼，以下為解決辦法。

#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860 from urllib.parse import unquote url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860' def new_url(url):new_url=unquote(url,'utf8')return new_url

實(shí)踐

先查找所有鏈接，發(fā)現(xiàn)鏈接在a標(biāo)簽中。

from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.parse import unquote url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860' def new_url(url):new_url=unquote(url,'utf8')return new_url html=urlopen('https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860') bs=BeautifulSoup(html,'html.parser') for link in bs.find_all('a'):if 'href' in link.attrs:print(link.attrs['href'])#發(fā)現(xiàn)符合要求的鏈接和不符合要求的鏈接都被選出，需要進(jìn)行下一步篩選

進(jìn)一步篩選合適的詞條鏈接，發(fā)現(xiàn)詞條鏈接的共同點(diǎn)：

詞條鏈接都是類似于：/item/%E4%BC%9A%E8%AE%A1/88436這樣的形式

利用正則表達(dá)式，篩選鏈接：

#^(/item/).*?/[0-9]*$ #https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.parse import unquote import re url='https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860' def new_url(url):new_url=unquote(url,'utf8')return new_url html=urlopen('https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860') bs=BeautifulSoup(html,'html.parser') for link in bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')):if 'href' in link.attrs:print(link.attrs['href'])

創(chuàng)建函數(shù)，優(yōu)化結(jié)構(gòu)

def getLinks(articleUrl):html = urlopen('https://baike.baidu.com{}'.format(articleUrl))bs = BeautifulSoup(html, 'html.parser')return bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')) links=getLinks('/item/%E9%87%91%E8%9E%8D/860') while len(links)>0:newArticle=links[random.randint(0,len(links)-1)].attrs['href']print(newArticle)links=getLinks(newArticle)

5.總的代碼：

#https://baike.baidu.com/item/%E9%87%91%E8%9E%8D/860 from urllib.request import urlopen from bs4 import BeautifulSoup from urllib.parse import unquote import datetime import random import re random.seed(datetime.datetime.now()) def new_url(url):new_url=unquote(url,'utf8')return new_url def getLinks(articleUrl):html = urlopen('https://baike.baidu.com{}'.format(articleUrl))bs = BeautifulSoup(html, 'html.parser')return bs.find_all('a',href=re.compile('^(/item/).*?/[0-9]*$')) links=getLinks('/item/%E9%87%91%E8%9E%8D/860') while len(links)>0:newArticle=links[random.randint(0,len(links)-1)].attrs['href']print(newArticle)links=getLinks(newArticle)

總結(jié)

以上是生活随笔為你收集整理的“百度百科六度分隔理论”（简单版）的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：子类初始化列表不能初始化父类元素 --
下一篇： hdu 6638 Snowy Smile