當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

爬虫实战操作（3）—— 获取列表下的新闻、诗词

發(fā)布時(shí)間：2023/12/31 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了爬虫实战操作（3）—— 获取列表下的新闻、诗词小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

本文前兩部分想實(shí)現(xiàn)：給定鏈接，獲取分頁(yè)的新聞標(biāo)題內(nèi)容，部分程序參考爬蟲(chóng)實(shí)戰(zhàn)操作（2）一新浪新聞內(nèi)容細(xì)節(jié)，爬蟲(chóng)的鏈接是國(guó)際新浪網(wǎng)。

1. 單個(gè)新聞

獲取國(guó)際新聞最新消息下得單個(gè)信息內(nèi)容
根據(jù)上面得鏈接簡(jiǎn)單修改了下程序參數(shù)，主要是評(píng)論數(shù)得修改。

#給一個(gè)新聞id,返回一個(gè)信息評(píng)論數(shù)，因?yàn)樵u(píng)論數(shù)的網(wǎng)址只差一個(gè)新聞id不一樣 import re import requests import json commentURL = "https://comment.sina.com.cn/page/info?version=1&format=json\ &channel=gj&newsid=comos-i{}&group=0&compress=0&ie=utf-8&oe=utf-8&page=1\ &page_size=3&t_size=3&h_size=3&thread=1&uid=unlogin_user&callback=jsonp_1601956837238&_=1601956837238" def getCommentCounts(newsurl): m = re.search('doc-ii(.+).shtml', newsurl)newsid = m.group(1) #獲取新聞編碼id comments=requests.get(commentURL.format(newsid))jd=json.loads(comments.text.strip('jsonp_1601956837238').strip('()'))return jd["result"]["count"]["total"]#獲取評(píng)論數(shù) import requests from datetime import datetime from bs4 import BeautifulSoup #輸入：網(wǎng)址；輸出：新聞?wù)?#xff0c;標(biāo)題，評(píng)論數(shù)，來(lái)源 def getNewsDetail(newsurl):result = {}res = requests.get(newsurl)res.encoding = 'utf-8'soup = BeautifulSoup(res.text, 'html.parser')result['title'] = soup.select(".main-title")[0].textresult['newssource'] = soup.select(".source")[0].texttimesource =soup.select(".date")[0].textresult['dt'] = datetime.strptime(timesource, "%Y年%m月%d日 %H:%M")result['article'] = '\n'.join([p.text.strip() for p in soup.select("#article p")[:-1]])result['editor'] = soup.select("#article p")[-1].text.strip('責(zé)任編輯：')result['comments'] = getCommentCounts(newsurl)return result import json news="https://news.sina.com.cn/w/2020-10-06/doc-iivhvpwz0572161.shtml" getNewsDetail(news)

2. 列表新聞

思想:
先找到控制網(wǎng)頁(yè)分頁(yè)的url，如下面的圖示
再獲取每一頁(yè)的所有新聞的鏈接
接著獲取每個(gè)鏈接的內(nèi)容
最后修改分頁(yè)url的頁(yè)碼

#獲取每一頁(yè)的鏈接，在調(diào)用上面的函數(shù)獲取每個(gè)鏈接的內(nèi)容 def parselistlink(url):newsdetails=[]res=requests.get(url)#去除兩邊的字符串，使得可以用json解析jd=json.loads(res.text.lstrip('newsloadercallback(').rstrip(');'))for ent in jd['result']['data']:#將每頁(yè)下每個(gè)新聞的鏈接傳給getNewsDetail，獲取每個(gè)新聞的內(nèi)容newsdetails.append(getNewsDetail(ent['url']))return newsdetailsurl='https://interface.sina.cn/news/get_news_by_channel_new_v2018.d.html?cat_1=51923&show_num=27&level=1,2&page={}&callback=newsloadercallback&_=1601968313565' result=pd.DataFrame() import pandas as pd #獲取前5頁(yè)的內(nèi)容 for i in range(1,5):newsurl=url.format(i)newsary=parselistlink(newsurl)result=pd.concat([result,pd.DataFrame(newsary)],axis=0) print(result) result1=result.drop_duplicates(keep='first') result1=result1.reset_index().drop('index',axis=1) print(result1)

3. 列表詩(shī)詞

詩(shī)詞鏈接：https://www.shicimingju.com/chaxun/zuozhe/9_2.html

1.先獲取每一頁(yè)的詩(shī)詞的鏈接

url='http://www.shicimingju.com/chaxun/zuozhe/9.html' base='https://www.shicimingju.com' headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'} #使用headers（客戶端的一些信息），偽裝為人類用戶，使得服務(wù)器不會(huì)簡(jiǎn)單地識(shí)別出是爬蟲(chóng) r=requests.get(url,headers=headers) html=r.text.encode(r.encoding).decode() soup=BeautifulSoup(html,'lxml') div=soup.find('div',attrs={'class':'card shici_card'}) hrefs=[h3.find('a')['href'] for h3 in div.findAll('h3')] hrefs=[base+i for i in hrefs] hrefs

2.再獲取所有頁(yè)碼下的所有詩(shī)詞的鏈接

def gethrefs(url):headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}#使用headers（客戶端的一些信息），偽裝為人類用戶，使得服務(wù)器不會(huì)簡(jiǎn)單地識(shí)別出是爬蟲(chóng)base='https://www.shicimingju.com'nexturl=urlans=[]while nexturl!=0:r=requests.get(nexturl,headers=headers)html=r.text.encode(r.encoding).decode()soup=BeautifulSoup(html,'lxml')div=soup.find('div',attrs={'class':'card shici_card'})hrefs=[h3.find('a')['href'] for h3 in div.findAll('h3')]hrefs=[base+i for i in hrefs]try:nexturl=base+soup.find('a',text='下一頁(yè)')['href']print('讀取頁(yè)碼中')except Exception as e:print('已經(jīng)是最后一頁(yè)')nexturl=0ans.append(hrefs)return ans

3.獲取每個(gè)連接下的古詩(shī)內(nèi)容

def writeotxt(url):headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 \(KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36'}#使用headers（客戶端的一些信息），偽裝為人類用戶，使得服務(wù)器不會(huì)簡(jiǎn)單地識(shí)別出是爬蟲(chóng)r=requests.get(url,headers=headers)soup=BeautifulSoup(r.text.encode(r.encoding),'lxml')#數(shù)據(jù)清洗titile=soup.find('h1',id='zs_title').textcontent=soup.find('div',class_='item_content').text.strip()#先建一個(gè)文件夾firedir=os.getcwd()+'蘇軾的詞'if not os.path.exists(firedir):os.mkdir(firedir)with open (firedir+'/%s.txt'%title,mode='w+',encoding='utf-8') as f:f.write(title+'\n')f.write(content+'\n')print('正在載入第 %d首古詩(shī)。。。'%i)

總結(jié)

以上是生活随笔為你收集整理的爬虫实战操作（3）—— 获取列表下的新闻、诗词的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：电脑店能安装mysql_用U盘给台式机安
下一篇：使用U盘PE修复电脑常规问题