當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

Python网络爬虫与信息提取(三)：网络爬虫之实战

發(fā)布時(shí)間：2025/3/20 python 12 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python网络爬虫与信息提取(三)：网络爬虫之实战小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

此系列筆記來(lái)源于
中國(guó)大學(xué)MOOC-北京理工大學(xué)-嵩天老師的Python系列課程

轉(zhuǎn)載自：http://www.jianshu.com/p/98d0139dacac

7. Re(正則表達(dá)式)庫(kù)入門

regular expression = regex = RE
是一種通用的字符串表達(dá)框架,用來(lái)簡(jiǎn)潔表達(dá)一組字符串的表達(dá)式,也可用來(lái)判斷某字符串的特征歸屬

正則表達(dá)式的語(yǔ)法

常用操作符1

常用操作符2

實(shí)例
經(jīng)典實(shí)例

Re庫(kù)的基本使用

正則表達(dá)式的表示類型為raw string類型(原生字符串類型),表示為r’text’

Re庫(kù)主要功能函數(shù)

功能函數(shù)

re.search(pattern,string,flags=0)

re.match(pattern,string,flags=0)
因?yàn)閙atch為從開(kāi)始位置開(kāi)始匹配,使用時(shí)要加if進(jìn)行判別返回結(jié)果是否為空,否則會(huì)報(bào)錯(cuò)
re.findall(pattern,string,flags=0)
re.split(pattern,string,maxsplit=0,flags=0)
maxsplit為最大分割數(shù),剩余部分作為最后一個(gè)元素輸出
re.finditer(pattern,string,flags=0)
re.sub(pattern,repl,string,count=0,flags=0)
repl是用來(lái)替換的字符串,count為替換次數(shù)
Re庫(kù)的另一種等價(jià)用法
Re庫(kù)的函數(shù)式用法為一次性操作,還有一種為面向?qū)ο笥梅?可在編譯后多次操作
regex = re.compile(pattern,flags=0)
通過(guò)compile生成的regex對(duì)象才能被叫做正則表達(dá)式
Re庫(kù)的match對(duì)象

Match對(duì)象的屬性
Match對(duì)象的方法

實(shí)例
- Re庫(kù)的貪婪匹配和最小匹配
  Re庫(kù)默認(rèn)采取貪婪匹配,即輸出匹配最長(zhǎng)的子串
  
  最小匹配操作符
8.實(shí)例二:淘寶商品比價(jià)定向爬蟲(chóng)(requests-re)

步驟1:提交商品搜索請(qǐng)求,循環(huán)獲取頁(yè)面
步驟2:對(duì)于每個(gè)頁(yè)面,提取商品名稱和價(jià)格信息
步驟3:將信息輸出顯示
import requests import redef getHTMLText(url):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def parsePage(ilt, html):try:plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)for i in range(len(plt)):price = eval(plt[i].split(':')[1])title = eval(tlt[i].split(':')[1])ilt.append([price , title])except:print("")def printGoodsList(ilt):tplt = "{:4}\t{:8}\t{:16}"print(tplt.format("序號(hào)", "價(jià)格", "商品名稱"))count = 0for g in ilt:count = count + 1print(tplt.format(count, g[0], g[1]))def main():goods = '書(shū)包'depth = 3start_url = 'https://s.taobao.com/search?q=' + goodsinfoList = []for i in range(depth):try:url = start_url + '&s=' + str(44*i)html = getHTMLText(url)parsePage(infoList, html)except:continueprintGoodsList(infoList)main()
9.實(shí)例三:股票數(shù)據(jù)定向爬蟲(chóng)(requests-bs

4-re)
步驟1:從東方財(cái)富網(wǎng)獲取股票列表
步驟2:根據(jù)股票列表逐個(gè)到百度股票獲取個(gè)股信息
步驟3:將結(jié)果存儲(chǔ)到文件
#CrawBaiduStocksB.py import requests from bs4 import BeautifulSoup import traceback import redef getHTMLText(url, code="utf-8"):try:r = requests.get(url)r.raise_for_status()r.encoding = codereturn r.textexcept:return ""def getStockList(lst, stockURL):html = getHTMLText(stockURL, "GB2312")soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a')for i in a:try:href = i.attrs['href']lst.append(re.findall(r"[s][hz]\d{6}", href)[0])except:continuedef getStockInfo(lst, stockURL, fpath):count = 0for stock in lst:url = stockURL + stock + ".html"html = getHTMLText(url)try:if html=="":continueinfoDict = {}soup = BeautifulSoup(html, 'html.parser')stockInfo = soup.find('div',attrs={'class':'stock-bets'})name = stockInfo.find_all(attrs={'class':'bets-name'})[0]infoDict.update({'股票名稱': name.text.split()[0]})keyList = stockInfo.find_all('dt')valueList = stockInfo.find_all('dd')for i in range(len(keyList)):key = keyList[i].textval = valueList[i].textinfoDict[key] = valwith open(fpath, 'a', encoding='utf-8') as f:f.write( str(infoDict) + '\n' )count = count + 1print("\r當(dāng)前進(jìn)度: {:.2f}%".format(count*100/len(lst)),end="")except:count = count + 1print("\r當(dāng)前進(jìn)度: {:.2f}%".format(count*100/len(lst)),end="")continuedef main():stock_list_url = 'http://quote.eastmoney.com/stocklist.html'stock_info_url = 'https://gupiao.baidu.com/stock/'output_file = 'D:/BaiduStockInfo.txt'slist=[]getStockList(slist, stock_list_url)getStockInfo(slist, stock_info_url, output_file)main()

總結(jié)

以上是生活随笔為你收集整理的Python网络爬虫与信息提取(三)：网络爬虫之实战的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： Python之Requests
下一篇： Python网络爬虫与信息提取(二)：网

python

Python网络爬虫与信息提取(三)：网络爬虫之实战

7. Re(正則表達(dá)式)庫(kù)入門

8.實(shí)例二:淘寶商品比價(jià)定向爬蟲(chóng)(requests-re)

9.實(shí)例三:股票數(shù)據(jù)定向爬蟲(chóng)(requests-bs

總結(jié)