Python网络爬虫与信息提取(三):网络爬虫之实战
此系列筆記來(lái)源于
中國(guó)大學(xué)MOOC-北京理工大學(xué)-嵩天老師的Python系列課程
轉(zhuǎn)載自:http://www.jianshu.com/p/98d0139dacac
7. Re(正則表達(dá)式)庫(kù)入門
regular expression = regex = RE
是一種通用的字符串表達(dá)框架,用來(lái)簡(jiǎn)潔表達(dá)一組字符串的表達(dá)式,也可用來(lái)判斷某字符串的特征歸屬
- 正則表達(dá)式的語(yǔ)法
常用操作符1
常用操作符2
實(shí)例
經(jīng)典實(shí)例
Re庫(kù)的基本使用
- 正則表達(dá)式的表示類型為raw string類型(原生字符串類型),表示為r’text’
Re庫(kù)主要功能函數(shù)
功能函數(shù)
re.search(pattern,string,flags=0)
-
re.match(pattern,string,flags=0)
因?yàn)閙atch為從開(kāi)始位置開(kāi)始匹配,使用時(shí)要加if進(jìn)行判別返回結(jié)果是否為空,否則會(huì)報(bào)錯(cuò)
-
re.findall(pattern,string,flags=0)
-
re.split(pattern,string,maxsplit=0,flags=0)
maxsplit為最大分割數(shù),剩余部分作為最后一個(gè)元素輸出
-
re.finditer(pattern,string,flags=0)
-
re.sub(pattern,repl,string,count=0,flags=0)
repl是用來(lái)替換的字符串,count為替換次數(shù)
-
Re庫(kù)的另一種等價(jià)用法
Re庫(kù)的函數(shù)式用法為一次性操作,還有一種為面向?qū)ο笥梅?可在編譯后多次操作
regex = re.compile(pattern,flags=0)
通過(guò)compile生成的regex對(duì)象才能被叫做正則表達(dá)式
-
Re庫(kù)的match對(duì)象
Match對(duì)象的屬性 -
Match對(duì)象的方法
實(shí)例- Re庫(kù)的貪婪匹配和最小匹配
Re庫(kù)默認(rèn)采取貪婪匹配,即輸出匹配最長(zhǎng)的子串
最小匹配操作符
8.實(shí)例二:淘寶商品比價(jià)定向爬蟲(chóng)(requests-re)步驟1:提交商品搜索請(qǐng)求,循環(huán)獲取頁(yè)面
import requests import redef getHTMLText(url):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def parsePage(ilt, html):try:plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"',html)tlt = re.findall(r'\"raw_title\"\:\".*?\"',html)for i in range(len(plt)):price = eval(plt[i].split(':')[1])title = eval(tlt[i].split(':')[1])ilt.append([price , title])except:print("")def printGoodsList(ilt):tplt = "{:4}\t{:8}\t{:16}"print(tplt.format("序號(hào)", "價(jià)格", "商品名稱"))count = 0for g in ilt:count = count + 1print(tplt.format(count, g[0], g[1]))def main():goods = '書(shū)包'depth = 3start_url = 'https://s.taobao.com/search?q=' + goodsinfoList = []for i in range(depth):try:url = start_url + '&s=' + str(44*i)html = getHTMLText(url)parsePage(infoList, html)except:continueprintGoodsList(infoList)main()
步驟2:對(duì)于每個(gè)頁(yè)面,提取商品名稱和價(jià)格信息
步驟3:將信息輸出顯示9.實(shí)例三:股票數(shù)據(jù)定向爬蟲(chóng)(requests-bs
4-re)
#CrawBaiduStocksB.py import requests from bs4 import BeautifulSoup import traceback import redef getHTMLText(url, code="utf-8"):try:r = requests.get(url)r.raise_for_status()r.encoding = codereturn r.textexcept:return ""def getStockList(lst, stockURL):html = getHTMLText(stockURL, "GB2312")soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a')for i in a:try:href = i.attrs['href']lst.append(re.findall(r"[s][hz]\d{6}", href)[0])except:continuedef getStockInfo(lst, stockURL, fpath):count = 0for stock in lst:url = stockURL + stock + ".html"html = getHTMLText(url)try:if html=="":continueinfoDict = {}soup = BeautifulSoup(html, 'html.parser')stockInfo = soup.find('div',attrs={'class':'stock-bets'})name = stockInfo.find_all(attrs={'class':'bets-name'})[0]infoDict.update({'股票名稱': name.text.split()[0]})keyList = stockInfo.find_all('dt')valueList = stockInfo.find_all('dd')for i in range(len(keyList)):key = keyList[i].textval = valueList[i].textinfoDict[key] = valwith open(fpath, 'a', encoding='utf-8') as f:f.write( str(infoDict) + '\n' )count = count + 1print("\r當(dāng)前進(jìn)度: {:.2f}%".format(count*100/len(lst)),end="")except:count = count + 1print("\r當(dāng)前進(jìn)度: {:.2f}%".format(count*100/len(lst)),end="")continuedef main():stock_list_url = 'http://quote.eastmoney.com/stocklist.html'stock_info_url = 'https://gupiao.baidu.com/stock/'output_file = 'D:/BaiduStockInfo.txt'slist=[]getStockList(slist, stock_list_url)getStockInfo(slist, stock_info_url, output_file)main()
步驟1:從東方財(cái)富網(wǎng)獲取股票列表
步驟2:根據(jù)股票列表逐個(gè)到百度股票獲取個(gè)股信息
步驟3:將結(jié)果存儲(chǔ)到文件 - Re庫(kù)的貪婪匹配和最小匹配
總結(jié)
以上是生活随笔為你收集整理的Python网络爬虫与信息提取(三):网络爬虫之实战的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: Python之Requests
- 下一篇: Python网络爬虫与信息提取(二):网