當前位置：首頁 > 编程语言 > python >内容正文

python

Python网络爬虫与信息提取

發布時間：2023/12/20 python 23 豆豆

生活随笔收集整理的這篇文章主要介紹了 Python网络爬虫与信息提取小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

1.Requests庫入門

Requests安裝

用管理員身份打開命令提示符：

pip install requests

測試：打開IDLE：

>>> import requests >>> r = requests.get("http://www.baidu.com") >>> r.status_code 200 >>> r.encoding = 'utf-8' #修改默認編碼 >>> r.text #打印網頁內容

HTTP協議

超文本傳輸協議,Hypertext Transfer Protocol.

HTTP是一個基于“請求與響應”模式的、無狀態的應用層協議。

HTTP協議采用URL作為定位網絡資源的標識。

URL格式

http://host[:port][path]

host:合法的Internet主機域名或IP地址

port：端口號，缺省端口為80

path：請求資源的路徑

操作

方法說明

GET	請求獲取URL位置的資源
HEAD	請求獲取URl位置資源的響應消息報告，即獲得該資源的頭部信息
POST	請求向URL位置的資源后附加新的數據
PUT	請求向URL位置存儲一個資源，覆蓋原URL位置的資源
PATCH	請求局部更新URL位置的資源，即改變該處資源的部分內容
DELETE	請求刪除URL位置存儲的資源

Requests主要方法

方法說明

requests.request()	構造一個請求，支撐以下各方法的基礎方法
requests.get()	獲取HTML網頁的主要方法，對應于HTTP的GET
requests.head()	獲取HTML網頁頭信息的方法，對應于HTTP的HEAD
requests.post()	向HTML網頁提交POST請求的方法，對應于HTTP的POST
requests.put()	向HTML網頁提交PUT請求的方法，對應于HTTP的PUT
requests.patch()	向HTML網頁提交局部修改請求，對應于HTTP的PATCH
requests.delete()	向HTML網頁提交刪除請求，對應于HTTP的DELETE

主要方法為request方法，其他方法都是在此方法基礎上封裝而來以便使用。

request()方法

requests.request(method,url,**kwargs) #method:請求方式，對應get/put/post等7種 #url：擬獲取頁面的url鏈接 #**kwargs：控制訪問的參數，共13個

**kwargs：控制訪問的參數，均為可選項

get()方法

r = requests.get(url) 完整方法： requests.get(url,params=None,**kwargs)url:擬獲取頁面的url鏈接params:url中的額外參數，字典或字節流格式，可選**kwargs:12個控制訪問的參數，可選

get()方法：

構造一個向服務器請求資源的Request對象

返回一個包含服務器資源的Response對象

Response對象

屬性說明

r.status_code	HTTP請求的返回狀態，200表示連接成功，404表示失敗
r.text	HTTP響應內容的字符串形式，即：url對應的頁面內容
r.encoding	從HTTP header中猜測的響應內容編碼方式
r.apparent_encoding	從內容中分析出的響應內容編碼方式（備選編碼方式）
r.content	HTTP響應內容的二進制形式

head()方法

r = requests.head('http://httpbin.org/get') r.headers

獲取網絡資源的概要信息

post()方法

向服務器提交新增數據

payload = {'key1':'value1','key2':'value2'} #新建一個字典 #向URL POST一個字典，自動編碼為form（表單） r = requests.post('http://httpbin.org/post',data = payload) #向URL POST一個字符串，自動編碼為data r = requests.post('http://httpbin.org/post',data = 'ABC') print(r.text)

put()方法

同post，只不過會把原來的內容覆蓋掉。

patch()方法

delete()方法

Requests庫的異常

異常說明

requests.ConnectionError	網絡連接錯誤異常，如DNS查詢失敗、拒絕連接等
requests.HTTPError	HTTP錯誤異常
requests.URLRequired	URL缺失異常
requests.TooManyRedirects	超過最大重定向次數，產生重定向異常
requests.ConnectTimeout	連接遠程服務器超時異常
requests.Timeout	請求URL超時，產生超時異常

異常方法說明

r.raise_for_status

如果不是200產生異常requests.HTTPError

爬取網頁的通用代碼框架

import requestsdef getHTMLText(url):try:r = requests.get(url,timeout=30)r.raise_for_status() #如果不是200，引發HTTPError異常r.recoding = r.apparent_encodingreturn r.textexcept:return "產生異常" if __name__ == "__main__":url = "http://www.baidu.com"print(getHTMLText(url))

實例

向百度提交關鍵詞

import requests# 向搜索引擎進行關鍵詞提交 url = "http://www.baidu.com" try:kv = {'wd':'python'}r = requests.get(url,params =kv)print(r.request.url)r.raise_for_status()print(len(r.text)) except:print("產生異常")

獲取網絡圖片及存儲

import requests import os url = "http://image.ngchina.com.cn/2019/0423/20190423024928618.jpg" root = "D://2345//Temp//" path = root + url.split('/')[-1] try:if not os.path.exists(root):os.mkdir(root)if not os.path.exists(path):r = requests.get(url)with open(path,'wb') as f:f.write(r.content) #r.content返回二進制內容f.close()print("文件保存成功")else:print("文件已存在") except:print("爬取失敗")

2.信息提取之Beautiful Soup庫入門

Beautiful Soup庫安裝

pip install beautifulsoup4

測試：

import requests r = requests.get("http://python123.io/ws/demo.html") demo = r.text form bs4 import BeautifulSoup #從bs4中引入BeautifulSoup類 soup = BeautifulSoup(demo, "html.parser")

Beautiful Soup庫是解析、遍歷、維護“標簽樹”的功能庫

Beautiful Soup庫的基本元素

Beautiful Soup庫的引用

Beautiful Soup庫，也叫beautifulsoup4或bs4.

from bs4 import BeautifulSoup soup = BeautifulSoup(demo,"html.parser")

Beautiful Soup類的基本元素

基本元素說明

Tag	標簽，最基本的信息組織單元，分別用<>和</>標明開頭和結尾
Name	標簽的名字， … 的名字是’p’，格式：.name
Attributes	標簽的屬性，字典形式組織，格式：.attrs
NavigableString	標簽內非屬性字符串，<>…</>中字符串，格式：.string
Comment	標簽內字符串的注釋部分，一種特殊的Comment類型

基于bs4庫的HTML內容遍歷方法

下行遍歷

屬性說明

.contents(列表類型)	子節點的列表，將所有兒子節點存入列表
.children	子節點的迭代類型，與.contents類似，用于循環遍歷兒子節點
.descendants	子孫節點的迭代類型，包含所有子孫節點，用于循環遍歷

#遍歷兒子節點 for child in soup.body.childrenprint(child) #遍歷子孫節點 for child in soup.body.descendantsprint(child)

上行遍歷

屬性說明

.parent	節點的父親標簽
.parents	節點先輩標簽的迭代類型，用于循環遍歷先輩節點

soup = BeautifulSoup(demo,"html.parser") for parent in soup.a.parents:if parent is None:print(parent)else:print(parent.name) #輸出結果 #p #body #html #[document]

平行遍歷

平行遍歷發生在同一個父節點下的各節點間。

下一個獲取的可能是字符串類型，不一定是下一個節點。

屬性說明

.next_sibling	返回按照HTML文本順序的下一個平行節點標簽
.previous_sibling	返回按照HTML文本順序的上一個平行節點標簽
.next_siblings	迭代類型，返回按照HTML文本順序的后續所有平行節點標簽
.previous_siblings	迭代類型，返回按照HTML文本順序的前續所有平行節點標簽

#遍歷后續節點 for sibling in soup.a.next_siblingsprint(sibling) #遍歷前續節點 for sibling in soup.a.previous_siblingsprint(sibling)

基于bs4庫的HTML格式化和編碼

格式化方法：.prettify()

soup = BeautifulSoup(demo,"html.parser") print(soup.a.prettify())

編碼：默認utf-8

soup = BeautifulSoup("<p>中文</p>","html.parser") soup.p.string #'中文' print(soup.p.prettify()) #<p> # 中文 #</p>

3.信息組織與提取

信息標記的三種形式

標記后的信息可形成信息組織結構，增加了信息的維度；

標記后的信息可用于通信、存儲和展示；

標記的結構和信息一樣具有重要價值；

標記后的信息有利于程序的理解和運用。

XML: eXtensible Matkup Language

最早的通用信息標記語言，可擴展性好，但繁瑣。

用于Internet上的信息交互和傳遞。

JSON: JavaScript Object Notation

信息有類型，適合程序處理(js)，較XML簡潔。

用于移動應用云端和節點的信息通信，無注釋。

#有類型的鍵值對表示信息的標記形式 "key":"value" "key":["value1","value2"] "key":{"subkey":"subvalue"}

YAMl: YAML Ain’t Markup Language

信息無類型，文本信息比例最高，可讀性好。

用于各類系統的配置文件，有注釋易讀。

#無類型的鍵值對表示信息的標記形式 key : "value" key : #comment -value1 -value2 key :subkey : subvalue

信息提取的一般方法

方法一：完整解析信息的標記形式，再提取關鍵信息。

XML JSON YAML

需要標記解析器，例如bs4庫的標簽樹遍歷。

優點：信息解析準確

缺點：提取過程繁瑣，過程慢

方法二：無視標記形式，直接搜索關鍵信息

搜索

對信息的文本查找函數即可。

優點：提取過程簡潔，速度較快

缺點：提取過程準確性與信息內容相關

融合方法：結合形式解析與搜索方法,提取關鍵信息

XML JSON YAML 搜索

需要標記解析器及文本查找函數。

實例：提取HTML中所有URL鏈接

思路： 1. 搜索到所有標簽

? 2.解析標簽格式，提取href后的鏈接內容

form bs4 import BeautifulSoup soup = BeautifulSoup(demo,"html.parser") for link in soup.find_all('a'):print(link.get('href'))

基于bs4庫的HTML內容查找方法

方法說明

<>.find_all(name,attrs,recursive,string,**kwargs)

返回一個列表類型，存儲查找的結果

簡寫形式：(…) 等價于 .find_all(…)

#name:對標簽名稱的檢索字符串 soup.find_all('a') soup.find_all(['a', 'b']) soup.find_all(True) #返回soup的所有標簽信息 for tag in soup.find_all(True):print(tag.name) #html head title body p b p a a #輸出所有b開頭的標簽，包括b和body #引入正則表達式庫 import re for tag in soup.find_all(re.compile('b')):print(tag.name) #body b#attrs:對標簽屬性值的檢索字符串，可標注屬性檢索 soup.find_all('p', 'course') soup.find_all(id='link1') import re soup.find_all(id=re.compile('link'))#recursive:是否對子孫全部檢索，默認為True soup.find_all('p', recursive = False)#string:<>...</>字符串區域的檢索字符串 soup.find_all(string = "Basic Python") import re soup.find_all(string = re.compile('Python')) #簡寫形式：soup(..) = soup.find_all(..)

拓展方法：參數同.find_all()

方法說明

<>.find()	搜索且只返回一個結果，字符串類型
<>.find_parents()	在先輩節點中搜索，返回列表類型
<>.find_parent()	在先輩節點中返回一個結果，字符串類型
<>.find_next_siblings()	在后續平行節點中搜索，返回列表類型
<>.find_next_sibling()	在后續平行節點中返回一個結果，字符串類型
<>.find_previous_siblings()	在前續平行節點中搜索，返回列表類型
<>.find_previous_sibling()	在前續平行節點中返回一個結果，字符串類型

4.信息提取實例

中國大學排名定向爬蟲

功能描述：

? 輸入：大學排名URL鏈接

? 輸出：大學排名信息的屏幕輸出（排名，大學名稱，總分）

? 技術路線：requests-bs4

? 定向爬蟲：僅對輸入URL進行爬取，不拓展爬取

程序的結構設計：

? 步驟1：從網絡上獲取大學排名網頁內容

? getHTMLText()

? 步驟2：提取網頁內容中信息到合適的數據結構

? fillUnivList()

? 步驟3：利用數據結構展示并輸出結果

? printUnivList()

初步代碼編寫

import requests from bs4 import BeautifulSoup import bs4def getHTMLText(url):try:r = requests.get(url, timeout= 30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def fillUnivList(ulist, html):soup = BeautifulSoup(html, "html.parser")for tr in soup.find('tbody').children:if isinstance(tr, bs4.element.Tag):tds = tr('td')ulist.append([tds[0].string, tds[1].string, tds[3].string])def printUnivList(ulist, num):print("{:^10}\t{:^6}\t{:^10}".format("排名", "學校名稱", "分數"))for i in range(num):u = ulist[i]print("{:^10}\t{:^6}\t{:^10}".format(u[0], u[1], u[2]))def main():uinfo = []url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'html = getHTMLText(url)fillUnivList(uinfo,html)printUnivList(uinfo,20) #20 univs main()

中文輸出對齊問題

當輸出中文的寬度不夠時，系統會采用西文字符填充，導致對齊出現問題。

可以使用中文空格chr(12288)填充解決。

<填充>：用于填充的單個字符

<對齊>：<左對齊 >右對齊 ^居中對齊

<寬度>：槽的設定輸出寬度

,：數字的千位分隔符適用于整數和浮點數

<精度>：浮點數小數部分的精度或字符串的最大輸出長度

<類型>：整數類型b,c,d,o,x,X浮點數類型e,E,f,%

代碼優化

import requests from bs4 import BeautifulSoup import bs4def getHTMLText(url):try:r = requests.get(url, timeout= 30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def fillUnivList(ulist, html):soup = BeautifulSoup(html, "html.parser")for tr in soup.find('tbody').children:if isinstance(tr, bs4.element.Tag):tds = tr('td')ulist.append([tds[0].string, tds[1].string, tds[3].string])def printUnivList(ulist, num):tplt = "{0:^10}\t{1:{3}^10}\t{2:^10}"print(tplt.format("排名", "學校名稱", "分數",chr(12288)))for i in range(num):u = ulist[i]print(tplt.format(u[0], u[1], u[2],chr(12288)))def main():uinfo = []url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'html = getHTMLText(url)fillUnivList(uinfo,html)printUnivList(uinfo,20) #20 univs main()

5.實戰之Re庫入門

正則表達式

通用的字符串表達框架
簡介表達一組字符串的表達式
針對字符串表達“簡潔”和“特征”思想的工具
判斷某字符串的特征歸屬

正則表達式的語法

操作符說明實例

.	表示任何單個字符
[ ]	字符集，對單個字符給出取值范圍	[abc]表達式a、b、c,[a-z]表示a到z單個字符
[^ ]	非字符集，對單個字符給出排除范圍	[^abc]表示非a或b或c的單個字符
*	前一個字符0次或無限次擴展	abc* 表示 ab、abc、abcc、abccc等
+	前一個字符1次或無限次擴展	abc+ 表示 abc、abcc、abccc等
?	前一個字符0次或1次擴展	abc？表示 ab、abc
\|	左右表達式任意一個	abc\|def 表示 abc 、def
{m}	擴展前一個字符m次	ab{2}c表示abbc
{m,n}	擴展前一個字符m至n次（含n）	ab{1,2}c表示abc、abbc
^	匹配字符串開頭	^abc表示abc且在一個字符串的開頭
$	匹配字符串結尾	abc$表示abc且在一個字符串的結尾
( )	分組標記，內部只能使用\|操作符	(abc)表示abc，{abc\|def}表示abc、def
\d	數字，等價于[0-9]
\w	單詞字符，等價于[A-Za-z0-9_]

經典正則表達式實例

正則表達式說明

^[A-Za-z]+$	由26個字母組成的字符串
^[A-Za-z0-9]+$	由26個字母和數字組成的字符串
^-?\d+$	整數形式的字符串
^[0-9][1-9][0-9]$	正整數形式的字符串
[1-9]\d{5}	中國境內郵政編碼，6位
[\u4e00-\u9fa5]	匹配中文字符
\d{3}-\d{8}\|\d{4}-\d{7}	國內電話號碼

Re庫的基本使用

Re庫是Python的標準庫，主要用于字符串匹配。

正則表達式的表示類型

raw string類型（原生字符串類型）,是不包含轉義符\的字符串

re庫采用raw string類型表示正則表達式，表示為：r’text’

例如：r'[1-9]\d{5}'

? r'\d{3}-\d{8}|\d{4}-\d{7}'

Re庫主要功能函數

函數說明

re.search()	在一個字符串中搜索匹配正則表達式的第一個位置，返回match對象
re.match()	從一個字符串的開始位置起匹配正則表達式，返回match對象
re.findall()	搜索字符串，以列表類型返回全部能匹配的子串
re.split()	將一個字符串按照正則表達式匹配結果進行分割，返回列表類型
re.finditer()	搜索字符串，返回一個匹配結果的迭代類型，每個迭代元素是match對象
re.sub()	在一個字符串中替換所有匹配正則表達式的子串，返回替換后的字符串

re.search(pattern,string,flags=0)

在一個字符串中搜索匹配正則表達式的第一個位置，返回match對象；

pattern：正則表達式的字符串或原生字符串表示；
string：待匹配字符串；

flags：正則表達式使用時的控制標記；

常用標記說明

re.I\|re.IGNORECASE	忽略正則表達式的大小寫，[A-Z]能匹配小寫字符
re.M\|re.MUTILINE	正則表達式中的^操作符能夠將給定字符串的每行當做匹配開始
re.S\|re.DOTILL	正則表達式中的.操作符能夠匹配所有字符，默認匹配除換行符外的所有字符

例子：

import re match = re.search(r'[1-9]\d{5}','BIT 100081') if match:print(match.group(0)) #'100081'

re.match(pattern,string,flags=0)

從一個字符串的開始位置起匹配正則表達式，返回match對象
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- flags：正則表達式使用時的控制標記；

例子：

import re match = re.match(r'[1-9]\d{5}','BIT 100081') if match:print(match.group(0)) #NULL match = re.match(r'[1-9]\d{5}','100081 BIT') if match:print(match.group(0)) #'100081'

re.findall(pattern,string,flags=0)

搜索字符串，以列表類型返回全部能匹配的子串
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- flags：正則表達式使用時的控制標記；

例子：

import re ls = re.findall(r'[1-9]\d{5}', 'BIT100081 TSU100084') print(ls) #['100081', '100084']

re.split(pattern,string,maxsplit=0,flags=0)

re.split(pattern,string,flags=0)

將一個字符串按照正則表達式匹配結果進行分割，返回列表類型
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- maxsplit：最大分割數，剩余部分作為最后一個元素輸出；
- flags：正則表達式使用時的控制標記；

例子：

import re ls = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084') print(ls) #['BIT', ' TSU', ''] ls2 = re.split(r'[1-9]\d{5}', 'BIT100081 TSU100084', maxsplit=1) print(ls2) #['BIT', ' TSU10084']

re.finditer(pattern,string,flags=0)

搜索字符串，返回一個匹配結果的迭代類型，每個迭代元素都是match對象
- pattern：正則表達式的字符串或原生字符串表示；
- string：待匹配字符串；
- flags：正則表達式使用時的控制標記；

例子：

import re for m in re.finditer(r'[1-9]\d{5}', 'BIT100081 TSU100084'):if m:print(m.group(0)) #100081 100084

re.sub(pattern,repl,string,count=0,flags=0)

在一個字符串中替換所有匹配正則表達式的子串，并返回替換后的字符串
- pattern：正則表達式的字符串或原生字符串表示；
- repl：替換匹配字符串的字符串；
- string：待匹配字符串；
- count：匹配的最大替換次數
- flags：正則表達式使用時的控制標記；

例子：

import re rst = re.sub(r'[1-9]\d{5}', ':zipcode', 'BIT 100081,TSU 100084') print(rst) # 'BIT :zipcode TSU :zipcode'

Re庫的另一種用法

編譯后的對象擁有的方法和re庫主要功能函數相同

#函數式用法：一次性操作 rst = re.search(r'[1-9]\d{5}', 'BIT 100081')#面向對象用法：編譯后的多次操作 pat = re.compile(r'[1-9]\d{5}') rst = pat.search('BIT 100081')

re.compile(pattern,flags=0)

將正則表達式的字符串形式編譯成正則表達式對象
- pattern：正則表達式的字符串或原生字符串表示；
- flags：正則表達式使用時的控制標記；

regex = re.compile(r'[1-9]\d{5}')

Re庫的match對象

import re match = re.search(r'[1-9]\d{5}','BIT 100081') if match:print(match.group(0)) # '100081' print(type(match)) # <class 're.Match'>

Match對象的屬性

屬性說明

.string	待匹配的文本
.re	匹配時使用的pattern對象（正則表達式）
.pos	正則表達式搜索文本的開始位置
.endpos	正則表達式搜索文本的結束位置

Match對象的方法

方法說明

.group(0)	獲得匹配后的字符串
.start()	匹配字符串在原始字符串的開始位置
.end()	匹配字符串在原始字符串的結束位置
.span()	返回(.start(),.end())

import re m = re.search(r'[1-9]\d{5}', 'BIT100081 TSU100084') print(m.string) # BIT100081 TSU100084 print(m.re) # re.compile('[1-9]\\d{5}') print(m.pos) # 0 print(m.endpos) # 19 print(m.group(0)) # '100081' 返回的是第一次匹配的結果,獲取所有使用re.finditer()方法 print(m.start()) # 3 print(m.end()) # 9 print(m.span()) # (3, 9)

Re庫的貪婪匹配和最小匹配

Re庫默認采用貪婪匹配，即輸出匹配最長的子串。

import re match = re.search(r'PY.*N', 'PYANBNCNDN') print(match.group(0)) # PYANBNCNDN

最小匹配方法：

import re match = re.search(r'PY.*?N', 'PYANBNCNDN') print(match.group(0)) # PYAN

最小匹配操作符

操作符說明

*?	前一個字符0次或無限次擴展，最小匹配
+?	前一個字符1次或無限次擴展，最小匹配
??	前一個字符0次或1次擴展，最小匹配
{m,n}?	擴展前一個字符m至n次（含n），最小匹配

Re庫實例之淘寶商品比價定向爬蟲

功能描述：

目標：獲取淘寶搜索頁面的信息，提取其中的商品名稱和價格
理解：
- 淘寶的搜索接口
- 翻頁的處理
技術路線：requests-re

程序的結構設計：

步驟1：提交商品搜索請求，循環獲取頁面
步驟2：對于每個頁面，提取商品的名稱和價格信息
步驟3：將信息輸出到屏幕上

import requests import redef getHTMLText(url):#瀏覽器請求頭中的User-Agent，代表當前請求的用戶代理信息（下方有獲取方式）headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}try:#瀏覽器請求頭中的cookie，包含自己賬號的登錄信息（下方有獲取方式）coo = ''cookies = {}for line in coo.split(';'): #瀏覽器偽裝name, value = line.strip().split('=', 1)cookies[name] = valuer = requests.get(url, cookies = cookies, headers=headers, timeout = 30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""#解析請求到的頁面，提取出相關商品的價格和名稱 def parsePage(ilt, html):try:plt = re.findall(r'\"view_price\"\:\"[\d\.]*\"', html)tlt = re.findall(r'\"raw_title\"\:\".*?\"', html)for i in range(len(plt)):price = eval(plt[i].split(':')[1])title = eval(tlt[i].split(':')[1])ilt.append([price, title])except:print("")def printGoodsList(ilt):tplt = "{:4}\t{:8}\t{:16}"print(tplt.format("序號", "價格", "商品名稱"))count = 0for g in ilt:count = count + 1print(tplt.format(count, g[0], g[1]))def main():goods = '書包'depth = 2 #爬取深度，2表示爬取兩頁數據start_url = 'https://s.taobao.com/search?q=' + goodsinfoList = []for i in range(depth):try:url = start_url + '&s=' + str(44*i)html = getHTMLText(url)parsePage(infoList, html)except:continueprintGoodsList(infoList)main()

? 需要注意的是，淘寶網站本身有反爬蟲機制，所以在使用requests庫的get()方法爬取網頁信息時，需要加入本地的cookie信息，否則淘寶返回的是一個錯誤頁面，無法獲取數據。

? 代碼中的coo變量中需要自己添加瀏覽器中的cookie信息，具體做法是在瀏覽器中按F12，在出現的窗口中進入network（網絡）內，搜索“書包”，然后找到請求的url（一般是第一個），點擊請求在右側header（消息頭）中找到Request Header（請求頭），在請求頭中找到User-Agent和cookie字段，放到代碼相應位置即可。

Re庫實例之股票數據定向爬蟲

功能描述：

目標：獲取上交所和深交所所有股票的名稱和交易信息
輸出：保存到文件中
技術路線：requests-bs4-re

候選數據網站的選擇：

新浪股票：https://finance.sina.com.cn/stock/
百度股票：https://gupiao.baidu.com/stock/
選取原則：股票信息靜態存在于HTML頁面中，非js代碼生成，沒有Robots協議限制。

程序的結構設計

步驟1：從東方財富網獲取股票列表
步驟2：根據股票列表逐個到百度股票獲取個股信息
步驟3：將結果存儲到文件

初步代碼編寫(error)

import requests from bs4 import BeautifulSoup import traceback import redef getHTMLText(url):try:r = requests.get(url)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return ""def getStockList(lst, stockURL):html = getHTMLText(stockURL)soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a')for i in a:try:href = i.attrs['href']lst.append(re.findall(r"[s][hz]\d{6}", href)[0])except:continuedef getStockInfo(lst, stockURL, fpath):for stock in lst:url = stockURL + stock + ".html"html = getHTMLText(url)try:if html=="":continueinfoDict = {}soup = BeautifulSoup(html, 'html.parser')stockInfo = soup.find('div',attrs={'class':'stock-bets'})name = stockInfo.find_all(attrs={'class':'bets-name'})[0]infoDict.update({'股票名稱': name.text.split()[0]})keyList = stockInfo.find_all('dt')valueList = stockInfo.find_all('dd')for i in range(len(keyList)):key = keyList[i].textval = valueList[i].textinfoDict[key] = valwith open(fpath, 'a', encoding='utf-8') as f:f.write( str(infoDict) + '\n' )except:traceback.print_exc()continuedef main():stock_list_url = 'https://quote.eastmoney.com/stocklist.html'stock_info_url = 'https://gupiao.baidu.com/stock/'output_file = 'D:/BaiduStockInfo.txt'slist=[]getStockList(slist, stock_list_url)getStockInfo(slist, stock_info_url, output_file)main()

代碼優化(error)

速度提高：編碼識別的優化

import requests from bs4 import BeautifulSoup import traceback import redef getHTMLText(url, code="utf-8"):try:r = requests.get(url)r.raise_for_status()r.encoding = codereturn r.textexcept:return ""def getStockList(lst, stockURL):html = getHTMLText(stockURL, "GB2312")soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a')for i in a:try:href = i.attrs['href']lst.append(re.findall(r"[s][hz]\d{6}", href)[0])except:continuedef getStockInfo(lst, stockURL, fpath):count = 0for stock in lst:url = stockURL + stock + ".html"html = getHTMLText(url)try:if html=="":continueinfoDict = {}soup = BeautifulSoup(html, 'html.parser')stockInfo = soup.find('div',attrs={'class':'stock-bets'})name = stockInfo.find_all(attrs={'class':'bets-name'})[0]infoDict.update({'股票名稱': name.text.split()[0]})keyList = stockInfo.find_all('dt')valueList = stockInfo.find_all('dd')for i in range(len(keyList)):key = keyList[i].textval = valueList[i].textinfoDict[key] = valwith open(fpath, 'a', encoding='utf-8') as f:f.write( str(infoDict) + '\n' )count = count + 1print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")except:count = count + 1print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")continuedef main():stock_list_url = 'https://quote.eastmoney.com/stocklist.html'stock_info_url = 'https://gupiao.baidu.com/stock/'output_file = 'D:/BaiduStockInfo.txt'slist=[]getStockList(slist, stock_list_url)getStockInfo(slist, stock_info_url, output_file)main()

測試成功代碼

由于東方財富網鏈接訪問時出現錯誤，所以更換了一個新的網站去獲取股票列表，具體代碼如下：

import requests import re import traceback from bs4 import BeautifulSoup import bs4def getHTMLText(url):try:r = requests.get(url, timeout=30)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept:return""def getStockList(lst, stockListURL):html = getHTMLText(stockListURL)soup = BeautifulSoup(html, 'html.parser')a = soup.find_all('a')lst = []for i in a:try:href = i.attrs['href']lst.append(re.findall(r"[S][HZ]\d{6}", href)[0])except:continuelst = [item.lower() for item in lst] # 將爬取信息轉換小寫return lstdef getStockInfo(lst, stockInfoURL, fpath):count = 0for stock in lst:url = stockInfoURL + stock + ".html"html = getHTMLText(url)try:if html == "":continueinfoDict = {}soup = BeautifulSoup(html, 'html.parser')stockInfo = soup.find('div', attrs={'class': 'stock-bets'})if isinstance(stockInfo, bs4.element.Tag): # 判斷類型name = stockInfo.find_all(attrs={'class': 'bets-name'})[0]infoDict.update({'股票名稱': name.text.split('\n')[1].replace(' ','')})keylist = stockInfo.find_all('dt')valuelist = stockInfo.find_all('dd')for i in range(len(keylist)):key = keylist[i].textval = valuelist[i].textinfoDict[key] = valwith open(fpath, 'a', encoding='utf-8') as f:f.write(str(infoDict) + '\n')count = count + 1print("\r當前速度：{:.2f}%".format(count*100/len(lst)), end="")except:count = count + 1print("\r當前速度：{:.2f}%".format(count*100/len(lst)), end="")traceback.print_exc()continuedef main():fpath = 'D://gupiao.txt'stock_list_url = 'https://hq.gucheng.com/gpdmylb.html'stock_info_url = 'https://gupiao.baidu.com/stock/'slist = []list = getStockList(slist, stock_list_url)getStockInfo(list, stock_info_url, fpath)main()

6.爬蟲框架-Scrapy

爬蟲框架：是實現爬蟲功能的一個軟件結構和功能組件集合。

爬蟲框架是一個半成品，能夠幫助用戶實現專業網絡爬蟲。

安裝Scrapy

pip install scrapy #驗證 scrapy -h

遇到錯誤

building 'twisted.test.raiser' extensionerror: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/

解決方案

查看python版本及位數

C:\Users\ASUS>python Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

可知，python版本為3.7.2, 64位

下載Twisted

到 http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted 下載twisted對應版本的whl文件;
根據版本應下載Twisted?17.9.0?cp37?cp37m?win_amd64.whl

注意：cp后面是python版本，amd64代表64位，32位的下載32位對應的文件。

安裝Twisted

python -m pip install D:\download\Twisted?19.2.0?cp37?cp37m?win_amd64.whl

安裝Scrapy

python -m pip install scrapy

Scrapy爬蟲框架解析

Engine：不需要用戶修改

控制所有模塊之間的數據流
根據條件觸發事件

Downloader：不需要用戶修改

根據請求下載網頁

Scheduler：不需要用戶修改

對所有爬取請求進行調度管理

Downloader Middleware：用戶可編寫配置代碼

目的：實施Engine、Scheduler和Downloader之間進行用戶可配置的控制
功能：修改、丟棄、新增請求或響應

Spiders：需要用戶編寫配置代碼

解析Downloader返回的響應（Response）
產生爬取項（scraped item）
產生額外的爬取請求（Request）

Item Pipelines：需要用戶編寫配置代碼

以流水線方式處理Spider產生的爬取項
由一組操作順序組成，類似流水線，每個操作是一個Item Pipeline類型
可能操作包括：清理、檢驗、和查重爬取項中的HTML數據、將數據存儲到數據庫

Spider Middleware：用戶可以編寫配置代碼

目的：對請求和爬取項的再處理
功能：修改、丟棄、新增請求或爬取項

requests vs. Scrapy

相同點
- 兩者都可以進行頁面請求和爬取，Python爬蟲的兩個重要技術路線
- 兩者可用性都好，文檔豐富，入門簡單
- 兩者都沒有處理js、提交表單、應對驗證碼等功能（可擴展）

不同點

requestsScrapy

頁面級爬蟲	網站級爬蟲
功能庫	框架
并發性考慮不足，性能較差	并發性好，性能較高
重點在于頁面下載	重點在于爬蟲結構
定制靈活	一般定制靈活，深度定制困難
上手十分簡單	入門稍難

Scrapy爬蟲的常用命令

Scrapy命令行

? Scrapy是為持續運行設計的專業爬蟲框架，提供操作的Scrapy命令行

命令說明格式

startproject	創建一個新工程	scrapy startproject [dir]
genspider	創建一個爬蟲	scrapy genspider [options]
settings	獲得爬蟲配置信息	scrapy setting [options]
crawl	運行一個爬蟲	scrapy crawl
list	列出工程中所有爬蟲	scrapy list
shell	啟動URL調試命令行	scrapy shell [url]

Scrapy框架的基本使用

步驟1：建立一個Scrapy爬蟲工程

#打開命令提示符-win+r 輸入cmd #進入存放工程的目錄 D:\>cd demo D:\demo> #建立一個工程，工程名python123demo D:\demo>scrapy startproject python123demo New Scrapy project 'python123demo', using template directory 'd:\program files\python\lib\site-packages\scrapy\templates\project', created in:D:\demo\python123demoYou can start your first spider with:cd python123demoscrapy genspider example example.com D:\demo>

生成的目錄工程介紹：

python123demo/ ----------------> 外層目錄

? scrapy.cfg ---------> 部署Scrapy爬蟲的配置文件

? python123demo/ ---------> Scrapy框架的用戶自定義Python代碼

? __init__.py ----> 初始化腳本

? items.py ----> Items代碼模板（繼承類）

? middlewares.py ----> Middlewares代碼模板（繼承類）

? pipelines.py ----> Pipelines代碼模板（繼承類）

? settings.py ----> Scrapy爬蟲的配置文件

? spiders/ ----> Spiders代碼模板目錄（繼承類）

spiders/ ----------------> Spiders代碼模板目錄（繼承類）

? __init__.py --------> 初始文件，無需修改

? __pycache__/ --------> 緩存目錄，無需修改

步驟2：在工程中產生一個Scrapy爬蟲

#切換到工程目錄 D:\demo>cd python123demo #產生一個scrapy爬蟲 D:\demo\python123demo>scrapy genspider demo python123.io Created spider 'demo' using template 'basic' in module:python123demo.spiders.demoD:\demo\python123demo>

步驟3：配置產生的spider爬蟲

修改D:\demo\python123demo\python123demo\spiders\demo.py

# -*- coding: utf-8 -*- import scrapyclass DemoSpider(scrapy.Spider):name = 'demo' # allowed_domains = ['python123.io']start_urls = ['http://python123.io/ws/demo.html']def parse(self, response):fname = response.url.split('/')[-1]with open(fname, 'wb') as f:f.write(response.body)self.log('Save file %s.' % name)

完整版代碼編寫方式

import scrapyclass DemoSpider(scrapy.Spider):name = "demo"def start_requests(self):urls = ['http://python123.io/ws/demo.html']for url in urls:yield scrapy.Requests(url=url, callback=self.parse)def parse(self, response):fname = response.url.split('/')[-1]with open(fname, 'wb') as f:f.write(response.body)self.log('Saved file %s.' % fname)

步驟4：運行爬蟲，獲取網頁

#輸入運行命令 scrapy crawl D:\demo\python123demo>scrapy crawl demo

可能出現的錯誤

ModuleNotFoundError: No module named 'win32api'

解決方法

到 https://pypi.org/project/pypiwin32/#files 下載py3版本的pypiwin32-223-py3-none-any.whl文件；

安裝pypiwin32-223-py3-none-any.whl

pip install D:\download\pypiwin32-223-py3-none-any.whl

再次在工程目錄下運行爬蟲

scrapy crawl demo

yield關鍵字的使用

yield<----------------------->生成器
- 生成器是一個不斷產生值的函數；
- 包含yield語句的函數是一個生成器；
- 生成器每次產生一個值（yield語句），函數會被凍結，被喚醒后再產生一個值；

實例：

def gen(n):for i in range(n):yield i**2 #產生小于n的所有2的平方值 for i in gen(5):print(i, " ", end="") #0 1 4 9 16 #普通寫法 def square(n):ls = [i**2 for i in range(n)]return ls for i in square(5):print(i, " ", end="") #0 1 4 9 16

為何要有生成器？

生成器比一次列出所有內容的優勢
- 更節省存儲空間
- 響應更迅速
- 使用更靈活

Scrapy爬蟲的使用步驟

步驟1：創建一個工程和Spider模板；
步驟2：編寫Spider；
步驟3：編寫Item Pipeline
步驟4：優化配置策略

Scrapy爬蟲的數據類型

Request類

class scrapy.http.Request()

Request對象表示一個HTTP請求
由Spiders生成，由Downloader執行

屬性或方法說明

.url	Request對應的請求URL地址
.method	對應的請求方法，’GET‘ ’POST‘等
.headers	字典類型風格的請求頭
.body	請求內容主體，字符串類型
.meta	用戶添加的擴展信息，在Scrapy內部模塊間傳遞信息使用
.copy()	復制該請求

Response類

class scrapy.http.Response()

Response對象表示一個HTTP響應
由Downloader生成，由Spider處理

屬性或方法說明

.url	Response對應的URL地址
.status	HTTP狀態碼，默認是200
.headers	Response對應的頭部信息
.body	Response對應的內容信息，字符串類型
.flags	一組標記
.request	產生Response類型對應的Request對象
.copy()	復制該響應

Item類

class scrapy.item.Item()

Item對象表示一個從HTML頁面中提取的信息內容
由Spider生成，由Item Pipeline處理
Item類似字典類型，可以按照字典類型操作

CSS Selector的基本使用

.css('a::attr(href)').extract()

CSS Selector由W3C組織維護并規范。

股票數據Scrapy爬蟲實例

功能描述：

技術路線：scrapy
目標：獲取上交所和深交所所有股票的名稱和交易信息
輸出：保存到文件中

實例編寫

步驟1：首先進入命令提示符建立工程和Spider模板

scrapy startproject BaiduStocks cd BaiduStocks scrapy genspider stocks baidu.com #進一步修改spiders/stocks.py文件

步驟2：編寫Spider
- 配置stock.py文件
- 修改對返回頁面的處理
- 修改對新增URL爬取請求的處理

打開spider.stocks.py文件

# -*- coding: utf-8 -*- import scrapy import reclass StocksSpider(scrapy.Spider):name = "stocks"start_urls = ['https://quote.eastmoney.com/stocklist.html']def parse(self, response):for href in response.css('a::attr(href)').extract():try:stock = re.findall(r"[s][hz]\d{6}", href)[0]url = 'https://gupiao.baidu.com/stock/' + stock + '.html'yield scrapy.Request(url, callback=self.parse_stock)except:continuedef parse_stock(self, response):infoDict = {}stockInfo = response.css('.stock-bets')name = stockInfo.css('.bets-name').extract()[0]keyList = stockInfo.css('dt').extract()valueList = stockInfo.css('dd').extract()for i in range(len(keyList)):key = re.findall(r'>.*</dt>', keyList[i])[0][1:-5]try:val = re.findall(r'\d+\.?.*</dd>', valueList[i])[0][0:-5]except:val = '--'infoDict[key]=valinfoDict.update({'股票名稱': re.findall('\s.*\(',name)[0].split()[0] + \re.findall('\>.*\<', name)[0][1:-1]})yield infoDict

步驟3：編寫Pipelines
- 配置pipelines.py文件
- 定義對爬取項（Scrapy Item）的處理類
- 配置ITEM_PIPELINES選項

pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlclass BaidustocksPipeline(object):def process_item(self, item, spider):return itemclass BaidustocksInfoPipeline(object):def open_spider(self, spider):self.f = open('BaiduStockInfo.txt', 'w')def close_spider(self, spider):self.f.close()def process_item(self, item, spider):try:line = str(dict(item)) + '\n'self.f.write(line)except:passreturn item

setting.py

# Configure item pipelines # See https://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = {'BaiduStocks.pipelines.BaidustocksInfoPipeline': 300, }

配置并發連接選項

settings.py

選項說明

CONCURRENT_REQUESTS	Downloader最大并發請求下載數量，默認為32
CONCURRENT_ITEMS	Item Pipeline最大并發ITEM處理數量，默認為100
CONCURRENT_REQUESTS_PRE_DOMAIN	每個目標域名最大的并發請求數量，默認為8
CONCURRENT_REQUESTS_PRE_IP	每個目標IP最大的并發請求數量，默認為0，非0有效

本篇文章是在中國大學MOOC中北京理工大學嵩天老師教授的Python網絡爬蟲與信息提取的過程中記錄而來。感謝中國大學MOOC平臺及北京理工大學嵩天老師的課程
來源：中國大學MOOC-北京理工大學-嵩天-Python網絡爬蟲與信息提取

總結

以上是生活随笔為你收集整理的Python网络爬虫与信息提取的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 2003年高考语文全国最高分_最新消息！
下一篇：算法原理系列：优先队列