當前位置：首頁 > 编程语言 > python >内容正文

python

使用Python3和BeautifulSoup4处理本地html文件

發布時間：2023/12/14 python 22 豆豆

生活随笔收集整理的這篇文章主要介紹了使用Python3和BeautifulSoup4处理本地html文件小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

- 遇到的問題
- 初始需要處理的文本
- 搜索和替換的一些常用正則表達式
- python3中使用beautifulsoup4
- - beautifulsoup4是什么？
  - 安裝beautifulsoup4
  - 開始使用beautifulsoup4
- 其他的一些小細節
- - python3中將list合并轉為string
- 最終的代碼（python3）
- 參考資料

我的博客地址：https://hxd.red
原文鏈接：https://hxd.red/2019/08/06/python3-beautifulsoup4-html-190805/
我的微信公眾號：不淡定的實驗室（hxdred）

遇到的問題

在制作第三個微信小程序“法語背單詞記憶小助手”時，我需要處理大量單詞有關的數據，為了一勞永逸解決單詞釋義、單詞例句等種種方面的問題，我打算提取mdx詞典數據，將詞典里面所有單詞的數據做成數據表，并上傳至云開發。這樣的話，另一個小程序“法語動詞變位記憶小助手”也能共享成果。

作為一個懶人，肯定不會手動去處理這么多數據（提取mdx之后有60萬行數據，去除對我來說沒用的動詞變位數據，還有15萬行，共計12000余個單詞）。所以打算使用python和Beautiful Soup（以下可能簡稱BS）進行數據處理。引用官方文檔的說法：Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航、查找，修改文檔的方式。Beautiful Soup會幫你節省數小時甚至數天的工作時間。

初始需要處理的文本

初始文本如下，下面僅選取兩個單詞的詳情頁作為示例：

<zidingyi> abandonner <h1 class="Adresse" >abandonner</h1><br /><span class="CategorieGrammaticale" >verbe transitif </span><br /> <span class="Indicateur">(déserter) </span><br /> <div class="Traductionchinois" >擅離</div> <span class="Locution2" id="48" >abandonner son poste</span> <div class="Traduction2chinois" >擅離職守</div Traduction2> </td></tr></table> <span class="Indicateur">(laisser) </span><br /> <div class="Traductionchinois" >拋棄</div> <span class="Locution2" id="49" >abandonner un animal</span> <div class="Traduction2chinois" >丟棄一只動物</div Traduction2><span class="Locution2" id="50" >partir en abandonnant femme et enfants</span> <div class="Traduction2chinois" >拋棄妻子和孩子出走</div Traduction2> </td></tr></table> <span class="Indicateur">(renoncer à) </span><br /> <div class="Traductionchinois" >放棄</div> <span class="Locution2" id="51" >abandonner ses études</span> <div class="Traduction2chinois" >放棄自己的學業</div Traduction2> </td></tr></table> <span class="Indicateur">(se retirer de) </span><br /> <div class="Traductionchinois" >棄權</div> <span class="Locution2" id="52" >il a abandonné la course</span> <span class="Traduction2chinois" >他在這次賽跑中棄權</span></td></tr></table> <br /><br /> <h1 class="Adresse" >abandonner</h1><br /><span class="CategorieGrammaticale" >verbe intransitif </span><br /> <div class="Traductionchinois" >退出比賽</div> <span class="Locution2" id="53" >après sa chute, le cycliste a abandonné</span> <span class="Traduction2chinois" >這個自行車運動員摔倒后就退出了比賽</span> </zidingyi> <zidingyi> abat-jour <h1 class="Adresse" >abat-jour</h1> <br /><span class="CategorieGrammaticale" >nom masculin invariable</span><br /> <div class="Traductionchinois" >燈罩</div> </zidingyi>

搜索和替換的一些常用正則表達式

在最原始的文檔中，有非常多無用的標簽，需要將這些標簽刪除。如果這些標簽是定值，那么直接就能用普通的搜索替換就行批量替換；但若是標簽中有有規律變動的id或者是標簽之間的文字有所變動時，就需要使用正則表達式進行查找。在使用過程中，最常用的表達式總結一些就是這樣的：

<a[^>]*>(.*?)</a>

舉例如下：<span class=”Traduction_py”>之間有不規則的文字內容，但是我需要將所有<span class=”Traduction_py”></span Traduction_py>和標簽之間文字一起替換掉，例如下方的第一行：<span class=”Locution2″ id=”12″>標簽中存在id號，但是我需要將所有的類似標簽（不同id）全部替換掉，例如下方的第二行：

python3中使用beautifulsoup4

beautifulsoup4是什么？

引用官方文檔的說法：Beautiful Soup 是一個可以從HTML或XML文件中提取數據的Python庫。它能夠通過你喜歡的轉換器實現慣用的文檔導航、查找，修改文檔的方式。Beautiful Soup會幫你節省數小時甚至數天的工作時間。

安裝beautifulsoup4

從這部分開始就需要使用到python了，至于如何方便快捷地0基礎使用上python？這里可能會單獨放一篇文章介紹，先立一個flag。用簡潔地話來說，需要配備一下幾點：

先下載一個Anaconda（搜索即可，傻瓜安裝）
裝完之后搜索所安裝的軟件里有：Anaconda Prompt。打開。
輸入下面代碼即可安裝完成beautifulsoup4

$ pip install beautifulsoup4

搜索所安裝的軟件：Anaconda Navigator，選擇Spyder，把本文的代碼修改一下貼上即可運行。

開始使用beautifulsoup4

首先我們需要打開html文件，告訴程序你的文件存在什么地方。在path中需要將你的文件路徑修改成自己的。html文件怎么來？參照“初始需要處理的文本”，將代碼保存在Notepad++中另存為html即可開始實驗。接下來兩行就是打開html文件并且讀取其中的內容。

path = 'D:/WORKS/larousse_original_test1.html'htmlfile = open(path, 'r', encoding='utf-8')htmlhandle = htmlfile.read()

下一步就是調用Beautifulsoup解析功能，解析器使用lxml。并且使用python中的panda包來存儲目標數據。注意此處BeautifulSoup的大小寫，不然會報錯。

from bs4 import BeautifulSoupsoup = BeautifulSoup(htmlhandle, 'lxml')import pandas as pd

創建一個計數的，然后創建result，之后的所有的數據都存在這里面，到時候打開excel表時就可以看到‘word’、‘word_cixing’等等的列，而數據正是隨著這些列進行逐行增加的。

count = 0result = pd.DataFrame({},index=[0])result['word'] = ''result['word_cixing'] = ''result['word_jieshi_fr'] = ''result['word_jieshi_cn'] = ''result['word_liju_fr'] = ''result['word_liju_cn'] = ''new = result

在這里建立一個循環。再初始html中我將原來mdx中的</>替換成了<zidingyi></zidingyi>。也就是說每一個單詞的最外面罩著<zidingyi></zidingyi>，每一個<zidingyi></zidingyi>里面就是該單詞的所有內容。

首先用了find_all()命令，這樣就能得到所有的<zidingyi></zidingyi>標簽的內容，并用循環遍歷。每一次讀到的內容存儲在item里面，再通過BS的CSS選擇器選擇了標簽為h1的內容，這是單詞本身。接下來，需要將讀到的list轉化為string，這個在下節會講到。

BeautifulSoup 對象表示的是一個文檔的全部內容.。大部分時候可以把它當作 Tag 對象，它支持遍歷文檔樹和搜索文檔樹中描述的大部分的方法。再使用get_text()，將所有標簽之內的所有內容讀出，儲存到new的“word”字段里面，并且拼接到result中，為最后的文檔輸出做好準備。

這里只舉了“word”一個例子，不同的字段對應著不同的樣式或者是標簽，可以從BS的官方中文文檔中尋找詳細信息。

for item in soup.find_all('zidingyi'):word = item.select("zidingyi > h1")word = ';'.join(str(e) for e in word)word = BeautifulSoup(word).get_text()new['word'] = wordcount += 1result = result.append(new,ignore_index=True)

最后大功告成，將所有的數據保存到excel表格中。（具體路徑和excel命名可以根據自己的實際需求改寫）

result.to_excel('d:result.xlsx')

其他的一些小細節

python3中將list合并轉為string

使用 ‘’.join，引號內可以加上相應的分隔符

list1 = ['1', '2', '3'] str1 = ''.join(list1)

如果list是數字類型或者不是string類型，那需要在join之前轉換。

list1 = [1, 2, 3] str1 = ''.join(str(e) for e in list1)

最終的代碼（python3）

-*- coding: utf-8 -*- """ Created on Sun Aug 4 14:13:54 2019 @author: https://hxd.red """ path = 'D:/WORKS/larousse_original_test1.html' htmlfile = open(path, 'r', encoding='utf-8') htmlhandle = htmlfile.read() from bs4 import BeautifulSoup soup = BeautifulSoup(htmlhandle, 'lxml') import pandas as pd count = 0 result = pd.DataFrame({},index=[0]) result['word'] = '' result['word_cixing'] = '' result['word_jieshi_fr'] = '' result['word_jieshi_cn'] = '' result['word_liju_fr'] = '' result['word_liju_cn'] = '' new = result for item in soup.find_all('zidingyi'):print(item)word = item.select("zidingyi > h1")word = ';'.join(str(e) for e in word)print(word)word_cixing = item.select(".CategorieGrammaticale")word_cixing = ';'.join(str(e) for e in word_cixing)print(word_cixing)word_jieshi_fr = item.select(".Indicateur")word_jieshi_fr = ';'.join(str(e) for e in word_jieshi_fr)print(word_jieshi_fr)word_jieshi_cn = item.select(".Traductionchinois")word_jieshi_cn = ';'.join(str(e) for e in word_jieshi_cn)print(word_jieshi_cn)word_liju_fr = item.select(".Locution2")word_liju_fr = ';'.join(str(e) for e in word_liju_fr)print(word_liju_fr)word_liju_cn = item.select(".Traduction2chinois")word_liju_cn = ';'.join(str(e) for e in word_liju_cn)print(word_liju_cn)word = BeautifulSoup(word).get_text()word_cixing = BeautifulSoup(word_cixing).get_text()word_jieshi_fr = BeautifulSoup(word_jieshi_fr).get_text()word_jieshi_cn = BeautifulSoup(word_jieshi_cn).get_text()word_liju_fr = BeautifulSoup(word_liju_fr).get_text()word_liju_cn = BeautifulSoup(word_liju_cn).get_text()new['word'] = wordnew['word_cixing'] = word_cixingnew['word_jieshi_fr'] = word_jieshi_frnew['word_jieshi_cn'] = word_jieshi_cnnew['word_liju_fr'] = word_liju_frnew['word_liju_cn'] = word_liju_cncount += 1result = result.append(new,ignore_index=True) result.to_excel('d:result.xlsx')

參考資料

https://stackoverflow.com/questions/5618878/how-to-convert-list-to-string

https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#contents-children

https://blog.csdn.net/fwj_ntu/article/details/78843872

總結

以上是生活随笔為你收集整理的使用Python3和BeautifulSoup4处理本地html文件的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

文件
html

上一篇： Netcat工具的玩法
下一篇：剪辑技巧，全部视频添加封面图片后保存在哪