當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python 抓取网页链接_从Python中的网页抓取链接

發(fā)布時間：2023/12/1 python 19 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 抓取网页链接_从Python中的网页抓取链接小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

python 抓取網(wǎng)頁鏈接

Prerequisite:

先決條件：

Urllib3: It is a powerful, sanity-friendly HTTP client for Python with having many features like thread safety, client-side SSL/TSL verification, connection pooling, file uploading with multipart encoding, etc.

Urllib3 ：這是一個功能強(qiáng)大，對環(huán)境友好的Python HTTP客戶端，具有許多功能，例如線程安全，客戶端SSL / TSL驗證，連接池，使用多部分編碼的文件上傳等。

Installing urllib3:

安裝urllib3：

$ pip install urllib3

BeautifulSoup: It is a Python library that is used to scrape/get information from the webpages, XML files i.e. for pulling data out of HTML and XML files.

BeautifulSoup ：這是一個Python庫，用于從網(wǎng)頁，XML文件中抓取/獲取信息，即從HTML和XML文件中提取數(shù)據(jù)。

Installing BeautifulSoup:

安裝BeautifulSoup：

$ pip install beautifulsoup4

Commands Used:

使用的命令：

html= urllib.request.urlopen(url).read(): Opens the URL and reads the whole blob with newlines at the end and it all comes into one big string.

html = urllib.request.urlopen(url).read() ：打開URL并以換行符結(jié)尾讀取整個blob，所有這些都變成一個大字符串。

soup= BeautifulSoup(html,'html.parser'): Using BeautifulSoup to parse the string BeautifulSoup converts the string and it just takes the whole file and uses the HTML parser, and we get back an object.

soup = BeautifulSoup(html，'html.parser') ：使用BeautifulSoup解析字符串BeautifulSoup轉(zhuǎn)換該字符串，它只獲取整個文件并使用HTML解析器，然后返回一個對象。

tags= soup('a'): To get the list of all the anchor tags.

tags =湯('a') ：獲取所有錨標(biāo)簽的列表。

tag.get('href',None): Extract and get the data from the href.

tag.get('href'，None) ：從href中提取并獲取數(shù)據(jù)。

網(wǎng)頁鏈接的Python程序 (Python program to Links from a Webpage)

# import statements import urllib.request, urllib.parse, urllib.error from bs4 import BeautifulSoup# Get links # URL of a WebPage url = input("Enter URL: ") # Open the URL and read the whole page html = urllib.request.urlopen(url).read() # Parse the string soup = BeautifulSoup(html, 'html.parser') # Retrieve all of the anchor tags # Returns a list of all the links tags = soup('a')#Prints all the links in the list tags for tag in tags: # Get the data from href keyprint(tag.get('href', None), end = "\n")

Output:

輸出：

Enter URL: https://www.google.com/ https://www.google.com/imghp?hl=en&tab=wi https://maps.google.com/maps?hl=en&tab=wl https://play.google.com/?hl=en&tab=w8 https://www.youtube.com/?gl=US&tab=w1 https://news.google.com/nwshp?hl=en&tab=wn https://mail.google.com/mail/?tab=wmhttps://drive.google.com/?tab=wo https://www.google.com/intl/en/about/products?tab=wh http://www.google.com/history/optout?hl=en /preferences?hl=en https://accounts.google.com/ServiceLogin?hl=en&passive=true &continue=https://www.google.com/ /advanced_search?hl=en&authuser=0 /intl/en/ads/ /services/ /intl/en/about.html /intl/en/policies/privacy/ /intl/en/policies/terms/

翻譯自: https://www.includehelp.com/python/scraping-links-from-a-webpage.aspx

python 抓取網(wǎng)頁鏈接

總結(jié)

以上是生活随笔為你收集整理的python 抓取网页链接_从Python中的网页抓取链接的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： ruby hash方法_Ruby中带有示
下一篇：个人医保每月交多少钱啊？