當前位置：首頁 > 编程语言 > python >内容正文

python

常见的爬虫分析库（1）-Python3中Urllib库基本使用

發布時間：2025/3/15 python 20 豆豆

生活随笔收集整理的這篇文章主要介紹了常见的爬虫分析库（1）-Python3中Urllib库基本使用小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原文來自：https://www.cnblogs.com/0bug/p/8893677.html

什么是Urllib?

Python內置的HTTP請求庫

urllib.request? ? ? ? ? 請求模塊

urllib.error? ? ? ? ? ? ? 異常處理模塊

urllib.parse? ? ? ? ? ? ?url解析模塊

urllib.robotparser? ? robots.txt解析模塊

相比Python的變化

Python2中的urllib2在Python3中被統一移動到了urllib.request中

python2

import urllib2

response = urllib2.urlopen('http://www.cnblogs.com/0bug')

Python3

import urllib.request

response = urllib.request.urlopen('http://www.cnblogs.com/0bug/')

urlopen()

不加data是以GET方式發送，加data是以POST發送

1 2 3 4 5

import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') html = response.read().decode('utf-8') print(html)

?結果

加data發送POST請求

1 2 3 4 5 6

import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'hello':?'0bug'}), encoding='utf-8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read())

?結果

timeout超時間

1 2 3 4

import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug', timeout=0.01) print(response.read())

?結果

1 2 3 4 5 6 7 8

import urllib.request import socket import urllib.error try: ????response = urllib.request.urlopen('http://www.cnblogs.com/0bug', timeout=0.01) except urllib.error.URLError?as??e: ????if?isinstance(e.reason,socket.timeout): ????????print('請求超時')

?結果

響應

1.響應類型

1 2 3 4

import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') print(type(response))

?結果

2.狀態碼、響應頭

1 2 3 4 5 6

import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') print(response.status) print(response.getheaders()) print(response.getheader('Content-Type'))

?結果

3.響應體

響應體是字節流，需要decode('utf-8')

1 2 3 4 5

import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') html = response.read().decode('utf-8') print(html)

Request

1 2 3 4 5

import urllib.request request = urllib.request.Request('http://www.cnblogs.com/0bug') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))

?結果

添加請求頭信息

1 2 3 4 5 6 7 8 9 10 11 12

from?urllib import request, parse url =?'http://httpbin.org/post' headers = { ????'User-Agent':?'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', ????'Host':?'httpbin.org' } dic = {'name':?'0bug'} data = bytes(parse.urlencode(dic), encoding='utf-8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))

?結果

add_header

1 2 3 4 5 6 7 8 9 10

from?urllib import request, parse url =?'http://httpbin.org/post' dic = {'name':?'0bug'} data = bytes(parse.urlencode(dic), encoding='utf-8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', ???????????????'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8'))

Handler

代理：

1 2 3 4 5 6 7 8 9

import urllib.request proxy_handler = urllib.request.ProxyHandler({ ????'http':?'http代理', ????'https':?'https代理' }) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://www.cnblogs.com/0bug') print(response.read())

Cookie

1 2 3 4 5 6 7 8

import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for?item?in?cookie: ????print(item.name +?"="?+ item.value)

?結果

Cookie保存為文件

1 2 3 4 5 6 7 8

import http.cookiejar, urllib.request filename =?'cookie.txt' cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)

?cookie.txt

另一種方式存

1 2 3 4 5 6 7 8

import http.cookiejar, urllib.request filename =?'cookie.txt' cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)

?cookie.txt

用什么格式的存就應該用什么格式的讀

1 2 3 4 5 6 7 8

import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))

異常處理

1 2 3 4 5 6

from?urllib import request, error try: ????response = request.urlopen('http://www.cnblogs.com/0bug/xxxx') except error.URLError?as?e: ????print(e.reason)

?結果

1 2 3 4 5 6 7 8 9 10

from?urllib import request, error try: ????response = request.urlopen('http://www.cnblogs.com/0bug/xxxx') except error.HTTPError?as?e: ????print(e.reason, e.code, e.headers, sep='\n') except error.URLError?as?e: ????print(e.reason) else: ????print('Request Successfully')

?結果

1 2 3 4 5 6 7 8 9 10

import socket import urllib.request import urllib.error try: ????response = urllib.request.urlopen('http://www.cnblogs.com/0bug/xxxx', timeout=0.001) except urllib.error.URLError?as?e: ????print(type(e.reason)) ????if?isinstance(e.reason, socket.timeout): ????????print('請求超時')

?結果

URL解析

1 2 3 4 5

from?urllib.parse import urlparse result = urlparse('www.baidu.com/index.html;user?id=5#comment') print(type(result)) print(result)

?結果

1 2 3 4

from?urllib.parse import urlparse result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https') print(result)

?結果

1 2 3 4

from?urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https') print(result)

?結果

1 2 3 4

from?urllib.parse import urlparse result = urlparse('http://www.badiu.com/index.html;user?id=5#comment', allow_fragments=False) print(result)

?結果

1 2 3 4

from?urllib.parse import urlparse result = urlparse('http://www.badiu.com/index.html#comment', allow_fragments=False) print(result)

?結果

urlunparse

1 2 3 4

from?urllib.parse import urlunparse data = ['http',?'www.baidu.com',?'index.html',?'user',?'id=6',?'comment'] print(urlunparse(data))

?結果

urljoin

1 2 3 4 5 6 7 8 9 10

from?urllib.parse import urljoin print(urljoin('http://www.baidu.com',?'ABC.html')) print(urljoin('http://www.baidu.com',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com/0bug',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com/0bug',?'https://www.cnblogs.com/0bug?q=2')) print(urljoin('http://www.baidu.com/0bug?q=2',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com',?'?q=2#comment')) print(urljoin('www.baidu.com',?'?q=2#comment')) print(urljoin('www.baidu.com#comment',?'?q=2'))

?結果

urlencode

1 2 3 4 5 6 7 8 9

from?urllib.parse import urlencode params?= { ????'name':?'0bug', ????'age': 25 } base_url =?'http://www.badiu.com?' url = base_url + urlencode(params) print(url)

轉載于:https://www.cnblogs.com/yunlongaimeng/p/9802052.html

總結

以上是生活随笔為你收集整理的常见的爬虫分析库（1）-Python3中Urllib库基本使用的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。