生活随笔
收集整理的這篇文章主要介紹了
常见的爬虫分析库(1)-Python3中Urllib库基本使用
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
原文來自:https://www.cnblogs.com/0bug/p/8893677.html
?
什么是Urllib?
Python內置的HTTP請求庫
urllib.request? ? ? ? ? 請求模塊
urllib.error? ? ? ? ? ? ? 異常處理模塊
urllib.parse? ? ? ? ? ? ?url解析模塊
urllib.robotparser? ? robots.txt解析模塊
相比Python的變化
Python2中的urllib2在Python3中被統一移動到了urllib.request中
python2
import urllib2
response = urllib2.urlopen('http://www.cnblogs.com/0bug')
Python3
import urllib.request
response = urllib.request.urlopen('http://www.cnblogs.com/0bug/')
urlopen()
不加data是以GET方式發送,加data是以POST發送
| 1 2 3 4 5 | import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') html = response.read().decode('utf-8') print(html) |
?
結果 加data發送POST請求
| 1 2 3 4 5 6 | import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'hello':?'0bug'}), encoding='utf-8') response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read()) |
?
結果 timeout超時間
| 1 2 3 4 | import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug', timeout=0.01) print(response.read()) |
?
結果 | 1 2 3 4 5 6 7 8 | import urllib.request import socket import urllib.error try: ????response = urllib.request.urlopen('http://www.cnblogs.com/0bug', timeout=0.01) except urllib.error.URLError?as??e: ????if?isinstance(e.reason,socket.timeout): ????????print('請求超時') |
?
結果 響應
1.響應類型
| 1 2 3 4 | import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') print(type(response)) |
?
結果 2.狀態碼、響應頭
| 1 2 3 4 5 6 | import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') print(response.status) print(response.getheaders()) print(response.getheader('Content-Type')) |
?
結果 3.響應體
響應體是字節流,需要decode('utf-8')
| 1 2 3 4 5 | import urllib.request response = urllib.request.urlopen('http://www.cnblogs.com/0bug') html = response.read().decode('utf-8') print(html) |
Request
| 1 2 3 4 5 | import urllib.request request = urllib.request.Request('http://www.cnblogs.com/0bug') response = urllib.request.urlopen(request) print(response.read().decode('utf-8')) |
?
結果 添加請求頭信息
| 1 2 3 4 5 6 7 8 9 10 11 12 | from?urllib import request, parse url =?'http://httpbin.org/post' headers = { ????'User-Agent':?'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', ????'Host':?'httpbin.org' } dic = {'name':?'0bug'} data = bytes(parse.urlencode(dic), encoding='utf-8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8')) |
?
結果 add_header
| 1 2 3 4 5 6 7 8 9 10 | from?urllib import request, parse url =?'http://httpbin.org/post' dic = {'name':?'0bug'} data = bytes(parse.urlencode(dic), encoding='utf-8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', ???????????????'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36') response = request.urlopen(req) print(response.read().decode('utf-8')) |
Handler
代理:
| 1 2 3 4 5 6 7 8 9 | import urllib.request proxy_handler = urllib.request.ProxyHandler({ ????'http':?'http代理', ????'https':?'https代理' }) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://www.cnblogs.com/0bug') print(response.read()) |
Cookie
| 1 2 3 4 5 6 7 8 | import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for?item?in?cookie: ????print(item.name +?"="?+ item.value) |
?
結果 Cookie保存為文件
| 1 2 3 4 5 6 7 8 | import http.cookiejar, urllib.request filename =?'cookie.txt' cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) |
?
cookie.txt 另一種方式存
| 1 2 3 4 5 6 7 8 | import http.cookiejar, urllib.request filename =?'cookie.txt' cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True) |
?
cookie.txt 用什么格式的存就應該用什么格式的讀
| 1 2 3 4 5 6 7 8 | import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8')) |
異常處理
| 1 2 3 4 5 6 | from?urllib import request, error try: ????response = request.urlopen('http://www.cnblogs.com/0bug/xxxx') except error.URLError?as?e: ????print(e.reason) |
?
結果 | 1 2 3 4 5 6 7 8 9 10 | from?urllib import request, error try: ????response = request.urlopen('http://www.cnblogs.com/0bug/xxxx') except error.HTTPError?as?e: ????print(e.reason, e.code, e.headers, sep='\n') except error.URLError?as?e: ????print(e.reason) else: ????print('Request Successfully') |
?
結果 | 1 2 3 4 5 6 7 8 9 10 | import socket import urllib.request import urllib.error try: ????response = urllib.request.urlopen('http://www.cnblogs.com/0bug/xxxx', timeout=0.001) except urllib.error.URLError?as?e: ????print(type(e.reason)) ????if?isinstance(e.reason, socket.timeout): ????????print('請求超時') |
?
結果 URL解析
| 1 2 3 4 5 | from?urllib.parse import urlparse result = urlparse('www.baidu.com/index.html;user?id=5#comment') print(type(result)) print(result) |
?
結果 | 1 2 3 4 | from?urllib.parse import urlparse result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https') print(result) |
?
結果 | 1 2 3 4 | from?urllib.parse import urlparse result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https') print(result) |
?
結果 | 1 2 3 4 | from?urllib.parse import urlparse result = urlparse('http://www.badiu.com/index.html;user?id=5#comment', allow_fragments=False) print(result) |
?
結果 | 1 2 3 4 | from?urllib.parse import urlparse result = urlparse('http://www.badiu.com/index.html#comment', allow_fragments=False) print(result) |
?
結果 urlunparse
| 1 2 3 4 | from?urllib.parse import urlunparse data = ['http',?'www.baidu.com',?'index.html',?'user',?'id=6',?'comment'] print(urlunparse(data)) |
?
結果 urljoin
| 1 2 3 4 5 6 7 8 9 10 | from?urllib.parse import urljoin print(urljoin('http://www.baidu.com',?'ABC.html')) print(urljoin('http://www.baidu.com',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com/0bug',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com/0bug',?'https://www.cnblogs.com/0bug?q=2')) print(urljoin('http://www.baidu.com/0bug?q=2',?'https://www.cnblogs.com/0bug')) print(urljoin('http://www.baidu.com',?'?q=2#comment')) print(urljoin('www.baidu.com',?'?q=2#comment')) print(urljoin('www.baidu.com#comment',?'?q=2')) |
?
結果 urlencode
| 1 2 3 4 5 6 7 8 9 | from?urllib.parse import urlencode params?= { ????'name':?'0bug', ????'age': 25 } base_url =?'http://www.badiu.com?' url = base_url + urlencode(params) print(url) |
轉載于:https://www.cnblogs.com/yunlongaimeng/p/9802052.html
總結
以上是生活随笔為你收集整理的常见的爬虫分析库(1)-Python3中Urllib库基本使用的全部內容,希望文章能夠幫你解決所遇到的問題。
如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。