Python | Web Crawler
生活随笔
收集整理的這篇文章主要介紹了
Python | Web Crawler
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
1)爬蟲心法 : 做個正常訪問者
Example:直接網絡連線,不添加任何Header
error message:
urllib.error.HTTPError: HTTP Error 403: Forbidden直接被Server拒絕,F12觀察一下正常訪問Server時候會發生什么。
會發送一大堆的Header,其中最重要的莫屬user-agent,標識你用的是什么OS,什么Browser。
2)改進后(request中添加header)
#抓取電影源碼 import ssl import urllib.request as requestcontext = ssl._create_unverified_context()src = 'https://www.ptt.cc/bbs/movie/index.html' #建立req Object,附加header信息 req = request.Request(src, headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36" })with request.urlopen(req, context= context) as response:data = response.read().decode("utf-8")print(data)返回message:
PS C:\Users\85380\Desktop\LearnPy> python .\test2.py <!DOCTYPE html> <html><head><meta charset="utf-8"><meta name="viewport" content="width=device-width, initial-scale=1"><title>看板 movie 文章列表 - 批踢踢實業坊</title><link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-common.css"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-base.css" media="screen"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-custom.css"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/pushstream.css" media="screen"> <link rel="stylesheet" type="text/css" href="//images.ptt.cc/bbs/v2.26/bbs-print.css" media="print"></head><body><div id="topbar-container"><div id="topbar" class="bbs-content"><a id="logo" href="/bbs/">批踢踢實業坊</a><span>›</span><a class="board" href="/bbs/movie/index.html"><span class="board-label">看板 </span>movie</a><a class="right small" href="/about.html">關於我們</a><a class="right small" href="/contact.html">聯絡資訊</a></div> </div><div id="main-container"><div id="action-bar-container"><div class="action-bar"><div class="btn-group btn-group-dir"><a class="btn selected" href="/bbs/movie/index.html">看板</a><a class="btn" href="/man/movie/index.html">精華區</a></div><div class="btn-group btn-group-paging"><a class="btn wide" href="/bbs/movie/index1.html">最 舊</a><a class="btn wide" href="/bbs/movie/index8210.html">‹ 上頁</a><a class="btn wide disabled">下頁 ›</a><a class="btn wide" href="/bbs/movie/index.html">最新 </a></div></div></div><div class="r-list-container action-bar-margin bbs-screen"><div class="search-bar"><form type="get" action="search" id="search-bar"><input class="query" type="text" name="q" value="" placeholder="搜尋文章⋯"></form></div><div class="r-ent"><div class="nrec"><span class="hl f2">8</span></div><div class="title"><a href="/bbs/movie/M.1565026014.A.B3C.html">[新聞] 「終局之戰」、「亂世佳人」、「阿凡達」誰真正票房冠軍?</a></div><div class="meta"><div class="author">orz44444</div><div class="article-menu"><div class="trigger">⋯</div><div class="dropdown"><div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E3%80%8C%E7%B5%82%E5%B1%80%E4%B9%8B%E6%88%B0%E3%80%8D%E3%80%81%E3%80%8C%E4%BA%82%E4%B8%96%E4%BD%B3%E4%BA%BA%E3%80%8D%E3%80%81%E3%80%8C%E9%98%BF%E5%87%A1%E9%81%94%E3%80%8D%E8%AA%B0%E7%9C%9F%E6%AD%A3%E7%A5%A8%E6%88%BF%E5%86%A0%E8%BB%8D%EF%BC%9F">搜尋同標題文章</a></div><div class="item"><a href="/bbs/movie/search?q=author%3Aorz44444">搜尋看板內 orz44444 的文章</a></div></div></div><div class="date"> 8/06</div><div class="mark"></div></div></div><div class="r-ent"><div class="nrec"><span class="hl f2">6</span></div><div class="title"><a href="/bbs/movie/M.1565027230.A.041.html">Re: [新 聞] 凱文費奇透露《雷神索爾4》為何要拍女雷神</a></div><div class="meta"><div class="author">godshibainu</div><div class="article-menu"><div class="trigger">⋯</div><div class="dropdown"><div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E5%87%B1%E6%96%87%E8%B2%BB%E5%A5%87%E9%80%8F%E9%9C%B2%E3%80%8A%E9%9B%B7%E7%A5%9E%E7%B4%A2%E7%88%BE4%E3%80%8B%E7%82%BA%E4%BD%95%E8%A6%81%E6%8B%8D%E5%A5%B3%E9%9B%B7%E7%A5%9E">搜尋同標題文章</a></div><div class="item"><a href="/bbs/movie/search?q=author%3Agodshibainu">搜尋看板內 godshibainu 的文章</a></div></div></div><div class="date"> 8/06</div><div class="mark"></div></div></div><div class="r-ent"><div class="nrec"><span class="hl f3">10</span></div><div class="title"><a href="/bbs/movie/M.1565027740.A.927.html">[新聞] 《復仇者4》驚見關史黛西!「就在蜘蛛人</a></div><div class="meta"><div class="author">chufenyang</div><div class="article-menu"><div class="trigger">⋯</div><div class="dropdown"><div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E3%80%8A%E5%BE%A9%E4%BB%87%E8%80%854%E3%80%8B%E9%A9%9A%E8%A6%8B%E9%97%9C%E5%8F%B2%E9%BB%9B%E8%A5%BF%EF%BC%81%E3%80%8C%E5%B0%B1%E5%9C%A8%E8%9C%98%E8%9B%9B%E4%BA%BA">搜尋同標題文章</a></div><div class="item"><a href="/bbs/movie/search?q=author%3Achufenyang">搜尋看板內 chufenyang 的文章</a></div></div></div><div class="date"> 8/06</div><div class="mark"></div></div></div><div class="r-ent"><div class="nrec"><span class="hl f2">4</span></div><div class="title"><a href="/bbs/movie/M.1565031671.A.280.html">Re: [新 聞] 必備片單!帝國雜誌評選30年來30部經典代</a></div><div class="meta"><div class="author">Payne22</div><div class="article-menu"><div class="trigger">⋯</div><div class="dropdown"><div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E6%96%B0%E8%81%9E%5D+%E5%BF%85%E5%82%99%E7%89%87%E5%96%AE%EF%BC%81%E5%B8%9D%E5%9C%8B%E9%9B%9C%E8%AA%8C%E8%A9%95%E9%81%B830%E5%B9%B4%E4%BE%8630%E9%83%A8%E7%B6%93%E5%85%B8%E4%BB%A3">搜尋同標題文章</a></div><div class="item"><a href="/bbs/movie/search?q=author%3APayne22">搜尋看板內 Payne22 的文章</a></div></div></div><div class="date"> 8/06</div><div class="mark"></div></div></div><div class="r-list-sep"></div><div class="r-ent"><div class="nrec"><span class="hl f3">22</span></div><div class="title"><a href="/bbs/movie/M.1559611458.A.DCA.html">[公告] 板規 2019/07/05</a></div><div class="meta"><div class="author">ckshchen</div><div class="article-menu"><div class="trigger">⋯</div><div class="dropdown"><div class="item"><a href="/bbs/movie/search?q=thread%3A%5B%E5%85%AC%E5%91%8A%5D+%E6%9D%BF%E8%A6%8F+2019%2F07%2F05">搜尋同標題文章</a></div><div class="item"><a href="/bbs/movie/search?q=author%3Ackshchen">搜尋看板內 ckshchen 的文章</a></div></div></div><div class="date"> 6/04</div><div class="mark">M</div></div></div></div></div><script>(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)})(window,document,'script','https://www.google-analytics.com/analytics.js','ga');ga('create', 'UA-32365737-1', {cookieDomain: 'ptt.cc',legacyCookieDomain: 'ptt.cc'});ga('send', 'pageview'); </script><script src="//ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script> <script src="//images.ptt.cc/bbs/v2.26/bbs.js"></script></body> </html>3)利用第三方套件Beautifulsop解析HTML
#抓取電影源碼 import ssl import urllib.request as requestcontext = ssl._create_unverified_context()src = 'https://www.ptt.cc/bbs/movie/index.html' #建立req Object,附加header信息 req = request.Request(src, headers={"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.87 Safari/537.36"})with request.urlopen(req, context= context) as response:data = response.read().decode("utf-8")#解析源碼,取得每篇文章的標題 import bs4 root = bs4.BeautifulSoup(data, "html.parser") #print(root.title.string)#抓到標簽"root.title" / 抓到標簽里面的文字"root.title.string"#找到想要的資料在HTML中的特色,如霸王別姬<div><a></a></div> #titles = root.find("div", class_="title") #尋找class = 'title'的div標簽 #print(titles.a.string) #titles會打印出其中一個符合條件的div的a標簽里面的stringtitles = root.find_all("div",class_ = "title") for title in titles:if title.a != None:print(title.a.string)result:
PS C:\Users\85380\Desktop\LearnPy> python .\test2.py [新聞] 「終局之戰」、「亂世佳人」、「阿凡達」誰真正票房冠軍? Re: [新聞] 凱文費奇透露《雷神索爾4》為何要拍女雷神 [新聞] 《復仇者4》驚見關史黛西!「就在蜘蛛人 Re: [新聞] 必備片單!帝國雜誌評選30年來30部經典代 Re: [新聞] 必備片單!帝國雜誌評選30年來30部經典代 [公告] 板規 2019/07/05總結
以上是生活随笔為你收集整理的Python | Web Crawler的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 自己动手写数据库
- 下一篇: 基础架构即服务(iaas)_基础架构即服