爬虫之静态页面抓取
靜態網頁抓取
在網絡爬蟲中,靜態網頁的數據比較容易獲取,因為其所有數據都呈現在網頁的HTML代碼中
在靜態網頁抓取中,Python中的Requests庫能夠容易實現這個需求
通過requests發起Http請求
import requests url="http://www.santostang.com/" r=requests.get(url) print("文本編碼:",r.encoding) print("響應狀態碼:",r.status_code) print("響應文本內容:",r.text) 文本編碼: UTF-8 響應狀態碼: 200 響應文本內容: <!DOCTYPE html> <html lang="zh-CN"> <head> <meta charset="UTF-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"> <title>Santos Tang</title> <meta name="description" content="Python網絡爬蟲:從入門到實踐 官方網站及個人博客" /> <meta name="keywords" content="Python網絡爬蟲, Python爬蟲, Python, 爬蟲, 數據科學, 數據挖掘, 數據分析, santostang, Santos Tang, 唐松, Song Tang" /> <link rel="apple-touch-icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png"> <link rel="apple-touch-icon" sizes="152x152" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_152.png"> <link rel="apple-touch-icon" sizes="167x167" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_167.png"> <link rel="apple-touch-icon" sizes="180x180" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_180.png"> <link rel="icon" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/images/icon_32.png" type="image/x-icon"> <link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/bootstrap.min.css"> <link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/css/fontawesome.min.css"> <link rel="stylesheet" href="http://www.santostang.com/wp-content/themes/SongStyle-Two/style.css"> <link rel="pingback" href="http://www.santostang.com/xmlrpc.php" /> <style type="text/css"> a{color:#1e73be} a:hover{color:#2980b9!important} #header{background-color:#1e73be} .widget .widget-title::after{background-color:#1e73be} .uptop{border-left-color:#1e73be} #titleBar .toggle:before{background:#1e73be} </style> </head><body> <header id="header"><div class="avatar"><a href="http://www.santostang.com" title="Santos Tang"><img src="http://www.santostang.com/wp-content/uploads/2019/06/me.jpg" alt="Santos Tang" class="img-circle" width="50%"></a></div><h1 id="name">Santos Tang</h1><div class="sns"><a href="https://weibo.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Weibo"><i class="fab fa-weibo"></i></a> <a href="https://www.linkedin.com/in/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Linkedin"><i class="fab fa-linkedin"></i></a> <a href="https://www.zhihu.com/people/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="Zhihu"><i class="fab fa-zhihu"></i></a> <a href="https://github.com/santostang" target="_blank" rel="nofollow" data-toggle="tooltip" data-placement="top" title="GitHub"><i class="fab fa-github-alt"></i></a> </div><div class="nav"><ul><li><a href="http://www.santostang.com/">首頁</a></li> <li><a href="http://www.santostang.com/aboutme/">關于我</a></li> <li><a href="http://www.santostang.com/python%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%e4%bb%a3%e7%a0%81/">爬蟲書代碼</a></li> <li><a href="http://www.santostang.com/%e5%8a%a0%e6%88%91%e5%be%ae%e4%bf%a1/">加我微信</a></li> <li><a href="https://santostang.github.io/">EnglishSite</a></li> </ul> </div><div class="weixin"><img src="http://www.santostang.com/wp-content/uploads/2019/06/qrcode_for_gh_370f70791e19_258.jpg" alt="微信公眾號" width="50%"><p>微信公眾號</p></div></header> <div id="main"><div class="row box"><div class="col-md-8"><h2 class="uptop"><i class="fas fa-arrow-circle-up"></i> <a href="http://www.santostang.com/2018/07/11/%e3%80%8a%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%ef%bc%9a%e4%bb%8e%e5%85%a5%e9%97%a8%e5%88%b0%e5%ae%9e%e8%b7%b5%e3%80%8b%e4%b8%80%e4%b9%a6%e5%8b%98%e8%af%af/" target="_blank">《網絡爬蟲:從入門到實踐》一書勘誤</a></h2><article class="article-list-1 clearfix"><header class="clearfix"><h1 class="post-title"><a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/">第四章 – 4.3 通過selenium 模擬瀏覽器抓取</a></h1><div class="post-meta"><span class="meta-span"><i class="far fa-calendar-alt"></i> 07月15日</span><span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 網絡爬蟲</a></span><span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/#respond">沒有評論</a></span><span class="meta-span hidden-xs"><i class="fas fa-tags"></i> </span></div></header><div class="post-content clearfix"><p>4.3 通過selenium 模擬瀏覽器抓取在上述的例子中,使用Chrome“檢查”功能找到源地址還十分容易。但是有一些網站非常復雜,例如前面的天貓產品評論,使用“檢查”功能很難找到調用的網頁地址。除此之外,有一些數據...</p></div></article><article class="article-list-1 clearfix"><header class="clearfix"><h1 class="post-title"><a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/">第四章 – 4.2 解析真實地址抓取</a></h1><div class="post-meta"><span class="meta-span"><i class="far fa-calendar-alt"></i> 07月14日</span><span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 網絡爬蟲</a></span><span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/#respond">沒有評論</a></span><span class="meta-span hidden-xs"><i class="fas fa-tags"></i> <a href="http://www.santostang.com/tag/ajax/" rel="tag">ajax</a>,<a href="http://www.santostang.com/tag/python/" rel="tag">python</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="tag">網絡爬蟲</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab/" rel="tag">網頁爬蟲</a>,<a href="http://www.santostang.com/tag/%e8%a7%a3%e6%9e%90%e5%9c%b0%e5%9d%80/" rel="tag">解析地址</a></span></div></header><div class="post-content clearfix"><p>由于網易云跟帖停止服務,現在已經在此處中更新了新寫的第四章。請參照文章: 4.2 解析真實地址抓取 雖然數據并沒有出現在網頁源代碼中,我們也可以找到數據的真實地址,請求這個真實地址也可以獲得想要的數據。...</p></div></article><article class="article-list-1 clearfix"><header class="clearfix"><h1 class="post-title"><a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/">第四章- 動態網頁抓取 (解析真實地址 + selenium)</a></h1><div class="post-meta"><span class="meta-span"><i class="far fa-calendar-alt"></i> 07月14日</span><span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 網絡爬蟲</a></span><span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/#respond">沒有評論</a></span><span class="meta-span hidden-xs"><i class="fas fa-tags"></i> <a href="http://www.santostang.com/tag/ajax/" rel="tag">ajax</a>,<a href="http://www.santostang.com/tag/javascript/" rel="tag">javascript</a>,<a href="http://www.santostang.com/tag/python/" rel="tag">python</a>,<a href="http://www.santostang.com/tag/selenium/" rel="tag">selenium</a>,<a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="tag">網絡爬蟲</a></span></div></header><div class="post-content clearfix"><p>由于網易云跟帖停止服務,現在已經在此處中更新了新寫的第四章。請參照文章: 前面爬取的網頁均為靜態網頁,這樣的網頁在瀏覽器中展示的內容都在HTML源代碼中。但是,由于主流網站都使用JavaScript展現網頁內容,...</p></div></article><article class="article-list-1 clearfix"><header class="clearfix"><h1 class="post-title"><a href="http://www.santostang.com/2018/07/04/hello-world/">Hello world!</a></h1><div class="post-meta"><span class="meta-span"><i class="far fa-calendar-alt"></i> 07月04日</span><span class="meta-span"><i class="far fa-folder"></i> <a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" rel="category tag">Python 網絡爬蟲</a></span><span class="meta-span"><i class="fas fa-comments"></i> <a href="http://www.santostang.com/2018/07/04/hello-world/#comments">1條評論</a></span><span class="meta-span hidden-xs"><i class="fas fa-tags"></i> </span></div></header><div class="post-content clearfix"><p>Welcome to WordPress. This is your first post. Edit or delete it, then start writing! 各位讀者,由于網易云跟帖在本書出版后已經停止服務,書中的第四章已經無法使用。所以我將本書的評論系統換成了來必力...</p></div></article><nav style="float:right"></nav></div><div class="col-md-4 hidden-xs hidden-sm"><aside class="widget clearfix"><form id="searchform" action="http://www.santostang.com"><div class="input-group"><input type="search" class="form-control" placeholder="搜索…" value="" name="s"><span class="input-group-btn"><button class="btn btn-default" type="submit"><i class="fas fa-search"></i></button></span></div></form> </aside> <aside class="widget clearfix"><h4 class="widget-title">文章分類</h4><ul class="widget-cat"><li class="cat-item cat-item-2"><a href="http://www.santostang.com/category/python-%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" >Python 網絡爬蟲</a> (5) </li></ul> </aside> <aside class="widget clearfix"><h4 class="widget-title">熱門文章</h4><ul class="widget-hot"></ul> </aside> <aside class="widget clearfix"><h4 class="widget-title">隨機推薦</h4><ul class="widget-hot"><li><a href="http://www.santostang.com/2018/07/04/hello-world/" title="Hello world!">Hello world!</a></li><li><a href="http://www.santostang.com/2018/07/11/%e3%80%8a%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab%ef%bc%9a%e4%bb%8e%e5%85%a5%e9%97%a8%e5%88%b0%e5%ae%9e%e8%b7%b5%e3%80%8b%e4%b8%80%e4%b9%a6%e5%8b%98%e8%af%af/" title="《網絡爬蟲:從入門到實踐》一書勘誤">《網絡爬蟲:從入門到實踐》一書勘誤</a></li><li><a href="http://www.santostang.com/2018/07/15/4-3-%e9%80%9a%e8%bf%87selenium-%e6%a8%a1%e6%8b%9f%e6%b5%8f%e8%a7%88%e5%99%a8%e6%8a%93%e5%8f%96/" title="第四章 – 4.3 通過selenium 模擬瀏覽器抓取">第四章 – 4.3 通過selenium 模擬瀏覽器抓取</a></li><li><a href="http://www.santostang.com/2018/07/14/%e7%ac%ac%e5%9b%9b%e7%ab%a0%ef%bc%9a%e5%8a%a8%e6%80%81%e7%bd%91%e9%a1%b5%e6%8a%93%e5%8f%96-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80-selenium/" title="第四章- 動態網頁抓取 (解析真實地址 + selenium)">第四章- 動態網頁抓取 (解析真實地址 + selenium)</a></li><li><a href="http://www.santostang.com/2018/07/14/4-2-%e8%a7%a3%e6%9e%90%e7%9c%9f%e5%ae%9e%e5%9c%b0%e5%9d%80%e6%8a%93%e5%8f%96/" title="第四章 – 4.2 解析真實地址抓取">第四章 – 4.2 解析真實地址抓取</a></li></ul> </aside> <aside class="widget clearfix"><h4 class="widget-title">標簽云</h4><div class="widget-tags"><a href="http://www.santostang.com/tag/ajax/" class="tag-cloud-link tag-link-3 tag-link-position-1" style="color:#66807;font-size: 22pt;" aria-label="ajax (2個項目);">ajax</a> <a href="http://www.santostang.com/tag/javascript/" class="tag-cloud-link tag-link-4 tag-link-position-2" style="color:#f508f0;font-size: 8pt;" aria-label="javascript (1個項目);">javascript</a> <a href="http://www.santostang.com/tag/python/" class="tag-cloud-link tag-link-5 tag-link-position-3" style="color:#cd09c1;font-size: 22pt;" aria-label="python (2個項目);">python</a> <a href="http://www.santostang.com/tag/selenium/" class="tag-cloud-link tag-link-6 tag-link-position-4" style="color:#39f9b0;font-size: 8pt;" aria-label="selenium (1個項目);">selenium</a> <a href="http://www.santostang.com/tag/%e7%bd%91%e7%bb%9c%e7%88%ac%e8%99%ab/" class="tag-cloud-link tag-link-8 tag-link-position-5" style="color:#eca091;font-size: 22pt;" aria-label="網絡爬蟲 (2個項目);">網絡爬蟲</a> <a href="http://www.santostang.com/tag/%e7%bd%91%e9%a1%b5%e7%88%ac%e8%99%ab/" class="tag-cloud-link tag-link-9 tag-link-position-6" style="color:#369ed1;font-size: 8pt;" aria-label="網頁爬蟲 (1個項目);">網頁爬蟲</a> <a href="http://www.santostang.com/tag/%e8%a7%a3%e6%9e%90%e5%9c%b0%e5%9d%80/" class="tag-cloud-link tag-link-10 tag-link-position-7" style="color:#c781f8;font-size: 8pt;" aria-label="解析地址 (1個項目);">解析地址</a> </div> </aside> <aside class="widget clearfix"><h4 class="widget-title">友情鏈接</h4><ul class="widget-links"></ul> </aside></div></div> </div> <div class="footer_search visible-xs visible-sm"><form id="searchform" action="http://www.santostang.com"><div class="input-group"><input type="search" class="form-control" placeholder="搜索…" value="" name="s"><span class="input-group-btn"><button class="btn btn-default" type="submit"><i class="fas fa-search"></i></button></span></div></form> </div> <footer id="footer"><div class="copyright"><p><i class="far fa-copyright"></i> 2019 <b>唐松-數據科學 版權所有</b></p><p>Powered by <b>WordPress</b>. Theme by <a href="https://tangjie.me/jiestyle-two" data-toggle="tooltip" data-placement="top" title="WordPress 主題模板" target="_blank"><b>JieStyle Two</b></a> | <a href="http://beian.miit.gov.cn" data-toggle="tooltip" data-placement="top" target="_blank"><b>粵ICP備19068356號</b></a> </p> </p></div><div style="display:none;">代碼在頁面底部,統計標識不會顯示,但不影響統計效果</div> </footer> <script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/jquery.min.js"></script> <script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/bootstrap.min.js"></script> <script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/skel.min.js"></script> <script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/util.min.js"></script> <script type="text/javascript" src="http://www.santostang.com/wp-content/themes/SongStyle-Two/js/nav.js"></script> <script type='text/javascript' src='http://www.santostang.com/wp-includes/js/jquery/jquery.js?ver=1.12.4'></script> <script type='text/javascript' src='http://www.santostang.com/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.4.1'></script> <script type='text/javascript' src='http://www.santostang.com/wp-content/plugins/captcha-bank/assets/global/plugins/custom/js/front-end-script.js?ver=4.8.17'></script> <script> $(function() {$('[data-toggle="tooltip"]').tooltip() }); </script> <script> (function(){var bp = document.createElement('script');var curProtocol = window.location.protocol.split(':')[0];if (curProtocol === 'https') {bp.src = 'https://zz.bdstatic.com/linksubmit/push.js';}else {bp.src = 'http://push.zhanzhang.baidu.com/push.js';}var s = document.getElementsByTagName("script")[0];s.parentNode.insertBefore(bp, s); })(); </script> <script> var _hmt = _hmt || []; (function() {var hm = document.createElement("script");hm.src = "https://hm.baidu.com/hm.js?752e310cec7906ba7afeb24cd7114c48";var s = document.getElementsByTagName("script")[0]; s.parentNode.insertBefore(hm, s); })(); </script> </body> </html>使用說明
- r.text:服務器響應的內容,會根據響應頭部的字符編碼進行解碼
- r.encoding:服務器內容使用的文本編碼
- r.status_code:響應狀態碼
- 200:請求成功
- 4xx:客戶端錯誤
- 5xx:服務器錯誤響應
- r.content:字節方式的響應
- r.json():Requests內置的JSON解碼器
requests的用法
1.URL參數
有時我們為了請求特定的數據,需要直接在URL中加入一些數據,并且這些數據為在一個問號后面
如點擊百度中的某條新聞,其鏈接為https://baijiahao.baidu.com/s?id=1719021671421019613&wfr=spider&for=pc,其URL中的?后面會跟著一些屬性值
2.定制請求頭部
請求頭部Headers提供了關于請求,響應或其他發送實體的信息
如何查看請求頭:
瀏覽器點擊檢查
查看Network
查看Headers
一般只需添加瀏覽器信息即可
- 即user-agent
3.發送Post請求
有時訪問網站需要提交一些表單形式的數據,這是需要發送POST請求
如果要發送POST請求,只需簡單傳遞一個字典數據給Requests中data參數
4.超時處理
有時爬蟲會遇到服務器長時間沒有響應,這時候會造成爬蟲程序一直等待
這是可以通過Requests中的timeout參數設置定時器,如果定時器觸發后沒有響應,就會返回異常
實踐爬取豆瓣Top250頁面
我們觀察發現,網站一頁最多顯示25部電影,且其每個頁面的url為:https://movie.douban.com/top250?start=value
其中value為0,25,50,…
頁面內容:
總結
- 上一篇: 文字输入限制_从拼音输入法的兴起看汉字文
- 下一篇: 《计算机应用基础》18春作业,【北语网院