當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

使用urllib2简单爬取并保存内涵吧内涵段子指定分页的的描述信息

發布時間：2025/6/17 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了使用urllib2简单爬取并保存内涵吧内涵段子指定分页的的描述信息小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

對 內涵8的內涵段子 爬取每個分頁上面顯示的描述信息，按回車鍵繼續對下一頁進行爬取，輸入quit退出爬取。
思路：

1. 爬取每個頁面的源碼 2. 對源碼進行處理（使用正則），獲取指定信息 3. 保存信息

源碼如下：

# -*- coding:utf-8 -*- #!/usr/bin/env python import urllib2 import redef writepage(content,page):'''保存爬取結果'''print('正在保存第' + page + '頁')filename = '第' + page + '.txt'with open(filename,'w') as f:f.write(content)def loadpage(url,page):'''爬取指定頁的描述信息'''print('正在下載第' + page + '頁')headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36"}request = urllib2.Request(url,headers=headers)response = urllib2.urlopen(request)html = response.read()# 使用正則，獲取并返回要爬取的信息pattern = re.compile('<div class="desc">.*?</div>')m = pattern.findall(html)content = ''for n in m:n = n.replace('<div class="desc">', '').replace('</div>', '') + '\n\n'content += nreturn contentdef neihan8spider(url,page):'''內涵8段子調度器，爬取并保存處理后的結果'''print('開始爬取')# 爬取開關switch = True# 開始爬取while switch:content = loadpage(url,page)writepage(content,page)s = raw_input('是否繼續爬取，按回車繼續，輸入quit退出：')if s == 'quit':switch = Falseelse:page = str(int(page) + 1)print('爬取結束')if __name__ == '__main__':page = raw_input('請輸入要查看第幾頁的頁面數： ')# 由于第一頁和其它頁的url不同，所以分別進行處理if(page=='1'):url = 'https://www.neihan8.com/article/index' + '.html'else:url = 'https://www.neihan8.com/article/index_'+ page + '.html'# 爬取并處理保存neihan8spider(url,page)

代碼測試：

轉載于:https://www.cnblogs.com/silence-cc/p/9213999.html

總結

以上是生活随笔為你收集整理的使用urllib2简单爬取并保存内涵吧内涵段子指定分页的的描述信息的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： delphi中DateTimePicke
下一篇：关于PIC和FPGA