努努小说通用爬取——多线程
生活随笔
收集整理的這篇文章主要介紹了
努努小说通用爬取——多线程
小編覺得挺不錯的,現在分享給大家,幫大家做個參考.
爬取結果
完整代碼:
# -*- coding: utf-8 -*- """ @email: bluechai@qq.com@author: NiceBlueChai """ import requests from bs4 import BeautifulSoup as sp import os import threading from concurrent.futures import ThreadPoolExecutor, as_completed import time from urllib.parse import urljoin# 請求HTML def getHTMLText(url, timeout=3000) -> str:try:r = requests.get(url, timeout=timeout)r.raise_for_status()r.encoding = r.apparent_encodingreturn r.textexcept Exception as e:print("Requeste Error: ", format(e))return None# 獲取小說章節列表的鏈接和章節名 def getUrls(baseUrl='https://www.kanunu8.com/book3/6633/', timeout=30) -> list:urls = []text = getHTMLText(baseUrl, timeout)if text is not None:soup = sp(text, 'lxml')table = soup.find('table', {'cellspacing': 1, 'bgcolor': '#d4d0c8'})alist = table.findAll('a')urls = [(a.get_text().strip(), urljoin(baseUrl, a.attrs['href']))for a in alist]return urls# 解析小說內容 def parse(text: str) -> (str, str):title = ''context = ''soup = sp(text, 'lxml')tag = soup.p.contentstitle = ''.join(soup.find('font'))context = ''.join([x for x in tag if len(x) > 5])return (title, context)# 線程入口 def MyPare(url, absPath: str):title = ''text = getHTMLText(url)if text == None:return url + ' Failed 'try:title, content = parse(text)try:with open(absPath, 'w', encoding='utf-8')as f:f.write(content)except Exception as e:print(title+' 寫入文件失敗 ', format(e))except Exception as e:print('Error : ' + url, format(e))return titledef main(basePath='E:/Test/小說/', baseUrl='https://www.kanunu8.com/book3/6633/', timeout=3000):with ThreadPoolExecutor(max_workers=8) as t:executor = ThreadPoolExecutor(8)futures = []try:os.mkdir(basePath)except:print('目錄已存在或該磁盤不存在')begin = time.time()for title, url in getUrls(baseUrl, 30):absPath = os.path.join(basePath, title+'.txt')if os.path.exists(absPath):print(absPath, "已下載...")else:print(absPath, " 開始下載: ", url)futures.append(executor.submit(MyPare, url, absPath))times = time.time() - beginprint("所有線程開始: ", times)for f in as_completed(futures):times = time.time() - beginprint(f.result(), " Done: ", times)times = time.time() - beginprint("ALL Done: ", times)if __name__ == '__main__':# 小說保存路徑savePath = r'E:/Test/小說/超新星紀元'# 小說目錄頁面baseUrl = r'https://www.kanunu8.com/book3/6634/'# requests超時timeout = 30main(savePath, baseUrl, time)??我的目標是:someday,即便你花錢看我的文章,也會覺得心滿意足
總結
以上是生活随笔為你收集整理的努努小说通用爬取——多线程的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 宜家开发中心东亚区完成了在中国的全新升级
- 下一篇: 让我摘下星星送给你_想摘下星星给你摘下月