當(dāng)前位置：首頁 > 编程语言 > python >内容正文

python

python怎么爬虎牙_使用python爬虫框架scrapy抓取虎牙主播数据

發(fā)布時(shí)間：2023/12/4 python 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 python怎么爬虎牙_使用python爬虫框架scrapy抓取虎牙主播数据小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

前言

本文利用python的scrapy框架對虎牙web端的主播、主播訂閱數(shù)、主播當(dāng)前觀看人數(shù)等基本數(shù)據(jù)進(jìn)行抓取，并將抓取到的數(shù)據(jù)以csv格數(shù)輸出，以及存儲到mongodb中

思路

觀察虎牙網(wǎng)站后確認(rèn)所有頻道url都在www.huya.com/g中的，而主播房間數(shù)據(jù)則是ajax異步數(shù)據(jù)，獲取數(shù)據(jù)的鏈接為

http://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId={頻道id}&tagAll=0&page={頁碼}

該鏈接通過控制gameId和page來返回某頻道下某頁的數(shù)據(jù)，根據(jù)以上觀察爬行設(shè)計(jì)思路如下

第一步：訪問www.huya.com/g頁面，在li(class類型為game-list-item)中獲取當(dāng)前所有頻道的鏈接、標(biāo)題、頻道id

第二步：根據(jù)第一步獲取到的頻道的鏈接進(jìn)入頻道頁面，在頻道頁面獲取當(dāng)前頻道頁數(shù)，再根據(jù)該頻道id，頁數(shù)構(gòu)造異步數(shù)據(jù)請求鏈接

第三步：從第二步中獲取頻道返回的異步數(shù)據(jù)內(nèi)容，將返回的json數(shù)據(jù)類型轉(zhuǎn)化為字典，再獲取要抓取的目標(biāo)內(nèi)容。

第四步：向第三步中獲取到的主播房間url發(fā)出請求,進(jìn)入房間頁面后抓取主播訂閱數(shù)

第五步：將數(shù)據(jù)輸出為csv格式以及存在mongodb數(shù)據(jù)庫中。

頻道分類頁面

ajax異步請求對應(yīng)的鏈接

代碼

items

在items中定義要抓取的字段內(nèi)容，items代碼如下

class HuyaspiderItem(scrapy.Item):

channel = scrapy.Field() #主播所在頻道

anchor_category = scrapy.Field() #主播類型

anchor_name = scrapy.Field() #主播名稱

anchor_url = scrapy.Field() #直播房間鏈接

anchor_tag = scrapy.Field() #主播標(biāo)簽

anchor_roomname = scrapy.Field() #主播房間名稱

position = scrapy.Field() #當(dāng)前頻道的主播排名

watch_num = scrapy.Field() #觀看人數(shù)

fan_num = scrapy.Field() #訂閱數(shù)量

crawl_time = scrapy.Field() #爬取時(shí)間

pipelines

在pipelines中設(shè)置輸出為csv表以及將數(shù)據(jù)保存到mongodb中，pipelines代碼設(shè)置如下

# -*- coding: utf-8 -*-

import json,codecs

import pymongo

class HuyaspiderPipeline(object):

def __init__(self):

self.file = codecs.open('huyaanchor.csv','wb+',encoding='utf-8') #創(chuàng)建以utf-8編碼的csv文件

client = pymongo.MongoClient('localhost',27017) #創(chuàng)建mongodb連接

db = client['huya'] #創(chuàng)建mongodb數(shù)據(jù)庫huya

self.collection =db['huyaanchor'] #創(chuàng)建數(shù)據(jù)庫huya中collection

def process_item(self, item, spider):

item = dict(item) #將抓到的item轉(zhuǎn)為dict格式

line = json.dumps(item)+'\n' #定義line字段將抓到的item轉(zhuǎn)為jump格式，加上空格換行

self.file.write(line.decode('unicode_escape')) #將line寫進(jìn)csv中輸出

self.collection.insert(item) #將item寫進(jìn)mongodb中

middlewares

在middlewares中以繼承UserAgentMiddleware父類方式創(chuàng)建創(chuàng)建HuyaUserAgentMiddlewares類，該類用于scrapy每次執(zhí)行請求時(shí)使用指定的useragent，middlewares代碼如下

from scrapy import signals

from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware

import random

class HuyaUserAgentMiddleware(UserAgentMiddleware):

def __init__ (self,user_agent=""):

'''定義隨機(jī)user_agent列表'''

self.user_agent =user_agent

self.ua_list = ["Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)",

"Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)",

"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)",

"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",

"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",

"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",

"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1",

"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0",]

self.count=0

def process_request(self,request,spider):

ua ='Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:53.0) Gecko/20100101 Firefox/53.0'

request.headers.setdefault('Use-Agent',ua) #設(shè)定reuqest使用的Use-Agent為ua

request.headers.setdefault('Host','www.huya.com') #設(shè)定reuqest使用的Host為www.huya.com

request.headers.setdefault('Referer','http://www.huya.com/') #設(shè)定reuqest使用的Referer為http://www.huya.com/

settings

settings配置如下,在“DOWNLOADER_MIDDLEWARES”以及“ITEM_PIPELINES”設(shè)置上述items和middlewares中的配置。

DOWNLOADER_MIDDLEWARES = {

'huyaspider.middlewares.HuyaUserAgentMiddleware': 400, #啟動(dòng)middlewares中設(shè)定好的usragent

'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware':None, #禁用默認(rèn)的usragent

}

ITEM_PIPELINES = {

'huyaspider.pipelines.HuyaspiderPipeline': 300, #設(shè)置pipelines

}

spider

在spider中定義了parse、channel_get、channel_parse、room_parse四個(gè)函數(shù)，其作用說明如下

parse ：獲取虎牙下所有頻道的url 、頻道id 、頻道名稱

channel_get ： def parse的回調(diào)函數(shù)，根據(jù)頻道id構(gòu)造主播數(shù)據(jù)連接并執(zhí)行請求

channel_parse ：channel_get 的回調(diào)函數(shù)，根據(jù)返回的json數(shù)據(jù)抓取相應(yīng)內(nèi)容，并抓出主播的房間鏈接，對房間鏈接執(zhí)行請求

room_parse ：channel_parse的回調(diào)函數(shù)，抓取主播的訂閱數(shù)量

代碼如下

# -*- coding: utf-8 -*-

import scrapy,re,json,time

from scrapy.http import Request

from huyaspider.items import HuyaspiderItem

class HuyaSpider(scrapy.Spider):

name = "huya"

allowed_domains = ["www.huya.com"] #設(shè)置爬蟲允許抓取的

start_urls = ['http://www.huya.com/g'] #設(shè)置第一個(gè)爬取的url

allow_pagenum = 5 #設(shè)置爬取頻道的數(shù)量

total_pagenum = 0 #計(jì)算檔前已爬取頻道的數(shù)量

url_dict={} #設(shè)置存放url的dict

def parse(self,response):

parse_content= response.xpath('/html/body/div[3]/div/div/div[2]/ul/li') #抓取當(dāng)前頻道

for i in parse_content:

channel_title = i.xpath('a/p/text()').extract() #抓取頻道名稱

channel_url = i.xpath('a/@href').extract_first() #抓取當(dāng)前頻道url

channel_id = i.xpath('a/@report').re(r'game_id\D*(.*)\D\}') #抓取當(dāng)前頻道對應(yīng)的id，用正則去掉不需要部分

channel_data = {"url":channel_url,"channel_id":channel_id[0]} #將頻道url和頻道id組成一一對應(yīng)的dict

self.url_dict[channel_title[0]]=channel_data #將頻道名稱和channel_data添加在url_dict中

if self.total_pagenum <= self.allow_pagenum: #用于控制爬出抓取數(shù)量，當(dāng)total_pagenum小于allow_pagenum 繼續(xù)爬

self.total_pagenum += 1

yield Request(url=channel_url,meta={'channel_data':channel_data,'channel':channel_title},callback=self.channel_get) #使用request，meta攜帶數(shù)據(jù)為頻道url，頻道id，回調(diào)函數(shù)為channel_get

def channel_get(self, response):

page_num = int( response.xpath('/html/body/div[3]/div/div/div["js-list-page"]/div[1]/@data-pages').extract_first( ) ) #抓取當(dāng)前頻道一共有多少頁，并轉(zhuǎn)為int格式

channel_id = response.meta['channel_data']['channel_id'] #將傳入meta的dict(channel_data)中的channel_id值賦給channel_id，該id用于構(gòu)造url從而實(shí)現(xiàn)翻頁

channel = response.meta['channel'] #將傳入的meta的dict中的channel_id值賦給channel_id

for i in range(1,page_num+1): #根據(jù)page_num數(shù)量構(gòu)造"下一頁"并繼續(xù)抓取

url ='http://www.huya.com/cache.php?m=LiveList&do=getLiveListByPage&gameId={gameid}&tagAll=0&page={page}'.format(gameid=channel_id,page=i) #獲取下一頁的json數(shù)據(jù)

yield Request(url=url,meta={'page':i,'channel':channel},callback=self.channel_parse) #meta攜帶數(shù)據(jù)為頻道當(dāng)前頁碼，頻道名稱，回調(diào)函數(shù)為channel_parse

def channel_parse(self, response):

print 'channel_parse start'

count =0 #用于當(dāng)前房間的位置計(jì)算位置

response_json = json.loads(response.text) #利用json.loads將json數(shù)據(jù)轉(zhuǎn)為字典

channel =response.meta['channel']

for i in response_json['data']['datas']:

count +=1

items=HuyaspiderItem() #實(shí)例化item.HuyaspiderItem

items['channel'] = channel #獲取頻道名稱

items['anchor_category'] = i['gameFullName'].replace('/n','') #獲取主播類型，并刪內(nèi)容中的換行符

items['watch_num'] = i['totalCount'] #獲取觀看數(shù)量

items['anchor_roomname'] = i['roomName'] #獲取房間名稱

items['anchor_url'] = 'http://www.huya.com/'+i['privateHost'] #獲房間url

items['anchor_name'] = i['nick'] #獲主播名稱

items['anchor_tag'] = i['recommendTagName'] #獲主播推薦標(biāo)簽

items['position'] = str(response.meta['page'])+"-"+str(count) #獲取所在頻道的位置

yield Request(url=items['anchor_url'],meta={'items':items},callback=self.room_parse) #進(jìn)入主播房間url獲取主播訂閱數(shù)量，meta攜帶數(shù)據(jù)為剛抓取的items，回調(diào)函數(shù)為room_parse

def room_parse(self,response):

print "room_parse start"

items =response.meta['items']

try:

items['fan_num'] =response.xpath('/html/body/div[2]/div/div/div[1]/div[1]/div[2]/div/div[1]/div[2]/text()').extract() #獲取主播訂閱數(shù)量

except Exception as e:

items['fan_num'] ='none' #如果主播訂閱數(shù)量為空值則數(shù)據(jù)則為none

items['crawl_time'] = time.strftime('%Y-%m-%d %X',time.localtime()) #記錄爬取時(shí)間

yield items #輸出items

總結(jié)

以上是生活随笔為你收集整理的python怎么爬虎牙_使用python爬虫框架scrapy抓取虎牙主播数据的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：鸿蒙与安卓“切割”加速生态建设是成功关
下一篇：护眼行业新标杆！荣耀100系列将首发荣耀