如何用Python从TripAdvisor抓取数十万条酒店评论
我從TripAdvisor抓取一些酒店評論,然后發現了一種從它們那里刮掉數十萬條酒店評論的好方法。
讓我們假設,例如,我們要從大加那利島刮掉酒店評論。如果轉到TripAdvisor,我們將看到URL為:
https://www.tripadvisor.com/Hotels-g187471-Gran_Canaria_Canary_Islands-Hotels.html
復制
首先,我們需要從該位置檢索酒店的完整列表。為此,我們將使用下載完整的HTML?requests.get(url),然后嘗試從HTML中獲取此值:
如果我們仔細查看頁面HTML,我們將看到此值在此<span>標記內:
由于該范圍沒有任何標識符,并且該類似乎是自動生成的,因此我們將在.MOBILE_SORT_FILTER_BUTTONS旁邊的div中選擇范圍。就像是:
.MOBILE_SORT_FILTER_BUTTONS + div span
復制
首先,我們將需要PIP的產品requests和bs4包裝。我們還將安裝Pandas,以快速生成Excel并在以后使用DataFrame。
$ pip install requests bs4 pandas
復制
獲取頁數
安裝庫之后,我們可以編寫以下代碼來獲取頁數:
import requestsfrom bs4 import BeautifulSoupimport math, timeBASE_URL = 'https://www.tripadvisor.com/Hotels-g187471-Gran_Canaria_Canary_Islands-Hotels.html'PER_PAGE = 30response = requests.get(BASE_URL).textsoup = BeautifulSoup(response)span = soup.select('.MOBILE_SORT_FILTER_BUTTONS + div span')[0]N_PROPERTIES = int(re.sub('([^0-9\.])', '', span.text))print(f'There are {N_PROPERTIES} properties')N_PAGES = math.ceil(N_PROPERTIES / PER_PAGE)print(f'There are {N_PAGES} different pages')?
如果導航到頁面2,我們將看到URL更改為:
https://www.tripadvisor.com/Hotels-g187471-oa30-Gran_Canaria_Canary_Islands-Hotels.html
獲取酒店列表
如我們所見,URL的唯一更改是-oa30在酒店ID之后添加的。如果導航到第二頁,則將使用-oa60代替-oa30。這發生在每個頁面上。這樣,我們可以創建一個函數來:
編寫N_PAGES完此代碼后,我們可以從0循環到并為每個頁面生成URL:
現在,讓我們下載每個酒店列表頁面,并使用每個酒店URL生成一個數組:
listings = []for i in range(N_PAGES):url = get_listing_url(i)# Random delay to avoid TripAdvisor blocking ustime.sleep(random.randint(2, 8))# Download current pagelisting_html = requests.get(url)listing_soup = BeautifulSoup(listing_html.text, 'html.parser')# Add hotels to listingsraw_listings = listing_soup.select('.listing')for raw_listing in raw_listings:listings.append('https://www.tripadvisor.com' + raw_listing.a['href'])幾分鐘后,我們應該listings用每個酒店URL填充變量🤩
分析數據
現在,讓我們看看如何從每個酒店刮取評論...這就是我們在TripAdvisor上可以看到的內容:
如果向下滾動,我們將看到每個URL僅獲得5條評論,這不是很好(每個酒店可能有數千條評論!)。好的,讓我們打開我們的Chrome DevTools并檢查在與本節進行交互時發生了什么:
?
如果現在更改評論語言(例如,更改為德語),我們將看到對此的請求/data/graphql/batched似乎很有趣:
?
TripAdvisor正在使用某種結構向其GraphQL端點發送請求,并發送了一個名為的屬性locationId:
?
再一次,這locationId與我們在URL(在本例中Hotel_Review-g562819-d296922-Reviews-Bohemia_Suites_Spa...)中使用的完全相同。如果我們可以使用此端點從每個酒店獲取評論怎么辦?🤔
首先,讓我們嘗試從酒店URL中提取位置ID和地理位置ID。每個酒店網址都與此類似:
https://www.tripadvisor.com/Hotel_Review-g562819-d296922-Reviews-Bohemia_Suites_Spa-Playa_del_Ingles_Maspalomas_Gran_Canaria_Canary_Islands.html
復制
我們將需要后面-g的數字和之后的數字-d:
def get_ids_from_hotel_url(url):url = url.split('-')geo = url[1]loc = url[2]return (int(geo[1:]), int(loc[1:]))從GraphQL獲取數據
現在,讓我們嘗試模仿TripAdvisor對他們的GraphQL執行的請求。如果我們從“網絡”標簽中復制原始請求,則會看到類似于以下JSON的內容:
[{"query": "mutation LogBBMLInteraction($interaction: ClientInteractionOpaqueInput!) {\n? logProductInteraction(interaction: $interaction)\n}\n","variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK","site": {"site_name": "ta","site_business_unit": "Hotels","site_domain": "www.tripadvisor.com"},"pageview": {"pageview_request_uid": "X@2fPQokGCIABGTeHYoAAAES","pageview_attributes": {"location_id": 296922,"geo_id": 562819,"servlet_name": "Hotel_Review"}},"user": {"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36","site_persistent_user_uid": "web373a.83.56.0.34.17609EB3BAC","unique_user_identifiers": {"session_id": "F5A494D1D5DB4DD491B72FB55E860886"}},"search": {},"item_group": {"item_group_collection_key": "X@2fPQokGCIABGTeHYoAAAES"},"item": {"product_type": "Hotels","item_id_type": "ta-location-id","item_id": loc,"item_attributes": {"element_type": "de","action_name": "REVIEW_FILTER_LANGUAGE"}}}}}},{"query": "query ReviewListQuery($locationId: Int!, $offset: Int, $limit: Int, $filters: [FilterConditionInput!], $prefs: ReviewListPrefsInput, $initialPrefs: ReviewListPrefsInput, $filterCacheKey: String, $prefsCacheKey: String, $keywordVariant: String!, $needKeywords: Boolean = true) {\n? cachedFilters: personalCache(key: $filterCacheKey)\n? cachedPrefs: personalCache(key: $prefsCacheKey)\n? locations(locationIds: [$locationId]) {\n??? locationId\n??? parentGeoId\n??? name\n??? placeType\n??? reviewSummary {\n????? rating\n????? count\n??? }\n??? keywords(variant: $keywordVariant) @include(if: $needKeywords) {\n????? keywords {\n??????? keyword\n????? }\n??? }\n??? ... on LocationInformation {\n????? parentGeoId\n??? }\n??? ... on LocationInformation {\n????? parentGeoId\n??? }\n??? ... on LocationInformation {\n????? name\n????? currentUserOwnerStatus {\n??????? isValid\n????? }\n??? }\n??? ... on LocationInformation {\n????? locationId\n????? currentUserOwnerStatus {\n??????? isValid\n????? }\n??? }\n??? ... on LocationInformation {\n????? locationId\n????? parentGeoId\n????? accommodationCategory\n????? currentUserOwnerStatus {\n??????? isValid\n????? }\n????? url\n??? }\n??? reviewListPage(page: {offset: $offset, limit: $limit}, filters: $filters, prefs: $prefs, initialPrefs: $initialPrefs, filterCacheKey: $filterCacheKey, prefsCacheKey: $prefsCacheKey) {\n????? totalCount\n????? preferredReviewIds\n????? reviews {\n??????? ... on Review {\n????????? id\n????????? url\n????????? location {\n??????????? locationId\n??????????? name\n????????? }\n????????? createdDate\n????????? publishedDate\n????????? provider {\n??????????? isLocalProvider\n????????? }\n????????? userProfile {\n??????????? id\n?????????? ?userId: id\n??????????? isMe\n??????????? isVerified\n??????????? displayName\n??????????? username\n??????????? avatar {\n????????????? id\n????????????? photoSizes {\n??????????????? url\n??????????????? width\n??????????????? height\n????????????? }\n ???????????}\n??????????? hometown {\n????????????? locationId\n????????????? fallbackString\n????????????? location {\n??????????????? locationId\n??????????????? additionalNames {\n????????????????? long\n??????????????? }\n??????????????? name\n??????? ??????}\n??????????? }\n??????????? contributionCounts {\n????????????? sumAllUgc\n????????????? helpfulVote\n??????????? }\n??????????? route {\n????????????? url\n??????????? }\n????????? }\n??????? }\n??????? ... on Review {\n????????? title\n????????? language\n????????? url\n??????? }\n??????? ... on Review {\n????????? language\n????????? translationType\n??????? }\n??????? ... on Review {\n????????? roomTip\n??????? }\n??????? ... on Review {\n????????? tripInfo {\n??????????? stayDate\n????????? }\n????????? location {\n??????????? placeType\n????????? }\n??????? }\n??????? ... on Review {\n????????? additionalRatings {\n??????????? rating\n??????????? ratingLabel\n????????? }\n??????? }\n??????? ... on Review {\n????????? tripInfo {\n??????????? tripType\n????????? }\n??????? }\n??????? ... on Review {\n????????? language\n????????? translationType\n????????? mgmtResponse {\n??????????? id\n??????????? language\n??????????? translationType\n????????? }\n??????? }\n??????? ... on Review {\n????????? text\n????????? publishedDate\n????????? username\n????????? connectionToSubject\n????????? language\n????????? mgmtResponse {\n??????????? id\n??????????? text\n??????????? language\n??????????? publishedDate\n??????????? username\n??????????? connectionToSubject\n????????? }\n??????? }\n??????? ... on Review {\n????????? id\n????????? locationId\n????????? title\n????????? text\n????????? rating\n????????? absoluteUrl\n????????? mcid\n????????? translationType\n????????? mtProviderId\n????????? photos {\n ???????????id\n??????????? statuses\n??????????? photoSizes {\n????????????? url\n????????????? width\n????????????? height\n??????????? }\n????????? }\n????????? userProfile {\n??????????? id\n??????????? displayName\n??????????? username\n????????? }\n? ??????}\n??????? ... on Review {\n????????? mgmtResponse {\n??????????? id\n????????? }\n????????? provider {\n??????????? isLocalProvider\n????????? }\n??????? }\n??????? ... on Review {\n????????? translationType\n????????? location {\n??????????? locationId\n??????????? parentGeoId\n????????? }\n????????? provider {\n??????????? isLocalProvider\n??????????? isToolsProvider\n????????? }\n????????? original {\n??????????? id\n??????????? url\n??????????? locationId\n??????????? userId\n??????????? language\n??????????? submissionDomain\n????????? }\n??????? }\n??????? ... on Review {\n????????? locationId\n????????? mcid\n????????? attribution\n??????? }\n??????? ... on Review {\n????????? __typename\n????????? locationId\n????????? helpfulVotes\n????????? photoIds\n????????? route {\n??????????? url\n????????? }\n????????? socialStatistics {\n??????????? followCount\n??????????? isFollowing\n??????????? isLiked\n??????????? isReposted\n??????????? isSaved\n??????????? likeCount\n??????????? repostCount\n?? ?????????tripCount\n????????? }\n????????? status\n????????? userId\n????????? userProfile {\n??????????? id\n??????????? displayName\n??????????? isFollowing\n????????? }\n????????? location {\n??????????? __typename\n??????????? locationId\n??????????? additionalNames {\n????????????? normal\n????????????? long\n????????????? longOnlyParent\n????????????? longParentAbbreviated\n????????????? longOnlyParentAbbreviated\n????????????? longParentStateAbbreviated\n????????????? longOnlyParentStateAbbreviated\n????????????? geo\n????????????? abbreviated\n????????????? abbreviatedRaw\n????????????? abbreviatedStateTerritory\n????????????? abbreviatedStateTerritoryRaw\n??????????? }\n??????????? parent {\n????????????? locationId\n????????????? additionalNames {\n??????????????? normal\n??????????????? long\n??????????????? longOnlyParent\n??????????????? longParentAbbreviated\n??????????????? longOnlyParentAbbreviated\n??????????????? longParentStateAbbreviated\n??????????????? longOnlyParentStateAbbreviated\n?? ?????????????geo\n??????????????? abbreviated\n??????????????? abbreviatedRaw\n??????????????? abbreviatedStateTerritory\n??????????????? abbreviatedStateTerritoryRaw\n????????????? }\n??????????? }\n????????? }\n??????? }\n??????? ... on Review {\n?????? ???text\n????????? language\n??????? }\n??????? ... on Review {\n????????? locationId\n????????? absoluteUrl\n????????? mcid\n????????? translationType\n????????? mtProviderId\n????????? originalLanguage\n????????? rating\n??????? }\n??????? ... on Review {\n????????? id\n????????? locationId\n????????? title\n????????? labels\n????????? rating\n????????? absoluteUrl\n????????? mcid\n????????? translationType\n????????? mtProviderId\n????????? alertStatus\n??????? }\n????? }\n??? }\n??? reviewAggregations {\n????? ratingCounts\n????? languageCounts\n????? alertStatusCount\n??? }\n? }\n}\n","variables": {"locationId": 296922,"offset": 0,"filters": [{"axis": "LANGUAGE","selections": ["de"]}],"prefs": None,"initialPrefs": {},"limit": 5,"filterCacheKey": None,"prefsCacheKey": "locationReviewPrefs","needKeywords": False,"keywordVariant": "location_keywords_v2_llr_order_30_en"}},{"query": "mutation UpdateReviewSettings($key: String!, $val: String!) {\n? writePersonalCache(key: $key, value: $val)\n}\n","variables": {"key": "locationReviewFilters_296922","val": "[{\"axis\":\"LANGUAGE\",\"selections\":[\"de\"]}]"}}]復制分析這個JSON,我們可以看到它是一個包含3個元素的數組。第一個元素似乎是記錄交互。第二個元素具有一些有趣的屬性,例如locationId,variables.filters.selections(似乎包含語言iso代碼的數組),variables.offset(要跳過的評論數)和variables.limit(評論數限制)。第三個要素似乎是將用戶首選項寫入他們的數據庫中。
很多人學習python,不知道從何學起。
很多人學習python,掌握了基本語法過后,不知道在哪里尋找案例上手。
很多已經做案例的人,卻不知道如何去學習更加高深的知識。
那么針對這三類人,我給大家提供一個好的學習平臺,免費領取視頻教程,電子書籍,以及課程的源代碼!
QQ群:553215015
知道了這一點,我們可以創建一個新函數來從某個酒店獲取GraphQL數據:
GRAPHQL_URL = 'https://www.tripadvisor.com/data/graphql/batched'def request_graphql(url, page=0):geo, loc = get_ids_from_hotel_url(url)request = [{"query": "mutation LogBBMLInteraction($interaction: ClientInteractionOpaqueInput!) {\n? logProductInteraction(interaction: $interaction)\n}\n","variables": {"interaction": {"productInteraction": {"interaction_type": "CLICK","site": {"site_name": "ta","site_business_unit": "Hotels","site_domain": "www.tripadvisor.com"},"pageview": {"pageview_request_uid": "X@2fPQokGCIABGTeHYoAAAES","pageview_attributes": {"location_id": loc,"geo_id": geo,"servlet_name": "Hotel_Review"}},"user": {"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36","site_persistent_user_uid": "web373a.83.56.0.34.17609EB3BAC","unique_user_identifiers": {"session_id": '{YOUR_SESSION_ID}'}},"search": {},"item_group": {"item_group_collection_key": "X@2fPQokGCIABGTeHYoAAAES"},"item": {"product_type": "Hotels","item_id_type": "ta-location-id","item_id": loc,"item_attributes": {"element_type": "es","action_name": "REVIEW_FILTER_LANGUAGE"}}}}}},{"query": "query ReviewListQuery($locationId: Int!, $offset: Int, $limit: Int, $filters: [FilterConditionInput!], $prefs: ReviewListPrefsInput, $initialPrefs: ReviewListPrefsInput, $filterCacheKey: String, $prefsCacheKey: String, $keywordVariant: String!, $needKeywords: Boolean = true) {\n? cachedFilters: personalCache(key: $filterCacheKey)\n? cachedPrefs: personalCache(key: $prefsCacheKey)\n? locations(locationIds: [$locationId]) {\n??? locationId\n??? parentGeoId\n??? name\n??? placeType\n??? reviewSummary {\n????? rating\n????? count\n??? }\n??? keywords(variant: $keywordVariant) @include(if: $needKeywords) {\n????? keywords {\n??????? keyword\n????? }\n??? }\n??? ... on LocationInformation {\n????? parentGeoId\n??? }\n??? ... on LocationInformation {\n????? parentGeoId\n??? }\n??? ... on LocationInformation {\n????? name\n????? currentUserOwnerStatus {\n??????? isValid\n????? }\n??? }\n??? ... on LocationInformation {\n????? locationId\n????? currentUserOwnerStatus {\n??????? isValid\n????? }\n??? }\n??? ... on LocationInformation {\n????? locationId\n????? parentGeoId\n????? accommodationCategory\n????? currentUserOwnerStatus {\n??????? isValid\n????? }\n????? url\n??? }\n??? reviewListPage(page: {offset: $offset, limit: $limit}, filters: $filters, prefs: $prefs, initialPrefs: $initialPrefs, filterCacheKey: $filterCacheKey, prefsCacheKey: $prefsCacheKey) {\n????? totalCount\n????? preferredReviewIds\n????? reviews {\n??????? ... on Review {\n????????? id\n ?????????url\n????????? location {\n??????????? locationId\n??????????? name\n????????? }\n????????? createdDate\n????????? publishedDate\n????????? provider {\n??????????? isLocalProvider\n????????? }\n????????? userProfile {\n??????????? id\n??????????? userId: id\n??????????? isMe\n??????????? isVerified\n??????????? displayName\n??????????? username\n??????????? avatar {\n????????????? id\n????????????? photoSizes {\n??????????????? url\n??????????????? width\n??????????????? height\n????????????? }\n? ??????????}\n??????????? hometown {\n????????????? locationId\n????????????? fallbackString\n????????????? location {\n??????????????? locationId\n??????????????? additionalNames {\n????????????????? long\n??????????????? }\n??????????????? name\n???????? ?????}\n??????????? }\n??????????? contributionCounts {\n????????????? sumAllUgc\n????????????? helpfulVote\n??????????? }\n??????????? route {\n????????????? url\n??????????? }\n????????? }\n??????? }\n??????? ... on Review {\n????????? title\n????????? language\n????????? url\n??????? }\n??????? ... on Review {\n????????? language\n????????? translationType\n??????? }\n??????? ... on Review {\n????????? roomTip\n??????? }\n??????? ... on Review {\n????????? tripInfo {\n??????????? stayDate\n????????? }\n ?????????location {\n??????????? placeType\n????????? }\n??????? }\n??????? ... on Review {\n????????? additionalRatings {\n??????????? rating\n??????????? ratingLabel\n????????? }\n??????? }\n??????? ... on Review {\n????????? tripInfo {\n??????????? tripType\n????????? }\n??????? }\n??????? ... on Review {\n????????? language\n????????? translationType\n????????? mgmtResponse {\n??????????? id\n??????????? language\n??????????? translationType\n????????? }\n??????? }\n??????? ... on Review {\n????????? text\n????????? publishedDate\n????????? username\n????????? connectionToSubject\n????????? language\n????????? mgmtResponse {\n??????????? id\n??????????? text\n??????????? language\n??????????? publishedDate\n??????????? username\n??????????? connectionToSubject\n????????? }\n??????? }\n??????? ... on Review {\n????????? id\n????????? locationId\n????????? title\n????????? text\n????????? rating\n????????? absoluteUrl\n????????? mcid\n????????? translationType\n????????? mtProviderId\n????????? photos {\n? ??????????id\n??????????? statuses\n??????????? photoSizes {\n????????????? url\n????????????? width\n????????????? height\n??????????? }\n????????? }\n????????? userProfile {\n??????????? id\n??????????? displayName\n??????????? username\n????????? }\n?? ?????}\n??????? ... on Review {\n????????? mgmtResponse {\n??????????? id\n????????? }\n????????? provider {\n??????????? isLocalProvider\n????????? }\n??????? }\n??????? ... on Review {\n????????? translationType\n????????? location {\n??????????? locationId\n??????????? parentGeoId\n????????? }\n????????? provider {\n??????????? isLocalProvider\n??????????? isToolsProvider\n????????? }\n????????? original {\n??????????? id\n??????????? url\n??????????? locationId\n??????????? userId\n??????????? language\n??????????? submissionDomain\n????????? }\n??????? }\n??????? ... on Review {\n????????? locationId\n????????? mcid\n????????? attribution\n??????? }\n??????? ... on Review {\n????????? __typename\n????????? locationId\n????????? helpfulVotes\n????????? photoIds\n????????? route {\n??????????? url\n????????? }\n????????? socialStatistics {\n??????????? followCount\n??????????? isFollowing\n??????????? isLiked\n??????????? isReposted\n??????????? isSaved\n??????????? likeCount\n??????????? repostCount\n??? ????????tripCount\n????????? }\n????????? status\n????????? userId\n????????? userProfile {\n??????????? id\n??????????? displayName\n??????????? isFollowing\n????????? }\n????????? location {\n??????????? __typename\n??????????? locationId\n??????????? additionalNames {\n????????????? normal\n????????????? long\n????????????? longOnlyParent\n????????????? longParentAbbreviated\n????????????? longOnlyParentAbbreviated\n????????????? longParentStateAbbreviated\n????????????? longOnlyParentStateAbbreviated\n ?????????????geo\n????????????? abbreviated\n????????????? abbreviatedRaw\n????????????? abbreviatedStateTerritory\n????????????? abbreviatedStateTerritoryRaw\n??????????? }\n??????????? parent {\n????????????? locationId\n????????????? additionalNames {\n??????????????? normal\n??????????????? long\n??????????????? longOnlyParent\n??????????????? longParentAbbreviated\n??????????????? longOnlyParentAbbreviated\n??????????????? longParentStateAbbreviated\n??????????????? longOnlyParentStateAbbreviated\n??? ????????????geo\n??????????????? abbreviated\n??????????????? abbreviatedRaw\n??????????????? abbreviatedStateTerritory\n??????????????? abbreviatedStateTerritoryRaw\n????????????? }\n??????????? }\n????????? }\n??????? }\n??????? ... on Review {\n??????? ??text\n????????? language\n??????? }\n??????? ... on Review {\n????????? locationId\n????????? absoluteUrl\n????????? mcid\n????????? translationType\n????????? mtProviderId\n????????? originalLanguage\n????????? rating\n??????? }\n??????? ... on Review {\n????????? id\n????????? locationId\n????????? title\n????????? labels\n????????? rating\n????????? absoluteUrl\n????????? mcid\n????????? translationType\n????????? mtProviderId\n????????? alertStatus\n??????? }\n????? }\n??? }\n??? reviewAggregations {\n????? ratingCounts\n????? languageCounts\n????? alertStatusCount\n??? }\n? }\n}\n","variables": {"locationId": loc,"offset": page * 20,"filters": [{"axis": "LANGUAGE","selections": ["es","en","de","fr","it"]}],"prefs": None,"initialPrefs": {},"limit": 20,"filterCacheKey": None,"prefsCacheKey": "locationReviewPrefs","needKeywords": False,"keywordVariant": "location_keywords_v2_llr_order_30_en"}},{"query": "mutation UpdateReviewSettings($key: String!, $val: String!) {\n? writePersonalCache(key: $key, value: $val)\n}\n","variables": {"key": "locationReviewFilters_4107099","val": "[{\"axis\":\"LANGUAGE\",\"selections\":[\"es\"]}]"}}]response = requests.post(GRAPHQL_URL, json=request, headers={'origin': 'https://www.tripadvisor.com','pragma': 'no-cache','referer': url,'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.101 Safari/537.36','x-requested-by': 'TNI1625!AJip35tLIWuhpNQPmrwHxPeiKXvdZcnL7knBuCZi5C72/qqhuKp4Z0UJIclF3lVur1Wu4ZdKfvqHmfGsn939HaPm574AH0+pxs5wBXmVwF5wm/4/retQGYfVPgorX2lUtTDc8/Ej6X5EDaY3f3qV5r4EfRGA8CA5E9Eu39DyE34C','Cookie': '{YOUR_COOKIE_STRING}'})return response.json()在繼續之前,請注意您需要在此函數內進行三件事更改:
- {YOUR_SESSION_ID}:您將需要在DevTools內部搜索cookie“ TASID”,并將此值放在此處。
- {YOUR_COOKIE_STRING}:您將需要轉到“請求標頭”并在此處粘貼完整的Cookie字符串。
?
- {YOUR_REQUESTED_BY}:您將需要轉到“請求標頭”,然后在此處粘貼完整的X-Requested-By標頭。
?
現在我們已經準備好了此功能,我們非常接近通過每個酒店URL進行迭代,并通過調用此功能來獲取所有評論。只需要做一件事:我們需要找出每家酒店有多少條評論。
讓我們嘗試我們的超級功能,看看會發生什么:-)
?
而且有效!this正如我們在此結構中看到的那樣,我們剛剛獲得了data.locations[0].reviewListPage.reviews該酒店的前20條評論!
每個評論具有以下結構:
{"id": 780641546,"url": "/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html","location": {"locationId": 559667,"name": "Hotel Cordial Mogan Playa","placeType": "ACCOMMODATION","parentGeoId": 187471,"__typename": "LocationInformation","additionalNames": {"normal": "Hotel Cordial Mogan Playa","long": "Hotel Cordial Mogan Playa, Spain","longOnlyParent": "Spain","longParentAbbreviated": "Hotel Cordial Mogan Playa, Spain","longOnlyParentAbbreviated": "Spain","longParentStateAbbreviated": "Hotel Cordial Mogan Playa, Spain","longOnlyParentStateAbbreviated": "Spain","geo": "Puerto de Mogan","abbreviated": "Hotel Cordial Mogan Playa","abbreviatedRaw": "Hotel Cordial Mogan Playa","abbreviatedStateTerritory": "Hotel Cordial Mogan Playa","abbreviatedStateTerritoryRaw": "Hotel Cordial Mogan Playa"},"parent": {"locationId": 187471,"additionalNames": {"normal": "Gran Canaria","long": "Gran Canaria, Spain","longOnlyParent": "Spain","longParentAbbreviated": "Gran Canaria, Spain","longOnlyParentAbbreviated": "Spain","longParentStateAbbreviated": "Gran Canaria, Spain","longOnlyParentStateAbbreviated": "Spain","geo": "Gran Canaria","abbreviated": "Gran Canaria","abbreviatedRaw": "Gran Canaria","abbreviatedStateTerritory": "Gran Canaria","abbreviatedStateTerritoryRaw": "Gran Canaria"}}},"createdDate": "2021-01-06","publishedDate": "2021-01-06","provider": {"isLocalProvider": true,"isToolsProvider": true},"userProfile": {"id": "63A449F68F3328E582979E7BC8F5D5E3","userId": "63A449F68F3328E582979E7BC8F5D5E3","isMe": false,"isVerified": false,"displayName": "kattullus","username": "kattullus","avatar": {"id": 120146276,"photoSizes": []},"hometown": {"locationId": 189852,"fallbackString": "189852","location": {"locationId": 189852,"additionalNames": {"long": "Stockholm, Sweden"},"name": "Stockholm"}},"contributionCounts": {"sumAllUgc": 890,"helpfulVote": 150},"route": {"url": "/Profile/kattullus"},"isFollowing": false},"title": "Did not stay but want to applaud the fantastic New Year's buffet","language": "en","translationType": null,"roomTip": null,"tripInfo": {"stayDate": "2020-12-31","tripType": "NONE"},"additionalRatings": [{"rating": 4,"ratingLabel": "Location"},{"rating": 5,"ratingLabel": "Service"},{"rating": 5,"ratingLabel": "Sleep Quality"}],"text": "The New Year's buffet was arguably the finest buffet we have enjoyed. Hundreds of dishes, all beautifully presented and delicious. Wines/beer included and the cost very reasonable. The venue is fantastic and in itself worth a detour!","username": "kattullus","connectionToSubject": null,"locationId": 559667,"rating": 5,"absoluteUrl": "https://www.tripadvisor.com/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html","mcid": 53922,"mtProviderId": 0,"photos": [],"original": null,"attribution": null,"__typename": "Review","helpfulVotes": 0,"photoIds": [],"route": {"url": "/ShowUserReviews-g664857-d559667-r780641546-Hotel_Cordial_Mogan_Playa-Puerto_de_Mogan_Mogan_Gran_Canaria_Canary_Islands.html"},"originalLanguage": "en","labels": [],"alertStatus": false}讓我們僅解析此文檔以生成我們的數據庫:
data = []for hotel_url in listings:response = request_graphql(hotel_url)[1]['data']['locations'][0]hotel_name = response['name']print(f'Scraping {hotel_name}')# Get total review counttotal_reviews = response['reviewListPage']['totalCount']# Get number of pages to get all the reviewspages = math.ceil(total_reviews / 20)pages = min(MAX_PAGES, pages)# Iterate through every possible page to get all the reviewsfor i in range(pages):# Sleep random seconds to avoid blockingtime.sleep(random.randint(1, 3))# Get the GraphQL response for each pageresponse = request_graphql(hotel_url, page=i)[1]['data']['locations'][0]# Get the reviews from each responsereviews = response['reviewListPage']['reviews'] if response['reviewListPage'] is not None else []# Add each review to the arrayfor review in reviews:review_title = review['title']review_description = review['text']location = review['location']['parent']['additionalNames']['normal']review_data = {'Hotel Name': hotel_name,'Review Date': review['createdDate'],'Stay Date': review['tripInfo']['stayDate'] if review['tripInfo'] is not None else None,'Location': location,'Lang': review['language'],'Room Tip': review['roomTip'] if 'roomTip' in review else None,'Review Title': review_title,'Review Stars': review['rating'],'Review': review_description,'User Name': review['userProfile']['displayName'] if review['userProfile'] else None,'Hometown': review['userProfile']['hometown']['location']['additionalNames']['long'] if review['userProfile'] is not None and review['userProfile']['hometown']['location'] is not None else None}# Iterate through additionalRatings (Cleanliness, Room Service...)for rating in review['additionalRatings']:review_data[f'{rating["ratingLabel"]} Stars'] = rating['rating']data.append(review_data)print(f'Reviews: {len(data)}')現在,我們已經填滿了數組(這可能會花費很多時間,具體取決于您需要的酒店數量),讓我們生成一個熊貓DataFrame并將結果存儲為CSV格式:
df = pd.DataFrame(data)df.to_csv('./reviews.csv', index=False, encoding='utf-8-sig', sep=';')df.head()概要
我們已經學習了如何利用TripAdvisor GraphQL端點從酒店列表中請求所有評論,最終生成結構化的CSV文件,可將其用于進一步的ML分析。
在這里還是要推薦下我自己建的Python學習群:553215015,群里都是學Python的,如果你想學或者正在學習Python ,歡迎你加入,大家都是軟件開發黨,不定期分享干貨(只有Python軟件開發相關的),包括我自己整理的一份2020最新的Python進階資料和零基礎教學,歡迎進階中和對Python感興趣的小伙伴加入!
?
總結
以上是生活随笔為你收集整理的如何用Python从TripAdvisor抓取数十万条酒店评论的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Navicat 常见操作
- 下一篇: linux文件权限3代表啥,3,LINU