WeRateDog---分析推特数据
數(shù)據(jù)收集
導(dǎo)入需要的庫
In?[60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import requests
import json
import os
打開并評估twitter-archive-enhanced
In?[61]:twitter_archive_enhanced = pd.read_csv('twitter-archive-enhanced.csv')
In?[62]:twitter_archive_enhanced.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2356 entries, 0 to 2355
Data columns (total 17 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 tweet_id 2356 non-null int64 1 in_reply_to_status_id 78 non-null float642 in_reply_to_user_id 78 non-null float643 timestamp 2356 non-null object 4 source 2356 non-null object 5 text 2356 non-null object 6 retweeted_status_id 181 non-null float647 retweeted_status_user_id 181 non-null float648 retweeted_status_timestamp 181 non-null object 9 expanded_urls 2297 non-null object 10 rating_numerator 2356 non-null int64 11 rating_denominator 2356 non-null int64 12 name 2356 non-null object 13 doggo 2356 non-null object 14 floofer 2356 non-null object 15 pupper 2356 non-null object 16 puppo 2356 non-null object
dtypes: float64(4), int64(3), object(10)
memory usage: 313.0+ KB
通過上面的info,可以看出tweet_id, timestamp 類型錯誤,in_reply_to_status_id,in_reply_to_user_id 僅有78列,expanded_urls 含有空值,是沒有照片的數(shù)據(jù), 根據(jù)項目要求,這些數(shù)據(jù)后面需要刪除
In?[63]:twitter_archive_enhanced.retweeted_status_id.notnull().value_counts()
Out[63]:
False 2175
True 181
Name: retweeted_status_id, dtype: int64
retweeted_status_id 不為nan的為轉(zhuǎn)發(fā)數(shù)據(jù),181條轉(zhuǎn)發(fā)數(shù)據(jù),根據(jù)項目要求,這些數(shù)據(jù)后面需要刪除
In?[64]:twitter_archive_enhanced.name.value_counts()
Out[64]:
None 745
a 55
Charlie 12
Oliver 11
Lucy 11...
Karll 1
Tiger 1
old 1
Meatball 1
Stormy 1
Name: name, Length: 957, dtype: int64
In?[65]:twitter_archive_enhanced.text[twitter_archive_enhanced.name=='a'].iloc[1]
Out[65]:
'Here is a perfect example of someone who has their priorities in order. 13/10 for both owner and Forrest https://t.co/LRyMrU7Wfq'
*55個名字為a的狗狗,調(diào)用一個名字為a的看了下,顯然a不是狗狗的名字,是為質(zhì)量問題
*text里面含有鏈接
In?[66]:twitter_archive_enhanced.rating_denominator.value_counts()
Out[66]:
10 2333
11 3
50 3
80 2
20 2
2 1
16 1
40 1
70 1
15 1
90 1
110 1
120 1
130 1
150 1
170 1
7 1
0 1
Name: rating_denominator, dtype: int64
可見,rating_denominator不全為10
In?[67]:twitter_archive_enhanced.source.iloc[0]
Out[67]:
'<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>'
source中含有html文本
另外,這個數(shù)據(jù)集還有個整潔度問題,狗狗地位是一個變量,doggo,floofer, pupper, puppo應(yīng)為一列
收集并評估image-predictions
In?[68]:folder_name ='pred-image'
if not os.path.exists(folder_name):
os.makedirs(folder_name) url='https://raw.githubusercontent.com/udacity/new-dand-advanced-china/master/%E6%95%B0%E6%8D%AE%E6%B8%85%E6%B4%97/WeRateDogs%E9%A1%B9%E7%9B%AE/image-predictions.tsv'
response = requests.get(url)
response
Out[68]:
<Response [200]>
In?[69]:
with open(os.path.join(folder_name, url.split('/')[-1]), mode='wb') as file:
file.write(response.content)
In?[70]:os.listdir(folder_name)
Out[70]:
['image-predictions.tsv']
In?[71]:image_predictions = pd.read_csv('image-predictions.tsv',sep='\t')
In?[72]:image_predictions.head()
Out[72]:
| ? | tweet_id | jpg_url | img_num | p1 | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 666020888022790149 | https://pbs.twimg.com/media/CT4udn0WwAA0aMy.jpg | 1 | Welsh_springer_spaniel | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True |
| 1 | 666029285002620928 | https://pbs.twimg.com/media/CT42GRgUYAA5iDo.jpg | 1 | redbone | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True |
| 2 | 666033412701032449 | https://pbs.twimg.com/media/CT4521TWwAEvMyu.jpg | 1 | German_shepherd | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True |
| 3 | 666044226329800704 | https://pbs.twimg.com/media/CT5Dr8HUEAA-lEu.jpg | 1 | Rhodesian_ridgeback | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True |
| 4 | 666049248165822465 | https://pbs.twimg.com/media/CT5IQmsXIAAKY4A.jpg | 1 | miniature_pinscher | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True |
In?[73]:image_predictions.jpg_url.duplicated().value_counts()
Out[73]:
False 2009
True 66
Name: jpg_url, dtype: int64
有66條重復(fù)的圖片鏈接
tweet_id類型錯誤
打開并評估tweet_json
In?[74]:tweet_json = pd.DataFrame()
In?[75]:
file = open('tweet_json.txt','r')
for line in file.readlines():
dic = json.loads(line)
tweet_id = dic['id']
retweet_count = dic['retweet_count']
favorite_count = dic['favorite_count']
tem_df = pd.DataFrame({'tweet_id':tweet_id,
'retweet_count':retweet_count,
'favorite_count':favorite_count},index=[0])
tweet_json = pd.concat([tweet_json,tem_df])
In?[76]:
tweet_json
Out[76]:
| ? | tweet_id | retweet_count | favorite_count |
|---|---|---|---|
| 0 | 892420643555336193 | 8842 | 39492 |
| 0 | 892177421306343426 | 6480 | 33786 |
| 0 | 891815181378084864 | 4301 | 25445 |
| 0 | 891689557279858688 | 8925 | 42863 |
| 0 | 891327558926688256 | 9721 | 41016 |
| ... | ... | ... | ... |
| 0 | 666049248165822465 | 41 | 111 |
| 0 | 666044226329800704 | 147 | 309 |
| 0 | 666033412701032449 | 47 | 128 |
| 0 | 666029285002620928 | 48 | 132 |
| 0 | 666020888022790149 | 530 | 2528 |
2352 rows × 3 columns
tweet_id 類型錯誤
綜上,
#*數(shù)據(jù)集里的質(zhì)量問題:
- tweet_id,timestamp類型錯誤
- jpg_url有66條重復(fù)的鏈接
- source中含有html文本
- rating_denominator不全為10,還有分母為0的情況出現(xiàn)
- 55個名字為a的狗狗,調(diào)用一個名字為a的看了下,顯然a不是狗狗的名字,是為質(zhì)量問題
- text里面含有鏈接
- retweeted_status_id 不為nan的為轉(zhuǎn)發(fā)數(shù)據(jù),181條轉(zhuǎn)發(fā)數(shù)據(jù),根據(jù)項目要求,這些數(shù)據(jù)后面需要刪除
- in_reply_to_status_id,in_reply_to_user_id 僅有78列
- 沒有照片的數(shù)據(jù), 根據(jù)項目要求,這些數(shù)據(jù)后面需要刪除
#*整潔度問題:
- 狗狗地位是一個變量,doggo,floofer, pupper, puppo應(yīng)為一列
- 三個數(shù)據(jù)集有一個觀察對象tweet_id,可以合為一個數(shù)據(jù)集
數(shù)據(jù)清洗
In?[77]:
twitter_archive_enhanced_clean = twitter_archive_enhanced.copy()
image_predictions_clean = image_predictions.copy()
tweet_json_clean = tweet_json.copy()
issue:?tweet_id類型錯誤
define:?修改tweet_id為str
code:
In?[78]:twitter_archive_enhanced_clean['tweet_id'] = twitter_archive_enhanced_clean['tweet_id'].astype('str')
In?[79]:image_predictions_clean['tweet_id'] = image_predictions_clean['tweet_id'].astype('str')
In?[80]:tweet_json_clean['tweet_id'] = tweet_json_clean['tweet_id'].astype('str')
Test
In?[81]:twitter_archive_enhanced_clean['tweet_id']
Out[81]:
0 892420643555336193
1 892177421306343426
2 891815181378084864
3 891689557279858688
4 891327558926688256...
2351 666049248165822465
2352 666044226329800704
2353 666033412701032449
2354 666029285002620928
2355 666020888022790149
Name: tweet_id, Length: 2356, dtype: object
In?[82]:image_predictions_clean['tweet_id']
Out[82]:
0 666020888022790149
1 666029285002620928
2 666033412701032449
3 666044226329800704
4 666049248165822465...
2070 891327558926688256
2071 891689557279858688
2072 891815181378084864
2073 892177421306343426
2074 892420643555336193
Name: tweet_id, Length: 2075, dtype: object
In?[83]:tweet_json_clean['tweet_id']
Out[83]:
0 892420643555336193
0 892177421306343426
0 891815181378084864
0 891689557279858688
0 891327558926688256...
0 666049248165822465
0 666044226329800704
0 666033412701032449
0 666029285002620928
0 666020888022790149
Name: tweet_id, Length: 2352, dtype: object
issue:?timestamp類型錯誤
define:?修改為datetime
code:
In?[84]:twitter_archive_enhanced_clean['timestamp'] = pd.to_datetime(twitter_archive_enhanced_clean['timestamp'])
Test
In?[85]:twitter_archive_enhanced_clean['timestamp']
Out[85]:
0 2017-08-01 16:23:56+00:00
1 2017-08-01 00:17:27+00:00
2 2017-07-31 00:18:03+00:00
3 2017-07-30 15:58:51+00:00
4 2017-07-29 16:00:24+00:00...
2351 2015-11-16 00:24:50+00:00
2352 2015-11-16 00:04:52+00:00
2353 2015-11-15 23:21:54+00:00
2354 2015-11-15 23:05:30+00:00
2355 2015-11-15 22:32:08+00:00
Name: timestamp, Length: 2356, dtype: datetime64[ns, UTC]
issue:?55個名字為a的狗狗,調(diào)用一個名字為a的看了下,顯然a不是狗狗的名字
define:?將a用None代替
code:
In?[86]:twitter_archive_enhanced_clean['name']= twitter_archive_enhanced_clean['name'].replace('a',np.nan)
Test
In?[88]:twitter_archive_enhanced_clean['name'].value_counts()
Out[88]:
None 745
Charlie 12
Lucy 11
Oliver 11
Cooper 11...
Karll 1
Tiger 1
old 1
Meatball 1
Stormy 1
Name: name, Length: 956, dtype: int64
Issue:
分母不全為10
define:?Create new column rating=rating_numerator/rating_denominator. Drop rating_numerator and rating_denominator.
Code:
In?[90]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.rating_denominator!= 0]
In?[91]:twitter_archive_enhanced_clean['rating']=twitter_archive_enhanced_clean.rating_numerator/twitter_archive_enhanced_clean.rating_denominator
In?[92]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['rating_numerator','rating_denominator'],axis=1)
Test:
In?[93]:twitter_archive_enhanced_clean
Out[93]:
| ? | tweet_id | in_reply_to_status_id | in_reply_to_user_id | timestamp | source | text | retweeted_status_id | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | name | doggo | floofer | pupper | puppo | rating |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | NaN | NaN | 2017-08-01 16:23:56+00:00 | <a href="http://twitter.com/download/iphone" r... | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | Phineas | None | None | None | None | 1.3 |
| 1 | 892177421306343426 | NaN | NaN | 2017-08-01 00:17:27+00:00 | <a href="http://twitter.com/download/iphone" r... | This is Tilly. She's just checking pup on you.... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | Tilly | None | None | None | None | 1.3 |
| 2 | 891815181378084864 | NaN | NaN | 2017-07-31 00:18:03+00:00 | <a href="http://twitter.com/download/iphone" r... | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | Archie | None | None | None | None | 1.2 |
| 3 | 891689557279858688 | NaN | NaN | 2017-07-30 15:58:51+00:00 | <a href="http://twitter.com/download/iphone" r... | This is Darla. She commenced a snooze mid meal... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | Darla | None | None | None | None | 1.3 |
| 4 | 891327558926688256 | NaN | NaN | 2017-07-29 16:00:24+00:00 | <a href="http://twitter.com/download/iphone" r... | This is Franklin. He would like you to stop ca... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | Franklin | None | None | None | None | 1.2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2351 | 666049248165822465 | NaN | NaN | 2015-11-16 00:24:50+00:00 | <a href="http://twitter.com/download/iphone" r... | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | None | None | None | None | None | 0.5 |
| 2352 | 666044226329800704 | NaN | NaN | 2015-11-16 00:04:52+00:00 | <a href="http://twitter.com/download/iphone" r... | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | NaN | None | None | None | None | 0.6 |
| 2353 | 666033412701032449 | NaN | NaN | 2015-11-15 23:21:54+00:00 | <a href="http://twitter.com/download/iphone" r... | Here is a very happy pup. Big fan of well-main... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | NaN | None | None | None | None | 0.9 |
| 2354 | 666029285002620928 | NaN | NaN | 2015-11-15 23:05:30+00:00 | <a href="http://twitter.com/download/iphone" r... | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | NaN | None | None | None | None | 0.7 |
| 2355 | 666020888022790149 | NaN | NaN | 2015-11-15 22:32:08+00:00 | <a href="http://twitter.com/download/iphone" r... | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | None | None | None | None | None | 0.8 |
2355 rows × 16 columns
Issue:?duplicated of jpg_url
define:?delete the duplicated
code:
In?[94]:image_predictions_clean=image_predictions_clean[~image_predictions_clean.jpg_url.duplicated()]
Test:
In?[95]:sum(image_predictions_clean.jpg_url.duplicated())
Out[95]:
Issue:?in_reply_to_status_id in_reply_to_user_id only 23
Define:?drop them directly
Code:
In?[96]:twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['in_reply_to_status_id','in_reply_to_user_id']],axis=1,inplace=True)
Test:
In?[97]:twitter_archive_enhanced_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2355
Data columns (total 14 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 tweet_id 2355 non-null object 1 timestamp 2355 non-null datetime64[ns, UTC]2 source 2355 non-null object 3 text 2355 non-null object 4 retweeted_status_id 181 non-null float64 5 retweeted_status_user_id 181 non-null float64 6 retweeted_status_timestamp 181 non-null object 7 expanded_urls 2297 non-null object 8 name 2300 non-null object 9 doggo 2355 non-null object 10 floofer 2355 non-null object 11 pupper 2355 non-null object 12 puppo 2355 non-null object 13 rating 2355 non-null float64
dtypes: datetime64[ns, UTC](1), float64(3), object(10)
memory usage: 276.0+ KB
Issue:?html content in source
define:?delete html
Code:
In?[98]:twitter_archive_enhanced_clean.source= twitter_archive_enhanced_clean.source.str.extract('>(.+)<',expand = True)
Test
In?[99]:twitter_archive_enhanced_clean['source'].value_counts()
Out[99]:
Twitter for iPhone 2220
Vine - Make a Scene 91
Twitter Web Client 33
TweetDeck 11
Name: source, dtype: int64
Issue:?text column contain url
define:?delete url
code:
In?[100]:twitter_archive_enhanced_clean.text.replace(r'https.*','',regex=True, inplace=True)
test
In?[101]:twitter_archive_enhanced_clean.text
Out[101]:
0 This is Phineas. He's a mystical boy. Only eve...
1 This is Tilly. She's just checking pup on you....
2 This is Archie. He is a rare Norwegian Pouncin...
3 This is Darla. She commenced a snooze mid meal...
4 This is Franklin. He would like you to stop ca......
2351 Here we have a 1949 1st generation vulpix. Enj...
2352 This is a purebred Piers Morgan. Loves to Netf...
2353 Here is a very happy pup. Big fan of well-main...
2354 This is a western brown Mitsubishi terrier. Up...
2355 Here we have a Japanese Irish Setter. Lost eye...
Name: text, Length: 2355, dtype: object
issue:?含有轉(zhuǎn)發(fā)數(shù)據(jù)
define:?刪除轉(zhuǎn)發(fā)數(shù)據(jù)
code:
In?[102]:twitter_archive_enhanced_clean=twitter_archive_enhanced_clean[twitter_archive_enhanced_clean.retweeted_status_id.isnull()]
twitter_archive_enhanced_clean=twitter_archive_enhanced_clean.drop(['retweeted_status_id'],axis=1)
Test
In?[103]:twitter_archive_enhanced_clean.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2174 entries, 0 to 2355
Data columns (total 13 columns):# Column Non-Null Count Dtype
--- ------ -------------- ----- 0 tweet_id 2174 non-null object 1 timestamp 2174 non-null datetime64[ns, UTC]2 source 2174 non-null object 3 text 2174 non-null object 4 retweeted_status_user_id 0 non-null float64 5 retweeted_status_timestamp 0 non-null object 6 expanded_urls 2117 non-null object 7 name 2119 non-null object 8 doggo 2174 non-null object 9 floofer 2174 non-null object 10 pupper 2174 non-null object 11 puppo 2174 non-null object 12 rating 2174 non-null float64
dtypes: datetime64[ns, UTC](1), float64(2), object(10)
memory usage: 237.8+ KB
issue:?狗狗地位是一個變量,應(yīng)該為一列
define?將其放在一列
code
In?[104]:
twitter_archive_enhanced_clean['stage']= twitter_archive_enhanced_clean.text.str.findall('(doggo|pupper|puppo|floofer)')twitter_archive_enhanced_clean['stage'] = twitter_archive_enhanced_clean['stage'].apply(lambda x: ','.join(set(x)))
In?[105]:
twitter_archive_enhanced_clean['stage']=twitter_archive_enhanced_clean['stage'].replace('',np.nan)
In?[106]:
twitter_archive_enhanced_clean.drop(twitter_archive_enhanced_clean[['doggo','puppo','pupper','floofer']],axis=1,inplace=True)
Test
In?[107]:
twitter_archive_enhanced_clean.stage.value_counts()
Out[107]:
pupper 242
doggo 78
puppo 30
pupper,doggo 8
floofer 4
puppo,doggo 2
Name: stage, dtype: int64
ISSUE:?三個數(shù)據(jù)集共有一個觀察對象,可以合并為一個數(shù)據(jù)集. 無照片的數(shù)據(jù)也可以刪除。
define:?將3個數(shù)據(jù)集合并在一起,并且刪除無照片的數(shù)據(jù)
code
In?[108]:
df1_clean = twitter_archive_enhanced_clean.merge(image_predictions_clean,how='inner',on='tweet_id')
In?[109]:
df_clean = df1_clean.merge(tweet_json_clean,how='left',on='tweet_id')
test
In?[110]:
df_clean
Out[110]:
| ? | tweet_id | timestamp | source | text | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | name | rating | stage | ... | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | retweet_count | favorite_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | 2017-08-01 16:23:56+00:00 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | Phineas | 1.3 | NaN | ... | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False | 8842 | 39492 |
| 1 | 892177421306343426 | 2017-08-01 00:17:27+00:00 | Twitter for iPhone | This is Tilly. She's just checking pup on you.... | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | Tilly | 1.3 | NaN | ... | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True | 6480 | 33786 |
| 2 | 891815181378084864 | 2017-07-31 00:18:03+00:00 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | Archie | 1.2 | NaN | ... | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True | 4301 | 25445 |
| 3 | 891689557279858688 | 2017-07-30 15:58:51+00:00 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal... | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | Darla | 1.3 | NaN | ... | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False | 8925 | 42863 |
| 4 | 891327558926688256 | 2017-07-29 16:00:24+00:00 | Twitter for iPhone | This is Franklin. He would like you to stop ca... | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | Franklin | 1.2 | NaN | ... | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True | 9721 | 41016 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1989 | 666049248165822465 | 2015-11-16 00:24:50+00:00 | Twitter for iPhone | Here we have a 1949 1st generation vulpix. Enj... | NaN | NaN | https://twitter.com/dog_rates/status/666049248... | None | 0.5 | NaN | ... | 0.560311 | True | Rottweiler | 0.243682 | True | Doberman | 0.154629 | True | 41 | 111 |
| 1990 | 666044226329800704 | 2015-11-16 00:04:52+00:00 | Twitter for iPhone | This is a purebred Piers Morgan. Loves to Netf... | NaN | NaN | https://twitter.com/dog_rates/status/666044226... | NaN | 0.6 | NaN | ... | 0.408143 | True | redbone | 0.360687 | True | miniature_pinscher | 0.222752 | True | 147 | 309 |
| 1991 | 666033412701032449 | 2015-11-15 23:21:54+00:00 | Twitter for iPhone | Here is a very happy pup. Big fan of well-main... | NaN | NaN | https://twitter.com/dog_rates/status/666033412... | NaN | 0.9 | NaN | ... | 0.596461 | True | malinois | 0.138584 | True | bloodhound | 0.116197 | True | 47 | 128 |
| 1992 | 666029285002620928 | 2015-11-15 23:05:30+00:00 | Twitter for iPhone | This is a western brown Mitsubishi terrier. Up... | NaN | NaN | https://twitter.com/dog_rates/status/666029285... | NaN | 0.7 | NaN | ... | 0.506826 | True | miniature_pinscher | 0.074192 | True | Rhodesian_ridgeback | 0.072010 | True | 48 | 132 |
| 1993 | 666020888022790149 | 2015-11-15 22:32:08+00:00 | Twitter for iPhone | Here we have a Japanese Irish Setter. Lost eye... | NaN | NaN | https://twitter.com/dog_rates/status/666020888... | None | 0.8 | NaN | ... | 0.465074 | True | collie | 0.156665 | True | Shetland_sheepdog | 0.061428 | True | 530 | 2528 |
1994 rows × 23 columns
保存數(shù)據(jù)集
In?[112]:
#save the file
save_file_name = 'twitter_archive_master.csv'
df_clean.to_csv(save_file_name, encoding='utf-8',index=False)
分析與可視化
In?[114]:
#data analysisdata = pd.read_csv('twitter_archive_master.csv', encoding='utf-8')
In?[115]:
data.head(10)
Out[115]:
| ? | tweet_id | timestamp | source | text | retweeted_status_user_id | retweeted_status_timestamp | expanded_urls | name | rating | stage | ... | p1_conf | p1_dog | p2 | p2_conf | p2_dog | p3 | p3_conf | p3_dog | retweet_count | favorite_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 892420643555336193 | 2017-08-01 16:23:56+00:00 | Twitter for iPhone | This is Phineas. He's a mystical boy. Only eve... | NaN | NaN | https://twitter.com/dog_rates/status/892420643... | Phineas | 1.3 | NaN | ... | 0.097049 | False | bagel | 0.085851 | False | banana | 0.076110 | False | 8842 | 39492 |
| 1 | 892177421306343426 | 2017-08-01 00:17:27+00:00 | Twitter for iPhone | This is Tilly. She's just checking pup on you.... | NaN | NaN | https://twitter.com/dog_rates/status/892177421... | Tilly | 1.3 | NaN | ... | 0.323581 | True | Pekinese | 0.090647 | True | papillon | 0.068957 | True | 6480 | 33786 |
| 2 | 891815181378084864 | 2017-07-31 00:18:03+00:00 | Twitter for iPhone | This is Archie. He is a rare Norwegian Pouncin... | NaN | NaN | https://twitter.com/dog_rates/status/891815181... | Archie | 1.2 | NaN | ... | 0.716012 | True | malamute | 0.078253 | True | kelpie | 0.031379 | True | 4301 | 25445 |
| 3 | 891689557279858688 | 2017-07-30 15:58:51+00:00 | Twitter for iPhone | This is Darla. She commenced a snooze mid meal... | NaN | NaN | https://twitter.com/dog_rates/status/891689557... | Darla | 1.3 | NaN | ... | 0.170278 | False | Labrador_retriever | 0.168086 | True | spatula | 0.040836 | False | 8925 | 42863 |
| 4 | 891327558926688256 | 2017-07-29 16:00:24+00:00 | Twitter for iPhone | This is Franklin. He would like you to stop ca... | NaN | NaN | https://twitter.com/dog_rates/status/891327558... | Franklin | 1.2 | NaN | ... | 0.555712 | True | English_springer | 0.225770 | True | German_short-haired_pointer | 0.175219 | True | 9721 | 41016 |
| 5 | 891087950875897856 | 2017-07-29 00:08:17+00:00 | Twitter for iPhone | Here we have a majestic great white breaching ... | NaN | NaN | https://twitter.com/dog_rates/status/891087950... | None | 1.3 | NaN | ... | 0.425595 | True | Irish_terrier | 0.116317 | True | Indian_elephant | 0.076902 | False | 3240 | 20548 |
| 6 | 890971913173991426 | 2017-07-28 16:27:12+00:00 | Twitter for iPhone | Meet Jax. He enjoys ice cream so much he gets ... | NaN | NaN | https://gofundme.com/ydvmve-surgery-for-jax,ht... | Jax | 1.3 | NaN | ... | 0.341703 | True | Border_collie | 0.199287 | True | ice_lolly | 0.193548 | False | 2142 | 12053 |
| 7 | 890729181411237888 | 2017-07-28 00:22:40+00:00 | Twitter for iPhone | When you watch your owner call another dog a g... | NaN | NaN | https://twitter.com/dog_rates/status/890729181... | None | 1.3 | NaN | ... | 0.566142 | True | Eskimo_dog | 0.178406 | True | Pembroke | 0.076507 | True | 19548 | 66596 |
| 8 | 890609185150312448 | 2017-07-27 16:25:51+00:00 | Twitter for iPhone | This is Zoey. She doesn't want to be one of th... | NaN | NaN | https://twitter.com/dog_rates/status/890609185... | Zoey | 1.3 | NaN | ... | 0.487574 | True | Irish_setter | 0.193054 | True | Chesapeake_Bay_retriever | 0.118184 | True | 4403 | 28187 |
| 9 | 890240255349198849 | 2017-07-26 15:59:51+00:00 | Twitter for iPhone | This is Cassie. She is a college pup. Studying... | NaN | NaN | https://twitter.com/dog_rates/status/890240255... | Cassie | 1.4 | doggo | ... | 0.511319 | True | Cardigan | 0.451038 | True | Chihuahua | 0.029248 | True | 7684 | 32467 |
10 rows × 23 columns
In?[116]:data.favorite_count.describe()
Out[116]:
count 1994.000000
mean 8923.133400
std 12400.238808
min 81.000000
25% 1972.250000
50% 4117.000000
75% 11275.500000
max 132318.000000
Name: favorite_count, dtype: float64
In?[117]:data.retweet_count.describe()
Out[117]:
count 1994.000000
mean 2770.021063
std 4715.961325
min 15.000000
25% 622.250000
50% 1348.500000
75% 3202.750000
max 79116.000000
Name: retweet_count, dtype: float64
In?[118]:
import matplotlib.pyplot as plt
%matplotlib inline
In?[119]:
plt.bar(x=['favorite_count','retweet_count'], height = [data.favorite_count.sum(),data.retweet_count.sum()])plt.title('Number of Favorite count VS Retweet Count')
Out[119]:
Text(0.5, 1.0, 'Number of Favorite count VS Retweet Count')
*?So the first conclusion is : favorate count more than retweet count
In?[120]:data[data.p1_conf > 0.5].p1.value_counts()
Out[120]:
golden_retriever 116
Pembroke 70
Labrador_retriever 65
Chihuahua 47
pug 43...
scorpion 1
Appenzeller 1
flamingo 1
axolotl 1
Irish_water_spaniel 1
Name: p1, Length: 245, dtype: int64
the second conclusion: the most dog: golden_retriever
In?[121]:data['rating'].value_counts()
Out[121]:
1.200000 454
1.000000 421
1.100000 402
1.300000 261
0.900000 151
0.800000 95
0.700000 51
1.400000 35
0.500000 34
0.600000 32
0.300000 19
0.400000 15
0.200000 10
0.100000 4
0.000000 2
177.600000 1
2.600000 1
3.428571 1
0.636364 1
0.818182 1
42.000000 1
7.500000 1
2.700000 1
Name: rating, dtype: int64
#the third conclusion: most numerator are more than 10
總結(jié)
以上是生活随笔為你收集整理的WeRateDog---分析推特数据的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 【网络】route和 IP route的
- 下一篇: 影响程序员生涯的三个错误观念,你千万不要