Filtering microblogging messages for Social TV
論文摘要, Filtering microblogging messages for Social TV, A Bootstrapping Approach to Identifying Relevant Tweets for Social TV
Social TV was named one of the ten most important emerging technologies in 2010 by the MIT Technology Review.
Social Television is a general term for technology that supports communication and social interaction in either the context of watching television, or related to TV content.
Some of these systems allow users to read microblogging messages related to the TV program they are currently watching.
所以這兒討論的問題就是, 怎么樣過濾出真正和TV相關的信息, 最簡單的, 而且也是我們一直使用的方法如下,
Current Social TV applications search for these messages by issuing queries to social networks with the full title of the TV program. This naive approach can lead to low precision and recall.
舉個簡單了例子就可以明白, 這個方法為啥precision and recall都很低...
The popular TV show House is an example that results in low precision.
對于House, 這是個有歧義的詞(ambiguous), 除了表示TV節目外, 在不同的語境下有很多其它的用途, 如White House, House of Representatives, building, home, etc. 所以直接搜索House必然是low precision.
Continuing with our example for the show House, there are many messages which do not mention the title of the show but make references to users, hashtags, or even actors and characters related to the show. The problem of low recall is more severe for shows with long titles.
上面說了recall問題對于title比較長的tv非常明顯, 很少有人愿意在tweet寫全title, 往往會使用縮寫.
總結一下, 我們要解決這個問題的挑戰如下,
Our task is to retrieve microblogging messages relevant to a given TV show with high precision. Filtering messages from microblogging websites poses several challenges, including:
1. Microblogging messages are short and often lack context. For instance, Twitter messages (tweets) are limited to 140 characters and often contain abbreviated expressions such as hashtags and short URLs.
2. Many social media messages lack proper grammatical structure. Also, users of social networks pay little attention to capitalization and punctuation. This makes it difficult to apply natural language processing technologies to parse the text.
3. Many social media websites offer access to their content through search APIs, but most have rate limits. In order to filter messages we first need to collect them by issuing queries to these services. For each show we require a set of queries which provides the best tradeoff between the need to cover as many messages about the show as possible, and the need to respect
the API rate limits imposed by the social network. Such queries could include the title of the show and other related strings such as hashtags and usernames related to the show. Determining which keywords best describe a TV show can be a challenge.
4. In the last decade alone, television networks have aired more than a thousand new TV shows. Obtaining training data for every show would be prohibitively expensive. Furthermore, new shows are aired every six months.
這個問題怎么解決, 我之前也想了很久, 我也想過要建立一個分類器來區分一條tweet是否是關于tv的, 但是沒有想好具體怎么做, 這篇paper就是提出了一個怎么樣建立這個分類器的方法.
分類器是個很成熟的技術, 關鍵就是特征的選取和訓練集的收集.
We propose a bootstrapping method which is built upon 1) a small set of labeled data, 2) a large unlabeled dataset, and 3) some domain knowledge, to form a classifier that can generalize to an arbitrary number of TV shows.
由于lable訓練集是個耗時的工作, 所以這兒只需要較小的訓練集labeled data, 并通過domain knowledge來選取初始的分類特征, 這樣可以完成初始的分類器的訓練.然后用a large unlabeled dataset作為測試集來測試初始分類器, 在測試過程中發現新的特征, 并不斷的完善, 形成可用的improved分類器.
這就是這個方法的大體思想, 并且通過測試, 可以發現improved后的分類器在recall上有很大的提高.
個人覺得這篇paper的價值就在于特征的選取, 下面就看看會選取哪些特征,
Terms related to TV watching
General terms commonly associated with watching TV. 這類特征通過手工收集, 包含如下3個特征,
tv_terms, general terms such as watching, episode, hdtv, netflix, etc.
network_terms, contains names of television networks such as cnn, bbc, pbs, etc.
season_episode,
Some users post messages which contain the season and episode number of the TV show they are currently watching.
“S06E07”, “06x07” and even “6.7” are common ways of referring to the sixth season and the seventh episode of a particular TV show. 所以我們要通過regular expressions來定位是否包含season_episode
對于以上特征, 在tweet中包含相應term時特征為1, 否則為0.
General Positive Rules
rules_score ,
The motivation behind the rules_score feature is the fact that many messages which discuss TV shows follow certain patterns.
如,
<start> watching <show_name>
episode of <show_name>
<show_name> was awesome
如果我們有這樣的一個rule列表, 當tweet中包含相應rule時特征為1, 否則為0.
問題是我們怎樣找到這些rule, 當然可以人工一個個去發現, 這樣也可以準確率比較高, 不過效率太低.
We developed an automated way to extract such general rules and compute their probability of occurrence.
We start from a manually compiled list of ten unambiguous TV show titles, such as “Mythbusters”, “The Simpsons”, “Grey’s Anatomy”, etc. unambiguous 就是沒有歧義, 明確的, 這個詞一定代表某一個tv的, 相對于ambiguous, 如House
現在我們想要提取tv相關的tweets中的general rules, 所以必須保證找到的tweets是真正和tv相關的, 比較好的辦法就是通過unambiguous TV show來收集, 這個方法我們之前也使用過.
For each message which contained one of these titles, the algorithm replaced the title of TV shows, hashtags, references to episodes, etc. with general placeholders, then computed the occurrence of trigrams around the keywords.
這個是關鍵的一步, 我們需要提取general rules, 所以要先把和某個具體tv相關的信息都屏蔽掉, 然后統計trigrams 的occurrence
Features related to show titles
Although many social media messages lack proper capitalization,when users do capitalize the titles of the shows this can be used as a feature.
title_case, which is set to 1 if the title of the show is capitalized, otherwise it has the value 0.
titles_match, any of the titles mentioned in the message are unambiguous, we can set the value of this feature to 1.
這兒比較有價值的是, 他提出了一個怎么樣判斷是否unambiguous的方法, 我們之前通過自己統計stop word的方法, 不過效果不是很好, 尤其是對多個詞的時候, 他提出可以利用WordNet……Good.
We define unambiguous title to be a title which has zero or one hits when searching for it in WordNET
Features based on domain knowledge crawled from online sources
One of our assumptions is that messages relevant to a show often contain names of actors, characters, or other keywords strongly related to the show.
cosine_characters, cosine_actors, and cosine_wiki, we compute the cosine similarity between a new message and the information we crawled (from TV.com and Wikipedia) about the show for each of the three features.
這個方法可用大大提高recall, 不過實現起來比較麻煩, 而且由于twitter的訪問限制, 也不允許為一個show設置太多的term, 所以一直沒有采用.
上面就列出了9個初始特征, 然后通過使用初始分類器對測試集進行測試后, 又發現如下特征,
pos_rules_score and neg_rules_score are natural extensions of the feature rules_score.
For instance, for the show House we can now learn positive rules such as episode of house, as well as negative rules such as in the house or the white house.
users_score and hashtags_score
Using messages labeled by Classifier #1, we can determine commonly occurring hashtags and users which often talk about a particular show. Furthermore, these features can also help us expand the set of queries for each show, thus improving the recall by searching for hashtags and users related to the show, in addition to the title.
這點我們之前也想到過, 只是沒有實現, 可以提高recall
rush_period, this feature is based on the observation that users of social media websites often discuss about a show during the time it is on air.When classifying a new message we check how many mentions of the show there were in the previous window of 10 minutes. 超過某一threshold設為1, 否則設為0.
轉載于:https://www.cnblogs.com/fxjwind/archive/2011/08/02/2125283.html
總結
以上是生活随笔為你收集整理的Filtering microblogging messages for Social TV的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: linux管理员常用的命令分享
- 下一篇: boost::bind时候注意性能问题