netflix的准实验面临的主要挑战
重點(diǎn) (Top highlight)
Kamer Toker-Yildiz, Colin McFarland, Julia Glick
KAMER Toker-耶爾德茲 , 科林·麥克法蘭 , Julia·格里克
At Netflix, when we can’t run A/B experiments we run quasi experiments! We run quasi experiments with various objectives such as non-member experiments focusing on acquisition, member experiments focusing on member engagement, or video streaming experiments focusing on content delivery. Consolidating on one methodology could be a challenge, as we may face different design or data constraints or optimization goals. We discuss some key challenges and approaches Netflix has been using to handle small sample size and limited pre-intervention data in quasi experiments.
在Netflix,當(dāng)我們無法進(jìn)行A / B實(shí)驗(yàn)時(shí),我們會(huì)進(jìn)行準(zhǔn)實(shí)驗(yàn) ! 我們運(yùn)行具有各種目標(biāo)的準(zhǔn)實(shí)驗(yàn),例如專注于獲取的非成員實(shí)驗(yàn),專注于成員參與的成員實(shí)驗(yàn)或?qū)W⒂趦?nèi)容交付的視頻流實(shí)驗(yàn)。 由于我們可能面臨不同的設(shè)計(jì)或數(shù)據(jù)約束或優(yōu)化目標(biāo),因此將一種方法論整合起來可能是一個(gè)挑戰(zhàn)。 我們討論了Netflix在準(zhǔn)實(shí)驗(yàn)中用于處理小樣本量和有限的干預(yù)前數(shù)據(jù)的一些關(guān)鍵挑戰(zhàn)和方法。
設(shè)計(jì)與隨機(jī)化 (Design and Randomization)
We face various business problems where we cannot run individual level A/B tests but can benefit from quasi experiments. For instance, consider the case where we want to measure the impact of TV or billboard advertising on member engagement. It is impossible for us to have identical treatment and control groups at the member level as we cannot hold back individuals from such forms of advertising. Our solution is to randomize our member base at the smallest possible level. For instance, TV advertising can be bought at TV media market level only in most countries. This usually involves groups of cities in closer geographic proximity.
我們面臨各種業(yè)務(wù)問題,無法運(yùn)行單獨(dú)的A / B級(jí)測(cè)試,但可以從準(zhǔn)實(shí)驗(yàn)中受益。 例如,考慮我們要評(píng)估電視或廣告牌廣告對(duì)會(huì)員參與度的影響的情況。 對(duì)于我們來說,在會(huì)員級(jí)別擁有相同的待遇和對(duì)照組是不可能的,因?yàn)槲覀儫o法阻止此類廣告形式的個(gè)人。 我們的解決方案是在盡可能小的水平上隨機(jī)分配我們的會(huì)員基礎(chǔ)。 例如,電視廣告只能在大多數(shù)國家/地區(qū)在電視媒體市場(chǎng)上購買。 這通常涉及地理上更接近的城市群。
One of the major problems we face in quasi experiments is having small sample size where asymptotic properties may not practically hold. We typically have a small number of geographic units due to test limitations and also use broader or distant groups of units to minimize geographic spillovers. We are also more likely to face high variation and uneven distributions in treatment and control groups due to heterogeneity across units. For example, let’s say we are interested in measuring the impact of marketing Lost in Space series on sci-fi viewing in the UK. London with its high population is randomly assigned to the treatment cell, and people in London love sci-fi much more than other cities. If we ignore the latter fact, we will overestimate the true impact of marketing — which is now confounded. In summary, simple randomization and mean comparison we typically utilize in A/B testing with millions of members may not work well for quasi experiments.
我們?cè)跍?zhǔn)實(shí)驗(yàn)中面臨的主要問題之一是樣本量較小,而漸近性質(zhì)可能實(shí)際上不成立。 由于測(cè)試的局限性,我們通常會(huì)有少量的地理單位,并且還會(huì)使用更廣或更遠(yuǎn)的單位組,以最大程度地減少地理溢出。 由于單位間的異質(zhì)性,我們也更有可能在治療組和對(duì)照組中面臨較高的變異和分布不均。 例如,假設(shè)我們對(duì)衡量“ 迷失太空”系列營(yíng)銷對(duì)英國科幻觀看的影響感興趣。 人口眾多的倫敦被隨機(jī)分配到治療室,倫敦的人們比其他城市更喜歡科幻小說。 如果我們忽略后一個(gè)事實(shí),我們將高估行銷的真正影響,而現(xiàn)在卻感到困惑 。 總而言之,我們通常在A / B測(cè)試中使用數(shù)以百萬計(jì)的成員進(jìn)行的簡(jiǎn)單隨機(jī)化和均值比較可能不適用于準(zhǔn)實(shí)驗(yàn)。
Completely tackling these problems during the design phase may not be possible. We use some statistical approaches during design and analysis to minimize bias and maximize precision of our estimates. During design, one approach we utilize is running repeated randomizations, i.e. ‘re-randomization’. In particular, we keep randomizing until we find a randomization that gives us the maximum desired level of balance on key variables across test cells. This approach generally enables us to define more similar test groups (i.e. getting closer to apples to apples comparison). However, we may still face two issues: 1) we can only simultaneously balance on a limited number of observed variables, and it is very difficult to find identical geographic units on all dimensions, and 2) we can still face noisy results with large confidence intervals due to small sample size. We next discuss some of our analysis approaches to further tackle these problems.
在設(shè)計(jì)階段可能無法完全解決這些問題。 在設(shè)計(jì)和分析過程中,我們使用一些統(tǒng)計(jì)方法來最小化偏差并最大化我們的估計(jì)精度。 在設(shè)計(jì)期間,我們使用的一種方法是運(yùn)行重復(fù)隨機(jī)化,即“ 重新隨機(jī)化” 。 特別是,我們一直進(jìn)行隨機(jī)化,直到找到一個(gè)隨機(jī)化,該隨機(jī)化可為我們提供跨測(cè)試單元的關(guān)鍵變量的最大期望平衡水平。 這種方法通常使我們能夠定義更多相似的測(cè)試組(即,越來越接近蘋果與蘋果之間的比較)。 但是,我們?nèi)匀豢赡苊媾R兩個(gè)問題:1)我們只能同時(shí)在有限數(shù)量的觀察變量上保持平衡,并且很難在所有維度上找到相同的地理單位,并且2)我們?nèi)匀豢梢苑浅S行判牡孛鎸?duì)嘈雜的結(jié)果由于樣本量較小,因此間隔不大。 接下來,我們將討論一些分析方法來進(jìn)一步解決這些問題。
分析 (Analysis)
超越簡(jiǎn)單的比較 (Going Beyond Simple Comparisons)
Difference in differences (diff-in-diff or DID) comparison is a very common approach used in quasi experiments. In diff-in-diff, we usually consider two time periods; pre and post intervention. We utilize the pre-intervention period to generate baselines for our metrics, and normalize post intervention values by the baseline. This normalization is a simple but very powerful way of controlling for inherent differences between treatment and control groups. For example, let’s say our success metric is signups and we are running a quasi experiment in France. We have Paris and Lyon in two test cells. We cannot directly compare signups in two cities as populations are very different. Normalizing with respect to pre-intervention signups would reduce variation and help us make comparisons at the same scale. Although the diff-in-diff approach generally works reasonably well, we have observed some cases where it may not be as applicable as we discuss next.
差異比較(diff-in-diff或DID)比較是準(zhǔn)實(shí)驗(yàn)中一種非常常用的方法。 在差異比較中,我們通??紤]兩個(gè)時(shí)間段; 干預(yù)前后。 我們利用干預(yù)前時(shí)期為我們的指標(biāo)生成基線,并根據(jù)該基線對(duì)干預(yù)后的值進(jìn)行標(biāo)準(zhǔn)化。 這種歸一化是控制治療組和對(duì)照組之間固有差異的簡(jiǎn)單但非常有效的方法。 例如,假設(shè)我們的成功指標(biāo)是注冊(cè),而我們正在法國進(jìn)行一次準(zhǔn)實(shí)驗(yàn)。 我們?cè)趦蓚€(gè)測(cè)試單元中有巴黎和里昂。 由于人口差異很大,我們無法直接比較兩個(gè)城市的注冊(cè)人數(shù)。 干預(yù)前簽約的規(guī)范化將減少差異,并幫助我們進(jìn)行相同規(guī)模的比較。 盡管diff-in-diff方法通??梢院芎玫毓ぷ?#xff0c;但我們觀察到某些情況下可能不像我們接下來討論的那樣適用。
具有歷史觀察結(jié)果但樣本量較小的成功指標(biāo) (Success Metrics With Historical Observations But Small Sample Size)
In our non-member focused tests, we can observe historical acquisition metrics, e.g. signup counts, however, we don’t typically observe any other information about non-members. High variation in outcome metrics combined with small sample size can be a problem to design a well powered experiment using traditional diff-in-diff like approaches. To tackle this problem, we try to implement designs involving multiple interventions in each unit over an extended period of time whenever possible (i.e. instead of a typical experiment with single intervention period). This can help us gather enough evidence to run a well-powered experiment even with a very small sample size (i.e. few geographic units).
在我們的非會(huì)員重點(diǎn)測(cè)試中,我們可以觀察到歷史獲取指標(biāo),例如注冊(cè)計(jì)數(shù),但是,我們通常不會(huì)觀察到任何有關(guān)非會(huì)員的信息。 結(jié)果量度的高變化與小樣本量相結(jié)合可能是使用類似傳統(tǒng)的差異比較法設(shè)計(jì)功能強(qiáng)大的實(shí)驗(yàn)的問題。 為了解決這個(gè)問題,我們嘗試在可能的情況下在較長(zhǎng)的時(shí)間內(nèi)實(shí)施涉及每個(gè)單元的多次干預(yù)的設(shè)計(jì)(即,代替具有單個(gè)干預(yù)期的典型實(shí)驗(yàn))。 這可以幫助我們收集足夠的證據(jù),即使樣本量很小(即地理單位很少),也可以運(yùn)行功能強(qiáng)大的實(shí)驗(yàn)。
In particular, we turn the intervention (e.g. advertising) “on” and “off” repeatedly over time in different patterns and geographic units to capture short term effects. Every time we “toggle” the intervention, it gives us another chance to read the effect of the test. So even if we only have few geographic units, we can eventually read a reasonably precise estimate of the effect size (although, of course, results may not be generalizable to others if we have very few units). As our analysis approach, we can use observations from steady-state units to estimate what would otherwise have happened in units that are changing. To estimate the treatment effect, we fit a dynamic linear model (aka DLM), a type of state space model where the observations are conditionally Gaussian. DLMs are a very flexible category of models, but we only use a narrow subset of possible DLM structures to keep things simple. We currently have a robust internal package embedded in our internal tool, Quasimodo, to cover experiments that have similar structure. Our model is comparable to Google’s CausalImpact package, but uses a multivariate structure to let us analyze more than a single point-in-time intervention in a single region.
特別是,我們會(huì)隨著時(shí)間的推移以不同的模式和地理單位反復(fù)“打開”和“關(guān)閉”干預(yù)措施(例如廣告),以捕獲短期影響。 每次我們“切換”干預(yù)時(shí),它都會(huì)給我們另一個(gè)機(jī)會(huì)來閱讀測(cè)試的效果。 因此,即使我們只有很少的地理單位,我們最終仍可以讀取對(duì)效果大小的合理精確的估計(jì)(盡管,如果我們只有很少的單位,則結(jié)果可能無法推廣到其他地區(qū))。 作為我們的分析方法,我們可以使用來自穩(wěn)態(tài)單位的觀察值來估計(jì)發(fā)生變化的單位中發(fā)生的情況。 為了估算治療效果,我們擬合了動(dòng)態(tài)線性模型(又稱DLM),這是一種狀態(tài)空間模型,其中的觀測(cè)條件是有條件的高斯模型。 DLM是模型的一種非常靈活的類別,但是我們僅使用一小部分可能的DLM結(jié)構(gòu)來簡(jiǎn)化事情。 目前,我們?cè)趦?nèi)部工具Quasimodo中嵌入了一個(gè)強(qiáng)大的內(nèi)部程序包,以涵蓋具有相似結(jié)構(gòu)的實(shí)驗(yàn)。 我們的模型可與Google的CausalImpact包相媲美,但使用多元結(jié)構(gòu)可以讓我們分析單個(gè)區(qū)域中的多個(gè)時(shí)間點(diǎn)干預(yù)。
沒有歷史觀察的成功指標(biāo) (Success Metrics Without Historical Observations)
In our member focused tests, we sometimes face cases where we don’t have success metrics with historical observations. For example, Netflix promotes its new shows that are yet to be launched on service to increase member engagement once the show is available. For a new show, we start observing metrics only when the show launches. As a result, our success metrics inherently don’t have any historical observations making it impossible to utilize the benefits of similar time series based approaches.
在針對(duì)會(huì)員的測(cè)試中,有時(shí)我們會(huì)遇到一些案例,其中沒有根據(jù)歷史觀察得出的成功指標(biāo)。 例如,Netflix推廣了尚未投入使用的新節(jié)目,以在節(jié)目可用后增加會(huì)員的參與度。 對(duì)于新節(jié)目,我們僅在節(jié)目開始時(shí)才開始觀察指標(biāo)。 結(jié)果,我們的成功指標(biāo)天生就沒有任何歷史觀察,因此無法利用類似基于時(shí)間序列的方法的優(yōu)勢(shì)。
In these cases, we utilize the benefits of richer member data to measure and control for members’ inherent engagement or interest with the show. We do this by using relevant pre-treatment proxies, e.g. viewing of similar shows, interest in Netflix originals or similar genres. We have observed that controlling for geographic as well as individual level differences work best in minimizing confounding effects and improving precision. For example, if members in Toronto watch more Netflix originals than members in other cities in Canada, we should then control for pre-treatment Netflix originals viewing at both individual and city level to capture within and between unit variation separately.
在這種情況下,我們利用豐富的會(huì)員數(shù)據(jù)的優(yōu)勢(shì)來衡量和控制會(huì)員對(duì)節(jié)目的內(nèi)在參與或興趣。 我們通過使用相關(guān)的預(yù)處理代理來做到這一點(diǎn),例如,觀看類似的節(jié)目,對(duì)Netflix原創(chuàng)作品或相似類型的興趣。 我們已經(jīng)觀察到,控制地理差異和個(gè)人水平差異可以最大程度地減少混淆影響并提高精度。 例如,如果多倫多的成員觀看的Netflix原件比加拿大其他城市的成員多,則我們應(yīng)控制在個(gè)人和城市級(jí)別觀看的Netflix原件的預(yù)處理,以分別捕獲單位差異內(nèi)和單位間的差異。
This is in nature very similar to covariate adjustment. However, we do more than just running a simple regression with a large set of control variables. At Netflix, we have worked on developing approaches at the intersection of regression covariate adjustment and machine learning based propensity score matching by using a wide set of relevant member features. Such combined approaches help us explicitly control for members’ inherent interest in the new show using hundreds of features while minimizing linearity assumptions and degrees of freedom challenges we may face. We thus gain significant wins in both reducing potential confounding effects as well as maximizing precision to more accurately capture the treatment effect we are interested in.
本質(zhì)上,這與協(xié)變量調(diào)整非常相似。 但是,我們所做的不只是運(yùn)行帶有大量控制變量的簡(jiǎn)單回歸。 在Netflix,我們使用大量相關(guān)成員功能來開發(fā)回歸協(xié)變量調(diào)整和基于機(jī)器學(xué)習(xí)的傾向得分匹配的交集方法。 這種組合方法有助于我們使用數(shù)百種功能來明確控制成員對(duì)新展會(huì)的內(nèi)在興趣,同時(shí)最大程度地減少線性假設(shè)和我們可能面臨的自由度挑戰(zhàn)。 因此,我們?cè)跍p少潛在的混雜影響以及最大程度地提高精確度以更準(zhǔn)確地捕獲我們感興趣的治療效果方面均獲得了重大勝利。
下一步 (Next Steps)
We have excelled in the quasi experimentation space with many measurement strategies now in play across Netflix for various use cases. However we are not done yet! We can expand methodologies to more use cases and continue to improve the measurement. As an example, another exciting area we have yet to explore is combining these approaches for those metrics where we can use both time series approaches and a rich set of internal features (e.g. general member engagement metrics). If you’re interested in working on these and other causal inference problems, join our dream team!
我們?cè)跍?zhǔn)實(shí)驗(yàn)領(lǐng)域表現(xiàn)出色,目前針對(duì)各種用例,整個(gè)Netflix都在使用許多測(cè)量策略。 但是,我們還沒有完成! 我們可以將方法擴(kuò)展到更多用例,并繼續(xù)改進(jìn)度量。 例如,我們尚未探索的另一個(gè)令人興奮的領(lǐng)域是將這些方法與那些指標(biāo)結(jié)合使用,在這些指標(biāo)中,我們既可以使用時(shí)間序列方法,又可以使用豐富的內(nèi)部功能(例如,一般成員參與指標(biāo))。 如果您有興趣解決這些和其他因果推理問題,請(qǐng)加入我們的理想團(tuán)隊(duì) !
翻譯自: https://netflixtechblog.com/key-challenges-with-quasi-experiments-at-netflix-89b4f234b852
總結(jié)
以上是生活随笔為你收集整理的netflix的准实验面临的主要挑战的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到自己儿子吐血是什么预兆
- 下一篇: 怀孕会梦到什么