统计和冰淇淋
摘要 (Summary)
In this article, you will learn a little bit about probability calculations in R Studio. As it is a Statistical language, R comes with many tests already built in it, with functions that can save you a lot of work if you know how to use them.
在本文中,您將學到一些有關R Studio中概率計算的知識。 由于R是一種統計語言,因此R內置了許多測試,并且如果您知道如何使用它們,這些函數可以節省大量工作。
We will talk about three of them here:
我們將在這里討論其中三個:
- Probability for Binomial Distributions 二項分布的概率
- Probability for Poisson Distributions 泊松分布的概率
- Probability for Normal Distributions 正態分布的概率
在我開始之前... (Before I start…)
Alright, you saw the summary, you are interested in this topic but you still didn’t get what does the Ice Cream have to do with all of it, right?
好了,您看到了摘要,您對該主題感興趣,但是您仍然沒有意識到Ice Cream與這一切有什么關系,對嗎?
Well, I just wanted to work on this article with an Ice Cream dataset. That’s all. Here is how you can create a small sample in R Studio:
好吧,我只是想使用Ice Cream數據集來撰寫本文。 就這樣。 這是在R Studio中創建小樣本的方法:
ice_cream <- data.frame(month= c(1,2,3,4,5,6,7,8,9,10,11,12),sales= sample(100:500,
size=12, replace=T,
set.seed(12)),
customers= sample(50:450,
size=12, replace=T,
set.seed(12)))| month| sales| customers|
|-----:|-----:|---------:|
| 1| 127| 77|
| 2| 427| 377|
| 3| 477| 427|
| 4| 208| 158|
| 5| 167| 117|
| 6| 113| 63|
| 7| 171| 121|
| 8| 357| 307|
| 9| 109| 59|
| 10| 103| 53|
| 11| 257| 207|
| 12| 426| 376|
All set. Let’s go!
可以了,好了。 我們走吧!
二項分布 (Binomial Distributions)
Binomial distributions, like the name already tells us, are those where we can get two possible results: Yes/ No, Correct/Wrong, True/False, Success/ Failure.
就像名字已經告訴我們的那樣,二項分布是可以得到兩個可能結果的分布:是/否,正確/錯誤,正確/錯誤,成功/失敗。
This test becomes helpful when you need to know the probability of an event to occur if you try it 'n' times.
如果您需要嘗試n次嘗試知道事件發生的可能性,此測試將很有幫助。
Using our ice cream example, imagine our store has 15 preset cups in the menu, but we wanted to focus on selling Sundae. If we wanted to know what is the probability of a person to come in and choose a Sundae over all the other 15 options, that would fall on a classic problem of statistics: the chance is 1/15 (6.67%), right?
以我們的冰淇淋示例為例,假設我們的商店的菜單中有15個預設杯子,但是我們想專注于銷售圣代冰淇淋。 如果我們想知道一個人進入所有其他15個選項中選擇圣代的可能性是什么,那將是一個經典的統計問題:幾率是1/15(6.67%),對嗎?
But knowing we have more than one customer each day, what would be the probability that 5 customers would choose a Sundae out of every 30 sales transactions? Well, now our problem could be a little bit more complicated to calculate, but it is not, as far as we use the Binomial test in R Studio.
但是,如果知道我們每天有一個以上的客戶,那么每30筆銷售交易中就有5個客戶選擇圣代冰淇淋的概率是多少? 好吧,現在我們的問題可能要稍微復雜一些才能計算出來,但是就我們在R Studio中使用二項式檢驗而言,并不是這樣。
The Binomial test is really simple to perform. You can use the function as follows, where the first parameter will be the number of successes you are measuring(x); the size here is the number of times the event will happen, the number of tries (it should not be confused with the sample size); and the probability of success you have.
二項式檢驗非常容易執行。 您可以按以下方式使用該函數,其中第一個參數是您正在測量的成功次數( x ); 這里的大小是事件發生的次數,嘗試的次數(不應與樣本大小混淆); 以及您成功的可能性 。
dbinom(x= number of successes,size = number of events/ tries,
prob = probability of success)
So, summarizing:
因此,總結一下:
Problem 1: What is the probability of 5 out of 30 clients to choose Sundae from the menu?
問題 1:30個客戶中有5個從菜單中選擇圣代的概率是多少?
Method: Binomial test = Choose Sundae or NOT Sundae.
方法 :二項式檢驗=選擇圣代或不圣代。
Probability of success: choose 1 over 15 menu options.
成功的可能性 :從15個菜單選項中選擇1個。
Number of events: 30 customers = 30 sales transactions.
事件數 :30個客戶= 30個銷售交易。
Success test: 5 people choose sundae from the menu.
成功測試 :5人從菜單中選擇圣代冰淇淋。
And there is more. If we wanted to test the accumulated probability of 5 or more people to choose a Sundae (5, 6, 7, ….30), there is a function for that too. We can use pbinom, which is pretty similar to dbinom, but it brings us the parameter lower.tail, used as TRUE when you want to check a given number of successes or less (q inclusive) and as FALSE when you want more than a given numbers of successes (q exclusive).
還有更多。 如果我們想測試選擇一個圣代(5,6,7,….30)的5個或更多人的累積概率,那么也有一個函數。 我們可以使用pbinom ,它與dbinom非常相似,但是它為我們帶來了參數lower.tail ,當您要檢查給定的成功次數或更少次數(包括q在內)時,此參數為TRUE;而當您希望大于a時,此參數為FALSE給定成功次數( q排除)。
# Information: 5 people or more, 30 sales, prob 1/15pbinom(q=4, size= 30, prob= 1/15, lower.tail= F)[1] 0.0464 # 4.6% of chance.
Side note: I know you’re probably thinking now "But a choice of a product by a customer is much more complex than a simple statistic test". And indeed, it is. It involves pricing, promotion, value, the store and many other things. But the idea in this article is just to show you how to perform the tests and have it as a new tool for your analysis.
旁注: 我知道您現在可能正在思考“但是,客戶選擇產品要比簡單的統計測試復雜得多”。 確實如此。 它涉及定價,促銷,價值,商店和許多其他方面。 但是本文的想法只是向您展示如何執行測試并將其作為分析的新工具。
泊松分布 (Poisson Distributions)
The Poisson Distribution (discovered by Siméon Denis Poisson) is related to events in a period of time.
泊松分布(由SiméonDenis Poisson發現)與一段時間內的事件有關。
You use the Poisson distribution when you want to know what is the chance of something happen 'n' times during a period of time.
當您想知道某一段時間內某事物發生“ n”次的可能性是什么時,您可以使用泊松分布。
In order to use that test, you can type dpois in R Studio. However, you will need to have the following information to proceed:
為了使用該測試,您可以在R Studio中鍵入dpois 。 但是,您將需要具備以下信息才能繼續:
dpois(x= number to test,lambda = average rate the event occurs)
Once again, bringing it to our sweet ice cream example, in the dataset presented in the beginning of this article, we see the columns month, sales and customers. So we know our time period is one month. And if we run a summary in our sales column, we will have the average rate for sales by one month, correct?
再次將其帶到我們的甜冰淇淋示例中,在本文開頭提供的數據集中,我們可以看到月份,銷售額和客戶列。 所以我們知道我們的時間是一個月。 而且,如果我們在“銷售”列中進行匯總,那么我們將有一個月的平均銷售率,對嗎?
summary(ice_cream$sales) Min. 1st Qu. Median Mean 3rd Qu. Max.103.0 123.5 189.5 245.2 374.2 477.0
Our lambda is, therefore, 245.2 sales per month. Now we just need to know what we want to test.
因此,我們的lambda是每月245.2銷售。 現在,我們只需要知道我們要測試的內容即可。
I want to increase 5% my sales average. How probable is that to happen, just by chance?
我希望將平均銷售收入提高5%。 偶然發生的可能性有多大?
Problem 2: Increase the average sales/month in 5%, to approx. 257?
問題2 :將每月平均銷售額提高5%,達到 257?
Method: Poisson test = 12 more sales per month
方法 :泊松測試=每月增加12次銷售
Current average: 245.2 (lambda)
目前平均 :245.2(lambda)
Yeah. I better start working more on marketing actions, right? Because if I leave it to chance, I will have to rely on tiny 1.8% of probability that the customers will start to appear in my store and buy more.
是的 我最好開始更多地從事營銷活動,對嗎? 因為如果我把它留給機會,我將不得不依靠很小的1.8%的可能性使客戶開始出現在我的商店中并購買更多商品。
Similarly to the other distribution tests, the Poisson also brings the ppois that calculates the accumulated probability. The difference is only the function name starting with the letter p and the inclusion of the lower.tail parameter.
與其他分布測試類似,泊松也帶來了ppois來計算累積概率。 區別只是函數名稱以字母p開頭,并包含lower.tail參數。
Now I will calculate the cumulative chance of increasing my sales anywhere between 1% and 5%.
現在,我將計算在1%到5%之間的任意位置增加銷售量的累積機會。
# Calculating the accumulated prob. of 5% increase or less and subtracting the prob. of 1% or less. This way I get only the exact interval between 1% and 5%, nothing over or below it.ppois(257, lambda = 245.2) - ppois(247, lambda = 245.2)[1] 0.2226 # 22% of chance!Remember, this is the addition of the chances. So, increasing 5% holds the sum of the chances to increase 1%+2%+3%+4%+5% or any decimals between. That way, you must be really careful when plotting and reading a graphic like the one below. It shows 56% of chance to increase 1% of our sales. Come on! What does it mean?
記住,這是機會的增加。 因此,增加5%就是增加1%+ 2%+ 3%+ 4%+ 5%或兩者之間任何小數的機會之和。 這樣,在繪制和讀取下面的圖形時,您必須非常小心。 它顯示出56%的機會增加了我們1%的銷售額。 來吧! 這是什么意思?
Where the picture shows 247 (or approx. 1% increase), we are actually calculating the accumulated probability of the sales go from the average of 245.2 to any number until 247 — anywhere from 0 to 1%. Even a minor change of 0.01, like 245.2 to 245.21 is considered and added to the probability calculation in this case. Thus, looking at the first bar in the graphic, it is not correct to say that you will be seated all day and it is 56% probable that your sales will go up by 1%.
當圖片顯示為247(或增長約1%)時,我們實際上是在計算銷售的累計概率,從平均245.2到任何數字,直到247-從0到1%。 在這種情況下,甚至考慮將0.01的微小變化(如245.2到245.21)添加到概率計算中。 因此,看一下圖形中的第一個條,說您整天都坐著不正確,并且您的銷售額將增長1%的可能性是56%。
However, it is 56% probable that your sales will move somewhere up within the range 245.2 to 247 if you keep doing what you do. It can increase 0.01% or 0.45% or 0.87%… Similarly, there is a 58% chance it will move within the range of 245.2 to 248 and so far so on.
但是,如果您繼續做自己的工作,那么您的銷售額很有可能在245.2到247范圍內上升。 它可以增加0.01%或0.45%或0.87%...類似地,它有58%的機會會在245.2到248范圍內移動,依此類推。
Therefore, be careful when interpreting this graphic!
因此,解釋該圖形時要小心!
Accumulated Poisson distribution for increasing the ice cream sales by chance.積累了泊松分布,以偶然地增加冰淇淋的銷量。正態分布 (Normal Distribution)
Finally, the Normal Distribution is the most common kind out there. Many statistical concepts and theories are based on this distribution.
最后,正態分布是最常見的一種。 許多統計概念和理論都基于這種分布。
The Normal Distribution is the famous 'bell shaped curve' where the data is distributed around the average. If you plot the values on a graphic, the mean will be the center of the curve.
正態分布是著名的“鐘形曲線”,數據分布在平均值附近。 如果將值繪制在圖形上,則平均值將是曲線的中心。
Knowing those qualities of that curve enables us to make many assumptions about data that are normally distributed and to calculate probabilities for a lot of things from our daily life. Extracting a sample from a population and using the statistics from that sample to understand the whole is one of the amazing advantages of the normal distribution.
知道該曲線的這些品質,使我們能夠對正態分布的數據做出許多假設,并從日常生活中計算出許多事物的概率。 從總體中提取樣本并使用該樣本中的統計數據來了解整體是正態分布的驚人優勢之一。
I believe the best analogy I know for that is with food. When you are exploring a new food or flavor, usually you don’t go for a large bite. First you get a small piece of it and try it to know the flavor. That is because you assume the whole will have the same taste of that little piece. The same is true for Normal Distributions and you can learn more about it researching about Central Limit Theorem.
我相信我所知道的最好的比喻是食物。 當您探索新的食物或風味時,通常不會大吃一口。 首先,您會得到一小塊,然后嘗試了解其味道。 那是因為您假設整個部分將具有與該小塊相同的味道。 正態分布也是如此,您可以通過研究中心極限定理了解更多有關正態分布的信息。
Furthermore, the area under the bell shaped curve will have 100% of the values. As the Normal Distributions are centered on its mean, it is correct to say that 50% will be higher and 50% will be lower than average, as well as most part of the values are concentrated around the average. If you calculate how much the values can be away from the center and then split the curve in 6 equal parts called standard deviation — 3 below average and 3 over the average — , each one unit of standard deviation added will hold more values.
此外,鐘形曲線下方的區域將具有100%的值。 由于正態分布以平均值為中心,因此可以正確地說,平均數將比平均值高50%,而平均值將低50%,并且大多數值都集中在平均值附近。 如果計算多少的值可以從中心要離開,然后分裂在6個等份的曲線稱為標準偏差-在以下的平均平均和3 3 - ,加入標準差中的每一個單元將持有多個值。
Now it becomes easier to divide the sample in ranges of probability. This is know as the 68–95–99 rule. Bear with me: looking at the normal curve below it becomes easy to see that my average is the center, the values are 3 points away from the center and we know the area under the curve comprehends 100% of the values from my sample. So if I take one standard deviation for more and one for less than the average, I will have around 68% of the values of a given attribute. If I take two standard deviations, I will have 95% of the values.And three gets me 99% of the values. And this explains the confidence interval you must have heard many times, specially during elections season. Learn more in this great video from Simple Learning Pro.
現在,將樣本劃分為概率范圍變得更加容易。 這就是6??8–95–99規則。 忍受我:看下面的法線曲線,很容易發現我的平均值是中心,其值與中心相距3個點,我們知道曲線下的面積包含了我樣本中100%的值。 因此,如果我采用一個標準偏差多于一個標準偏差,而采用一個小于平均值的標準偏差,則將擁有給定屬性值的大約68% 。 如果我采用兩個標準偏差,則將獲得95%的值 ,而三個將獲得99%的值。 這解釋了您必須多次聽到的置信區間,尤其是在選舉季節。 觀看來自Simple Learning Pro的精彩視頻,了解更多信息。
How much you can explain using 1, 2 or 3 standard deviations: 68–95–99 rule.使用1、2或3個標準偏差可以解釋多少:68–95–99規則。Moving on and bringing the problem to the last ice cream example. Let’s take the month of March for our test. Here is the distribution of the 427 sales.
繼續講到最后一個冰淇淋問題。 讓我們以三月份進行測試。 這是427筆交易的分布。
Normal distribution of the sales. Mean = 14, standard deviation = 1銷售的正態分布。 平均值= 14,標準偏差= 1427 sales in 30 days gives us approximately 14 sales per day. The standard deviation is 1 (e.g. it could have been 13 or 15 sales instead).
在30天內實現427筆銷售,使我們每天大約有14筆銷售。 標準偏差是1(例如,可能是13或15次銷售)。
We want to know how probable is that we have 16 sales in a day. I am multiplying by 100 so we see the final percentage already. Then we see 5% chance to have 16 sales in a day. That drops to only 0.44% if we test 17 and 0.01% for the 18 sales test.
我們想知道一天有16筆銷售的可能性有多大。 我乘以100,所以我們已經看到了最終百分比。 然后,我們發現一天有16筆銷售的機會為5%。 如果我們測試17,則下降到0.44%,而對于18銷售測試,下降到0.01%。
# 16 sales in a daydnorm(16, mean=14, sd=1)*100
[1] 5.399097# 17 sales in a day
dnorm(17, mean=14, sd=1)*100
[1] 0.4431848# 18 sales in a day
dnorm(18, mean=14, sd=1)*100
[1] 0.01338302
結論 (Conclusion)
The statistical tests are very useful for business and data science if we know how to apply them.
如果我們知道如何應用統計測試,則它們對于業務和數據科學非常有用。
We must have caution when showing the numbers and probabilities to the decision makers, since those can be easily misinterpreted. Be sure to always include detailed explanations for each probability and graphics.
在向決策者展示數字和概率時,我們必須謹慎行事,因為這些數字和概率很容易被誤解。 確保始終包括每種概率和圖形的詳細說明。
It is easy to make a mistake and put the blame on 'bad' statistics. But the problem is not with the numbers, the problem is with the people interpreting those numbers.
容易犯錯誤并歸咎于“不良”統計數據。 但是問題不在于數字,問題在于人們解釋這些數字。
if data:data.science()
Gus
古斯
翻譯自: https://medium.com/gustavorsantos/statistics-and-ice-cream-4004cd86d57b
總結
- 上一篇: 梦到自己上牙掉了一颗是什么意思
- 下一篇: 对数据仓库进行数据建模_确定是否可以对您