在置信区间下置信值的计算_使用自举计算置信区间
在置信區(qū)間下置信值的計算
嗨,大家好, (Hi everyone,)
In this article, I will attempt to explain how we can find a confidence interval by using Bootstrap Method. Statistics and Python knowledge are needed for better understanding.
在本文中,我將嘗試解釋如何使用Bootstrap方法找到置信區(qū)間。 需要統(tǒng)計信息和Python知識才能更好地理解。
Before diving into the method, let’s remember some statistical concepts.
在深入探討該方法之前,讓我們記住一些統(tǒng)計概念。
Variance: It is obtained by the sum of squared distances between a data point and the mean for each data point divided by the number of data points.
方差:通過將數(shù)據(jù)點與每個數(shù)據(jù)點的平均值之間的平方距離之和除以數(shù)據(jù)點數(shù)而獲得。
Sample variance樣本方差Standard Deviation: It is a measurement that shows us how our data points spread out from the mean. It is obtained by taking the square root of the variance
標準差:這是一項度量,它向我們顯示了數(shù)據(jù)點如何從均值散布。 通過求方差的平方根獲得
Sample standard deviation樣品標準偏差Cumulative Distribution Function: It can be used on any kind of variable X(discrete, continuous, etc.). It shows us the probability distribution of a variable. Therefore allowing us to interpret the probability of a value less than or equal to x from a given probability distribution
累積分布函數(shù) :可用于任何類型的變量X(離散,連續(xù)等)。 它向我們展示了變量的概率分布。 因此,允許我們根據(jù)給定的概率分布來解釋小于或等于x的值的概率
Empirical Cumulative Distribution Function: Also known as Empirical Distribution Function. The only difference between CDF and ECDF is, while the former shows us the hypothetical distribution of any given population, the latter is based on our observed data.
經驗累積分布函數(shù):也稱為經驗分布函數(shù)。 CDF和ECDF之間的唯一區(qū)別是,前者向我們展示了任何給定總體的假設分布,而后者則基于我們的觀察數(shù)據(jù)。
For example, how can we interpret the ECDF of the data shown on the chart above? We can say that 40% of heights are less than or equal to 160cm. Likewise, the percentage of people with heights of less than or equal to 180 cm is 99.3%
例如,我們如何解釋上表所示數(shù)據(jù)的ECDF? 可以說40%的高度小于或等于160cm。 同樣,身高小于或等于180厘米的人的百分比是99.3%
Probability Density Function: It shows us the distribution of continuous variables. The area under the curve gives us the probability so that the area must always be equal to 1
概率密度函數(shù):它向我們展示了連續(xù)變量的分布。 曲線下的面積為我們提供了概率,因此該面積必須始終等于1
Normal Distribution: Also known as Gaussian Distribution. It is the most important probability distribution function in statistics which is bell-shaped and symmetric.
正態(tài)分布:也稱為高斯分布 。 它是鐘形和對稱的統(tǒng)計中最重要的概率分布函數(shù)。
Normal (Gaussian) Distribution正態(tài)(高斯)分布Confidence Interval: It is the range in which the values likely to exist in the population. It is estimated from the original sample and usually defined as 95% confidence but it may differ. You can consider the figure below which indicates a 95% confidence interval. The lower and upper limits of confidence interval defined by the values corresponding to the first and last 2.5th percentiles.
置信區(qū)間:這是總體中可能存在的值的范圍。 它是根據(jù)原始樣本估算的,通常定義為95%置信度,但可能有所不同。 您可以考慮下圖,它表示置信區(qū)間為95%。 置信區(qū)間的上限和下限由與第一個和最后一個第2.5個百分點相對應的值定義。
Image by author作者提供的圖片什么是Bootstrap方法? (What is Bootstrap Method?)
Bootstrap Method is a resampling method that is commonly used in Data Science. It has been introduced by Bradley Efron in 1979. Mainly, it consists of the resampling our original sample with replacement (Bootstrap Sample) and generating Bootstrap replicates by using Summary Statistics.
Bootstrap方法是數(shù)據(jù)科學中常用的重采樣方法。 它由布拉德利·埃夫隆(Bradley Efron)在1979年推出。主要包括重新采樣原始樣本并進行替換( Bootstrap Sample ),并使用Summary Statistics生成Bootstrap副本 。
人們身高的置信區(qū)間 (Confidence Interval of people heights)
In this article, we are going to work with one of the datasets in Kaggle. It is Weight-Height data sets. It contains height (in inches) and weight (in pounds) information of 10.000 people separated by gender.
在本文中,我們將使用Kaggle中的一個數(shù)據(jù)集。 它是重量-高度數(shù)據(jù)集。 它包含按性別分隔的10.000人的身高(英寸)和體重(磅)信息。
If you would like to see the whole code, you can find the IPython notebook via this link.
如果您想查看整個代碼,可以通過此 鏈接 找到 IPython筆記本。
We are going to use only heights of 500 randomly selected people and compute a 95% confidence interval by using Bootstrap Method
我們將僅使用500個隨機選擇的人員的身高,并使用Bootstrap方法計算95%的置信區(qū)間
Let’s start with importing the libraries that we will need.
讓我們從導入所需的庫開始。
The first five rows of the DataFrame like following
DataFrame的前五行如下所示
Apparently, heights are in inches, let’s convert heights from inches to centimeters and store in a new column Height(cm).
顯然,高度以英寸為單位 ,讓我們將高度從英寸轉換為厘米,并存儲在新列Height(cm)中 。
As we can see above, the maximum and minimum height in the data set are 137.8 cm and 200.6 cm respectively.
從上面我們可以看到,數(shù)據(jù)集中的最大高度和最小高度分別為137.8 cm和200.6 cm。
We can use pandas.DataFrame’s sample method to select 500 randomly selected heights. After that, we will print the summary statistics.
我們可以使用pandas.DataFrame的樣本方法來選擇500個隨機選擇的高度。 之后,我們將打印摘要統(tǒng)計信息。
According to the output, our sample has 145 cm as minimum height and 198 cm as the maximum height.
根據(jù)輸出,我們的樣本的最小高度為145厘米,最大高度為198厘米。
Let’s look at how ECDF and PDF look like?
讓我們看看ECDF和PDF的外觀如何?
ECDF, Image by authorECDF,作者提供的圖片Empirical CDF demonstrates that 50% of people in our sample have 162 cm or less height.
經驗CDF證明樣本中50%的人身高在162厘米以下。
What about PDF?
那PDF呢?
PDF, Image by authorPDF,作者提供的圖片PDF shows us the heights’ distribution is too close to the normal distribution. Do not forget that the area under the curve gives us the probability.
PDF顯示高度的分布與正態(tài)分布過于接近。 不要忘記曲線下方的面積 給了我們概率。
Now, take a moment to think. We have only 500 observations in our sample, but there are billions of people in the world who we cannot measure their heights. Therefore, our sample does not give inference to the population. If we did the same measurements for different samples again and again, what would be the mean of heights?
現(xiàn)在,花點時間思考。 我們的樣本中只有500個觀測值,但是世界上有數(shù)十億人無法測量他們的身高。 因此,我們的樣本無法推斷總體。 如果我們一次又一次地對不同的樣品進行相同的測量,那么高度的平均值是多少?
For instance, assume that we did the same measurements with the same number of people (500) for 1000 times and plot the ECDF for each in a way that overlays the first observation’s ECDF. It will look like the following.
例如,假設我們用相同的人數(shù)(500)進行了1000次相同的測量,并以覆蓋第一個觀測值的ECDF的方式繪制了每個ECDF的圖。 它將如下所示。
ECDF, Image by authorECDF,作者提供的圖片As we can see above, we got different heights, but we can easily detect that the points are spreading in a specific range. That’s the confidence interval that we want to learn
正如我們在上面看到的,我們得到了不同的高度,但是我們可以輕松地檢測到這些點在特定范圍內擴展。 那就是我們要學習的置信區(qū)間
You may say that it is impossible to repeat the experiment so many times, you are not wrong. The exact reason why we use the Bootstrap Method. It helps us to simulate the same experiment thousands or even billions of times.
您可能會說不可能重復這么多次實驗,您是對的。 我們使用Bootstrap方法的確切原因。 它可以幫助我們模擬同一實驗數(shù)千甚至數(shù)十億次。
How?
怎么樣?
In fact, the Bootstrap Method is quite straightforward and easy to understand. First, it generates bootstrap samples from our original sample by randomly choosing among the original sample. After that, it applies a summary statistics such as variation, standard deviation, mean, and so forth to get replicates. We will use ‘mean’ to generate our bootstrap replicates.
實際上,Bootstrap方法非常簡單易懂。 首先,它通過從原始樣本中隨機選擇來從我們的原始樣本中生成引導樣本。 之后,它應用摘要統(tǒng)計信息(例如變異,標準偏差,均值等)來獲得重復數(shù)據(jù)。 我們將使用“均值”來生成引導程序副本。
To understand the method, let’s apply it to a small sample that contains only 5 heights. We can generate our bootstrap samples like the following. Do not forget the fact that we can choose any observation more than once (resampling with replacement)
為了理解該方法,讓我們將其應用于僅包含5個高度的小樣本。 我們可以生成如下所示的引導程序示例。 不要忘記我們可以多次選擇任何觀測值的事實(通過替換進行重采樣)
Resampling, Image by author重采樣,作者提供圖片As we can see above we create 4 bootstrap samples and after that calculate their means. We will call these means our bootstrap replicates. Instead of ‘mean’ we could choose variance, standard deviation, median, or anything else.
正如我們在上面看到的,我們創(chuàng)建了4個引導程序樣本,然后計算它們的均值。 我們將這些稱為“引導復制”。 除了“均值”,我們可以選擇方差,標準差,中位數(shù)或其他任何值。
Come back to our project. The next step, we are going to generate our bootstrap sample from our original sample and we will apply to mean to get bootstrap replicate. We will repeat this process 15.000 times (drawing) in a for loop and store the replicates in an array. To do this we can define a function like following
回到我們的項目。 下一步,我們將從原始樣本中生成引導樣本,并將應用于以獲得引導復制。 我們將在for循環(huán)中重復此過程15.000次(繪制),并將重復項存儲在數(shù)組中。 為此,我們可以定義如下函數(shù)
After we got 15.000 replicates by calling the function, we can compare between the means of the original sample and the bootstrap replicates
通過調用函數(shù)獲得15.000個復制后,我們可以在原始樣本的均值和引導復制之間進行比較
Their means are too close.
他們的手段太接近了。
So, what are we going to do to calculate a 95% confidence interval?
那么,我們要怎么做才能計算出95%的置信區(qū)間?
After obtaining bootstrap replicates, the rest is so simple. As we know, our lower and upper limits are the values correspond to the 2.5th and 97.5th percentiles.
獲得引導程序副本后,其余的操作非常簡單。 眾所周知,我們的下限和上限分別是2.5%和97.5%的值。
Image by author作者提供的圖片We can find the boundaries with following simple Python code
我們可以通過以下簡單的Python代碼找到邊界
Our boundaries are found at 167.7 and 169.5. Therefore, we can say that if we do the same experiment with the whole population. The mean of heights will be between 167.7 cm and 169.5 cm with 95% of chance
我們的邊界位于167.7和169.5。 因此,可以說,如果對整個人口進行相同的實驗。 身高的平均值在167.7厘米至169.5厘米之間,有95%的機會
摘要 (Summary)
Let’s summarize what we did. We have randomly selected 500 heights and generated bootstrap samples. We calculated the ‘mean’ from those samples and got bootstrap replicates of means. Ultimately we calculated a 95% confidence interval.
讓我們總結一下我們所做的。 我們隨機選擇了500個高度并生成了引導程序樣本。 我們從這些樣本中計算出“均值”,并獲得了均值的自舉重復項。 最終,我們計算出95%的置信區(qū)間。
I wish you good luck in your data journey :)
祝您在數(shù)據(jù)旅途中一切順利:)
翻譯自: https://towardsdatascience.com/calculating-confidence-interval-with-bootstrapping-872c657c058d
在置信區(qū)間下置信值的計算
總結
以上是生活随笔為你收集整理的在置信区间下置信值的计算_使用自举计算置信区间的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 显卡支架怎么装(如何选择显卡)
- 下一篇: 全民pc是什么(全民K歌电脑版下载)