python离群点检测_如何从熊猫DataFrame中检测峰点(离群值)
我有一個帶有多個速度值的熊貓數據幀,這些速度值是連續移動的值,但它是一個傳感器數據,因此我們經常在中間出現誤差的情況下,移動平均值似乎也無濟于事,所以我可以采用什么方法用于從數據中刪除這些離群值或峰點?
例:
data points={0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9}
在此數據中,如果我看到點4、4、5、6完全是離群值,那么在我使用具有5分鐘窗框的滾動平均值來平滑這些值之前,但仍然得到了這些類型的亮點,我想刪除它,有人可以建議我采取任何技術擺脫這些問題。
我有一張圖片,可以更清晰地查看數據:
如果您在此處看到數據如何顯示一些必須刪除的離群點?有什么想法擺脫這些問題的可能方法是什么?
解決方案
I really think z-score using scipy.stats.zscore() is the way to go here. Have a look at the related issue in this post. There they are focusing on which method to use before removing potential outliers. As I see it, your challenge is a bit simpler, since judging by the data provided, it would be pretty straight forward to identify potential outliers without having to transform the data. Below is a code snippet that does just that. Just remember though, that what does and does not look like outliers will depend entirely on your dataset. And after removing some outliers, what has not looked like an outlier before, suddenly will do so now. Have a look:
importmatplotlib.pyplotaspltimportpandasaspdimportnumpyasnpfromscipyimportstats# your data (as a list)data=[0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]# initial plotdf1=pd.DataFrame(data=data)df1.columns=['data']df1.plot(style='o')# Function to identify and remove outliersdefoutliers(df,level):# 1. temporary dataframedf=df1.copy(deep=True)# 2. Select a level for a Z-score to identify and remove outliersdf_Z=df[(np.abs(stats.zscore(df))
Originial data:
Test run 1 : Z-score = 4:
As you can see, no data has been removed because the level was set too high.
Test run 2 : Z-score = 2:
Now we're getting somewhere. Two outliers have been removed, but there is still some dubious data left.
Test run 3 : Z-score = 1.2:
This is looking really good. The remaining data now seems to be a bit more evenly distributed than before. But now the data point highlighted by the original datapoint is starting to look a bit like a potential outlier. So where to stop? That's going to be entirely up to you!
EDIT: Here's the whole thing for an easy copy&paste:
importmatplotlib.pyplotaspltimportpandasaspdimportnumpyasnpfromscipyimportstats# your data (as a list)data=[0.5,0.5,0.7,0.6,0.5,0.7,0.5,0.4,0.6,4,0.5,0.5,4,5,6,0.4,0.7,0.8,0.9]# initial plotdf1=pd.DataFrame(data=data)df1.columns=['data']df1.plot(style='o')# Function to identify and remove outliersdefoutliers(df,level):# 1. temporary dataframedf=df1.copy(deep=True)# 2. Select a level for a Z-score to identify and remove outliersdf_Z=df[(np.abs(stats.zscore(df))
總結
以上是生活随笔為你收集整理的python离群点检测_如何从熊猫DataFrame中检测峰点(离群值)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 不知道电脑的CPU型号,怎么升级Win1
- 下一篇: 可制作多种声音配音作品如何生成电脑配音