海量数据寻找最频繁的数据_在数据中寻找什么
海量數(shù)據(jù)尋找最頻繁的數(shù)據(jù)
Some activities are instinctive. A baby doesn’t need to be taught how to suckle. Most people can use an escalator, operate an elevator, and open a door instinctively. The same isn’t true of playing a guitar, driving a car, or analyzing data. Once you get comfortable with what to look for in a data set, you’ll find data analysis can be as much fun as playing a guitar or driving a car.
有些活動(dòng)是本能的。 不需要教嬰兒如何哺乳。 大多數(shù)人可以本能地使用自動(dòng)扶梯,操作電梯和開門。 彈吉他,開汽車或分析數(shù)據(jù)并非如此。 一旦您對(duì)數(shù)據(jù)集中的內(nèi)容感到滿意,就會(huì)發(fā)現(xiàn)數(shù)據(jù)分析和彈吉他或開車一樣有趣。
目的 (Objective)
When faced with new data, the first thing to consider is the objective you, your boss, or your client have in analyzing the dataset. Consider these four possibilities, three are comparatively easy and one is a relative challenge.
面對(duì)新數(shù)據(jù)時(shí),首先要考慮的是您,您的老板或客戶在分析數(shù)據(jù)集時(shí)要達(dá)到的目標(biāo)。 考慮這四種可能性 ,三種相對(duì)容易,一種相對(duì)挑戰(zhàn)。
Conduct a Specific Analysis — Your client only wants you to conduct a specific analysis, perhaps like descriptive statistics or a statistical test between two groups. No problem, just conduct the analysis. There’s no need to go further. That’s easy.
進(jìn)行特定分析 -您的客戶只希望您進(jìn)行特定分析,例如描述性統(tǒng)計(jì)或兩組之間的統(tǒng)計(jì)檢驗(yàn)。 沒問題,只需進(jìn)行分析即可。 無需進(jìn)一步。 這很簡單。
Answer a Specific Question — Some clients only want one thing — answer a specific question. Maybe it’s something like “is my water safe to drink” or “is traffic on my street worse on Wednesdays.” This will require more thought and perhaps some experience, but again, you have a specific direction to go in. That makes it easier.
回答一個(gè)特定的問題 -有些客戶只想要一件事-回答一個(gè)特定的問題。 可能是“我的水可以安全飲用”或“星期三街道上的交通情況是否更糟”。 這將需要更多的思考和也許的一些經(jīng)驗(yàn),但是同樣,您有一個(gè)特定的方向可以進(jìn)入。這使它更容易。
Address a General Need — Projects with general goals often involve model building. You’ll have to establish whether they need a single forecast, map or model, or a tool that can be used again in the future. This will require quite a bit of thought and experience but at least you know what you need to do and where you need to end up. Not easy but straightforward.
解決一般需求 -具有一般目標(biāo)的項(xiàng)目通常涉及模型構(gòu)建。 您必須確定他們是否需要單個(gè)預(yù)測(cè),地圖或模型,或者需要將來可以再次使用的工具。 這將需要大量的思想和經(jīng)驗(yàn),但是至少您知道您需要做些什么以及最終需要去哪里。 不容易,但直接。
Explore the Unknown — Every once in a while, a client will have nothing specific in mind, but will want to know whatever can be determined from the dataset. This is a challenge because there’s no guidance for where to start or where to finish. This blog will help you address this objective.
探索未知 -每隔一段時(shí)間,客戶就不會(huì)有什么特別的主意,但希望知道可以從數(shù)據(jù)集中確定的內(nèi)容。 這是一個(gè)挑戰(zhàn),因?yàn)闆]有關(guān)于從哪里開始或從哪里結(jié)束的指導(dǎo)。 該博客將幫助您解決此目標(biāo)。
If your client is not clear about their objective, start at the very end. Ask what decisions will need to be made based on the results of your analysis. Ask what kind of outputs would be appropriate — a report, an infographic, a spreadsheet file, a presentation, or an application. If they have no expectations, it’s time to explore.
如果您的客戶端沒有明確自己的目標(biāo),開始在最后 。 詢問根據(jù)分析結(jié)果需要做出哪些決定。 詢問哪種輸出是合適的-報(bào)告,信息圖,電子表格文件,演示文稿或應(yīng)用程序。 如果他們沒有期望,那就該去探索了。
有數(shù)據(jù)嗎? (Got data?)
Scrubbing your data will make you familiar with what you have. That’s why it’s a good idea to know your objective first. There are many things you can do to scrub your data but the first thing is to put it into a matrix. Statistical analyses all begin with matrices. The form of the matrix isn’t always the same, but most commonly, the matrix has columns that represent variables (e.g., metrics, measurements) and rows that represent observations (e.g., individuals, students, patients, sample units, or dates). Data on the variables for each observation go into the cells. Usually, this is done with spreadsheet software.
整理數(shù)據(jù)將使您熟悉所擁有的內(nèi)容。 這就是為什么首先了解您的目標(biāo)是一個(gè)好主意。 您可以執(zhí)行許多操作來清理數(shù)據(jù),但首先要將其放入矩陣中。 統(tǒng)計(jì)分析都是從矩陣開始的。 矩陣的形式并不總是相同的,但是最常見的是,矩陣具有代表變量(例如度量,度量)的列和代表觀察值的行(例如個(gè)人,學(xué)生,患者,樣本單位或日期) 。 每個(gè)觀察變量的數(shù)據(jù)都進(jìn)入單元格。 通常,這是通過電子表格軟件完成的。
Data scrubbing can be cursory or exhaustive. Assuming the data are already available in electronic form, you’ll still have to achieve two goals — getting the numbers right and getting the right numbers.
數(shù)據(jù)清理可能是粗略的或詳盡的。 假設(shè)數(shù)據(jù)已經(jīng)以電子形式提供,您仍然必須實(shí)現(xiàn)兩個(gè)目標(biāo)-正確地編號(hào)和正確地編號(hào)。
Getting the numbers right requires correcting at least three types of data errors:
正確計(jì)算數(shù)字要求至少糾正三種類型的數(shù)據(jù)錯(cuò)誤 :
Alphanumeric substitution, which involves mixing letters and numbers (e.g., 0 and o or O, 1 and l, 5 and S, 6 and b), dropped or added digits, spelling mistakes in text fields that will be sorted or filtered, and random errors.
字母數(shù)字替換 ,包括字母和數(shù)字的混合(例如0和o或O,1和l,5和S,6和b),數(shù)字的掉落或增加,文本字段中的拼寫錯(cuò)誤(將被排序或過濾)以及隨機(jī)錯(cuò)誤。
Specification errors involve bad data generation, perhaps attributable to recording mistakes, uncalibrated equipment, lab mistakes, or incorrect sample IDs and aliases.
規(guī)范錯(cuò)誤涉及不良的數(shù)據(jù)生成,可能歸因于記錄錯(cuò)誤,未校準(zhǔn)的設(shè)備,實(shí)驗(yàn)室錯(cuò)誤或不正確的樣品ID和別名。
Inappropriate Data Formats, such as extra columns and rows, inconsistent use of ND, NA, or NR flags, and the inappropriate presence of 0s versus blanks.
不適當(dāng)?shù)臄?shù)據(jù)格式 ,例如多余的列和行,ND,NA或NR標(biāo)志的使用不一致,以及0與空白之間的不適當(dāng)存在。
Getting the right numbers requires addressing a variety of data issues:
獲取正確的數(shù)字需要解決各種數(shù)據(jù)問題:
Variables and phenomenon. Are the variables sufficient to explore the phenomena in question?
變量和現(xiàn)象 。 這些變量是否足以探索所討論的現(xiàn)象 ?
Variable scales. Review the measurement scales of the variables so you know what analyses might be applicable to the data. Also, look for nominal and ordinal scale variables to consider how you might segment the data.
可變比例尺 。 查看變量的度量范圍 ,以了解哪些分析可能適用于數(shù)據(jù)。 另外,查找名義和次序比例變量以考慮如何分割數(shù)據(jù)。
Representative sample. Considering the population being explored, does the sample appear to be representative.
代表性樣品 。 考慮到正在探索的種群,樣本是否具有代表性。
Replicates. If there are replicate or other quality control samples, they should be removed from the analysis appropriately.
復(fù)制 。 如果有重復(fù)樣品或其他質(zhì)量控制樣品 ,則應(yīng)將其從分析中適當(dāng)除去。
Censored data. If you have censored data (i.e., unquantified data above or below some limit), you can recode the data as some fraction of the limit, but not zero.
審查數(shù)據(jù) 。 如果您檢查了數(shù)據(jù)(即,超出或低于某個(gè)限制的未量化數(shù)據(jù)),則可以將數(shù)據(jù)重新編碼為限制的一部分,但不能為零。
Missing data. If you have missing data, they should be recoded as blanks or use another accepted procedure for treating missing data.
缺少數(shù)據(jù) 。 如果您有丟失的數(shù)據(jù),應(yīng)將它們重新編碼為空白或使用其他可接受的過程來處理丟失的數(shù)據(jù)。
Data scrubbing can consume a substantial amount of time, even more than the statistical calculations.
數(shù)據(jù)清理會(huì)消耗大量時(shí)間,甚至比統(tǒng)計(jì)計(jì)算還要多。
要找什么 (What To Look For)
If you’re new to applied statistics, you might wonder where to start looking at a dataset. Here are five places to consider looking.
如果您不熟悉應(yīng)用統(tǒng)計(jì)信息,則可能想知道從哪里開始查看數(shù)據(jù)集。 這里有五個(gè)要考慮的地方。
Photo by author作者照片- Snapshot 快照
- Population or Sample Characteristics 總體或樣本特征
- Change 更改
- Trends and Patterns 趨勢(shì)與模式
- Anomalies 異常現(xiàn)象
Start with the entire dataset. Don’t divide the data into groups based on categoral variables. Divide and aggregate groupings later after you have a feel for the global situation. The reason for this is that the number of possible combinations of variables and levels of grouping variables can be large, overwhelming, each one being an analysis in itself. Like peeling an onion, explore one layer of data at a time until you get to the core.
從整個(gè)數(shù)據(jù)集開始。 不要根據(jù)類別變量將數(shù)據(jù)分為幾類。 在對(duì)全球形勢(shì)有所了解之后,請(qǐng)對(duì)分組進(jìn)行分組和匯總。 這樣做的原因是,變量的可能組合和分組變量級(jí)別的數(shù)量可能很大,令人不知所措,每個(gè)變量本身就是一項(xiàng)分析。 就像剝洋蔥一樣,一次瀏覽一層數(shù)據(jù),直到到達(dá)核心為止。
快照 (Snapshot)
What does the data look like at one point. Usually it’s at the same point in time but it could also be some common conditions, like after a specific business activity, or at a certain temperature and pressure.
數(shù)據(jù)在某一點(diǎn)是什么樣的。 通常是在同一時(shí)間點(diǎn),但也可能是某些常見條件,例如在進(jìn)行特定業(yè)務(wù)活動(dòng)之后,或在一定溫度和壓力下。
Snapshots aren’t difficult to analyze. You just decide where you want a snapshot and record all the variable values at that point. There are no descriptive statistics, graphs, or tests unless you decide to subdivide the data later. The only challenge is deciding whether taking a snapshot makes any sense for exploring the data.
快照并不難分析。 您只需確定要快照的位置,然后記錄所有變量值。 除非您決定稍后再細(xì)分?jǐn)?shù)據(jù),否則沒有描述性的統(tǒng)計(jì)信息,圖表或測(cè)試。 唯一的挑戰(zhàn)是確定拍攝快照是否對(duì)瀏覽數(shù)據(jù)有意義。
The only thing you look for in a snapshot is something unexpected or unusual that might direct further analysis. It can also be used as a baseline to evaluate change.
您在快照中唯一需要查找的是意外或異常情況,可能會(huì)導(dǎo)致進(jìn)一步的分析。 它也可以用作評(píng)估變化的基準(zhǔn)。
人口特征 (Population Characteristics)
It’s always a good idea to know everything you can about the populations you are exploring. The approach is straightfoward; calculate descriptive statistics. Here’s a summary of what you might look at. It’s based on the measurement scale of the variable you are assessing.
了解您所探索的人群的一切都是一個(gè)好主意。 這種方法是直截了當(dāng)?shù)?#xff1b; 計(jì)算描述統(tǒng)計(jì) 。 這是您可能會(huì)看到的摘要。 它基于您正在評(píng)估的變量的度量范圍。
For grouping (nominal scale) variables, look at the frequencies of the groups. You’ll want to know if there are enough observations in each group to break them out for further analysis. For progression (continuous) scales, look at the median and the mean. If they’re close, the frequency distribution is probably symmetrical. You can confirm this by looking at a histogram or the skewness. If the standard-deviation-divided-by-the-mean (called the coefficient of variation) is over 1, the distribution may be lognormal, or at least, asymmetrical. Quartiles and deciles will support this finding. Look at the measures of central tendency and dispersion. If the dispersion is relatively large, statistical testing may be problematical.
對(duì)于分組(標(biāo)稱比例)變量,請(qǐng)查看組的頻率。 您可能想知道每個(gè)組中是否有足夠的觀測(cè)值可以將其分解以進(jìn)行進(jìn)一步的分析。 對(duì)于進(jìn)展(連續(xù))量表,請(qǐng)查看中位數(shù)和均值。 如果它們很接近,則頻率分布可能是對(duì)稱的。 您可以通過查看直方圖或偏度來確認(rèn)這一點(diǎn)。 如果按均值劃分的標(biāo)準(zhǔn)偏差(稱為變異系數(shù) )超過1,則分布可能是對(duì)數(shù)正態(tài)分布,或者至少是不對(duì)稱分布。 四分位數(shù)和十分位數(shù)將支持這一發(fā)現(xiàn)。 看一下集中趨勢(shì)和分散性的度量。 如果離散度相對(duì)較大,則統(tǒng)計(jì)測(cè)試可能會(huì)出現(xiàn)問題。
Graphs are also a good way, and in my mind the best way, to explore population characteristics. Never calculate a statistic without looking at its visual representation in a graph. There are many types of graphs that will let you do that.
圖也是探索人口特征的一種好方法,也是我認(rèn)為最好的方法。 在不查看圖形的直觀表示的情況下,切勿計(jì)算統(tǒng)計(jì)信息。 有許多類型的圖形可以幫助您做到這一點(diǎn)。
What you look for in a graph depends on what the graph is supposed to show — distribution, mixtures, properties, or relationships. There are other things you might look for but here are a few things to start with.
您在圖表中尋找的內(nèi)容取決于圖表應(yīng)顯示的內(nèi)容-分布,混合,屬性或關(guān)系。 您可能還會(huì)尋找其他東西,但是這里有一些開始的事情。
For distribution graphs (box plots, histograms, dot plots, stem-leaf diagrams, Q-Q plots, rose diagrams, and probability plots), look for symmetry. That will separate many theoretical distributions, say a normal distribution (symmetrical) from a lognormal distribution (asymmetrical). This will be useful information if you do any statistical testing later.
對(duì)于分布圖(箱形圖,直方圖,點(diǎn)圖,莖葉圖,QQ圖,玫瑰圖和概率圖),請(qǐng)尋找對(duì)稱性 。 這會(huì)將許多理論分布(例如,正態(tài)分布(對(duì)稱)和對(duì)數(shù)正態(tài)分布(不對(duì)稱))分開。 如果以后進(jìn)行任何統(tǒng)計(jì)測(cè)試,這將是有用的信息。
For mixture graphs (pie charts, rose diagrams, and ternary plots), look for imbalance. If you have some segments that are very large and others very small, there may be common and unique themes to the mix to explore. Maybe the unique segments can be combined. This will be useful information if you break out subgroups later.
對(duì)于混合圖(餅圖,玫瑰圖和三元圖),請(qǐng)查找不平衡度 。 如果您的某些細(xì)分受眾群很大,而其他細(xì)分受眾群很小,那么可能會(huì)有一些共同而獨(dú)特的主題可供探索。 也許可以組合獨(dú)特的細(xì)分。 如果以后再細(xì)分子組,這將是有用的信息。
For properties graphs (bar charts, area charts, line charts, candlestick charts, control charts, means plots, deviation plots, spread plots, matrix plots, maps, block diagrams, and rose diagrams), look for the unexpected. Are the central tendency and dispersion what you might expect? Where are big deviations?
對(duì)于特性圖(條形圖,面積圖,折線圖,燭臺(tái)圖,控制圖,均值圖,偏差圖,散布圖,矩陣圖,地圖,框圖和玫瑰圖),請(qǐng)查找意外的 。 您所期望的主要趨勢(shì)和分散是嗎? 大的偏差在哪里?
For relationship graphs (icon plots, 2D scatter plots, contour plots, bubble plots, 3D scatter plots, surface plots, and multivariable plots), look for trends and patterns. You might find linear or curvilinear trends, repeating cycles, one-time shifts, continuing steps, periodic shocks, or just random points. This is the prelude for looking for more detailed patterns.
對(duì)于關(guān)系圖(圖標(biāo)圖,2D散點(diǎn)圖,輪廓圖,氣泡圖,3D散點(diǎn)圖,表面圖和多變量圖),請(qǐng)查找趨勢(shì)和模式 。 您可能會(huì)發(fā)現(xiàn)線性或曲線趨勢(shì),重復(fù)周期,一次移位,連續(xù)步驟,周期性沖擊或只是隨機(jī)點(diǎn)。 這是尋找更詳細(xì)模式的序幕。
更改 (Change)
Change usually refers to differences between time periods but, like snapshots, it could also refer to some common conditions. Change can be difficult, or at least complicated, to analyze because you must first calculate the changes you want to explore. When calculating changes, be sure the intervals of the change are consistent. But after that, here’s what might you do.
更改通常是指時(shí)間段之間的差異,但是,像快照一樣,它也可以指某些常見情況。 因?yàn)槟仨毷紫扔?jì)算要探索的變更,所以變更可能很難分析,或者至少很復(fù)雜。 計(jì)算更改時(shí),請(qǐng)確保更改間隔一致。 但是之后,這就是您可能會(huì)做的。
First, look for very large, negative or positive changes. Are the percentages of change consistent for all variables? What might be some reasons for the changes.
首先,尋找非常大的,消極的或積極的變化。 所有變量的變化百分比是否一致? 進(jìn)行更改可能是某些原因。
Calculate the mean and median changes. If the indicators of central tendency for the changes are not near zero, you might have a trend. Verify the possibility by plotting the change data. You might even consider conducting a statistical test to confirm that the change is different from zero. If you do think you have a pattern, trend, or anomaly, graphs are always the best place to look.
計(jì)算均值和中位數(shù)變化。 如果變化的主要趨勢(shì)指標(biāo)不接近于零,則可能具有趨勢(shì)。 通過繪制更改數(shù)據(jù)來驗(yàn)證可能性。 您甚至可以考慮進(jìn)行統(tǒng)計(jì)測(cè)試,以確認(rèn)更改不為零。 如果您確實(shí)認(rèn)為自己有模式,趨勢(shì)或異常,則圖形始終是最佳的查看位置。
趨勢(shì)與模式 (Trends and Patterns)
There are at least ten types of data relationships — direct, feedback, common, mediated, stimulated, suppressed, inverse, threshold, and complex — and of course spurious relationships. They can all produce different patterns and trends, or no recognizable arrangement at all.
至少有十種類型的數(shù)據(jù)關(guān)系 -直接,反饋,公共,中介,刺激,抑制,逆向,閾值和復(fù)雜-當(dāng)然是虛假關(guān)系。 它們都可以產(chǎn)生不同的模式和趨勢(shì),或者根本沒有可識(shí)別的安排。
There are four patterns to look for:
有四種模式可尋找:
- Shocks 電擊
- Steps 腳步
- Shifts 轉(zhuǎn)變
- Cycles. 周期。
Shocks are seemingly random excursions far from the main body of data. They are outliers but they often reoccur, sometimes in a similar way suggesting a common, though sporadic cause. Some shocks may be attributed to an intermittent malfunction in the measurement instrument. Sometimes they occur in pairs, one in the positive direction and another of similar size in the negative direction. This is often seen when reporting dates for business data are missed.
沖擊似乎是遠(yuǎn)離數(shù)據(jù)主體的隨機(jī)漂移。 它們是異常值,但它們經(jīng)常重復(fù)出現(xiàn),有時(shí)以類似的方式暗示了一個(gè)常見的零星原因。 某些沖擊可能歸因于測(cè)量儀器的間歇性故障。 有時(shí)它們成對(duì)出現(xiàn),一個(gè)在正方向,另一個(gè)在大小相似,在負(fù)方向。 當(dāng)錯(cuò)過業(yè)務(wù)數(shù)據(jù)的報(bào)告日期時(shí),通常會(huì)看到這種情況。
Steps are periodic increases or decreases in the body of the data. Steps progress in the same direction because they reflect a progressive change in conditions. If the steps are small enough, they can appear to be, and be analyzed as, a linear trend.
步驟是數(shù)據(jù)主體中的周期性增加或減少。 步驟沿同一方向前進(jìn),因?yàn)樗鼈兎从沉藯l件的逐步變化。 如果步長足夠小,則它們看起來可能是線性趨勢(shì),并且被分析為線性趨勢(shì)。
Shifts are increases and/or decreases in the body of the data like steps, but shifts tend to be longer than steps and don’t necessarily progress in the same direction. Shifts reflect occasional changes in conditions. The changes may remain or revert to the previous conditions, making them more difficult to analyze with linear models.
移位是數(shù)據(jù)主體(如步長)中的增加和/或減少,但移位往往比步長,并且不一定沿相同方向進(jìn)行。 變動(dòng)反映了情況的偶然變化。 這些更改可能會(huì)保留或恢復(fù)為先前的條件,從而使使用線性模型進(jìn)行分析變得更加困難。
Cycles are increases and decreases in the body of the data that usually appear as a waveform having fairly consistent amplitudes and frequencies. Cycles reflect periodic changes in conditions, often associated with time, such as daily or seasonal cycles. Cycles cannot be analyzed effectively with linear models. Sometimes different cycles add together making them more difficult to recognize and analyze.
周期是數(shù)據(jù)主體中的增加和減少,通常以具有相當(dāng)一致的幅度和頻率的波形形式出現(xiàn)。 周期反映出條件的周期性變化,通常與時(shí)間相關(guān),例如每日或季節(jié)性周期。 使用線性模型無法有效地分析周期。 有時(shí),不同的循環(huán)加在一起會(huì)使它們更加難以識(shí)別和分析。
Trends are often easy to identify because they are more familiar to most data analysts. Again, graphs are the best place to look for trends.
趨勢(shì)通常很容易識(shí)別,因?yàn)榇蠖鄶?shù)數(shù)據(jù)分析人員對(duì)趨勢(shì)更為熟悉。 同樣, 圖形是尋找趨勢(shì)的最佳位置。
Linear trends are easy to see; the data form a line. Curvilinear trends can be more difficult to recognize because they don’t necessarily follow a set path. With some experience and intuition, however, they can be identified. Nonlinear trends look similar to curvilinear trends but they require more complicated nonlinear models to analyze. Curvilinear trends can be analyzed with linear models with the use of transformations.
線性趨勢(shì)很容易看到; 數(shù)據(jù)排成一行。 曲線趨勢(shì)可能更難以識(shí)別,因?yàn)樗鼈儾灰欢ㄗ裱O(shè)定的路徑。 但是,憑著一些經(jīng)驗(yàn)和直覺,就可以確定它們。 非線性趨勢(shì)看起來與曲線趨勢(shì)相似,但是它們需要更復(fù)雜的非線性模型進(jìn)行分析。 曲線趨勢(shì)可以通過使用變換的線性模型進(jìn)行分析。
There are also more complex trends involving different dimensions, including:
還有涉及不同方面的更復(fù)雜的趨勢(shì),包括:
- Temporal 顳
- Spatial 空間空間
- Categorical 分類的
- Hidden 隱
- Multivariate 多變量
Temporal Trends can be more difficult to identify because Time-series data can be combinations of shocks, steps, shifts, cycles, and linear and curvilinear trends. The effects may be seasonal, superimposed on each other within a given time period, or spread over many different time periods. Confounded effects are often impossible to separate, especially if the data record is short or the sampled intervals are irregular or too large.
時(shí)間趨勢(shì)可能更難識(shí)別,因?yàn)闀r(shí)間序列數(shù)據(jù)可以是沖擊,階躍,移位,周期以及線性和曲線趨勢(shì)的組合。 這些影響可以是季節(jié)性的,也可以在給定的時(shí)間段內(nèi)相互疊加,也可以分布在許多不同的時(shí)間段內(nèi)。 混淆的效果通常是無法分離的,尤其是在數(shù)據(jù)記錄較短或采樣間隔不規(guī)則或太大的情況下。
Spatial Trends present a different twist. Time is one-dimensional (at least as we now know it); distance can be one-, two-, or three-dimensional. Distance can be in a straight line (“as the crow flies”) or along a path (such as driving distance). Defining the location of a unique point on a two-dimensional surface (i.e., a plane) requires at least two variables. The variables can represent coordinates (northing/easting, latitude/longitude) or distance and direction from a fixed starting point. At least three variables are needed to define a unique point location in a three-dimensional volume, so a variable for depth (or height) must be added to the location coordinates. Looking for spatial patterns involves interpolation of geographic data using one of several available algorithms, like moving averages, inverse distances, or geostatistics.
空間趨勢(shì)呈現(xiàn)出不同的變化。 時(shí)間是一維的(至少我們現(xiàn)在知道)。 距離可以是一維,二維或三維。 距離可以是直線(“烏鴉飛翔”)或沿路徑(例如行駛距離)。 在二維表面(即平面)上定義唯一點(diǎn)的位置至少需要兩個(gè)變量。 變量可以表示坐標(biāo)(北/東,緯度/經(jīng)度)或距固定起點(diǎn)的距離和方向。 至少需要三個(gè)變量來定義三維體積中的唯一點(diǎn)位置,因此必須將深度(或高度)變量添加到位置坐標(biāo)中。 尋找空間模式涉及使用幾種可用算法之一對(duì)地理數(shù)據(jù)進(jìn)行插值,例如移動(dòng)平均值,反距離或地統(tǒng)計(jì)學(xué) 。
Categorical Trends are no more difficult to identify than any trend except you have to break out categories to do it, which can be a lot of work. One thing you might see when analyzing categories is Simpson’s paradox. The paradox occurs when trends appear in categories that are different from the overall group. Hidden Trends are trends that appear only in categories and not the overall group. You may be able to detect linear trends in categories without graphs if you have enough data in the categories to calculate correlation coefficients within each.
分類趨勢(shì)比任何趨勢(shì)都更容易識(shí)別,除了您必須細(xì)分類別來進(jìn)行,這可能需要很多工作。 分析類別時(shí),您可能會(huì)看到的一件事是Simpson的悖論 。 當(dāng)趨勢(shì)出現(xiàn)在與整個(gè)組不同的類別中時(shí),就會(huì)發(fā)生自相矛盾。 隱藏趨勢(shì)是僅顯示在類別中而不顯示在整個(gè)組中的趨勢(shì)。 如果您在類別中有足夠的數(shù)據(jù)來計(jì)算每個(gè)類別中的相關(guān)系數(shù),則可以在沒有圖形的情況下檢測(cè)類別中的線性趨勢(shì)。
Multivariate Trends add a layer of complexity to most trends, which are bivariate. Still, you look for the same things, patterns and trends, only you have to examine at least one additional dimension. The extra dimension may be an additional axis or some other way of representing data, like icon type, size, or color.
多元趨勢(shì)為大多數(shù)是雙變量的趨勢(shì)增加了一層復(fù)雜性。 盡管如此,您仍在尋找相同的事物,模式和趨勢(shì),只需要檢查至少一個(gè)額外的維度。 額外的維度可以是額外的軸或其他表示數(shù)據(jù)的方式,例如圖標(biāo)類型,大小或顏色。
異常現(xiàn)象 (Anomalies)
Sometimes the most interesting revelations you can garner from a dataset are the ways that it doesn’t fit expectations. Three things to look for are:
有時(shí),您可以從數(shù)據(jù)集中獲得的最有趣的啟示是它不符合預(yù)期的方式。 要尋找的三件事是:
- Censoring 審查制度
- Heteroskedasticity 異方差
- Outliers. 離群值。
Censoring is when a measurement is recorded as a <value or as a >value, indicating that the measurement instrument was unable to quantify the real value. For example, the real value may be outside the range of a meter, or counts can’t be approximated because there are too many or too few, or a time can only be estimated as before or after. Censoring is easy to detect in a dataset because they should be qualified with < or >.
刪減是指將測(cè)量記錄為<值或>值,表示測(cè)量儀器無法量化實(shí)際值。 例如,實(shí)際值可能超出了儀表的范圍, 或者由于數(shù)量太多或太少而無法近似計(jì)數(shù), 或者只能估計(jì)之前或之后的時(shí)間。 審查在數(shù)據(jù)集中很容易檢測(cè),因?yàn)樗鼈儜?yīng)使用<或>進(jìn)行限定。
Heteroskedasticity is when the variability in a variable is not uniform across its range. This is important because homo-scedasticity (the opposite of heteroskedasicity) is assumed by parametric statistics. Look for differing thicknesses in plotted data. This is often seen in automated measurements when a measurement instrument is upgraded to one with a greater precision.
異是當(dāng)在一個(gè)變量中的變化是不是在其整個(gè)范圍內(nèi)均勻。 這很重要,因?yàn)閰?shù)統(tǒng)計(jì)量假定為均方差性(與異方差性相反)。 在繪圖數(shù)據(jù)中查找不同的厚度。 當(dāng)自動(dòng)將測(cè)量儀器升級(jí)為更高精度的儀器時(shí),通常會(huì)看到這種情況。
Influential observations and outliers are the data points that don’t fit the overall trends and patterns. Finding anomalies isn’t that difficult; deciding why they are anomalous and what to do with them are the really tough parts. Here are some examples of the types of outliers to look for.
有影響力的觀察結(jié)果和離群值是與總體趨勢(shì)和模式不符的數(shù)據(jù)點(diǎn)。 查找異常并不困難; 決定它們?yōu)槭裁串惓R约叭绾翁幚硭鼈兪钦嬲щy的部分。 以下是一些要查找的異常值類型的示例。
如何看待 (How and Where to Look)
That’s a lot of information to take in and remember, so here’s a summary you can refer to in the future if you ever need it.
需要記住的很多信息,因此,如果需要,這里是您將來可以參考的摘要。
And when you’re done, be sure to document your results so others can follow what you did.
完成后,請(qǐng)務(wù)必記錄您的結(jié)果,以便其他人可以照做。
Originally published at http://statswithcats.net on January 21, 2019.
最初于 2019年1月21日 發(fā)布在 http://statswithcats.net 上。
翻譯自: https://medium.com/@charliekufs/what-to-look-for-in-data-e63209bb9c30
海量數(shù)據(jù)尋找最頻繁的數(shù)據(jù)
總結(jié)
以上是生活随笔為你收集整理的海量数据寻找最频繁的数据_在数据中寻找什么的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 梦到蟒蛇是什么征兆
- 下一篇: 梦到小孩子牙齿掉了是什么意思