对数据仓库进行数据建模_确定是否可以对您的数据进行建模
對(duì)數(shù)據(jù)倉(cāng)庫(kù)進(jìn)行數(shù)據(jù)建模
Some data sets are just not meant to have the geospatial representation that can be clustered. There is great variance in your features, and theoretically great features as well. But, it doesn’t mean is statistically separable.
某些數(shù)據(jù)集并不意味著具有可以聚類的地理空間表示。 您的功能差異很大,理論上也很棒。 但是,這并不意味著在統(tǒng)計(jì)上是可分離的。
那么,我什么時(shí)候停止? (So, WHEN DO I STOP?)
Always Visualize your data based on the class label you are trying to predict
始終根據(jù)您要預(yù)測(cè)的類標(biāo)簽可視化數(shù)據(jù)
sns.pairplot(columns_pairplot, hue = 'readmitted')
plt.show()
The distribution of different classes is almost exact. Of course, it is an imbalanced dataset. But, notice how the spread of the classes overlaps as well?
不同類別的分布幾乎是準(zhǔn)確的。 當(dāng)然,它是一個(gè)不平衡的數(shù)據(jù)集。 但是,注意這些類的傳播也是如何重疊的嗎?
2. Apply the t-SNE visualization
2.應(yīng)用t-SNE可視化
t-SNE is “t-distributed stochastic neighbor embedding”. It maps higher dimensional data to 2-D space. This approximately preserves the nearness of the samples.
t-SNE是“ t分布隨機(jī)鄰居嵌入”。 它將高維數(shù)據(jù)映射到二維空間。 這大致保持了樣品的接近性。
You might need to apply different learning rates to find the best one for your dataset. Usually, try values between 50 and 200.
您可能需要應(yīng)用不同的學(xué)習(xí)率才能為數(shù)據(jù)集找到最佳學(xué)習(xí)率 。 通常,請(qǐng)嘗試輸入介于50和200之間的值。
Hyper-parameter, perplexity balances the importance t-SNE gives to local and global variability of the data. It is a guess on the number of close neighbors each point has. Use values between 5–50. Higher, if there are more data points. Perplexity value should not be more than the number of data points.
超參數(shù), 困惑度平衡了t-SNE對(duì)數(shù)據(jù)局部和全局可變性的重要性。 這是對(duì)每個(gè)點(diǎn)的近鄰數(shù)量的猜測(cè)。 使用5-50之間的值。 如果有更多數(shù)據(jù)點(diǎn),則更高 。 困惑度值不應(yīng)大于數(shù)據(jù)點(diǎn)的數(shù)量。
NOTE: Axis to t-SNE plot are not interpretable. They will be different every time t-SNE is applied
注意:軸到t-SNE圖是無法解釋的。 每次應(yīng)用t-SNE時(shí)它們都會(huì)不同
Image by Author: T-SNE Visualization圖片作者:T-SNE可視化Hmm, let’s look a bit more- tweak some hyperparameters.
嗯,讓我們?cè)倏匆幌?調(diào)整一些超參數(shù)。
# reduce dimensionality with t-snetsne = TSNE(n_components=2, verbose=1, perplexity=50, n_iter=1000, learning_rate=50)
tsne_results = tsne.fit_transform(x_train)Image by Author: Overlapping Data- Unclassifiable!圖片作者:重疊數(shù)據(jù)-無法分類!
Do you see how the clusters can not be separated! I should have stopped here! But, I could not get myself out of the rabid hole. [YES WE ALL GO DOWN THAT SOMETIMES].
您是否看到群集無法分離! 我應(yīng)該在這里停下來! 但是,我無法擺脫困境。 [是的,我們有時(shí)會(huì)倒下]。
3. Multi-Class Classification
3.多類別分類
We already know from above that the decision boundaries are non-linear. So, we can use an SVC (Support Vector Classifier with RBF Kernel)
從上面我們已經(jīng)知道決策邊界是非線性的。 因此,我們可以使用SVC (帶有RBF內(nèi)核的支持向量分類器)
from sklearn.svm import SVCfrom sklearn.model_selection import GridSearchCVsvc_model = SVC() ## default kernel - RBF
parameters = {'C':[0.1, 1, 10], 'gamma':[0.00001, 0.0001, 0.001, 0.01, 0.1]}
searcher = GridSearchCV(svc_model, param_grid = parameters, n_jobs= 4, verbose = 2, return_train_score= True)
searcher.fit(x_train, y_train)
# Report the best parameters and the corresponding score
Train Score: 0.59Test Score: 0.53F1 Score: 0.23Precision Score: 0.24
火車得分:0.59測(cè)試得分:0.53F1得分:0.23精度得分:0.24
So, I should have stopped earlier…It is always good to have an understanding of your data before you try to over-tune and complicate the model in the hopes of better results. Good Luck!
因此,我應(yīng)該早點(diǎn)停下來……在您嘗試過度調(diào)整模型并使模型復(fù)雜化以期獲得更好的結(jié)果之前,最好先了解您的數(shù)據(jù)。 祝好運(yùn)!
翻譯自: https://towardsdatascience.com/determine-if-your-data-can-be-modeled-e619d65c13c5
對(duì)數(shù)據(jù)倉(cāng)庫(kù)進(jìn)行數(shù)據(jù)建模
總結(jié)
以上是生活随笔為你收集整理的对数据仓库进行数据建模_确定是否可以对您的数据进行建模的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 统计和冰淇淋
- 下一篇: 梦到同事们给我过生日好不好