ml是什么_ML,ML,谁是所有人的冠军?
ml是什么
The past few months must be not easy for millions of soccer fans including me. As a die-hard fan of Real Madrid, I have already watched the highlights of Ronaldo, Raúl over and over again, trying to fulfill the excitement brought my favorite side but taken by COVID-19. But the good news is, soccer is back: three weeks ago, UEFA announced that from quarter-finals, UEFA Champion League (UCL) will be back in early August, and all fixtures will be single-elimination and released all the fixture draws. It is a piece of good news for me because I have been thinking about how Real will make a comeback in Manchester for months and here is the time to witness.
對于包括我在內的數百萬足球迷來說,過去幾個月一定不容易。 作為皇馬的忠實粉絲,我已經一遍又一遍地看過羅納爾多,勞爾的精彩場面,試圖通過激動來實現我最喜歡的一面,但被COVID-19接受了。 但是好消息是, 足球又回來了:三周前,歐洲足聯宣布從八強賽開始,歐洲冠軍聯賽(UCL)將在八月初復出,所有固定裝置將被淘汰,并釋放所有固定裝置的抽簽 。 這對我來說是個好消息,因為幾個月來我一直在思考皇馬將如何在曼徹斯特卷土重來,現在是時候見證一下。
However, the question is, how much is the comeback possible?
但是,問題是,卷土重來的可能性有多大?
So in the last week, I with the question in mind started a machine learning project to research how the UCL matches will go on when they return. In this project, I researched who would be the potential champion in the end, who can be named “winning underdog”, and most importantly, how far Real Madrid might reach. In the rest of the article, I am going to show how I carried this project entirely from the start to the end all on my own, while sharing any interesting discoveries.
因此,在上周,我想到了一個問題,就開始了一個機器學習項目,以研究UCL比賽返回時如何進行比賽。 在這個項目中,我研究了最終誰將成為潛在的冠軍,誰將被稱為“失敗者”,最重要的是,皇馬可能走多遠。 在本文的其余部分中,我將展示如何在分享所有有趣發現的同時,完全自始至終地獨自承擔這個項目。
數據采集 (Data Collection)
The data I needed is all the information that was generated during the games before the long suspension. Hence, I collected all the data of 2019/2020 UCL using Scrapy in fbref.com. This covered the qualification rounds, group stage, and several knockouts that had taken place in February and March.
我需要的數據是長時間停賽前在比賽期間產生的所有信息。 因此,我在fbref.com中使用Scrapy收集了2019/2020 UCL的所有數據。 這涵蓋了資格賽,小組賽以及2月和3月發生的幾次淘汰賽。
Screenshot of fixtures/results (source from the author)夾具/結果的屏幕截圖(作者提供)Generally speaking, the data is categorized into two kinds: the side names and their final scores in a match (shown in the caption above), and the detailed data about the squads which are separated in several aspects such as goalkeeping, shooting, possession, etc. To be honest, I had no idea at that time if all the data I collected would be in use, but I just kept them all to avoid web-scraping again. I would tell about how I decide what features are supposed to be used in the next section.
一般而言,數據分為兩類:比賽中的邊名和最終得分(如上標題所示),以及有關球隊的詳細數據,這些數據分為幾個方面,例如守門員,射門,控球,坦白說,當時我不知道我收集的所有數據是否都將被使用,但我只是保留了所有數據以避免再次進行網絡抓取。 我將告訴我如何決定下一部分應使用哪些功能。
Passing data (source from the author)傳遞數據(作者提供) Shooting data (source from the author)拍攝數據(作者提供)數據清理和預處理 (Data Cleaning & Pre-processing)
So my next step is to configure the pandas DataFrame using the collected data. As mentioned above, it wasn’t certain whether all the features of the dataset should be utilized. Some of them are just one property but described into aggregate form and average form. For example, Gls and Gls/90 respectively stand for goals in total and goals on average. Therefore, I just extracted the features that reflect the average performance of the squad and the following image illustrates what I kept:
所以我的下一步是使用收集的數據配置pandas DataFrame。 如上所述,不確定是否應利用數據集的所有特征。 其中一些只是一種屬性,但以匯總形式和平均形式描述。 例如,Gls和Gls / 90分別代表總體目標和平均目標。 因此,我只提取了反映小隊平均表現的特征,下圖說明了我保留的內容:
Remaining columns (source from the author)剩余的列(作者提供的信息)Such kind of feature selection cannot 100% eliminate multilinearity though. Perhaps between some remaining features, they are highly interdependent with each other even if they have a different name. This can be unveiled by plotting the correlation heatmap among the features:
但是,這種特征選擇不能100%消除多線性。 也許在其余的某些功能之間,即使它們的名稱不同,它們之間也是高度相互依賴的。 可以通過在功能之間繪制相關熱圖來揭示這一點:
Heatmap of selected features (source from the author)選定功能的熱圖(作者提供)Obviously, there exist some strong correlations. My solution to it is the dimensionality reduction by Prior Component Analysis. Before carrying out PCA, I needed to merge the fixture table with the feature set because I assumed the data of both sides of a match was the input of my model.
顯然,存在一些強相關性。 我的解決方案是通過先驗分量分析降低尺寸。 在執行PCA之前,我需要將夾具表與特征集合并,因為我認為匹配雙方的數據都是我模型的輸入。
As for the number of prior components, I also carried out a little test and it turned out that the first 14 prior components are already able to explain more than 95% of the information. That is impressive given that previously we had more than 80 input variables in total.
至于先前組件的數量,我還進行了一些測試,結果發現前14個先前組件已經能夠解釋超過95%的信息。 考慮到以前我們總共有80多個輸入變量,這令人印象深刻。
The cumulative variance of first prior components (source from the author)第一個先驗成分的累積方差(作者提供)造型 (Modeling)
I chose four models for comparison: Logistics Regression, XGBoosting, Decision Tree, and Gaussian Mixture Model. I cross-validated them each in 5 folds, using Rooted Mean Square Error (RMSE) as the measure, and found that XGBoosting outperformed any other counterpart.
我選擇了四個模型進行比較:后勤回歸,XGBoosting,決策樹和高斯混合模型。 我使用根均方根誤差(RMSE)作為量度,對它們分別進行了5折交叉驗證,發現XGBoosting的表現優于其他同類產品。
Comparison of the four models in terms of RMSE (source from the author)四種模型在RMSE方面的比較(作者提供)And now I need to obtain the optimal prediction result. To do this, I tuned the parameters of the best performer XGBoosting using Bayesian Optimization.
現在,我需要獲得最佳的預測結果。 為此,我使用貝葉斯優化優化了性能最佳的XGBoosting的參數。
Bayesian Optimization (source from the author)貝葉斯優化(作者提供)The optimization increased the performance from about 2.84 to 2.74, and what is left now is to predict the results of future games.
優化使性能從大約2.84提高到2.74,現在剩下的就是預測未來游戲的結果。
Fixture bracket (source from the author)燈具支架(作者提供)In the above figure, the four teams on the right half have been through to the quarter-finals while the teams on the left still have the second leg to play this month. So the numbers in the brackets mean the aggregate scores of the sides in the round of 16, while the numbers outside of the brackets represent their scores in the second leg. Noteworthily, it is extremely weird that all the results of the remaining matches are Home 2: 1 Away, which would hardly happen in reality. It would not be an accurate prediction and there must be something that I missed and needs fixing.
在上圖中,右半部分的四支球隊進入了八強,而左半部分的球隊本月仍有第二回合。 因此,括號中的數字表示第16輪中各邊的總得分,而括號之外的數字表示第二回合中其得分。 值得注意的是,剩下的比賽的所有結果都是Home 2:1 Away ,這在現實中很難發生,這是非常奇怪的。 這將不是一個準確的預測,肯定有我想念的東西需要修復。
改善 (Improvement)
It took me a while to think through the entire process all over again and notice that the problem comes from the imbalance of data. Soccer is a type of sport that there will not be likely to have many goals in a match. In other words, a game with more goals less tends to happen frequently. According to my observation, regardless of home/away difference, 1:1, 2:1, 2:0 are the mostly seen scores on the result table in this season’s UCL up to now. That is to say, a team just score 2 goals at most in most of their games.
我花了一段時間才重新考慮整個過程,并注意到問題出在數據不平衡。 足球是一項運動,一場比賽中不可能有很多進球。 換句話說,具有更多目標的游戲往往會頻繁發生。 根據我的觀察,無論主場/客場差異如何,到目前為止,本賽季UCL比賽成績表上最常看到的是1:1、2:1、2:0。 也就是說,一支球隊在大多數比賽中最多只能進球2個進球。
Match result distribution (source from the author)比賽結果分配(作者提供) Score distribution of one side in one game (source from the author)一場比賽中一側得分的分布(作者提供)However, there also exist some matches such as 4:4 and 7:2, though they just happened once. That makes my prediction biased because the model got cheated and tended to label a game with one of the most frequent samples.
但是,也存在一些匹配,例如4:4和7:2,盡管它們只發生一次。 這使我的預測有偏差,因為該模型被欺騙并傾向于用最頻繁的樣本之一來標記游戲。
There are two solutions to this issue: downsizing the dataset or over-sampling. Cutting off the dataset would also take away part of the information from the dataset, and considering the small scale of my dataset, this absolutely is not an ideal option. And then it just left me only one choice.
有兩個解決方案:縮小數據集或過采樣。 切斷數據集還會從數據集中刪除部分信息,并且考慮到我的數據集規模較小,這絕對不是理想的選擇。 然后,這只給我留下了唯一的選擇。
I tried the python package called “imlearn”, and because imlearn has not enabled over-sampling towards multi-label classification, my alternative is to apply RandomOverSampling to the Scores of one game (one column) instead of how many goals two sides scored (two columns). In this way, I viewed the result of each game as a label and equally populated each of them.
我嘗試了名為“ imlearn ”的python程序包,由于imlearn尚未啟用對多標簽分類的過度采樣,因此我的替代方法是將RandomOverSampling應用于一個游戲(一欄)的得分,而不是兩邊得分的進球數(兩列)。 這樣,我將每個游戲的結果視為一個標簽,并平均填充每個游戲。
Match result distribution after over-sampling (source from the author)過度采樣后的匹配結果分布(作者提供)And now all I need to do is just repeat the following steps.
現在,我只需要重復以下步驟即可。
The results look good now! (source from the author)結果看起來不錯! (作者提供)One last factor that I need to control is the parameters of the classifying model. Based on my observation, the parameters tuned by Bayesian Optimization are not stable, which means that they might probably be different in different optimization and consequently so are the prediction generated by the model. To have a relatively reliable value, I simulated the tournament by running the entire prediction over 300 times to calculate how big the possibility of each side to win the trophy.
我需要控制的最后一個因素是分類模型的參數。 根據我的觀察,貝葉斯優化調整的參數不穩定,這意味著它們在不同的優化中可能會有所不同,因此模型生成的預測也是如此。 為了獲得相對可靠的價值,我通過對整個預測進行了300次模擬來模擬比賽,以計算雙方贏得獎杯的可能性有多大。
And the result of the simulation above is illustrated as the following figure:
上面的仿真結果如下圖所示:
Round-by-round probability based on XGBoosting基于XGBoosting的逐輪概率結論 (Conclusion)
To sum up, now I can answer the questions that I had in mind when I started the project:
綜上所述,現在我可以回答啟動項目時想到的問題:
Though I’m not willing to accept it, I should admit that Real’s comeback in Manchester would like a Mission Impossible to the ML model: we just have a 0.33% chance to survive this round and literally no chance to go further.
盡管我不愿意接受,但我應該承認,皇馬在曼徹斯特的復出希望進行ML模型無法完成的任務 :我們只有0.33%的機會能夠幸存于這一回合,而實際上沒有機會繼續前進。
思想 (Thoughts)
It is a project that is carried entirely on my own when I had nothing else to kill my time, so I guess there must be something that I failed to take into better consideration. As far as I am concerned, there are 3 aspects that I should have done better in this project:
當我沒有其他事情可以打發時間時,這是一個完全由我自己承擔的項目,所以我想一定有一些我沒有更好地考慮的事情。 就我而言,在這個項目中我應該在三個方面做得更好:
Timeliness of the data. If you read this article carefully enough, you must remember that what I use for prediction is solely the data of UCL of this season. Nevertheless, the thing is that the last match of UCL has already been approximately 5 months ago, and things would change over the 5 months. In other words, I didn’t take the most recent performance into account. Just take Real as an example. It is acknowledged that they underperformed in the group stage and the first leg versus Man. City. But since the restart of La Liga, they have not lost one single game (10 wins, 1 draw) and won the domestic champion. Given such a furious performance, are you still confident that Pep Guardiola’s side is going to eliminate Zizou’s without mercy?
數據的及時性 。 如果您足夠仔細地閱讀了本文,則必須記住,我用于預測的只是本賽季UCL的數據。 不過,事實是,UCL的最后一場比賽已經在大約5個月前了,情況會在5個月內發生變化。 換句話說,我沒有考慮最新的表現。 僅以Real為例。 公認的是,他們在小組賽中表現不佳,并且在第一回合對曼的比賽中表現不佳。 市。 但是自從西甲聯賽重新開始以來,他們并沒有輸過一場比賽(10勝1平)并贏得了國內冠軍 。 鑒于如此瘋狂的表現,您是否仍然相信瓜迪奧拉的一面會消滅齊祖的無情?
Draw breaker. When there appears a “draw” draw, say, two sides score the same goals in the single-elimination match, or, they score the same aggregate goals and the away goals in the double-elimination fixture, I have no idea to judge which side deserves to be through. All I did is to choose one between them randomly. From where I stand, collecting the data of the games which extended to the overtime and even penalty shootouts in previous reasons is the key to solve this problem.
抽斷路器 。 例如,當出現“平局”平局時,雙方在單淘汰賽中得分相同,或者在雙重淘汰賽中得分相同,總進球和客場進球,我不知道該評哪個一面值得經歷。 我要做的就是在它們之間隨機選擇一個。 從我的立場出發,收集以前原因導致的加班甚至點球大戰的游戲數據是解決此問題的關鍵。
Individual players’ impact on the match. Throughout the project, I have just considered the macro performance of the teams but never thought of their players. Ignoring them might discredit the accuracy of the prediction model: some players have been transferred and no longer able to play for their previous sides. For instance, Timo Werner, such a dangerous striker who contributed to 6 goals (including assists) to the German side RB Leipzig, while the entire team just has 14 goals in total on record until now. So it is safe to say that his transfer to Chelsea would definitely go with the luxurious stats, and possibly RB Leipzig’s fire powers.
個人球員對比賽的影響 。 在整個項目中,我只是考慮了團隊的宏觀表現,卻沒有想到他們的球員。 忽略他們可能會破壞預測模型的準確性:一些球員已經被調任,不再能夠為自己的前任效力。 例如,蒂莫·沃納(Timo Werner)就是這樣一個危險的前鋒,他為德國方面的萊比錫RB貢獻了6個進球(包括助攻),而迄今為止,整個團隊總共只有14個進球。 因此,可以肯定地說, 他轉會切爾西肯定會擁有豪華的數據,還有萊比錫RB的火力。
That’s all of my first ever data science project towards UCL champion prediction. What a fantastic experience! If you love reading story, please give me a clap and click the follow button, this is your biggest encouragement to me. And if you are interested in my implementation, feel free to visit my Kaggle kernel at https://www.kaggle.com/anzhemeng/19-20-ucl-champion-prediction.
這是我有史以來第一個針對UCL冠軍預測的數據科學項目。 多么奇妙的經歷! 如果您喜歡閱讀故事,請給我鼓掌,然后單擊“跟隨”按鈕,這是對我的最大鼓勵。 如果您對我的實現感興趣,請隨時訪問https://www.kaggle.com/anzhemeng/19-20-ucl-champion-prediction訪問我的Kaggle內核。
And finally, ?hala Madrid y nada más!
最后,“ Hala Madrid y nadamás”!
Photo by Manuel N?bauer on Unsplash ManuelN?bauer在Unsplash上的照片翻譯自: https://towardsdatascience.com/ml-ml-who-is-the-champion-of-them-all-1a4d253e86ad
ml是什么
總結
以上是生活随笔為你收集整理的ml是什么_ML,ML,谁是所有人的冠军?的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 速度10GB/s!PCIe 5.0 SS
- 下一篇: 国内燃油价格将在2月3日再次调整 国际油