数据仓库 python_python新手如何通过即时转换彻底改变收藏的数据仓库
數(shù)據(jù)倉(cāng)庫(kù) python
A user buying a Fave Deal or an eCard or using FavePay at Fave’s merchant partners has the option of paying using credit/debit cards or using online banking. They can get further discounts by using credits they’ve accumulated as well as apply promo code to get more. Malaysian users can redeem their AirAsia Big Points and pay from their Boost e-Wallet. Singaporean users can link up their GrabPay Wallet. More recently, we partnered with DBS and Singtel to also give Singaporeans the option of paying at our merchants via their apps while still being able to earn the cashback they would enjoy via the Fave app.
在Fave的貿(mào)易伙伴處購(gòu)買(mǎi)Fave Deal或eCard或使用FavePay的用戶(hù)可以選擇使用信用卡/借記卡或使用網(wǎng)上銀行付款。 他們可以通過(guò)使用積累的積分以及應(yīng)用促銷(xiāo)代碼獲得更多折扣,從而獲得更多折扣。 馬來(lái)西亞用戶(hù)可以?xún)稉Q其亞航大積分并通過(guò)Boost電子錢(qián)包付款。 新加坡用戶(hù)可以鏈接他們的GrabPay錢(qián)包。 最近,我們與星展銀行(DBS)和新加坡電信(Singtel)合作,還為新加坡人提供了通過(guò)他們的應(yīng)用程序向我們的商家付款的選項(xiàng),同時(shí)仍然能夠通過(guò)Fave應(yīng)用程序獲得他們想享受的現(xiàn)金返還。
我們需要考慮12度的位置,3度的位置,4度的時(shí)間,16度的位置和8度的位置 (We need to consider 12 degrees of where, 3 degrees of who, 4 degrees of when, 16 degrees of how and 8 degrees of which)
With all this, it isn’t enough for us to know how much, where and when a user has transacted. We need to consider 12 degrees of where it happens, 3 degrees of who (anonymised, of course) is purchasing, 4 degrees of when purchasing/redeeming occurs, 16 degrees of how the payment is covered, 8 degrees of which Fave offerings apply and 6 degrees of the rewards a user can receive. For Fave’s earliest product, Fave Deals, this meant combining at least 15 tables, all with their own periodicity for data population, to end up with a comprehensive repository of information.
有了這些,我們還不足以知道用戶(hù)進(jìn)行了多少交易,何時(shí)何地進(jìn)行交易。 我們需要考慮發(fā)生情況的12度,購(gòu)買(mǎi)者的3度(當(dāng)然是匿名的),發(fā)生購(gòu)買(mǎi)/兌換的4度,支付方式的16度,Fave產(chǎn)品適用的8度和用戶(hù)可獲得的獎(jiǎng)勵(lì)的6度。 對(duì)于Fave最早的產(chǎn)品,Fave Deals,這意味著將至少15張表結(jié)合起來(lái),它們各自具有周期性的數(shù)據(jù)填充功能,最終形成一個(gè)全面的信息庫(kù)。
The Dilemma
困境
Our existing reporting data warehouse was ageing, the strategy behind it unscalable, the problems ensuing from it untenable. A revitalisation was overdue. We wanted to be in a position where real-time reporting was available to support the business across 3 countries and potentially more countries in the future. But mainly we wanted a solid backbone on which to build our own data science projects. Fave’s Data Science team* has been teeing up for this for a long time and our delay to a full-on immersion has been in no small part due to an inherited extract, transform and load (ETL or data refresh) strategy that did its job well in its time but proved severely lacking for a ceaselessly growing company.
我們現(xiàn)有的報(bào)告數(shù)據(jù)倉(cāng)庫(kù)正在老化,其背后的策略不可擴(kuò)展,隨之而來(lái)的問(wèn)題也難以為繼。 振興已經(jīng)過(guò)期。 我們希望處于一個(gè)可以提供實(shí)時(shí)報(bào)告的位置,以支持3個(gè)國(guó)家/地區(qū)以及將來(lái)可能更多國(guó)家/地區(qū)的業(yè)務(wù)。 但是主要是我們希望有一個(gè)堅(jiān)實(shí)的基礎(chǔ)來(lái)構(gòu)建我們自己的數(shù)據(jù)科學(xué)項(xiàng)目。 Fave的數(shù)據(jù)科學(xué)團(tuán)隊(duì)*已經(jīng)為此做好了很長(zhǎng)時(shí)間的準(zhǔn)備,并且由于采用了繼承,提取和轉(zhuǎn)換和加載(ETL或數(shù)據(jù)刷新)策略,Fave的全浸入式延遲在很大程度上不容小part當(dāng)時(shí)還算不錯(cuò),但事實(shí)證明,對(duì)于一個(gè)不斷成長(zhǎng)的公司而言,它嚴(yán)重缺乏。
延遲的轉(zhuǎn)換意味著無(wú)法控制數(shù)據(jù)的可用性,從而阻礙了討論并削弱了為客戶(hù)和我們的貿(mào)易伙伴提供服務(wù)的能力。 (Delayed transformation means there’s a hold up to data availability which in turn impedes discussions and tempers the ability to serve customers and our merchant partners.)
That previous ETL depended on an absolute data refresh daily. Every day, a dump of ALL data since inception until that point into our reporting database would initiate in the predawn hours. A cron job schedule would then trigger SQL-scripted transformation** jobs which itself takes a healthy number of hours to wrap up. The next day and indeed every day, rinse and repeat. The daily dump would naturally accumulate a larger load with each passing day and so a delay to its completion snowballed. What was particularly painful about this strategy was that older unchanging data would also be refreshing for absolutely no reason whatsoever. I myself have watched and reconfigured the transformation schedule gradually from 6 am when I started at Fave just under 2 years ago to its current 10:20 am.
先前的ETL每天都依賴(lài)于絕對(duì)數(shù)據(jù)刷新。 每天,從開(kāi)始到此刻一直到我們的報(bào)告數(shù)據(jù)庫(kù)的所有數(shù)據(jù)轉(zhuǎn)儲(chǔ)都會(huì)在黎明前開(kāi)始。 然后,cron作業(yè)計(jì)劃將觸發(fā)SQL腳本化轉(zhuǎn)換**作業(yè),這本身需要花費(fèi)大量的時(shí)間來(lái)完成工作。 第二天,實(shí)際上是每天,沖洗并重復(fù)。 每天的轉(zhuǎn)儲(chǔ)自然會(huì)每天累積較大的負(fù)載,因此延誤了其累積量。 這種策略特別令人痛苦的是,舊的不變數(shù)據(jù)也將毫無(wú)理由地刷新。 從我不到兩年前在Fave開(kāi)始時(shí)的上午6點(diǎn)到現(xiàn)在的10:20,我本人就一直在觀(guān)察并重新配置轉(zhuǎn)換時(shí)間表。
Inherited ETL Setup繼承的ETL設(shè)置Delayed transformation means there’s a hold up to data availability which in turn impedes discussions such as the weekly top management meeting. It tempers the ability of our Customer Happiness and Partner Manager teams to serve customers and our merchant partners. It hampers Finance from completing their monthly account closings in a timely manner.
延遲的轉(zhuǎn)換意味著無(wú)法保證數(shù)據(jù)的可用性,這反過(guò)來(lái)又阻礙了每周高層管理會(huì)議等討論。 它可以改善我們的客戶(hù)幸福度和合作伙伴經(jīng)理團(tuán)隊(duì)為客戶(hù)和我們的貿(mào)易伙伴提供服務(wù)的能力。 它會(huì)妨礙財(cái)務(wù)部及時(shí)完成每月的帳戶(hù)結(jié)清工作。
The Solution
解決方案
One solution would have been to replicate data into our reporting database as it is generated. In other words, the dump would be spread out throughout the day and at no point recurrent for the same data point. That still meant hours transforming the data daily though, not to mention a minefield of read/write conflicts. Our lead, Jatin Solanki, proposed to spread out the transformations as well i.e. transform the data on its way to being dumped. On top of that, we coupled this project with migration to BigQuery to take advantage of their BI engine and table partitioning and clustering features for faster report load times.
一種解決方案是在生成數(shù)據(jù)時(shí)將數(shù)據(jù)復(fù)制到我們的報(bào)告數(shù)據(jù)庫(kù)中。 換句話(huà)說(shuō),轉(zhuǎn)儲(chǔ)將全天分布,并且同一數(shù)據(jù)點(diǎn)絕不會(huì)重復(fù)出現(xiàn)。 盡管如此,這仍然意味著每天要花費(fèi)數(shù)小時(shí)來(lái)轉(zhuǎn)換數(shù)據(jù),更不用說(shuō)讀/寫(xiě)沖突的雷區(qū)了。 我們的負(fù)責(zé)人Jatin Solanki提議也擴(kuò)展轉(zhuǎn)換,即按照轉(zhuǎn)儲(chǔ)的方式轉(zhuǎn)換數(shù)據(jù)。 最重要的是,我們將此項(xiàng)目與遷移到BigQuery結(jié)合在一起,以利用其BI引擎和表分區(qū)以及集群功能來(lái)加快報(bào)告加載時(shí)間。
我們的即時(shí)轉(zhuǎn)換解決方案強(qiáng)調(diào)準(zhǔn)確性,遠(yuǎn)見(jiàn),模塊化和協(xié)作。 (Our on-the-fly transformations solution emphasises accuracy, foresight, modularisation and collaboration.)
To be clear, there are ready-built ETL tools and SAAS products in the market that do exactly this. We had been using one such tool for a few use-cases. It could flatten JSONs and calculate new fields as streaming occurred but joining to other streamed tables had to be done through a separate feature that had to have its own schedule and pricing. Ultimately our issue with it was its cost but also reliability. Even while using it for only a subset of our data, we had been plagued with failing pipelines that confounded us and efforts to seek assistance from their support team more often than not fell short of our expectations. Because its core was not under our jurisdiction, we didn’t have the fullest flexibility to investigate, experiment and tinker. The decision was made to close that door, for now, go the open-source route and chart our own way for all our pipelines.
需要明確的是,市場(chǎng)上有現(xiàn)成的ETL工具和SAAS產(chǎn)品可以做到這一點(diǎn)。 我們已經(jīng)在一些用例中使用了這樣一種工具。 它可以展平JSON并在發(fā)生流式傳輸時(shí)計(jì)算新字段,但必須通過(guò)單獨(dú)的功能來(lái)加入其他流式表,該功能必須具有自己的時(shí)間表和定價(jià)。 最終,我們面臨的問(wèn)題是它的成本以及可靠性。 即使僅將其用于我們的部分?jǐn)?shù)據(jù),我們也一直在遭受失敗的管道困擾,這使我們感到困惑,并且尋求支持小組的幫助的努力常常沒(méi)有達(dá)到我們的預(yù)期。 由于其核心不在我們的管轄范圍內(nèi),因此我們沒(méi)有最大的靈活性來(lái)進(jìn)行調(diào)查,試驗(yàn)和修補(bǔ)。 決定關(guān)閉那扇門(mén),現(xiàn)在,走開(kāi)源路線(xiàn),為我們所有的管道制定自己的方式。
Our on-the-fly transformations solution emphasised the following tenants:
我們的即時(shí)轉(zhuǎn)換解決方案強(qiáng)調(diào)了以下租戶(hù):
Accuracy cannot be compromised. Period.
準(zhǔn)確性不能受到損害 。 期間 。
Get ahead: We didn’t want just a faster database or a more up-to-date one. We wanted to improve anywhere we saw it was needed. That meant more efficient logic, deprecating some tables while introducing more comprehensive columns. Our 8 members of the Data team collectively have 13 years of dealing with every single other team in Fave. Our design must facilitate answering their needs but also our own data science projects.
取得成功:我們不想要一個(gè)更快的數(shù)據(jù)庫(kù)或一個(gè)更新的數(shù)據(jù)庫(kù)。 我們想在需要的地方進(jìn)行改進(jìn)。 這意味著更高效的邏輯,不推薦使用某些表,同時(shí)引入更全面的列。 我們的8位數(shù)據(jù)團(tuán)隊(duì)成員與Fave中的每個(gè)其他團(tuán)隊(duì)共同擁有13年的工作經(jīng)驗(yàn)。 我們的設(shè)計(jì)必須方便滿(mǎn)足他們的需求,而且還必須滿(mǎn)足我們自己的數(shù)據(jù)科學(xué)項(xiàng)目的要求。
Modularisation: As much as possible, we should apply predefined libraries, functions and variables (whether our own or from elsewhere) to reduce complexity and increase code reuse. While there could still be improvements, what we’ve accomplished so far has already proven advantageous. In fact, our accuracy check module is already being applied in other projects with turnaround being exponentially shorter given we were not building from scratch.
模塊化:我們應(yīng)該盡可能地使用預(yù)定義的庫(kù),函數(shù)和變量(無(wú)論是我們自己的還是其他的),以降低復(fù)雜性并增加代碼重用性。 盡管仍有改進(jìn)的余地,但到目前為止我們已經(jīng)完成的工作已經(jīng)證明是有利的。 實(shí)際上,我們的準(zhǔn)確性檢查模塊已經(jīng)在其他項(xiàng)目中應(yīng)用,由于我們不是從頭開(kāi)始構(gòu)建的,因此周轉(zhuǎn)時(shí)間縮短了幾倍。
Engineering, as the capturers of the data, need to always be on the same page (or at least only one page away): A great many companies suffer the inefficiency of not having their engineers and various data people talk to each other. We have that relationship at Fave and this undertaking shouldn’t dislodge that.
工程,作為數(shù)據(jù)的捕獲者,必須始終在同一頁(yè)上 (或至少只有一頁(yè)):許多公司效率低下,因?yàn)闆](méi)有工程師和各種數(shù)據(jù)人員互相交談。 我們?cè)贔ave擁有這種關(guān)系,這項(xiàng)事業(yè)不應(yīng)該消除這種關(guān)系。
In brief terms, the system we conceived consisted of Kafka being made to tap into our production database. We then had Python scripts listen to a combination of Kafka topics in real-time, where each topic corresponds to a raw table’s events*** and utilise Dask for streamlined parallel computing. Everything from joins to flattening to decoding to enrichment to practically any beneficial computation would be applied at this point. This improves on our SQL scripted transformations by being continuous as new data is created, immediately streamed and transformed and improves on the SAAS product we were using by way of a more integrated and customisable answer to our woes. It also improves on both by making use of distributed computing so the load involved in the transformations would be shared among parallel-running workers, thus decreasing the operation time further. That same script would then insert or update into the final tables that plug into our data reports used across Fave as well as the reports we send out to merchants and partners.
簡(jiǎn)而言之,我們構(gòu)想的系統(tǒng)包括將Kafka用于進(jìn)入我們的生產(chǎn)數(shù)據(jù)庫(kù)。 然后,我們讓Python腳本實(shí)時(shí)收聽(tīng)Kafka主題的組合,其中每個(gè)主題都對(duì)應(yīng)于原始表的事件***,并利用Dask簡(jiǎn)化了并行計(jì)算。 從連接到展平到解碼再到充實(shí)到實(shí)際上任何有益的計(jì)算都將在這一點(diǎn)上應(yīng)用。 通過(guò)在創(chuàng)建,立即流式傳輸和轉(zhuǎn)換新數(shù)據(jù)時(shí)保持連續(xù)性,這改進(jìn)了我們SQL腳本轉(zhuǎn)換,并通過(guò)對(duì)問(wèn)題的更全面,更可定制的解決方案,對(duì)我們正在使用的SAAS產(chǎn)品進(jìn)行了改進(jìn)。 通過(guò)使用分布式計(jì)算,它在兩個(gè)方面都得到了改善,因此轉(zhuǎn)換所涉及的負(fù)載將在并行運(yùn)行的工人之間共享,從而進(jìn)一步減少了操作時(shí)間。 然后,該腳本將插入或更新到最終表中,該最終表將插入我們?cè)贔ave中使用的數(shù)據(jù)報(bào)告以及我們發(fā)送給商家和合作伙伴的報(bào)告。
High-level View of our Kafka-Powered Real-Time Transformations System. Image credit: Jatin SolankiKafka支持的實(shí)時(shí)轉(zhuǎn)換系統(tǒng)的高級(jí)視圖。 圖片來(lái)源:Jatin SolankiPursuing any single milestone in this project was not easy for individuals proficient in Fave’s existing data set-up. It was not easy for those who already had numerous Python projects and libraries under their belt. One challenge certainly was that those of us who knew Fave’s data inside-out was not also the same people within the team who had an established repertoire around Kafka and Dask. Coding transformation logics on top of events-based input and using distributed computing for efficiency, which by the way also inevitably separated linked events, was one thing. Having to contend with the magnitude of inherent complexity when trying to get numerous raw tables with distinct population timelines into a coherent unified table was its own beast. All of us took on numerous new skills while leveraging the skills, knowledge and business sense the data analysts had already accumulated. By the end, we had clocked in-excess of 25,000 lines of Python.
對(duì)于精通Fave現(xiàn)有數(shù)據(jù)設(shè)置的個(gè)人而言,在該項(xiàng)目中追求任何一個(gè)里程碑都不容易。 對(duì)于那些已經(jīng)擁有眾多Python項(xiàng)目和庫(kù)的人來(lái)說(shuō),這并不容易。 當(dāng)然,我們面臨的挑戰(zhàn)是,那些從內(nèi)而外了解Fave數(shù)據(jù)的人與在卡夫卡和達(dá)斯克周?chē)鷵碛型暾康膱F(tuán)隊(duì)中的人也不相同。 在基于事件的輸入之上編碼轉(zhuǎn)換邏輯,并使用分布式計(jì)算以提高效率(順便說(shuō)一句也不可避免地將鏈接的事件分開(kāi))是一回事。 當(dāng)試圖將具有不同人口時(shí)間表的大量原始表放入一個(gè)統(tǒng)一的統(tǒng)一表中時(shí),必須應(yīng)對(duì)固有的復(fù)雜性的程度是它自己的野獸。 我們所有人都采用了許多新技能,同時(shí)又利用了數(shù)據(jù)分析師已經(jīng)積累的技能,知識(shí)和商業(yè)意識(shí)。 到最后,我們已經(jīng)處理了25,000行以上的Python。
Why transform the data at all?
為什么要完全轉(zhuǎn)換數(shù)據(jù)?
The priorities of an engineering team and a data team are innately different工程團(tuán)隊(duì)和數(shù)據(jù)團(tuán)隊(duì)的優(yōu)先級(jí)本質(zhì)上是不同的As much as the transformations can be coded as part of a data science project or when somebody clicks submit in one of our reports to get results, it is not ideal as it takes up time. Take Fave Deals again. To generate the full range of transactional information from a SQL-based report would need half of its script to just be taken up by the joins of at least 15 different raw tables. A run of such a report would correspond to a load time that can be best described as the timer I could use when making roast potatoes.
轉(zhuǎn)換可以作為數(shù)據(jù)科學(xué)項(xiàng)目的一部分進(jìn)行編碼,或者當(dāng)有人單擊提交到我們的一份報(bào)告中以獲取結(jié)果時(shí),這并不理想,因?yàn)檫@會(huì)占用時(shí)間。 再次進(jìn)行收藏交易。 要從基于SQL的報(bào)表中生成所有交易信息,將需要將其腳本的一半僅由至少15個(gè)不同原始表的聯(lián)接所占用。 運(yùn)行此報(bào)告將對(duì)應(yīng)于加載時(shí)間,可以將其最好地描述為我在制作烤土豆時(shí)可以使用的計(jì)時(shí)器。
數(shù)據(jù)可用性滯后從36小時(shí)縮短至<8分鐘 (Data availability lag is down from 36 hours to <8 minutes)
Transformations save time and effort for the really potent work to be done on top of the data. Data science work can be focussed on modelling. Analysis need not require code sections simply to translate the European currency notation that Indonesia uses into the more typical format to allow for arithmetics.
轉(zhuǎn)換節(jié)省了時(shí)間和精力,可以在數(shù)據(jù)之上完成真正有效的工作。 數(shù)據(jù)科學(xué)工作可以集中在建模上。 分析不需要代碼部分,只需將印度尼西亞使用的歐洲貨幣符號(hào)轉(zhuǎn)換為更典型的格式即可進(jìn)行算術(shù)運(yùn)算。
So does it work?
這樣有效嗎?
Reading this article now takes longer than reporting on transactions現(xiàn)在閱讀本文比報(bào)告交易花費(fèi)的時(shí)間更長(zhǎng)We brought the data availability lag from 36 hours down to <8 minutes. With this, the establishment and maintenance of a separate pipeline and reporting are now rendered unnecessary for our marketing team to watch costs as it happens. That same dataset could advise our Operations team on their turnaround times to create, quality check and deploy new offerings. The Partnerships and various Sales teams can communicate with their stakeholders at a heightened level of informativeness. Product and Engineering benefit from our API firing our team’s AI-generated content built upon that same dataset again, created just hours prior, for both our merchant app (FaveBiz) and consumer app. Similarly, CRM communications stay relevant and personalised to each Fave user. Everybody wins.
我們將數(shù)據(jù)可用性的滯后時(shí)間從36小時(shí)降低到了不到8分鐘。 這樣,現(xiàn)在就無(wú)需建立和維護(hù)單獨(dú)的管道并生成報(bào)告了,因此我們的營(yíng)銷(xiāo)團(tuán)隊(duì)無(wú)需再看成本。 同一數(shù)據(jù)集可以為我們的運(yùn)營(yíng)團(tuán)隊(duì)提供周轉(zhuǎn)時(shí)間建議,以創(chuàng)建,質(zhì)量檢查和部署新產(chǎn)品。 合作伙伴關(guān)系和各種銷(xiāo)售團(tuán)隊(duì)可以與他們的利益相關(guān)者進(jìn)行交流,以提高他們的信息水平。 產(chǎn)品和工程技術(shù)得益于我們的API,它激發(fā)了我們團(tuán)隊(duì)的AI生成的內(nèi)容,這些內(nèi)容是在相同的數(shù)據(jù)集上再次構(gòu)建的,該內(nèi)容是為商戶(hù)應(yīng)用程序( FaveBiz )和消費(fèi)者應(yīng)用程序在幾個(gè)小時(shí)前創(chuàng)建的。 同樣,CRM通信與每個(gè)Fave用戶(hù)保持相關(guān)和個(gè)性化。 每個(gè)人都贏(yíng)。
But...
但...
While we agree it was necessary for the company with its unending need for customisation and its propensity for prudent spending, the barrier-of-entry to our team has undoubtedly been raised. Custom-built Python libraries including our own version of Dask (pending approval of our pull request) need to be maintained. Bugs ranging from a missing comma to leaking Kafka offsets to the more intertwined issues stemming from how Fave’s raw tables are populated in production are all ours to own and ours to resolve.
盡管我們同意公司有迫切需要進(jìn)行定制以及謹(jǐn)慎消費(fèi)的傾向,但毫無(wú)疑問(wèn),這給我們團(tuán)隊(duì)增加了進(jìn)入壁壘。 需要維護(hù)定制的Python庫(kù),包括我們自己的Dask版本(正在等待我們的請(qǐng)求請(qǐng)求的批準(zhǔn))。 錯(cuò)誤包括從缺少逗號(hào)到泄漏的Kafka偏移量,再到更糾纏的問(wèn)題,這些問(wèn)題由Fave的原始表如何在生產(chǎn)中填充而產(chǎn)生,這些都是我們自己擁有和解決的。
As mentioned, it is elaborate even for Python experts and it is labyrinthine even for those that are able to navigate the maze that is Fave’s relational database structure. That said, we welcome all attempts to scale this challenge, both in the literal and figurative senses. At the end of the day, we’ve built real-time Python processing for broad-scale utilisation, something other companies choose to pay for. More than anything, we’re really quite excited at the opportunities this has given us.
如前所述,它甚至對(duì)于Python專(zhuān)家來(lái)說(shuō)都是精心設(shè)計(jì)的,對(duì)于那些能夠?yàn)g覽Fave關(guān)系數(shù)據(jù)庫(kù)結(jié)構(gòu)的迷宮的人來(lái)說(shuō),也是迷宮般的。 也就是說(shuō),我們歡迎所有從字面意義和象征意義上解決這一挑戰(zhàn)的嘗試 。 歸根結(jié)底,我們已經(jīng)構(gòu)建了用于大規(guī)模利用的實(shí)時(shí)Python處理,這是其他公司選擇付費(fèi)的。 最重要的是,我們對(duì)它給我們帶來(lái)的機(jī)會(huì)感到非常興奮。
25,000行Python + 3個(gè)定制的Python庫(kù),數(shù)據(jù)可用性提高250倍,世界一流的低成本事件流系統(tǒng)以及在全球大流行情況下的開(kāi)發(fā)和交付。 (25,000 lines of Python + 3 custom-built Python libraries, 250x faster data availability, a world-class low-cost event-streaming system and development and delivery under global pandemic conditions.)
Up until the last few weeks of this project, we were consistently in a 1-step-forward, 2-steps-back dance. Hypothesising and experimenting constantly with last week’s breakthrough becoming this week’s redundancy at times left us dejected. A 3-month planned timeline stretched to a 4, then 5, then 6, then 7-month execution as each key component had to be planned, coded then stress-tested against a permutation and combination of scenarios. All this till we reached the point of a very slow and very cautious realisation of “I think... we’ve maybe kinda sorta did it”.
直到該項(xiàng)目的最后幾個(gè)星期,我們一直在向前邁進(jìn)1步,向后邁2步。 假設(shè)和不斷嘗試上周的突破成為本周的冗余有時(shí)使我們沮喪。 3個(gè)月的計(jì)劃時(shí)間表延長(zhǎng)到4、5、6、7個(gè)月,因?yàn)楸仨氂?jì)劃每個(gè)關(guān)鍵組件,進(jìn)行編碼,然后針對(duì)場(chǎng)景的排列和組合進(jìn)行壓力測(cè)試。 所有這一切,直到我們達(dá)到了一個(gè)非常緩慢和非常謹(jǐn)慎的認(rèn)識(shí):“我認(rèn)為……我們也許已經(jīng)做到了”。
25,000 lines of Python not including 3 custom-built Python libraries, 250x faster data availability, a world-class low-cost event-streaming system and development and delivery under global pandemic conditions. Not bad at all from a team that while not ignorant in Python, until a year ago, were working primarily in SQL on a day-to-day basis.
25,000行Python,不包括3個(gè)自定義的Python庫(kù),250倍的數(shù)據(jù)可用性,世界一流的低成本事件流系統(tǒng)以及在全球大流行情況下的開(kāi)發(fā)和交付。 一支團(tuán)隊(duì)雖然對(duì)Python并不了解,但直到一年前,他們主要每天都在SQL中工作,這一點(diǎn)也不錯(cuò)。
*Fave’s Data Science Team, at time of writing, consists of Husein Zolkepli (Creator of Malaya — THE Malay language toolkit library and our chief tamer of Kafka), Lin Cheun Hong (budding data engineer who always has a fresh thought to contribute), Evonne Soon, Zuhairi “Harry” Akshah and myself, Sarhan Abd. Samat (data analysts/analytics engineers who successfully detoured into Dask scripting with just the Python basics), Cheok Huei Keat (who made sure the company knew our team still existed but somehow managed to serve up some time-saving data science deliverables) and Faris Hassan (our only data scientist from the get-go and our generous teacher in the months to come). We are helmed by Jatin Solanki who, and this cannot be emphasised enough, always showed the right amount of guidance, patience, shielding and confidence in us throughout. A lesser leader would have binned this project long before it came to fruition. Also an unqualified special mention to Aiyas Aboobakar who helped get this going before abandoning us for greener pastures.
*在撰寫(xiě)本文時(shí),Fave的數(shù)據(jù)科學(xué)團(tuán)隊(duì)由 Husein Zolkepli ( 馬來(lái)亞的 創(chuàng)建者 -馬來(lái)語(yǔ)語(yǔ)言工具包庫(kù) 和我們的Kafka首席負(fù)責(zé)人), Lin Cheun Hong (萌芽的數(shù)據(jù)工程師,始終 懷有 新的想法做出貢獻(xiàn))組成, Evonne Soon , Zuhairi“ Harry” Akshah 和我自己, Sarhan Abd。 Samat (僅使用Python基礎(chǔ)知識(shí)成功繞過(guò)了Dask腳本的數(shù)據(jù)分析師/分析工程師), Cheok Huei Keat (確保公司知道我們的團(tuán)隊(duì)仍然存在,但以某種方式設(shè)法提供了一些省時(shí)的數(shù)據(jù)科學(xué)成果)和 Faris哈桑 (我們唯一的數(shù)據(jù)科學(xué)家,我們的慷慨的老師在接下來(lái)的幾個(gè)月中)。 我們受到賈廷·索蘭基( Jatin Solanki )的控制,而對(duì)此卻沒(méi)有足夠強(qiáng)調(diào),他始終對(duì)我們表現(xiàn)出適當(dāng)?shù)闹笇?dǎo),耐心,屏蔽和信心。 一個(gè)較小的領(lǐng)導(dǎo)者早在該項(xiàng)目實(shí)現(xiàn)之前就已經(jīng)對(duì)這個(gè)項(xiàng)目進(jìn)行了分類(lèi)。 還特別 感謝Aiyas Aboobakar ,他在放棄我們之前選擇了更綠色的牧場(chǎng)之前幫助實(shí)現(xiàn)了這一目標(biāo)。
**Essentially what happens during transformations: 1) Different data sources are combined into a unitary data source. 2) The data is cleaned so your RM39.99 that was stored in a JSON just becomes 39.99 to facilitate arithmetic operations. 3) Timestamps are localised from UTC. 4) Some decoding of ids to their names as well as categorisation, among other things, also done to make the data more human-decipherable. 5) Some additional calculated fields like gross profit are created and stored so it’s lighter for reports to produce.
** 轉(zhuǎn)換期間基本上會(huì)發(fā)生以下情況:1)將不同的數(shù)據(jù)源組合為一個(gè)單一的數(shù)據(jù)源。 2)數(shù)據(jù)已清除,因此存儲(chǔ)在JSON中的RM39.99變?yōu)?9.99,以方便進(jìn)行算術(shù)運(yùn)算。 3)時(shí)間戳是從UTC本地化的。 4)除其他事項(xiàng)外,還對(duì)ID對(duì)其名稱(chēng)進(jìn)行了一些解碼以及分類(lèi),以使數(shù)據(jù)更易于辨認(rèn)。 5)創(chuàng)建并存儲(chǔ)了一些額外的計(jì)算字段,例如毛利,因此可以簡(jiǎn)化報(bào)表的生成。
***Each version/snapshot of a table row at any given time -even a few milliseconds- is an event
*** 在任何給定時(shí)間(甚至幾毫秒),表行的每個(gè)版本/快照都是一個(gè)事件
翻譯自: https://medium.com/fave-engineering/how-python-novices-revolutionised-faves-data-warehouse-with-on-the-fly-transformations-1520c24c160c
數(shù)據(jù)倉(cāng)庫(kù) python
總結(jié)
以上是生活随笔為你收集整理的数据仓库 python_python新手如何通过即时转换彻底改变收藏的数据仓库的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: DeepL Pro(deepl翻译器)官
- 下一篇: 微信台配置那服务器,微信配置