白裤子变粉裤子怎么办_使用裤子构建构建数据科学的monorepo
白褲子變粉褲子怎么辦
At HousingAnywhere, one of the first major obstacles we had to face when scaling the Data team was building a centralised repository that contains our ever-growing machine learning applications. Between these projects, many of them share dependencies with each other which means code refactoring can become a pain and consumes lots of time. In addition, as we’re very opposed to Data Scientists’ tendency to copy/paste codes, we need a unified location where we can store reusable functions that can be easily accessed.
在HousingAnywhere ,擴(kuò)展數(shù)據(jù)團(tuán)隊(duì)時(shí)我們面臨的第一個(gè)主要障礙之一是建立一個(gè)包含我們不斷增長(zhǎng)的機(jī)器學(xué)習(xí)應(yīng)用程序的集中式存儲(chǔ)庫(kù)。 在這些項(xiàng)目之間,它們中的許多彼此共享依賴關(guān)系,這意味著代碼重構(gòu)可能會(huì)很麻煩并且會(huì)花費(fèi)大量時(shí)間。 另外,由于我們非常反對(duì)數(shù)據(jù)科學(xué)家復(fù)制/粘貼代碼的趨勢(shì),因此我們需要一個(gè)統(tǒng)一的位置,在這里我們可以存儲(chǔ)易于訪問(wèn)的可重用功能。
The perfect solution to our use case was building a monorepo. In this article, I’ll go through how a simple monorepo can be built using the build automation system Pantsbuild.
對(duì)于我們的用例,完美的解決方案是構(gòu)建一個(gè)monorepo。 在本文中,我將介紹如何使用構(gòu)建自動(dòng)化系統(tǒng)Pantsbuild構(gòu)建簡(jiǎn)單的monorepo 。
什么是monorepo? (What is a monorepo?)
A monorepo is a repository where code for many projects are stored together. Having a centralised repository for your team comes with a number of benefits:
monorepo是一個(gè)用于存儲(chǔ)許多項(xiàng)目代碼的存儲(chǔ)庫(kù)。 為您的團(tuán)隊(duì)建立集中式存儲(chǔ)庫(kù)有許多好處:
Reusability: Allows projects to share functions, in the case of Data Science, codes for preprocessing data, calculating metrics and even plotting graphs can be shared across projects.
可重用性 :允許項(xiàng)目共享功能,就數(shù)據(jù)科學(xué)而言,可以在項(xiàng)目之間共享用于預(yù)處理數(shù)據(jù) , 計(jì)算度量甚至繪圖的代碼。
Atomic changes: It only takes one operation to make changes across multiple projects.
原子更改 :只需執(zhí)行一項(xiàng)操作即可在多個(gè)項(xiàng)目中進(jìn)行更改。
Large scale refactoring: can be done easily and quickly, ensuring projects would still work afterwards.
大規(guī)模重構(gòu) : 可以輕松,快速地完成,從而確保項(xiàng)目在以后仍然可以正常工作。
Monorepo, however, is not a solution that fits all as there are a number of disadvantages:
但是,Monoropo并不是一個(gè)適合所有人的解決方案,因?yàn)榇嬖谠S多缺點(diǎn):
Security issues: There are no means to expose only parts of the repository.
安全問(wèn)題 :沒(méi)有辦法只公開(kāi)存儲(chǔ)庫(kù)的一部分。
Big codebase: As the repo grows in size, it can cause problems as developers have to check out the entire repository.
大型代碼庫(kù) :隨著存儲(chǔ)庫(kù)大小的增加,由于開(kāi)發(fā)人員必須檢出整個(gè)存儲(chǔ)庫(kù),因此可能導(dǎo)致問(wèn)題。
At HousingAnywhere, our team of Data Scientists find monorepo to be the perfect solution for our use cases in the Data team. Many of our machine learning applications have smaller projects that spin off from them. The monorepo enables us to quickly intergrate these new projects into the CI/CD pipeline, reducing the amount of time having to setup pipeline individually for each new project.
在HousingAnywhere,我們的數(shù)據(jù)科學(xué)家團(tuán)隊(duì)發(fā)現(xiàn)monorepo是我們數(shù)據(jù)團(tuán)隊(duì)中用例的理想解決方案。 我們的許多機(jī)器學(xué)習(xí)應(yīng)用程序都有一些較小的項(xiàng)目,這些項(xiàng)目可以從中分離出來(lái)。 monorepo使我們能夠快速將這些新項(xiàng)目集成到CI / CD管道中,從而減少了為每個(gè)新項(xiàng)目分別設(shè)置管道的時(shí)間。
We tried out a number of build automation systems and the one that we stuck with is Pantsbuild. Pants is one of the few systems that supports Python natively, and is an open-source project widely used by Twitter, Toolchain, Foursquare, Square, and Medium.
我們嘗試了許多構(gòu)建自動(dòng)化系統(tǒng), 我們堅(jiān)持使用的是Pantsbuild 。 Pant是本機(jī)支持Python的少數(shù)系統(tǒng)之一,并且是Twitter,Toolchain,Foursquare,Square和Medium廣泛使用的開(kāi)源項(xiàng)目。
Recently Pants has updated to v2 which only supports Python at the moment but it isn’t too much of a limitation for Data Science projects.
最近,Pant已更新到v2 ,目前僅支持Python,但對(duì)Data Science項(xiàng)目的限制不是太大。
一些基本概念 (Some basic concepts)
There are a couple of concepts in Pants that you should understand beforehand:
您需要事先了解褲子中的幾個(gè)概念:
Goals help users tell Pants what actions to take e.g. test
目標(biāo)可以幫助用戶告訴褲子要采取哪些措施,例如進(jìn)行test
Tasks are the Pants modules that run actions
任務(wù)是運(yùn)行動(dòng)作的褲子模塊
Targets describe what files to take those actions upon. These targets are defined in a BUILD file
目標(biāo)描述要對(duì)這些文件執(zhí)行哪些操作。 這些目標(biāo)在BUILD文件中定義
Target types define the types of operations that can be performed on a target e.g. you can perform tests on test targets
目標(biāo)類型定義可以在目標(biāo)上執(zhí)行的操作的類型,例如,您可以在測(cè)試目標(biāo)上執(zhí)行測(cè)試
Addresses describe the location of a target in the repo
地址描述了目標(biāo)在倉(cāng)庫(kù)中的位置
For more information, I highly recommend reading this documentation where the developers of Pants have done an excellent job in explaining these concepts in detail.
有關(guān)更多信息,我強(qiáng)烈建議閱讀本文檔 ,Pant的開(kāi)發(fā)人員在詳細(xì)解釋這些概念方面做得很好。
一個(gè)示例存儲(chǔ)庫(kù) (An example repository)
In this section, I’ll go through how you can easily setup a monorepo using Pants. First, makes sure these requirements are met to install Pants:
在本節(jié)中,我將介紹如何使用褲子輕松設(shè)置monorepo。 首先,確保滿足以下要求才能安裝褲子:
- Linux or macOS. Linux或macOS。
Python 3.6+ discoverable on your PATH.
可在PATH上發(fā)現(xiàn)的Python 3.6+。
- Internet access (so that Pants can fully bootstrap itself). Internet訪問(wèn)(以便褲子可以完全自舉)。
Now, let’s set up a new repository:
現(xiàn)在,讓我們建立一個(gè)新的存儲(chǔ)庫(kù):
mkdir monorepo-examplecd monorepo-example
git init
Alternatively, you can clone the example repo via:
或者,您可以通過(guò)以下方式克隆示例存儲(chǔ)庫(kù) :
git clone https://github.com/uiucanh/monorepo-example.gitNext, run these commands to download the setup file:
接下來(lái),運(yùn)行以下命令以下載安裝文件:
printf '[GLOBAL]\npants_version = "1.30.0"\nbackend_packages = []\n' > pants.tomlcurl -L -o ./pants https://pantsbuild.github.io/setup/pants && \ chmod +x ./pants
Then, bootstrap Pants by running ./pants --version . You should receive 1.30.0 as output.
然后,通過(guò)運(yùn)行./pants --version引導(dǎo)褲子。 您應(yīng)該收到1.30.0作為輸出。
Let’s add a couple of simple apps to the repo. First, we’ll create a utils/data_gen.py and a utils/metrics.py that contain a couple of util functions:
讓我們向倉(cāng)庫(kù)添加幾個(gè)簡(jiǎn)單的應(yīng)用程序。 首先,我們將創(chuàng)建一個(gè)utils/data_gen.py和一個(gè)utils/metrics.py ,其中包含幾個(gè)util函數(shù):
import numpy as npdef generate_linear_data(n_samples: int = 100, n_features: int = 1,x_min: int = -5, x_max: int = 5,m_min: int = -10, m_max: int = 10,noise_strength: int = 1, seed: int = None,bias: int = 10):# Set the random seedif seed is not None:np.random.seed(seed)X = np.random.uniform(x_min, x_max, size=(n_samples, n_features))m = np.random.uniform(m_min, m_max, size=n_features)y = np.dot(X, m).reshape((n_samples, 1))if bias != 0:y += bias# Add Gaussian noisey += np.random.normal(size=y.shape) * noise_strengthreturn X, ydef split_dataset(X: np.ndarray, y: np.ndarray,test_size: float = 0.2, seed: int = 0):# Set the random seednp.random.seed(seed)# Shuffle datasetindices = np.random.permutation(len(X))X = X[indices]y = y[indices]# SplittingX_split_point = int(len(X) * (1 - test_size))y_split_point = int(len(y) * (1 - test_size))X_train, X_test = X[:X_split_point], X[X_split_point:]y_train, y_test = y[:y_split_point], y[y_split_point:]return X_train, X_test, y_train, y_testimport numpy as npdef mean_absolute_percentage_error(y_true: np.ndarray, y_pred: np.ndarray):y_true, y_pred = np.array(y_true), np.array(y_pred)return np.mean(np.abs((y_true - y_pred) / y_true)) * 100def r2(y_test: np.ndarray, y_pred: np.ndarray):y_mean = np.mean(y_test)ss_tot = np.square(y_test - y_mean).sum()ss_res = np.square(y_test - y_pred).sum()result = 1 - ss_res / ss_totreturn resultNow, we’ll add an application first_app/app.pythat imports these codes. The app uses data fromgenerate_linear_data , passes them to a Linear Regression model and outputs the Mean Absolute Percentage Error.
現(xiàn)在,我們將添加一個(gè)導(dǎo)入這些代碼的應(yīng)用程序first_app/app.py 該應(yīng)用程序使用generate_linear_data數(shù)據(jù),將其傳遞到線性回歸模型,然后輸出平均絕對(duì)百分比誤差。
import os import sys# Enable import from outer directory file_path = os.path.dirname(os.path.realpath(__file__)) sys.path.insert(0, file_path + "/..")from utils.data_gen import generate_linear_data, split_dataset # noqa from utils.metrics import mean_absolute_percentage_error, r2 # noqa from sklearn.linear_model import LinearRegression # noqaclass Model:def __init__(self, X, y):self.X = Xself.y = yself.m = LinearRegression()self.y_pred = Nonedef split(self, test_size=0.33, seed=0):self.X_train, self.X_test, self.y_train, self.y_test = split_dataset(self.X, self.y, test_size=test_size, seed=seed)def fit(self):self.m.fit(self.X_train, self.y_train)def predict(self):self.y_pred = self.m.predict(self.X_test)def main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()print("MAPE:", mean_absolute_percentage_error(m.y_test, m.y_pred))if __name__ == '__main__':main()And another app second_app/app.pythat uses the first app codes:
還有另一個(gè)使用第一個(gè)應(yīng)用程序代碼的應(yīng)用程序second_app/app.py :
import sys import os# Enable import from outer directory file_path = os.path.dirname(os.path.realpath(__file__)) sys.path.insert(0, file_path + "/..")from utils.metrics import r2 # noqa from utils.data_gen import generate_linear_data, split_dataset # noqa from first_app.app import Model # noqadef main():X, y = generate_linear_data()m = Model(X, y)m.split()m.fit()m.predict()result = r2(m.y_test, m.y_pred)print("R2:", result)return resultif __name__ == '__main__':_ = main()Then we add a couple of simple tests for these apps, for example:
然后,我們?yōu)檫@些應(yīng)用添加一些簡(jiǎn)單的測(cè)試,例如:
import numpy as np from first_app.app import Modeldef test_model_working():X, y = np.array([[1, 2, 3], [4, 5, 6]]), np.array([[1], [2]])m = Model(X, y)m.split()m.fit()m.predict()assert m.y_pred is not NoneIn each of these directories, we’ll need a BUILD file. These files contain information about your targets and their dependencies. In these files, we’ll declare what requirements are needed for these projects as well as declare the test targets.
在每個(gè)目錄中,我們需要一個(gè)BUILD文件。 這些文件包含有關(guān)目標(biāo)及其依賴項(xiàng)的信息。 在這些文件中,我們將聲明這些項(xiàng)目需要哪些要求以及聲明測(cè)試目標(biāo)。
Let’s start from the root of the repository:
讓我們從存儲(chǔ)庫(kù)的根目錄開(kāi)始:
python_requirements()This BUILD file contains a macro python_requirements() that creates multiple targets to pull third party dependencies from a requirements.txt in the same directory. It saves us time from having to do it manually for each requirement:
此BUILD文件包含一個(gè)宏python_requirements() ,該宏創(chuàng)建多個(gè)目標(biāo)以從同一目錄中的requirements.txt中提取第三方依賴項(xiàng)。 它為我們節(jié)省了手動(dòng)完成每個(gè)需求的時(shí)間:
python_requirement_library(name="numpy",
requirements=[
python_requirement("numpy==1.19.1"),
],
)
The BUILD file inutils would look like below:
utils的BUILD文件如下所示:
python_library(name = "utils",sources = ["data_gen.py","metrics.py",],dependencies = [# The `//` signals that the target is at the root of your project."//:numpy"] )python_tests(name = 'utils_test',sources = ["data_gen_test.py","metrics_test.py",],dependencies = [":utils",] )Here we have two targets: First one is a Python library that contains Python codes which are defined in source i.e. our two utility files. It also specifies the requirements needed to run these codes which is numpy, one of our third party dependencies we defined in the root BUILD file.
這里我們有兩個(gè)目標(biāo):第一個(gè)是Python庫(kù),其中包含在source代碼中定義的Python代碼,即我們的兩個(gè)實(shí)用程序文件。 它還指定了運(yùn)行這些代碼numpy所需的要求, numpy是我們?cè)诟鵅UILD文件中定義的第三方依賴項(xiàng)之一。
The second target is the collection of tests we defined earlier, their dependency is the previous Python library. To run these tests, it’s as simple as running ./pants test utils:utils_test or ./pants test utils:: from root. The second : tells Pants to run all the test targets in that BUILD file. The output should look like this:
第二個(gè)目標(biāo)是我們前面定義的測(cè)試集合,它們的依賴關(guān)系是先前的Python庫(kù)。 要運(yùn)行這些測(cè)試,就像從根目錄運(yùn)行./pants test utils:utils_test或./pants test utils::一樣簡(jiǎn)單。 第二個(gè):告訴Pant運(yùn)行該BUILD文件中的所有測(cè)試目標(biāo)。 輸出應(yīng)如下所示:
============== test session starts ===============platform darwin -- Python 3.7.5, pytest-5.3.5, py-1.9.0, pluggy-0.13.1
cachedir: .pants.d/test/pytest/.pytest_cache
rootdir: /Users/ducbui/Desktop/Projects/monorepo-example, inifile: /dev/null
plugins: cov-2.8.1, timeout-1.3.4
collected 3 items
utils/data_gen_test.py . [ 33%]
utils/metrics_test.py .. [100%]
Similarly, we’ll create 2 BUILD files for first_app and second_app
同樣,我們將為first_app和second_app創(chuàng)建2個(gè)BUILD文件
python_library(name = "first_app",sources = ["app.py"],dependencies = ["//:numpy","//:scikit-learn","//:pytest","utils",], )python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":first_app",] )In the second_app BUILD file, we declare the library fromfirst_app above as the dependency for this library. This means that all the dependencies from that library, together with its source will be the dependencies for first_app .
在second_app BUILD文件中,我們從上面的first_app聲明該庫(kù)作為該庫(kù)的依賴項(xiàng)。 這意味著該庫(kù)中的所有依賴項(xiàng)及其源將成為first_app的依賴項(xiàng)。
python_library(name = "second_app",sources = ["app.py"],dependencies = ["first_app",], )python_tests(name = 'app_test',sources = ["app_test.py"],dependencies = [":second_app",] )Similarly, we also add some test targets to these BUILD files and they can be run with ./pants test first_app:: or ./pants test second_app:: .
同樣,我們也向這些BUILD文件添加了一些測(cè)試目標(biāo),它們可以通過(guò)./pants test first_app::或./pants test second_app:: 。
The final directory tree should look like this:
最終目錄樹(shù)應(yīng)如下所示:
.├── BUILD
├── first_app
│ ├── BUILD
│ ├── app.py
│ └── app_test.py
├── pants
├── pants.toml
├── requirements.txt
├── second_app
│ ├── BUILD
│ ├── app.py
│ └── app_test.py
└── utils
├── BUILD
├── data_gen.py
├── data_gen_test.py
├── metrics.py
└── metrics_test.py
The power of Pants comes from the ability to trace transitive dependencies between projects and test targets that were affected by the change. The developers of Pants provide us with this nifty bash script that can be used to track down affected test targets:
Pant的強(qiáng)大之處在于能夠跟蹤受更改影響的項(xiàng)目與測(cè)試目標(biāo)之間的傳遞依賴關(guān)系。 Pants的開(kāi)發(fā)人員為我們提供了這個(gè)漂亮的bash腳本,可用于跟蹤受影響的測(cè)試目標(biāo):
#!/bin/bashset -x set -o set -e# Disable Zinc incremental compilation to ensure no historical cruft pollutes the build used for CI testing. export PANTS_COMPILE_ZINC_INCREMENTAL=Falsechanged=("$(./pants --changed-parent=origin/master list)") dependees=("$(./pants dependees --dependees-transitive --dependees-closed ${changed[@]})") minimized=("$(./pants minimize ${dependees[@]})") ./pants filter --filter-type=-jvm_binary ${minimized[@]} | sort > minimized.txt# In other contexts we can use --spec-file to read the list of targets to operate on all at # once, but that would merge all the classpaths of all the test targets together, which may cause # errors. See https://www.pantsbuild.org/3rdparty_jvm.html#managing-transitive-dependencies. # TODO(#7480): Background cache activity when running in a loop can sometimes lead to race conditions which # cause pants to error. This can probably be worked around with --no-cache-compile-rsc-write. See # https://github.com/pantsbuild/pants/issues/7480.for target in $(cat minimized.txt); do./pants test $target doneTo showcase its power, let’s run an example. We’ll create a new branch, make a modification to data_gen.py (e.g. changing the default parameter for generate_linear_data ) and commit:
為了展示其功能,讓我們來(lái)看一個(gè)例子。 我們將創(chuàng)建一個(gè)新分支,對(duì)data_gen.py進(jìn)行修改(例如,更改generate_linear_data的默認(rèn)參數(shù))并提交:
git checkout -b "example_1"git add utils/data_gen.py
git commit -m "support/change-params"
Now, running the bash script we’ll see a minimized.txt that contains all the projects that are impacted and the test targets that will be executed:
現(xiàn)在,運(yùn)行bash腳本,我們將看到一個(gè)minimized.txt ,其中包含所有受影響的項(xiàng)目以及將要執(zhí)行的測(cè)試目標(biāo):
first_app:app_testsecond_app:app_test
utils:utils_testTransitive dependencies傳遞依存關(guān)系
Looking at the graph above, we can clearly see that changing utils would affect all of its above nodes, including first_app and second_app .
查看上圖,我們可以清楚地看到更改utils會(huì)影響其上面的所有節(jié)點(diǎn),包括first_app和second_app 。
Let’s do another example, this time we’ll only modify second_app/app.py . Switch branch, commit and run the script again. Insideminimized.txt , we’ll only get second_app:app_test as it’s the topmost node.
讓我們?cè)倥e一個(gè)例子,這次我們只修改second_app/app.py 切換分支,提交并再次運(yùn)行腳本。 里面minimized.txt ,我們只得到second_app:app_test ,因?yàn)樗亲铐攲拥墓?jié)點(diǎn)。
And that’s it, hopefully, I’ve managed to demonstrate to you how useful Pantsbuild can be for Data Science monorepos. Together with a properly implemented CI/CD pipeline, the speed and reliability of development can be improved vastly.
就是這樣,希望我能夠向您演示Pantsbuild對(duì)Data Science monorepos的有用性。 加上正確實(shí)施的CI / CD管道,可以極大地提高開(kāi)發(fā)速度和可靠性。
翻譯自: https://towardsdatascience.com/building-a-monorepo-for-data-science-with-pantsbuild-2f77b9ee14bd
白褲子變粉褲子怎么辦
總結(jié)
以上是生活随笔為你收集整理的白裤子变粉裤子怎么办_使用裤子构建构建数据科学的monorepo的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 做梦梦到好多蛇是好事还是坏事
- 下一篇: 梦到洗头是什么意思周公解梦