當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据探查_数据科学家，开始使用探查器

發(fā)布時間：2023/11/29 编程问答 36 豆豆

生活随笔收集整理的這篇文章主要介紹了数据探查_数据科学家，开始使用探查器小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

數(shù)據(jù)探查

Data scientists often need to write a lot of complex, slow, CPU- and I/O-heavy code — whether you’re working with large matrices, millions of rows of data, reading in data files, or web-scraping.

數(shù)據(jù)科學(xué)家經(jīng)常需要編寫許多復(fù)雜，緩慢，CPU和I / O繁重的代碼-無論您是使用大型矩陣，數(shù)百萬行數(shù)據(jù)，讀取數(shù)據(jù)文件還是網(wǎng)絡(luò)抓取。

Wouldn’t you hate to waste your time refactoring one section of your code, trying to wring out every last ounce of performance, when a few simple changes to another section could speed up your code tenfold?

當(dāng)對另一部分進(jìn)行一些簡單的更改可以使您的代碼速度提高十倍時，您是否不愿意浪費(fèi)您的時間來重構(gòu)代碼的一部分，試圖浪費(fèi)每一刻的性能呢？

If you’re looking for a way to speed up your code, a profiler can show you exactly which parts are taking the most time, allowing you to see which sections would benefit most from optimization.

如果您正在尋找一種加快代碼速度的方法，則探查器可以準(zhǔn)確地向您顯示哪些部分花費(fèi)的時間最多，從而使您可以查看哪些部分將從優(yōu)化中受益最大。

A profiler measures the time or space complexity of a program. There’s certainly value in theorizing about the big O complexity of an algorithm but it can be equally valuable to examine the real complexity of an algorithm.

探查器測量程序的時間或空間復(fù)雜度。對算法的O復(fù)雜度進(jìn)行理論化肯定具有價值，但檢查算法的實(shí)際復(fù)雜度同樣有價值。

Where is the biggest slowdown in your code? Is your code I/O bound or CPU bound? Which specific lines are causing the slowdowns?

您的代碼中最慢的地方在哪里？是代碼 I / O綁定還是CPU綁定？哪些特定的行導(dǎo)致了速度下降？

Once you’ve answered those questions you’ll A) have a better understanding of your code and B) know where to target your optimization efforts in order to get the biggest boon with the least effort.

回答完這些問題后，您將A)對您的代碼有更好的了解，B)知道在哪里進(jìn)行優(yōu)化工作，以便以最少的努力獲得最大的收益。

Let’s dive into some quick examples using Python.

讓我們來看一些使用Python的快速示例。

基礎(chǔ) (The Basics)

You might already be familiar with a few methods of timing your code. You could check the time before and after a line executes like this:

您可能已經(jīng)熟悉幾種計(jì)時代碼的方法。您可以像這樣檢查行執(zhí)行前后的時間：

In [1]: start_time = time.time()
...: a_function() # Function you want to measure
...: end_time = time.time()
...: time_to_complete = end_time - start_time
...: time_to_complete
Out[1]: 1.0110783576965332

Or, if you’re in a Jupyter Notebook, you could use the magic %time command to time the execution of a statement, like this:

或者，如果您在Jupyter Notebook中，則可以使用不可思議的%time命令來計(jì)時語句的執(zhí)行時間，如下所示：

In [2]: %time a_function()
CPU times: user 14.2 ms, sys: 41 μs, total: 14.2 ms
Wall time: 1.01 s

Or, you could use the other magic command %timeit which gets a more accurate measurement by running the command multiple times, like this:

或者，您可以使用另一個魔術(shù)命令%timeit ，它可以通過多次運(yùn)行該命令來獲得更準(zhǔn)確的測量結(jié)果，如下所示：

In [3]: %timeit a_function()
1.01 s ± 1.45 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Alternatively, if you want to time your whole script, you can use the bash command time, like so…

另外，如果您想對整個腳本進(jìn)行計(jì)時，則可以使用bash命令time ，就像這樣……

$ time python my_script.py
real 0m1.041s
user 0m0.040s
sys 0m0.000s

These techniques are great if you want to get a quick sense of how long a script or a section of code takes to run but they are less useful when you want a more comprehensive picture. It would be a nightmare if you had to wrap each line in time.time() checks. In the next section, we’ll look at how to use Python’s built-in profiler.

如果您想快速了解腳本或一段代碼要花多長時間，這些技術(shù)非常有用，但是當(dāng)您想要更全面的了解時，它們就沒什么用處了。如果必須在time.time()檢查中包裝每一行，那將是一場噩夢。在下一節(jié)中，我們將研究如何使用Python的內(nèi)置事件探查器。

使用cProfile深入了解 (Diving Deeper with cProfile)

When you’re trying to get a better understanding of how your code is running, the first place to start is cProfile, Python’s built-in profiler. cProfile will keep track of how often and for how long parts of your program were executed.

當(dāng)您試圖更好地了解代碼的運(yùn)行方式時，第一個開始的地方是cProfile ，它是Python的內(nèi)置探查器。 cProfile將跟蹤程序的執(zhí)行頻率和執(zhí)行時間。

Just keep in mind that cProfile shouldn’t be used to benchmark your code. It’s written in C which makes it fast but it still introduces some overhead that could throw off your times.

請記住，不應(yīng)將cProfile用作基準(zhǔn)測試代碼。它是用C語言編寫的，雖然速度很快，但仍然會帶來一些開銷，這可能會打亂您的時間。

There are multiple ways to use cProfile but one simple way is from the command line.

使用cProfile的方法有多種，但一種簡單的方法是從命令行使用。

Before we demo cProfile, let’s start by looking at a basic sample program that will download some text files, count the words in each one, and then save the top 10 words from each to a file. Now that being said, it isn’t too important what the code does, just that we’ll be using it to show how the profiler works.

在演示cProfile之前，我們先來看一個基本的示例程序，該程序?qū)⑾螺d一些文本文件，計(jì)算每個文本文件中的單詞，然后將每個單詞中的前10個單詞保存到文件中。話雖如此，代碼的功能并不太重要，只是我們將使用它來展示事件探查器的工作方式。

Demo code to test our profiler演示代碼以測試我們的探查器

Now, with the following command, we’ll profile our script.

現(xiàn)在，使用以下命令，我們將分析腳本。

$ python -m cProfile -o profile.stat script.py

The -o flag specifies an output file for cProfile to save the profiling statistics.

-o標(biāo)志為cProfile指定一個輸出文件，以保存性能分析統(tǒng)計(jì)信息。

Next, we can fire up python to examine the results using the pstats module (also part of the standard library).

接下來，我們可以使用pstats模塊(也是標(biāo)準(zhǔn)庫的一部分)啟動python來檢查結(jié)果。

In [1]: import pstats
...: p = pstats.Stats("profile.stat")
...: p.sort_stats(
...: "cumulative" # sort by cumulative time spent
...: ).print_stats(
...: "script.py" # only show fn calls in script.py
...: )Fri Aug 07 08:12:06 2020 profile.stat46338 function calls (45576 primitive calls) in 6.548 secondsOrdered by: cumulative time
List reduced from 793 to 6 due to restriction <'script.py'>ncalls tottime percall cumtime percall filename:lineno(function)
1 0.008 0.008 5.521 5.521 script.py:1(<module>)
1 0.012 0.012 5.468 5.468 script.py:19(read_books)
5 0.000 0.000 4.848 0.970 script.py:5(get_book)
5 0.000 0.000 0.460 0.092 script.py:11(split_words)
5 0.000 0.000 0.112 0.022 script.py:15(count_words)
1 0.000 0.000 0.000 0.000 script.py:32(save_results)

Wow! Look at all that useful info!

哇！查看所有有用的信息！

For each function called, we’re seeing the following information:

對于每個調(diào)用的函數(shù)，我們都會看到以下信息：

ncalls: number of times the function was called
ncalls ：調(diào)用函數(shù)的次數(shù)
tottime: total time spent in the given function (excluding calls to sub-functions)
tottime ：在給定功能上花費(fèi)的總時間(不包括對子功能的調(diào)用)
percall: tottime divided by ncalls
percall ： tottime除以ncalls
cumtime: total time spent in this function and all sub-functions
cumtime ：此功能和所有子功能所花費(fèi)的總時間
percall: (again) cumtime divided by ncalls
percall ：(再次) cumtime除以ncalls
filename:lineo(function): the file name, line number, and function name
filename:lineo(function) ：文件名，行號和函數(shù)名

When reading this output, note the fact that we’re hiding a lot of data —in fact, we’re only seeing 6 out of 793 rows. Those hidden rows are all the sub-functions being called from within functions like urllib.request.urlopen or re.split. Also, note that the <module> row corresponds to the code in script.py that isn’t inside a function.

在讀取此輸出時，請注意以下事實(shí)：我們隱藏了很多數(shù)據(jù)-實(shí)際上，在793行中，我們僅看到6。這些隱藏的行是從諸如urllib.request.urlopen或re.split之類的函數(shù)中調(diào)用的所有re.split 。另外，請注意<module>行對應(yīng)于script.py中不在函數(shù)內(nèi)的代碼。

Now let’s look back at the results, sorted by cumulative duration.

現(xiàn)在，讓我們回顧一下按累積持續(xù)時間排序的結(jié)果。

ncalls tottime percall cumtime percall filename:lineno(function)
1 0.008 0.008 5.521 5.521 script.py:1(<module>)
1 0.012 0.012 5.468 5.468 script.py:19(read_books)
5 0.000 0.000 4.848 0.970 script.py:5(get_book)
5 0.000 0.000 0.460 0.092 script.py:11(split_words)
5 0.000 0.000 0.112 0.022 script.py:15(count_words)
1 0.000 0.000 0.000 0.000 script.py:32(save_results)

Keep in mind the hierarchy of function calls. The top-level, <module>, calls read_books and save_results. read_books calls get_book, split_words, and count_words. By comparing cumulative times, we see that most of <module>’s time is spent in read_books and most of read_books’s time is spent in get_book, where we make our HTTP request, making this script (unsurprisingly) I/O bound.

請記住函數(shù)調(diào)用的層次結(jié)構(gòu)。頂層<module>調(diào)用read_books和save_results. read_books調(diào)用get_book ， split_words和count_words 。通過比較累積時間，我們可以看到<module>的大部分時間都花在了read_books而大多數(shù)read_books的時間都花在了get_book ，我們在這里進(jìn)行HTTP請求，從而使該腳本( 毫不奇怪 )受I / O約束。

Next, let’s take a look at how we can be even more granular by profiling our code line-by-line.

接下來，讓我們看看如何通過逐行分析代碼來使粒度更細(xì)。

逐行分析 (Profiling Line-by-Line)

Once we’ve used cProfile to get a sense of what function calls are taking the most time, we can examine those functions line-by-line to get an even clearer picture of where our time is being spent.

一旦使用cProfile來了解哪些函數(shù)調(diào)用花費(fèi)了最多的時間，我們就可以逐行檢查這些函數(shù)，以更清楚地了解我們的時間花在哪里。

For this, we’ll need to install the line-profiler library with the following command:

為此，我們需要使用以下命令安裝line-profiler庫：

$ pip install line-profiler

Once installed, we just need to add the @profile decorator to the function we want to profile. Here’s the updated snippet from our script:

安裝完成后，我們只需要將@profile裝飾器添加到我們要分析的函數(shù)中即可。這是腳本中的更新片段：

Note the fact that we don’t need to import the profile decorator function — it will be injected by line-profiler.

請注意，我們不需要導(dǎo)入profile裝飾器功能，它將由line-profiler注入。

Now, to profile our function, we can run the following:

現(xiàn)在，要分析我們的功能，我們可以運(yùn)行以下命令：

$ kernprof -l -v script-prof.py

kernprof is installed along with line-profiler. The -l flag tells line-profiler to go line-by-line and the -v flag tells it to print the result to the terminal rather than save it to a file.

kernprof與line-profiler一起安裝。 -l標(biāo)志告訴line-profiler逐行進(jìn)行， -v標(biāo)志告訴它將結(jié)果打印到終端，而不是將其保存到文件。

The result for our script would look something like this:

我們的腳本的結(jié)果如下所示：

The key column to focus on here is % Time. As you can see, 89.5% of our time parsing each book is spent in the get_book function — making the HTTP request — further validation that our program is I/O bound rather than CPU bound.

這里要重點(diǎn)關(guān)注的關(guān)鍵列是% Time 。如您所見，解析每本書的時間中有89.5％花費(fèi)在get_book函數(shù)(發(fā)出HTTP請求)中，這進(jìn)一步驗(yàn)證了我們的程序是I / O綁定而不是CPU綁定。

Now, with this new info in mind, if we wanted to speed up our code we wouldn’t want to waste our time trying to make our word counter more efficient. It only takes a fraction of the time compared to the HTTP request. Instead, we’d focus on speeding up our requests — possibly by making them asynchronously.

現(xiàn)在，有了這些新信息，如果我們想加快代碼的速度，我們就不會浪費(fèi)時間試圖使我們的單詞計(jì)數(shù)器更有效。與HTTP請求相比，它只花費(fèi)一小部分時間。取而代之的是，我們將專注于加快我們的請求-可能通過使其異步進(jìn)行。

Here, the results are hardly surprising, but on a larger and more complicated program, line-profiler is an invaluable tool in our programming tool belt, allowing us to peer under the hood of our program and find the computational bottlenecks.

在這里，結(jié)果不足為奇，但是在更大，更復(fù)雜的程序上， line-profiler是我們編程工具帶中的寶貴工具，它使我們能夠窺視程序的底層并找到計(jì)算瓶頸。

分析內(nèi)存 (Profiling Memory)

In addition to profiling the time-complexity of our program, we can also profile its memory-complexity.

除了分析程序的時間復(fù)雜度之外，我們還可以分析其內(nèi)存復(fù)雜度。

In order to do line-by-line memory profiling, we’ll need to install the memory-profiler library which also uses the same @profile decorator to determine which function to profile.

為了進(jìn)行逐行內(nèi)存分析，我們需要安裝memory-profiler庫，該庫也使用相同的@profile裝飾器來確定要分析的函數(shù)。

$ pip install memory-profiler$ python -m memory_profiler script.py

The result of running memory-profiler on our same script should look something like the following:

在同一腳本上運(yùn)行memory-profiler的結(jié)果應(yīng)類似于以下內(nèi)容：

There are currently some issues with the accuracy of the “Increment” so just focus on the “Mem usage” column for now.

當(dāng)前，“增量”的準(zhǔn)確性存在一些問題，因此暫時僅關(guān)注“內(nèi)存使用量”列。

Our script had peak memory usage on line 28 when we split the books up into words.

當(dāng)我們將書分成單詞時，腳本在第28行的內(nèi)存使用量達(dá)到峰值。

結(jié)論 (Conclusion)

Hopefully, now you’ll have a few new tools in your programming tool belt to help you write more efficient code and quickly determine how to best spend your optimization-time.

希望您現(xiàn)在在編程工具帶中擁有一些新工具，可以幫助您編寫更有效的代碼并快速確定如何最佳地利用優(yōu)化時間。

You can read more about cProfile here, line-profiler here, and memory-profiler here. I also highly recommend the book High Performance Python, by Micha Gorelick and Ian Ozsvald [1].

你可以關(guān)于CPROFILE 這里，行探查這里，和內(nèi)存分析器這里。我也強(qiáng)烈推薦Micha Gorelick和Ian Ozsvald [1]一書“ 高性能Python ”。

Thanks for reading! I’d love to hear your thoughts on profilers or data science or anything else. Comment below or reach out on LinkedIn or Twitter!

謝謝閱讀！我很想聽聽您對分析器或數(shù)據(jù)科學(xué)或其他方面的想法。在下面發(fā)表評論，或在 LinkedIn 或 Twitter上聯(lián)系！

翻譯自: https://towardsdatascience.com/data-scientists-start-using-profilers-4d2e08e7aec0

數(shù)據(jù)探查

總結(jié)

以上是生活随笔為你收集整理的数据探查_数据科学家，开始使用探查器的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。