Real-Time Rendering——18.5 Multiprocessing多处理
Traditional APIs have evolved toward issuing fewer calls that each do more [443, 451]. The new generation of APIs—DirectX 12, Vulkan, Metal—take a different strategy.For these APIs the drivers are streamlined and minimal, with much of the complexity and responsibility for validating the state shifted to the calling application, as well as memory allocation and other functions [249, 1438, 1826]. This redesign in good part was done to minimize draw call and state change overhead, which comes from having to map older APIs to modern GPUs. The other element these new APIs encourage is using multiple CPU processors to call the API.
傳統(tǒng)的API已經(jīng)朝著發(fā)出更少的調(diào)用而每個調(diào)用都做得更多的方向發(fā)展[443451]。新一代API DirectX 12、Vulkan、Metal采取了不同的策略。對于這些API,驅(qū)動程序是精簡和最小化的,驗證狀態(tài)的復(fù)雜性和責(zé)任大部分轉(zhuǎn)移到調(diào)用應(yīng)用程序,以及內(nèi)存分配和其他功能[24914381826]。這種重新設(shè)計在很大程度上是為了最小化繪制調(diào)用和狀態(tài)更改開銷,這是因為必須將舊API映射到現(xiàn)代GPU。這些新API鼓勵的另一個元素是使用多個CPU處理器來調(diào)用API。
Around 2003 the trend of ever-rising clock speeds for CPUs flattened out at around 3.4 GHz, due to several physical issues such as heat dissipation and power consumption [1725]. These limits gave rise to multiprocessing CPUs, where instead of higher clock rates, more CPUs were put in a single chip. In fact, many small cores provide the best performance per unit area [75], which is the major reason why GPUs themselves are so effective. Creating efficient and reliable programs that exploit concurrency has been the challenge ever since. In this section we will cover the basic concepts of efficient multiprocessing on CPU cores, at the end discussing how graphics APIs have evolved to enable more concurrency within the driver itself.
2003年前后,由于散熱和功耗等物理問題,CPU時鐘速度不斷上升的趨勢在3.4GHz左右趨于平穩(wěn)[1725]。這些限制導(dǎo)致了多處理器CPU,在這種情況下,更多的CPU被放在單個芯片中,而不是更高的時鐘速率。事實上,許多小內(nèi)核提供了最佳的單位面積性能[75],這是GPU本身如此有效的主要原因。從那時起,創(chuàng)建利用并發(fā)性的高效可靠的程序一直是一項挑戰(zhàn)。在本節(jié)中,我們將介紹CPU核上高效多處理的基本概念,最后討論圖形API如何發(fā)展以實現(xiàn)驅(qū)動程序本身的更多并發(fā)性。
Multiprocessor computers can be broadly classified into message-passing architectures and shared memory multiprocessors. In message-passing designs, each processor has its own memory area, and messages are sent between the processors to communicate results. These are not common in real-time rendering. Shared memory multiprocessors are just as they sound; all processors share a logical address space of memory among themselves. Most popular multiprocessor systems use shared memory, and most of these have a symmetric multiprocessing (SMP) design. SMP means that all the processors are identical. A multicore PC system is an example of a symmetric multiprocessing architecture.
多處理器計算機可以大致分為消息傳遞架構(gòu)和共享內(nèi)存多處理器。在消息傳遞設(shè)計中,每個處理器都有自己的存儲區(qū)域,并且在處理器之間發(fā)送消息以傳達結(jié)果。這些在實時渲染中并不常見。共享內(nèi)存多處理器就像它們聽起來一樣;所有處理器在它們之間共享存儲器的邏輯地址空間。大多數(shù)流行的多處理器系統(tǒng)使用共享內(nèi)存,其中大多數(shù)都有對稱多處理器(SMP)設(shè)計。SMP意味著所有處理器都是相同的。多核PC系統(tǒng)是對稱多處理架構(gòu)的一個示例。
Here, we will present two general methods for using multiple processors for realtime graphics. The first method—multiprocessor pipelining, also called temporal parallelism— will be covered in more detail than the second—parallel processing, also called spatial parallelism. These two methods are illustrated in Figure 18.8. These two types of parallelism are then brought together with task-based multiprocessing,where the application creates jobs that can each be picked up and processed by an individual core.
在這里,我們將介紹兩種使用多個處理器進行實時圖形處理的通用方法。第一種方法多處理器流水線(也稱為時間并行)將比第二種并行處理(也稱為空間并行)更詳細地介紹。這兩種方法如圖18.8所示。然后,這兩種類型的并行性與基于任務(wù)的多處理結(jié)合在一起,應(yīng)用程序創(chuàng)建作業(yè),每個作業(yè)都可以由單個核心拾取和處理。
Figure 18.8. Two different ways of using multiple processors. At the top we show how three processors (CPUs) are used in a multiprocessor pipeline, and at the bottom we show parallel execution on three CPUs. One of the differences between these two implementations is that lower latency can be achieved if the configuration at the bottom is used. On the other hand, it may be easier to use a multiprocessor pipeline. The ideal speedup for both of these configurations is linear, i.e., using n CPUs would give a speedup of n times.?
圖18.8.使用多個處理器的兩種不同方式。在頂部,我們展示了如何在多處理器流水線中使用三個處理器(CPU),在底部,我們展示三個CPU上的并行執(zhí)行。這兩種實現(xiàn)之間的區(qū)別之一是,如果使用底部的配置,可以實現(xiàn)更低的延遲。另一方面,使用多處理器流水線可能更容易。這兩種配置的理想加速都是線性的,即使用n個CPU可以實現(xiàn)n倍的加速。
18.5.1 Multiprocessor Pipelining多處理器流水線
As we have seen, pipelining is a method for speeding up execution by dividing a job into certain pipeline stages that are executed in parallel. The result from one pipeline stage is passed on to the next. The ideal speedup is n times for n pipeline stages, and the slowest stage (the bottleneck) determines the actual speedup. Up to this point, we have seen pipelining used with a single CPU core and a GPU to run the application,geometry processing, rasterization, and pixel processing in parallel. Pipelining can also be used when multiple processors are available on the host, and in these cases, it is called multiprocess pipelining or software pipelining.
正如我們所看到的,流水線是一種通過將作業(yè)劃分為并行執(zhí)行的特定流水線階段來加快執(zhí)行速度的方法。來自一個流水線階段的結(jié)果被傳遞到下一個。理想的加速是n個流水線級的n倍,最慢的級(瓶頸)決定了實際的加速。到目前為止,我們已經(jīng)看到流水線與單個CPU內(nèi)核和GPU一起使用,以并行運行應(yīng)用程序、幾何處理、光柵化和像素處理。當(dāng)主機上有多個處理器可用時,也可以使用流水線,在這些情況下,稱為多處理器流水線或軟件流水線。
Here we describe one type of software pipelining. Endless variations are possible and the method should be adapted to the particular application. In this example, the application stage is divided into three stages [1508]: APP, CULL, and DRAW. This is coarse-grained pipelining, which means that each stage is relatively long. The APP stage is the first stage in the pipeline and therefore controls the others. It is in this stage that the application programmer can put in additional code that does, for example, collision detection. This stage also updates the viewpoint. The CULL stage can perform:
這里我們描述一種類型的軟件流水線。無限變化是可能的,并且該方法應(yīng)適合特定應(yīng)用。在本例中,應(yīng)用階段分為三個階段[1508]:APP、CULL和DRAW。這是粗粒度流水線,這意味著每個階段都相對較長。APP階段是管道中的第一階段,因此控制其他階段。正是在這一階段,應(yīng)用程序程序員可以添加額外的代碼,例如進行碰撞檢測。此階段還更新視點。CULL階段可以執(zhí)行:
? Traversal and hierarchical view frustum culling on a scene graph (Section 19.4).
? Level of detail selection (Section 19.9).
? State sorting, as discussed in Section 18.4.5.
? Finally (and always performed), generation of a simple list of all objects that should be rendered.
?場景圖上的遍歷和層次視圖截頭體剔除(第19.4節(jié))。
?詳細程度選擇(第19.9節(jié))。
?狀態(tài)分類,如第18.4.5節(jié)所述。
?最后(并始終執(zhí)行),生成應(yīng)渲染的所有對象的簡單列表。
The DRAW stage takes the list from the CULL stage and issues all graphics calls in this list. This means that it simply walks through the list and feeds the GPU.Figure 18.9 shows some examples of how this pipeline can be used.
DRAW階段從CULL階段獲取列表,并發(fā)出該列表中的所有圖形調(diào)用。這意味著它只需遍歷列表并向GPU提供信息。圖18.9顯示了如何使用該管道的一些示例。
Figure 18.9. Different configurations for a multiprocessor pipeline. The thick lines represent synchronization between the stages, and the subscripts represent the frame number. At the top, a single CPU pipeline is shown. In the middle and at the bottom are shown two different pipeline subdivisions using two CPUs. The middle has one pipeline stage for APP and CULL and one pipeline stage for DRAW. This is a suitable subdivision if DRAW has much more work to do than the others. At the bottom, APP has one pipeline stage and the other two have another. This is suitable if APP has much more work than the others. Note that the bottom two configurations have more time for the APP, CULL, and DRAW stages.?
圖18.9.多處理器流水線的不同配置。粗線表示階段之間的同步,下標(biāo)表示幀號。在頂部,顯示了單個CPU管道。中間和底部顯示了使用兩個CPU的兩個不同的管道細分。中間有一個管道級用于APP和CULL,一個管道階段用于DRAW。如果DRAW要做的工作比其他工作多得多,這是一個合適的細分。在底部,APP有一個流水線階段,其他兩個有另一個。如果應(yīng)用程序比其他應(yīng)用程序有更多的工作,這是合適的。請注意,底部兩個配置有更多的時間用于APP、CULL和DRAW階段。
If one processor core is available, then all three stages are run on that core. If two CPU cores are available, then APP and CULL can be executed on one core and DRAW on the other. Another configuration would be to execute APP on one core and CULL?and DRAW on the other. Which is the best depends on the workloads for the different stages. Finally, if the host has three cores available, then each stage can be executed on a separate core. This possibility is shown in Figure 18.10.
如果一個處理器內(nèi)核可用,那么所有三個階段都在該內(nèi)核上運行。如果有兩個CPU內(nèi)核可用,則可以在一個內(nèi)核上執(zhí)行APP和CULL,在另一個內(nèi)核中執(zhí)行DRAW。另一種配置是在一個核心上執(zhí)行APP,在另一個核心執(zhí)行CULL和DRAW。哪個最佳取決于不同階段的工作負載。最后,如果主機有三個可用的核心,那么每個階段都可以在單獨的核心上執(zhí)行。這種可能性如圖18.10所示。
Figure 18.10. At the top, a three-stage pipeline is shown. In comparison to the configurations in Figure 18.9, this configuration has more time for each pipeline stage. The bottom illustration shows a way to reduce the latency: The CULL and the DRAW are overlapped with FIFO buffering in between.?
圖18.10.頂部顯示了三級管道。與圖18.9中的配置相比,該配置在每個管道階段有更多的時間。下圖顯示了一種減少延遲的方法:CULL和DRAW重疊,中間有FIFO緩沖。
The advantage of this technique is that the throughput, i.e., the rendering speed, increases. The downside is that, compared to parallel processing, the latency is greater. Latency, or temporal delay, is the time it takes from the polling of the user’s actions to the final image [1849]. This should not be confused with frame rate, which is the number of frames displayed per second. For example, say the user is using an untethered head mounted display. The determination of the head’s position may take 10 milliseconds to reach the CPU, then it takes 15 milliseconds to render the frame. The latency is then 25 milliseconds from initial input to display. Even though the frame rate is 66.7 Hz (1/0.015 seconds), if no location prediction or other compensation is performed, interactivity can feel sluggish because of the delay in sending the position changes to the CPU. Ignoring any delay due to user interaction (which is a constant under both systems), multiprocessing has more latency than parallel processing because it uses a pipeline. As is discussed in detail in the next section, parallel processing breaks up the frame’s work into pieces that are run concurrently.
這種技術(shù)的優(yōu)點是吞吐量(即渲染速度)增加。缺點是,與并行處理相比,延遲更大。延遲或時間延遲是指從輪詢用戶動作到最終圖像所需的時間[1849]。這不應(yīng)與幀速率混淆,幀速率是每秒顯示的幀數(shù)。例如,假設(shè)用戶使用的是無限制的頭戴式顯示器。頭部位置的確定可能需要10毫秒才能到達CPU,然后渲染幀需要15毫秒。從初始輸入到顯示,等待時間為25毫秒。即使幀速率為66.7 Hz(1/0.015秒),如果不執(zhí)行位置預(yù)測或其他補償,由于向CPU發(fā)送位置變化的延遲,交互性可能會感覺遲緩。忽略由于用戶交互引起的任何延遲(這在兩種系統(tǒng)下都是常數(shù)),多處理比并行處理具有更大的延遲,因為它使用流水線。正如在下一節(jié)中詳細討論的,并行處理將框架的工作分解為并行運行的部分。
In comparison to using a single CPU on the host, multiprocessor pipelining gives a higher frame rate and the latency is about the same or a little greater due to the cost of synchronization. The latency increases with the number of stages in the pipeline. For a well-balanced application the speedup is n times for n CPUs.
與在主機上使用單個CPU相比,多處理器流水線提供了更高的幀速率,由于同步成本的原因,延遲大約相同或稍大。延遲隨著管道中階段的數(shù)量而增加。對于一個平衡良好的應(yīng)用程序,n個CPU的加速是n倍。
One technique for reducing the latency is to update the viewpoint and other latency-critical parameters at the end of the APP stage [1508]. This reduces the latency by (approximately) one frame. Another way to reduce latency is to execute CULL and DRAW overlapped. This means that the result from CULL is sent over to DRAW as soon as anything is ready for rendering. For this to work, there has to be some buffering, typically a FIFO, between those stages. The stages are stalled on empty and full conditions; i.e., when the buffer is full, then CULL has to stall, and when the buffer is empty, DRAW has to starve. The disadvantage is that techniques such as state sorting cannot be used to the same extent, since primitives have to be rendered as soon as they have been processed by CULL. This latency reduction technique is visualized in Figure 18.10.
減少延遲的一種技術(shù)是在APP階段結(jié)束時更新視點和其他延遲關(guān)鍵參數(shù)[1508]。這將延遲減少(大約)一幀。另一種減少延遲的方法是重疊執(zhí)行CULL和DRAW。這意味著一旦準(zhǔn)備好進行渲染,就將CULL的結(jié)果發(fā)送到DRAW。為了實現(xiàn)這一點,這些階段之間必須有一些緩沖,通常是FIFO。在空載和滿載條件下,各階段均處于停滯狀態(tài);i、 例如,當(dāng)緩沖區(qū)已滿時,CULL必須暫停,而當(dāng)緩沖區(qū)為空時,DRAW必須饑餓。缺點是,諸如狀態(tài)排序之類的技術(shù)不能在相同程度上使用,因為基本體必須在CULL處理后立即呈現(xiàn)。這種延遲減少技術(shù)如圖18.10所示。
The pipeline in this figure uses a maximum of three CPUs, and the stages have certain tasks. However, this technique is in no way limited to this configuration—rather, you can use any number of CPUs and divide the work in any way you want.The key is to make a smart division of the entire job to be done so that the pipeline tends to be balanced. The multiprocessor pipelining technique requires a minimum of synchronization in that it needs to synchronize only when switching frames. Additional processors can also be used for parallel processing, which needs more frequent synchronization.
此圖中的管道最多使用三個CPU,各個階段都有特定的任務(wù)。然而,這種技術(shù)絕不局限于這種配置,相反,您可以使用任意數(shù)量的CPU,并以您想要的任何方式分配工作。關(guān)鍵是要對要完成的整個工作進行明智的劃分,以便管道趨于平衡。多處理器流水線技術(shù)需要最少的同步,因為它只需要在切換幀時進行同步。額外的處理器也可以用于并行處理,這需要更頻繁的同步。
18.5.2 Parallel Processing并行處理
A major disadvantage of using a multiprocessor pipeline technique is that the latency tends to increase. For some applications, such as flight simulators, first person shooters,and virtual reality rendering, this is not acceptable. When moving the viewpoint,you usually want instant (next-frame) response but when the latency is long this will not happen. That said, it all depends. If multiprocessing raised the frame rate from 30 FPS with 1 frame latency to 60 FPS with 2 frames latency, the extra frame delay would have no perceptible difference.
使用多處理器流水線技術(shù)的一個主要缺點是延遲往往會增加。對于一些應(yīng)用程序,如飛行模擬器、第一人稱射擊游戲和虛擬現(xiàn)實渲染,這是不可接受的。移動視點時,通常需要即時(下一幀)響應(yīng),但當(dāng)延遲很長時,這不會發(fā)生。也就是說,這一切都取決于。如果多處理將幀速率從1幀延遲的30 FPS提高到2幀延遲的60 FPS,則額外的幀延遲將沒有明顯的差異。
If multiple processors are available, one can also try to run sections of the code concurrently, which may result in shorter latency. To do this, the program’s tasks must possess the characteristics of parallelism. There are several different methods for parallelizing an algorithm. Assume that n processors are available. Using static assignment [313], the total work package, such as the traversal of an acceleration structure, is divided into n work packages. Each processor then takes care of a work package, and all processors execute their work packages in parallel.When all processors have completed their work packages, it may be necessary to merge the results from the processors. For this to work, the workload must be highly predictable.
如果有多個處理器可用,也可以嘗試并發(fā)運行代碼部分,這可能會縮短延遲。要做到這一點,程序的任務(wù)必須具有并行性的特點。有幾種不同的算法并行化方法。假設(shè)有n個處理器可用。使用靜態(tài)賦值[313],將整個工作包(如加速結(jié)構(gòu)的遍歷)劃分為n個工作包。然后,每個處理器處理一個工作包,所有處理器并行執(zhí)行其工作包。當(dāng)所有處理者都完成其工作包后,可能需要合并處理者的結(jié)果。為了實現(xiàn)這一點,工作量必須是高度可預(yù)測的。
When this is not the case, dynamic assignment algorithms that adapt to different workloads may be used [313]. These use one or more work pools. When jobs are generated, they are put into the work pools. CPUs can then fetch one or more jobs from the queue when they have finished their current job. Care must be taken so that only one CPU can fetch a particular job, and so that the overhead in maintaining the queue does not damage performance. Larger jobs mean that the overhead for maintaining the queue becomes less of a problem, but, on the other hand, if the jobs are too large, then performance may degrade due to imbalance in the system—i.e.,one or more CPUs may starve.
當(dāng)情況并非如此時,可以使用適應(yīng)不同工作負載的動態(tài)分配算法[313]。它們使用一個或多個工作池。生成作業(yè)后,它們將被放入工作池。CPU完成當(dāng)前作業(yè)后,可以從隊列中提取一個或多個作業(yè)。必須注意,只有一個CPU可以獲取特定的作業(yè),并且維護隊列的開銷不會損害性能。更大的作業(yè)意味著維護隊列的開銷問題更小,但另一方面,如果作業(yè)太大,則性能可能會由于系統(tǒng)中的不平衡而降低,即一個或多個CPU可能會不足。
As for the multiprocessor pipeline, the ideal speedup for a parallel program running on n processors would be n times. This is called linear speedup. Even though linear speedup rarely happens, actual results can sometimes be close to it.
對于多處理器流水線,在n個處理器上運行的并行程序的理想速度是n倍。這稱為線性加速。盡管線性加速很少發(fā)生,但實際結(jié)果有時可能接近于此。
In Figure 18.8 on page 807, both a multiprocessor pipeline and a parallel processing system with three CPUs are shown. Temporarily assume that these should do the same amount of work for each frame and that both configurations achieve linear speedup.This means that the execution will run three times faster in comparison to serial execution (i.e., on a single CPU). Furthermore, we assume that the total amount of work per frame takes 30 ms, which would mean that the maximum frame rate on a single CPU would be 1/0.03 ≈ 33 frames per second.
在第807頁的圖18.8中,顯示了多處理器流水線和具有三個CPU的并行處理系統(tǒng)。暫時假設(shè)這些應(yīng)該為每個幀做相同的工作量,并且兩種配置都實現(xiàn)了線性加速。這意味著與串行執(zhí)行(即,在單個CPU上)相比,執(zhí)行速度將快三倍。此外,我們假設(shè)每幀的總工作量需要30毫秒,這意味著單個CPU上的最大幀速率為1/0.03≈ 每秒33幀。
The multiprocessor pipeline would (ideally) divide the work into three equal-sized work packages and let each of the CPUs be responsible for one work package. Each work package should then take 10 ms to complete. If we follow the work flow through the pipeline, we will see that the first CPU in the pipeline does work for 10 ms (i.e.,one third of the job) and then sends it on to the next CPU. The first CPU then starts working on the first part of the next frame. When a frame is finally finished, it has taken 30 ms for it to complete, but since the work has been done in parallel in the pipeline, one frame will be finished every 10 ms. So, the latency is 30 ms, and the speedup is a factor of three (30/10), resulting in 100 frames per second.
多處理器流水線(理想情況下)將工作分成三個大小相等的工作包,并讓每個CPU負責(zé)一個工作包。然后,每個工作包需要10毫秒才能完成。如果我們遵循流水線中的工作流程,我們將看到流水線中第一個CPU工作了10毫秒(即作業(yè)的三分之一),然后將其發(fā)送到下一個CPU。然后,第一CPU開始處理下一幀的第一部分。當(dāng)一幀最終完成時,它需要30毫秒才能完成,但由于這項工作是在流水線中并行完成的,因此每10毫秒就會完成一幀。因此,延遲是30毫秒,加速是三倍(30/10),導(dǎo)致每秒100幀。
A parallel version of the same program would also divide the jobs into three work packages, but these three packages will execute at the same time on the three CPUs.This means that the latency will be 10 ms, and the work for one frame will also take 10 ms. The conclusion is that the latency is much shorter when using parallel processing than when using a multiprocessor pipeline.
同一程序的并行版本也會將作業(yè)分為三個工作包,但這三個包將在三個CPU上同時執(zhí)行。這意味著延遲將為10毫秒,一幀的工作也將花費10毫秒。結(jié)論是,使用并行處理時的延遲比使用多處理器流水線時的延遲要短得多。
18.5.3 Task-Based Multiprocessing基于任務(wù)的多處理
Knowing about pipelining and parallel processing techniques, it is natural to combine both in a single system. If there are only a few processors available, it might make sense to have a simple system of explicitly assigning systems to a particular core.However, given the large number of cores on many CPUs, the trend has been to use task-based multiprocessing. Just as one can create several tasks (also called jobs) for a process that can be parallelized, this idea can be broadened to include pipelining.Any task generated by any core is put into the work pool as it is generated. Any free processor gets a task to work on.
了解流水線和并行處理技術(shù),很自然地將兩者結(jié)合在一個系統(tǒng)中。如果只有幾個處理器可用,那么使用一個簡單的系統(tǒng)將系統(tǒng)顯式地分配給特定的核心可能是有意義的。然而,考慮到許多CPU上有大量內(nèi)核,趨勢是使用基于任務(wù)的多處理。正如可以為一個可以并行化的流程創(chuàng)建多個任務(wù)(也稱為作業(yè))一樣,這個想法可以擴展到包括流水線。任何核心生成的任何任務(wù)都會在生成時放入工作池。任何空閑的處理器都有一個任務(wù)要處理。
One way to convert to multiprocessing is to take an application’s workflow and determine which systems are dependent on others. See Figure 18.11.
轉(zhuǎn)換為多處理的一種方法是采用應(yīng)用程序工作流,并確定哪些系統(tǒng)依賴于其他系統(tǒng)。如圖18.11所示。
Figure 18.11. Frostbite CPU job graph, with one small zoomed-in part inset [45]. (Figure courtesy of Johan Andersson—Electronic Arts.)
圖18.11.Frostbite CPU作業(yè)圖,一個小的放大部分插圖[45]。(圖片由Johan Andersson Electronic Arts提供)
Having a processor stall while waiting for synchronization means a task-based version of the application could even become slower due to this cost and the overhead for task management [1854]. However, many programs and algorithms do have a large number of tasks that can be performed at the same time and can therefore benefit.
在等待同步時處理器暫停意味著應(yīng)用程序的基于任務(wù)的版本甚至可能會因為此成本和任務(wù)管理開銷而變得更慢[1854]。然而,許多程序和算法確實有大量可以同時執(zhí)行的任務(wù),因此可以受益。
The next step is to determine what parts of each system can be decomposed into tasks. Characteristics of a piece of code that is a good candidate to become a task include [45, 1060, 1854]:
下一步是確定每個系統(tǒng)的哪些部分可以分解為任務(wù)。適合成為任務(wù)候選的一段代碼的特征包括[4510601854]:
? The task has a well-defined input and output.
? The task is independent and stateless when run, and always completes.
? It is not so large a task that it often becomes the only process running.
?任務(wù)具有明確定義的輸入和輸出。
?任務(wù)在運行時是獨立的、無狀態(tài)的,并且始終完成。
?任務(wù)不是那么大,它往往成為唯一運行的流程。
Languages such as C++11 have facilities built into them for multithreading [1445].On Intel-compatible systems, Intel’s open-source Threading Building Blocks (TBB) is an efficient library that simplifies task generation, pooling, and synchronization [92].
C++11等語言內(nèi)置了多線程功能[1445]。在與英特爾兼容的系統(tǒng)上,英特爾的開源線程構(gòu)建塊(TBB)是一個高效的庫,可以簡化任務(wù)生成、池化和同步[92]。
Having the application create its own sets of tasks that are multiprocessed, such as simulation, collision detection, occlusion testing, and path planning, is a given when performance is critical [45, 92, 1445, 1477, 1854]. We note here again that there are also times when the GPU cores tend to be idle. For example, these are usually underused during shadow map generation or a depth prepass. During such idle times, compute shaders can be applied to other tasks [1313, 1884]. Depending on the architecture, API, and content, it is sometimes the case that the rendering pipeline cannot keep all the shaders busy, meaning that there is always some pool available for compute shading. We will not tackle the topic of optimizing these, as Lauritzen makes a convincing argument that writing fast and portable compute shaders is not possible, due to hardware differences and language limitations [993]. How to optimize the core rendering pipeline itself is the subject of the next section.
當(dāng)性能至關(guān)重要時,讓應(yīng)用程序創(chuàng)建自己的多處理任務(wù)集,如模擬、碰撞檢測、遮擋測試和路徑規(guī)劃,這是一個給定的條件[45,92,1445,1477,1854]。我們在此再次注意到,有時GPU內(nèi)核趨于空閑。例如,這些通常在陰影貼圖生成或深度預(yù)處理期間使用不足。在這種空閑時間期間,計算著色器可以應(yīng)用于其他任務(wù)[13131884]。根據(jù)體系結(jié)構(gòu)、API和內(nèi)容,有時渲染管道無法使所有著色器保持忙碌,這意味著總有一些池可用于計算著色。我們不會討論優(yōu)化這些著色器的問題,因為Lauritzen提出了一個令人信服的論點,即由于硬件差異和語言限制,編寫快速和可移植的計算著色器是不可能的[993]。如何優(yōu)化核心渲染管道本身是下一節(jié)的主題。
18.5.4 Graphics API Multiprocessing Support圖形API多處理支持
Parallel processing often does not map to hardware constraints. For example, DirectX 10 and earlier allow only one thread to access the graphics driver at a time, so parallel processing for the actual draw stage is more difficult [1477].
并行處理通常不會映射到硬件約束。例如,DirectX 10和更早版本一次只允許一個線程訪問圖形驅(qū)動程序,因此實際繪制階段的并行處理更加困難[1477]。
There are two operations in a graphics driver that can potentially use multiple processors: resource creation and render-related calls. Creating resources such as textures and buffers can be purely CPU-side operations and so are naturally parallelizable.That said, creation and deletion can also be blocking tasks, as they might trigger operations on the GPU or need a particular device context. In any case, older APIs were created before consumer-level multiprocessing CPUs existed, so needed to be rewritten to support such concurrency.
圖形驅(qū)動程序中有兩種操作可能使用多個處理器:資源創(chuàng)建和渲染相關(guān)調(diào)用。創(chuàng)建諸如紋理和緩沖區(qū)之類的資源可以是純粹的CPU端操作,因此自然是可并行的。也就是說,創(chuàng)建和刪除也可能是阻塞任務(wù),因為它們可能會觸發(fā)GPU上的操作或需要特定的設(shè)備上下文。無論如何,舊的API是在消費者級多處理CPU存在之前創(chuàng)建的,因此需要重寫以支持這種并發(fā)性。
A key construct used is the command buffer or command list, which harks back to an old OpenGL concept called the display list. A command buffer (CB) is a list of API state change and draw calls. Such lists can be created, stored, and replayed as?desired.They may also be combined to form longer command buffers. Only a single CPU processor communicates with the GPU via the driver and so can send it a CB for execution. However, every processor (including this single processor) can create or concatenate stored command buffers in parallel.
使用的一個關(guān)鍵結(jié)構(gòu)是命令緩沖區(qū)或命令列表,它可以追溯到舊的OpenGL概念,稱為顯示列表。命令緩沖區(qū)(CB)是API狀態(tài)更改和繪制調(diào)用的列表。可以根據(jù)需要創(chuàng)建、存儲和重放這些列表。它們也可以組合起來形成更長的命令緩沖區(qū)。只有一個CPU處理器通過驅(qū)動程序與GPU通信,因此可以向其發(fā)送CB以供執(zhí)行。但是,每個處理器(包括這個單個處理器)都可以并行創(chuàng)建或連接存儲的命令緩沖區(qū)。
In DirectX 11, for example, the processor that communicates with the driver sends its render calls to what is called the immediate context. The other processors each use a deferred context to generate command buffers. As the name implies, these are not directly sent to the driver. Instead, these are sent to the immediate context for rendering. See Figure 18.12. Alternately, a command buffer can be sent to another deferred context, which inserts it into its own CB. Beyond sending a command buffer to the driver for execution, the main operations that the immediate context can perform that the deferred cannot are GPU queries and readbacks. Otherwise, command buffer management looks the same from either type of context.
例如,在DirectX 11中,與驅(qū)動程序通信的處理器將其渲染調(diào)用發(fā)送到所謂的即時上下文。其他處理器都使用延遲上下文來生成命令緩沖區(qū)。顧名思義,這些信息不會直接發(fā)送給駕駛員。相反,它們被發(fā)送到直接上下文進行渲染。參見圖18.12。或者,可以將命令緩沖區(qū)發(fā)送到另一個延遲上下文,后者將其插入到自己的CB中。除了將命令緩沖區(qū)發(fā)送給驅(qū)動程序以供執(zhí)行之外,直接上下文可以執(zhí)行的延遲無法執(zhí)行的主要操作是GPU查詢和讀回。否則,命令緩沖區(qū)管理在兩種類型的上下文中看起來都是一樣的。
Figure 18.12. Command buffers. Each processor uses its deferred context, shown in orange, to create and populate one or more command buffers, shown in blue. Each command buffer is sent to Process #1, which executes these as desired, using its immediate context, shown in green. Process #1 can do other operations while waiting for command buffer N from Process #3. (After Zink et al. [1971].)?
圖18.12.命令緩沖區(qū)。每個處理器都使用其延遲上下文(以橙色顯示)來創(chuàng)建和填充一個或多個命令緩沖區(qū)(以藍色顯示)。每個命令緩沖區(qū)都被發(fā)送到進程#1,進程#1根據(jù)需要使用其直接上下文執(zhí)行這些命令,如綠色所示。進程#1可以在等待來自進程#3的命令緩沖區(qū)N時執(zhí)行其他操作
An advantage of command buffers, and their predecessor, display lists, is that they can be stored and replayed. Command buffers are not fully bound when created, which aids in their reuse. For example, say a CB contains a view matrix. The camera moves, so the view matrix changes. However, the view matrix is stored in a constant?buffer. The constant buffer’s contents are not stored in the CB, only the reference to them. The contents of the constant buffer can be changed without having to rebuild the CB. Determining how best to maximize parallelism involves choosing a suitable granularity—per view, per object, per material—to create, store, and combine command buffers [1971].
命令緩沖區(qū)及其前身顯示列表的一個優(yōu)點是可以存儲和重放它們。命令緩沖區(qū)在創(chuàng)建時沒有完全綁定,這有助于它們的重用。例如,假設(shè)CB包含視圖矩陣。相機移動,因此視圖矩陣發(fā)生變化。然而,視圖矩陣存儲在恒定緩沖區(qū)中。常量緩沖區(qū)的內(nèi)容不存儲在CB中,只有對它們的引用。可以更改常量緩沖區(qū)的內(nèi)容,而無需重建CB。確定如何最好地最大化并行性涉及到為每個視圖、每個對象、每個材質(zhì)選擇合適的粒度來創(chuàng)建、存儲和組合命令緩沖區(qū)[1971]。
Such multithreading draw systems existed for years before command buffers were made a part of modern APIs [1152, 1349, 1552, 1554]. API support makes the process simpler and lets more tools work with the system created. However, command lists do have creation and memory costs associated with them. Also, the expense of mapping an API’s state settings to the underlying GPU is still a costly operation with DirectX 11 and OpenGL, as discussed in Section 18.4.2. Within these systems command buffers can help when the application is the bottleneck, but can be detrimental when the driver is.
在命令緩沖區(qū)成為現(xiàn)代API的一部分之前,這種多線程繪制系統(tǒng)已經(jīng)存在多年[1152134915521554]。API支持使流程更簡單,并允許更多工具與創(chuàng)建的系統(tǒng)一起工作。然而,命令列表確實有與之相關(guān)的創(chuàng)建和內(nèi)存成本。此外,使用DirectX 11和OpenGL,將API的狀態(tài)設(shè)置映射到底層GPU的開銷仍然是一項昂貴的操作,如第18.4.2節(jié)所述。在這些系統(tǒng)中,命令緩沖區(qū)可以在應(yīng)用程序遇到瓶頸時有所幫助,但在驅(qū)動程序遇到瓶頸的情況下可能會有害。
Certain semantics in these earlier APIs did not allow the driver to parallelize various operations, which helped motivate the development of Vulkan, DirectX 12, and Metal. A thin draw submission interface that maps well to modern GPUs minimizes the driver costs of these newer APIs. Command buffer management, memory allocation, and synchronization decisions become the responsibility of the application instead of the driver. In addition, command buffers with these newer APIs are validated once when formed, so repeated playback has less overhead than those used with earlier APIs such as DirectX 11. All these elements combine to improve API efficiency,allow multiprocessing, and lessen the chances that the driver is the bottleneck.
這些早期API中的某些語義不允許驅(qū)動程序并行化各種操作,這有助于推動Vulkan、DirectX 12和Metal的開發(fā)。一個可以很好地映射到現(xiàn)代GPU的精簡繪圖提交界面將這些新API的驅(qū)動程序成本降至最低。命令緩沖區(qū)管理、內(nèi)存分配和同步?jīng)Q策由應(yīng)用程序而不是驅(qū)動程序負責(zé)。此外,這些更新的API的命令緩沖區(qū)在形成時會被驗證一次,因此重復(fù)播放的開銷比以前的API(如DirectX 11)要小。所有這些元素結(jié)合起來可以提高API效率,允許多處理,并減少驅(qū)動程序成為瓶頸的機會。
Further Reading and Resources
Mobile devices can have a different balance of where time is spent, especially if they use a tile-based architecture. Merry [1200] discusses these costs and how to use this type of GPU effectively. Pranckeviˇcius and Zioma [1433] provide an in-depth presentation on many aspects of optimizing for mobile devices. McCaffrey [1156] compares mobile versus desktop architectures and performance characteristics. Pixel shading is often the largest cost on mobile GPUs. Sathe [1545] and Etuaho [443] discuss shader precision issues and optimization on mobile devices.
移動設(shè)備可以有不同的時間平衡,特別是如果它們使用基于瓦片的架構(gòu)。Merry[1200]討論了這些成本以及如何有效地使用這類GPU。Pranckeviís和Zioma[1433]對移動設(shè)備優(yōu)化的許多方面進行了深入介紹。McCaffrey[1156]比較了移動與桌面架構(gòu)和性能特征。像素著色通常是移動GPU上的最大成本。Sathe[1545]和Etuaho[443]討論了移動設(shè)備上的著色器精度問題和優(yōu)化。
For the desktop, Wiesendanger [1882] gives a thorough walkthrough of a modern game engine’s architecture. O’Donnell [1313] presents the benefits of a graph-based rendering system. Zink et al. [1971] discuss DirectX 11 in depth. De Smedt [331] provides guidance as to the common hotspots found in video games, including optimizations for DirectX 11 and 12, for multiple-GPU configurations, and for virtual reality.Coombes [291] gives a rundown of DirectX 12 best practices, and Kubisch [946] provides a guide for when to use Vulkan. There are numerous presentations about porting from older APIs to DirectX 12 and Vulkan [249, 536, 699, 1438]. By the time you read this, there will undoubtedly be more. Check IHV developer sites, such as NVIDIA, AMD, and Intel; the Khronos Group; and the web at large, as well as this book’s website.
對于桌面,Wiesendanger[1882]對現(xiàn)代游戲引擎的架構(gòu)進行了全面的介紹。O'Donnell[1313]介紹了基于圖形的渲染系統(tǒng)的優(yōu)點。Zink等人[1971]深入討論了DirectX 11。De Smedt[331]為視頻游戲中常見的熱點提供了指導(dǎo),包括DirectX 11和12、多GPU配置和虛擬現(xiàn)實的優(yōu)化。Coombes[291]簡要介紹了DirectX 12的最佳實踐,Kubisch[946]提供了何時使用Vulkan的指南。有許多關(guān)于從舊API移植到DirectX 12和Vulkan的演示[249、536、699、1438]。當(dāng)你讀到這篇文章時,毫無疑問會有更多。查看IHV開發(fā)者網(wǎng)站,如NVIDIA、AMD和Intel;Khronos集團;以及本書的網(wǎng)站。
Though a little dated, Cebenoyan’s article [240] is still relevant. It gives an overview of how to find the bottleneck and techniques to improve efficiency. Some popular optimization guides for C++ are Fog’s [476] and Isensee’s [801], free on the web. Hughes et al. [783] provide a modern, in-depth discussion of how to use trace tools and GPUView to analyze where bottlenecks occur. Though focused on virtual reality systems, the techniques discussed are applicable to any Windows-based machine.Sutter [1725] discusses how CPU clock rates leveled out and multiprocessor chipsets arose. For more on why this change occurred and for information on how chips are designed, see the in-depth report by Asanovic et al. [75]. Foley [478] discusses various forms of parallelism in the context of graphics application development. Game Engine Gems 2 [1024] has several articles on programming multithreaded elements for game engines. Preshing [1445] explains how Ubisoft uses multithreading and gives specifics on using C++11’s threading support. Tatarchuk [1749, 1750] gives two detailed presentations on the multithreaded architecture and shading pipeline used for the game Destiny.
盡管有點過時,Cebenoyan的文章[240]仍然相關(guān)。它概述了如何找到瓶頸和提高效率的技術(shù)。一些流行的C++優(yōu)化指南是Fog的[476]和Isensee的[801],在網(wǎng)上免費提供。Hughes等人[783]對如何使用跟蹤工具和GPUView分析瓶頸發(fā)生的位置進行了現(xiàn)代深入的討論。盡管關(guān)注于虛擬現(xiàn)實系統(tǒng),但所討論的技術(shù)適用于任何基于Windows的機器。Sutter[1725]討論了CPU時鐘速率如何均衡以及多處理器芯片組的產(chǎn)生。有關(guān)這種變化發(fā)生的原因以及芯片設(shè)計方式的更多信息,請參見Asanovic等人的深入報告。[75]。Foley[478]討論了圖形應(yīng)用程序開發(fā)背景下的各種并行形式。Game Engine Gems 2[1024]有幾篇關(guān)于為游戲引擎編程多線程元素的文章。Preshing[1445]解釋了育碧如何使用多線程,并給出了使用C++11的線程支持的細節(jié)。Tatarchuk[17491750]對游戲《命運》中使用的多線程架構(gòu)和著色管道進行了兩次詳細介紹。
總結(jié)
以上是生活随笔為你收集整理的Real-Time Rendering——18.5 Multiprocessing多处理的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: JavaScript特效——开关灯泡
- 下一篇: 自动关机win10_win10系统U盘使