DSP代码优化
1 注意使用ccs自帶的優化工具
1.1 選擇恰當的編譯器選項
- 必須要用的選項 –O[2|3]
- 可以使用-mt(要確保寫的數據和讀的數據在內存空間上沒有重合)
- -mh<num>? Specify speculative load byte count threshold
- 如果源代碼里含有永遠不會執行的代碼,使用選項-mo Place each function in a separate subsection
- 如果考慮可執行程序的大小,加上-ms[0-3]。
- 不要加上-g –gp –ss –ml3 –mu
- -s[-k -al]、–mw 、(-on2 -o3)、-consultant 可以在產生分析信息的同時不影響生成代碼的性能?
1.2 確保循環中的次數變量(一般for(i; i<N; i++), 此處的i)是有符號整數。
1.3 給編譯器提供盡可能多的信息,如限定符restrict, 編譯指示MUST_ITERATE、_nasserts()等。
1.4 內存中數據結構的對齊,可使用DATA_ALIGN、 STRUCT_ALIGN等
1.5 參考Compiler Consultant和.nfo文件的建議信息
2 利用優化器的意見
當編譯選項中有-s時,在生成的*.asm文件中會有優化器的意見。如"C:/CCStudio_v3.3/C6000/cgtools/bin/cl6x" -g -k -s -on2 -o3 -mt -mw -mv6400+ --mem_model:data=near --consultant -@"Debug.lkf" "lesson_c.c"。其中,
void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N)
{
??? int i, w_vec1, w_vec2;
??? short w1,w2;
??? w1 = zptr[0];
??? w2 = zptr[1];
??? for (i = 0; i < N; i++)?{
??????? w_vec1 =? xptr[i] * w1;
??????? w_vec2 =? yptr[i] * w2;
??????? w_sum[i] = (w_vec1 + w_vec2) >> 15;
??? }
}
生成的優化器意見為:
;** --------------------------------------------------------------------------*
;** 27??? -----------------------??? w1 = *zptr;
;** 28??? -----------------------??? w2 = zptr[1];
;** 29??? -----------------------??? if ( N <= 0 ) goto g4;
;** --------------------------------------------------------------------------*
;**????? -----------------------??? U$17 = xptr;
;**????? -----------------------??? U$20 = yptr;
;**????? -----------------------??? U$26 = w_sum;
;** 31??? -----------------------??? L$1 = N;
;**????? -----------------------??? #pragma MUST_ITERATE(1, 1099511627775, 1)
;**????? -----------------------??? #pragma LOOP_FLAGS(4096u)
;**??? -----------------------g3:
;** 31??? -----------------------??? *U$26++ = _mpy(*U$17++, w1)+_mpy(*U$20++, w2)>>15;
;** 29??? -----------------------??? if ( --L$1 ) goto g3;
;**??? -----------------------g4:
;**????? -----------------------??? return;
從中可以看出,編譯器自動加入了N是否為0的判斷。如果改為:
void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N)
{
??? int i, w_vec1, w_vec2;
??? short w1,w2;
? ??w1 = zptr[0];
??? w2 = zptr[1];
??? #pragma MUST_ITERATE(1)? //告訴編譯器至少循環一次(編譯器就不會再判斷N是否為0)
??? for (i = 0; i < N; i++)
??? {
??????? w_vec1 =? xptr[i] * w1;
??????? w_vec2 =? yptr[i] * w2;
??????? w_sum[i] = (w_vec1 + w_vec2) >> 15;
??? }
}
如下意見中已經沒有了對N是否為0的判斷:
;** --------------------------------------------------------------------------*
;** 27??? -----------------------??? w1 = *zptr;
;** 28??? -----------------------??? w2 = zptr[1];
;**????? -----------------------??? U$15 = xptr;
;**????? -----------------------??? U$18 = yptr;
;**????? -----------------------??? U$24 = w_sum;
;** 32??? -----------------------??? L$1 = N;
;**????? -----------------------??? #pragma MUST_ITERATE(1, 4294967295, 1)
;**????? -----------------------??? #pragma LOOP_FLAGS(4096u)
;**??? -----------------------g2:
;** 32??? -----------------------??? *U$24++ = _mpy(*U$15++, w1)+_mpy(*U$18++, w2)>>15;
;** 30??? -----------------------??? if ( --L$1 ) goto g2;
;**????? -----------------------??? return;
3 利用軟件流水信息優化循環
void lesson_c(short *xptr, short *yptr, short *zptr, short *w_sum, int N)
{
??? int i, w_vec1, w_vec2;
??? short w1,w2;
??? w1 = zptr[0];
??? w2 = zptr[1];
??? for (i = 0; i < N; i++)
??? {
??????? w_vec1 =? xptr[i] * w1;
??????? w_vec2 =? yptr[i] * w2;
??????? w_sum[i] = (w_vec1 + w_vec2) >> 15;
??? }
}
上述代碼的編譯軟件流水信息為:
;*----------------------------------------------------------------------------*
;*?? SOFTWARE PIPELINE INFORMATION
;*
;*????? Loop source line???????????????? : 30??????????????????????? ;循環開始的行數
;*????? Loop opening brace source line?? : 31
;*????? Loop closing brace source line?? : 35
;*????? Known Minimum Trip Count???????? : 1????????????????;循環最小次數
;*????? Known Max Trip Count Factor????? : 1????????????????;循環的因子,循環次數是循環因子的倍數,如果知道循環因子的話,便于編譯器自動鋪開(unroll)代碼
;*????? Loop Carried Dependency Bound(^) : 0???????????;內存讀寫瓶頸,如果有的話,后面的匯編代碼注釋里相應語句含有^標志
;*????? Unpartitioned Resource Bound???? : 2???????????????;資源瓶頸
;*????? Partitioned Resource Bound(*)??? : 2
;*????? Resource Partition:
;*?????????????????????????????????? A-side?? B-side
;*????? .L units? ? ? ? ? ? ? ? ? ? ? ? ?0??????? 0?????
;*????? .S units? ? ? ? ? ? ? ? ? ? ? ? 1??????? 0?????
;*????? .D units? ? ? ? ? ? ? ? ? ? ? ? 2*?????? 1?????
;*????? .M units? ? ? ? ? ? ? ? ? ? ? ?1??????? 1?????
;*????? .X cross paths? ? ? ? ? ? ?1??????? 0?????
;*????? .T address paths???????? 2*?????? 1?????
;*????? Long read paths????????? 0??????? 0?????
;*????? Long write paths? ? ? ? ? 0??????? 0?????
;*????? Logical? ops (.LS)? ? ? ? ?0??????? 0???? (.L or .S unit)
;*????? Addition ops (.LSD)????? 1??????? 0???? (.L or .S or .D unit)
;*????? Bound(.L .S .LS)? ? ? ? ? ?1??????? 0?????
;*????? Bound(.L .S .D .LS .LSD)???? 2*?????? 1???????????????? ;資源使用不平衡(沒有完全利用可用的計算能力)
;*
;*????? Searching for software pipeline schedule at ...
;*??????ii = 2? Schedule found with 6 iterations in parallel
;*
;*????? Register Usage Table:
;*????????? +-----------------------------------------------------------------+
;*????????? |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*????????? |00000000001111111111222222222233|00000000001111111111222222222233|
;*????????? |01234567890123456789012345678901|01234567890123456789012345678901|
;*????????? |--------------------------------+--------------------------------|
;*?????? 0: |?? ******?????????????????????? |??? * **??????????????????????? |
;*?????? 1: |?? *? ***?????????????????????? |??? ****??????????????????????? |
;*????????? +-----------------------------------------------------------------+
;*
;*????? Done
;*
;*????? Loop will be splooped
;*????? Collapsed epilog stages???? : 0
;*????? Collapsed prolog stages???? : 0
;*????? Minimum required memory pad : 0 bytes
;*
;*????? Minimum safe trip count???? : 1
;*----------------------------------------------------------------------------*
;*??????? SINGLE SCHEDULED ITERATION??????????????????? ;需要加-mw選項
;*
;*??????? $C$C23:
;*?? 0????????????? LDH???? .D2T2?? *B6++,B5????????? ; |32|????????????? ;一次裝載16bit,浪費了剩余16bit帶寬(假設數據帶寬為32bit)
;*?? 1????????????? LDH???? .D1T1?? *A6++,A5????????? ; |32|?
;*?? 2????????????? NOP???????????? 3
;*?? 5????????????? MPY???? .M2???? B5,B7,B4????????? ; |32|?
;*?? 6????????????? MPY???? .M1???? A5,A8,A4????????? ; |32|?
;*?? 7????????????? NOP???????????? 1
;*?? 8????????????? ADD???? .L1X??? B4,A4,A3????????? ; |32|?
;*?? 9????????????? SHR???? .S1???? A3,15,A3????????? ; |32|?
;*? 10????????????? STH???? .D1T1?? A3,*A7++????????? ; |32|?
;*? ? ? ? ?||?????????? SPBR??????????? $C$C23
;*? 11????????????? NOP???????????? 1
;*? 12????????????? ; BRANCHCC OCCURS {$C$C23}??????? ; |30|????????? ;一次循環需要12時鐘周期(沒有展開)
;*----------------------------------------------------------------------------*
修改為下面的代碼時:
#define WORD_ALIGNED(x)? (_nassert(((int)(x) & 0x3) == 0))
#define DWORD_ALIGNED(x) (_nassert(((int)(x) & 0x7) == 0))
void lesson3_c(short * restrict xptr, short * restrict yptr, short *zptr,?short *w_sum, int N)
{
??? int i, w_vec1, w_vec2;
??? short w1, w2;
??? WORD_ALIGNED(xptr); //保證內存裝載的帶寬
??? WORD_ALIGNED(yptr);???????????????????????????????????
??? w1 = zptr[0];
??? w2 = zptr[1];
??? #pragma MUST_ITERATE(48, , 2);? ?//已知最小循環次數=48,循環因子factor=2, 可以鋪開代碼
??? for (i = 0; i < N; i++)
??? {
??????? w_vec1 =? xptr[i] * w1;
??????? w_vec2 =? yptr[i] * w2;
??????? w_sum[i] = (w_vec1 + w_vec2) >> 15;
??? }
}
上述代碼的編譯軟件流水信息為:
;*----------------------------------------------------------------------------*
;*?? SOFTWARE PIPELINE INFORMATION
;*
;*????? Loop source line???????????????? : 59
;*????? Loop opening brace source line?? : 60
;*????? Loop closing brace source line?? : 64
;*????? Loop Unroll Multiple???????????? : 4x?????????????????????? ;循環鋪開的次數
;*????? Known Minimum Trip Count???????? : 12????????????????????
;*????? Known Max Trip Count Factor????? : 1
;*????? Loop Carried Dependency Bound(^) : 0
;*????? Unpartitioned Resource Bound???? : 4
;*????? Partitioned Resource Bound(*)??? : 4
;*????? Resource Partition:
;*??????????????????????????????? A-side?? B-side
;*????? .L units???????????????????? 0??????? 0?????
;*????? .S units???????????????????? 2??????? 2?????
;*????? .D units???????????????????? 4*?????? 4*???????????????????? ;鋪開循環保證了資源使用的平衡
;*????? .M units???????????????????? 2??????? 2?????
;*????? .X cross paths?????????????? 2??????? 2?????
;*????? .T address paths???????????? 4*?????? 4*????
;*????? Long read paths????????????? 0??????? 0?????
;*????? Long write paths???????????? 0??????? 0?????
;*????? Logical? ops (.LS)?????????? 0??????? 0???? (.L or .S unit)
;*????? Addition ops (.LSD)????????? 2??????? 2???? (.L or .S or .D unit)
;*????? Bound(.L .S .LS)???????????? 1??????? 1?????
;*????? Bound(.L .S .D .LS .LSD)???? 3??????? 3?????
;*
;*????? Searching for software pipeline schedule at ...
;*???????? ii = 4? Schedule found with 4 iterations in parallel
;*
;*????? Register Usage Table:
;*????????? +-----------------------------------------------------------------+
;*????????? |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
;*????????? |00000000001111111111222222222233|00000000001111111111222222222233|
;*????????? |01234567890123456789012345678901|01234567890123456789012345678901|
;*????????? |--------------------------------+--------------------------------|
;*?????? 0: |??? *****??????? ***??????????? |??? ******????? ****??????????? |
;*?????? 1: |?? *******???????? *??????????? |??? ******????? ** *??????????? |
;*?????? 2: |?? *******????? ****??????????? |??? * ****????? **????????????? |
;*?????? 3: |?? *******????? ****??????????? |??? ******????? **????????????? |
;*????????? +-----------------------------------------------------------------+
;*
;*????? Done
;*
;*????? Loop will be splooped
;*????? Collapsed epilog stages???? : 0
;*????? Collapsed prolog stages???? : 0
;*????? Minimum required memory pad : 0 bytes
;*
;*????? Minimum safe trip count???? : 1 (after unrolling)
;*----------------------------------------------------------------------------*
;*??????? SINGLE SCHEDULED ITERATION
;*
;*??????? $C$C24:
;*?? 0????????????? LDW???? .D1T1?? *A7++(8),A9?????? ; |61|???????????? ;每次裝載一個字
;*???? ||?????????? LDW???? .D2T2?? *B6++(8),B17????? ; |61|?
;*?? 1????????????? LDW???? .D1T1?? *A4++(8),A3?????? ; |61|?
;*?? 2????????????? NOP???????????? 1
;*?? 3????????????? LDW???? .D2T2?? *B16++(8),B4????? ; |61|?
;*?? 4????????????? NOP???????????? 2
;*?? 6????????????? MPY2??? .M1???? A9,A8,A17:A16???? ; |61|?
;*?? 7????????????? MPY2??? .M1???? A3,A8,A19:A18???? ; |61|?
;*???? ||?????????? MPY2??? .M2???? B17,B7,B5:B4????? ; |61|?
;*?? 8????????????? MPY2??? .M2???? B4,B7,B19:B18???? ; |61|?
;*?? 9????????????? NOP???????????? 2
;*? 11????????????? ADD???? .L2X??? B4,A16,B17??????? ; |61|?
;*? 12????????????? ADD???? .L2X??? B18,A18,B5??????? ; |61|?
;*???? ||?????????? SHR???? .S2???? B17,15,B4???????? ; |61|?
;*???? ||?????????? ADD???? .L1X??? B5,A17,A3???????? ; |61|?
;*? 13????????????? SHR???? .S2???? B5,15,B4????????? ; |61|?
;*???? ||?????????? ADD???? .L1X??? B19,A19,A19?????? ; |61|?
;*???? ||?????????? STH???? .D2T2?? B4,*B9++(8)?????? ; |61|?
;*???? ||?????????? SHR???? .S1???? A3,15,A18???????? ; |61|?
;*? 14????????????? STH???? .D2T2?? B4,*B8++(8)?????? ; |61|?
;*???? ||?????????? SHR???? .S1???? A19,15,A9???????? ; |61|?
;*???? ||?????????? STH???? .D1T1?? A18,*A5++(8)????? ; |61|?
;*? 15????????????? STH???? .D1T1?? A9,*A6++(8)?????? ; |61|?
;*???? ||?????????? SPBR??????????? $C$C24
;*? 16????????????? ; BRANCHCC OCCURS {$C$C24}??????? ; |59|?????? ;4次循環使用16個時鐘周期(循環展開,一次循環展開為4次)
;*----------------------------------------------------------------------------*
4 consultant advice 和 *.nfo文件
當編譯時加上--consultant和–on2 –o3,可以查看相應的consultant advice 和*.nfo文件。打開profile,運行程序,這是可以查看viewer的consultant內容。*.nfo文件編譯時便可以生成,內容沒有consultant全。下面是lesson3_c.nfo的內容
TMS320C6x C/C++ Optimizer?????????????? v6.0.8
Build Number 1GKUL-JA0KH827-RSAQQ-TAV-ZAZG_W_Q_Y
??????? ======File-level Analysis Summary======
extern void _lesson3_c() is called from 0 sites in this file.
??? It appears to be inlineable (size = 58 units)
??? It calls these functions:
??? <NONE>
??????? ======= End file-level Analysis =======
extern void _lesson3_c() is called from 0 sites in this file.
??? It appears to be inlineable (size = 58 units)
??? It calls these functions:
??? <NONE>
ADVICE: In function lesson3_c()
??? in the 'for' loop with loop variable 'i' at lines 39-44
??? for the statement w_sum[i] = _mpy(xptr[i], w1)+_mpy(yptr[i], w2)>>15; at line 41
??? The address of w_sum[i] for the first iteration of the loop is &w_sum[0].
??? This pointer is aligned to a 16 bit boundary.
??? Consider adding an assertion just before the loop:
??????? _nassert( ((int)w_sum % 4) == 0 );? /* 32-bit aligned */
?????? or??? _nassert( ((int)w_sum % 8) == 0 );? /* 64-bit aligned */
??? to specify that multiple elements of w_sum[i]
??? may be accessed in parallel.
ADVICE: In function lesson3_c()
??? in the 'for' loop with loop variable 'i' at lines 39-44
??? for the statement w_sum[i] = _mpy(xptr[i], w1)+_mpy(yptr[i], w2)>>15; at line 41
??? The address of yptr[i] for the first iteration of the loop is &yptr[0].
??? This pointer is aligned to a 32 bit boundary.
??? Consider adding an assertion just before the loop:
??????? _nassert( ((int)yptr % 8) == 0 );? /* 64-bit aligned */
??? to specify that multiple elements of yptr[i]
??? may be accessed in parallel.
ADVICE: In function lesson3_c()
??? in the 'for' loop with loop variable 'i' at lines 39-44
??? for the statement w_sum[i] = _mpy(xptr[i], w1)+_mpy(yptr[i], w2)>>15; at line 41
??? The address of xptr[i] for the first iteration of the loop is &xptr[0].
??? This pointer is aligned to a 32 bit boundary.
??? Consider adding an assertion just before the loop:
??????? _nassert( ((int)xptr % 8) == 0 );? /* 64-bit aligned */
??? to specify that multiple elements of xptr[i]
??? may be accessed in parallel.
<<NULL MIX DOMAIN>>
== END OF INFO OUTPUT==
轉載于“https://blog.csdn.net/tercel_zhang/article/details/52944286”
總結
- 上一篇: 为什么要进行字节对齐?
- 下一篇: Cache相关基本概念理解