Summary: A configurable integer transform operation unit is proposed and used in the adaptive transform module of H.264/AVC High Profile video encoder. Through the configuration of the transform type signal, the transform unit can complete the corresponding transform operation. This design uses Altera’s Cyclone II series FPGA for implementation and verification. The maximum operating frequency after layout and wiring is 63 MHz. It uses a transformation module with 4 configurable transformation units, which can meet the real-time encoding requirements of [email protected] frames/s video. .
Key words: Adaptive Transform; Hardware Multiplexing; Discrete Cosine Transform; Hadamard Transform; H.264 High Profile Video Encoder
In July 2004, the JVT organization released the High Fidelity Extension (FRExt) part of the H.264/AVC standard, which introduced Adaptive Size Transformation (ABT), the transformation type is changed from 4×4 transformation to adaptive selection between 4×4 and 8×8 transformations.
4×4 DCT is performed on the luma component prediction residuals obtained in the intra 4×4 and intra 16×16 prediction modes; 8×8 DCT is performed on the luma component prediction residuals obtained using the intra 8×8 prediction modes. Using the prediction residual of the luma component obtained in the inter prediction mode, if the prediction mode block is not larger than 8×8, use the 4×4 DCT, otherwise the encoder needs to choose between 4×4 DCT and 8×8 DCT. The chroma component prediction residuals all use 4×4 DCT. The transform type selection algorithm adopted by the H.264 reference software JM is: respectively find the 4×4 SATD and 8×8 SATD of the luminance component prediction residual of a macroblock, and select the transform size with the smaller SATD value as the transform size of the discrete cosine transform .
1 Design of Configurable Transform Operation Unit
The configurable transform operation unit can complete two-dimensional 4×4 forward and inverse transform operations and one-dimensional 8×8 forward and reverse transform operations according to the configuration signal. Multiple instantiations of this arithmetic unit constitute the adaptive transform module used in the overall encoder.
1.1 Forward transformation
(1) 4×4 Integer Cosine Transform: According to Kronecker Product in Matrix Theory and Symmetry of Matrix, the two-dimensional 4×4 DCT can be written as:
(3) 8×8 integer cosine transform: Through the row-column decomposition of the two-dimensional transform, the two-dimensional 4×4 DCT can be written as the two-dimensional 8×8 DCT:
referencesA fast 1D 8×8 integer cosine transform algorithm is given.
(4) Integrated forward transformation unit: Through the sharing of computing resources, the operations in Equation (1), Equation (2), and Equation (3) can be implemented with the same structure, as shown in FIG. 1 . For 4×4 transform, when PHASE=1, the first-level operation is an addition operation, and the obtained data is the 0th and 2nd rows of a 4×4 two-dimensional transform coefficient block; when PHASE=0, the first The first-level operation is a subtraction operation, and what is obtained is the first and third rows of data of a 4×4 two-dimensional transform coefficient block. In the last level of addition, the solid line is the path of the 4×4 transformation, and the dashed line is the path of the 8×8 transformation.
1.2 Inverse transformation
(1) 4×4 Inverse Integer Cosine Transform: 4×4 iDCT can be further written as:
(2) 8×8 Hadamard Transform: ReferencesA prediction residual block of 8×8 is divided into two 4×8 blocks for operation. Its one-dimensional transformation expression can be written as:
referencesA fast algorithm for the inverse cosine transform of one-dimensional 8×8 integers is given.
(4) Integrated inverse transformation unit: The operations in formula (4), formula (5), and formula (6) can also be implemented in the same structure, and the integrated operation unit is shown in FIG. 2 .
1.3 Configurable transform operation unit
The forward transform unit (Fig. 1) and the reverse transform unit (Fig. 2) have similar hardware structures, and they are integrated into the same operation unit, as shown in Fig. 3. The configurable transform operation unit structure diagram needs a total of 36 adders, and there are 4 adders that are only used for 8×8 transform and are not shown in the figure. The input interconnection, intermediate interconnection and output interconnection structure is a connection structure. The input of each stage of operator is determined according to the transformation type and the output data is subjected to elementary transformation, so that the final output result corresponds to the position of the input data one-to-one.
The input of this operation unit is 16 data: a 4×4/8×1 prediction residual block or inverse quantized coefficient block; the output is 8 data: a 4×2/8×1 transform (inverse transform ) coefficient block. The PHASE signal is only valid during 4×4 transitions. In the operand input process of each stage of operation, the operand isolation method is adopted to reduce the invalid calculation operation of the circuit in the encoding process to reduce power consumption. In order to shorten the critical path and increase the operating frequency, the data path adopts a 2-stage pipeline design.
2 Adaptive transform module design
The overall structure of the adaptive transformation module is shown in Figure 4. All 4×4 transforms in this module adopt the direct two-dimensional transform method, and all 8×8 transforms adopt the row-column decomposition method. In order to balance the throughput of 4×4 transform and 8×8 transform and obtain a more friendly quantization module interface, the adaptive transform module adopts 4 transform operation units to process 32 data in parallel, that is, 2 4×4 blocks or one 8×8 blocks of 4 rows/columns, all operations are 16 bit operations. The trans_type signal indicates the type of transformation, and the cnt signal counts the number of clock cycles in which the transformation takes place.
When 4×4 transformation is performed, the input data of the transformation operation unit 0 (2) and the transformation operation unit 1 (3) are the same 4×4 block, and the rows 0, 3 and 1, 2 of the 4×4 block are obtained respectively. Line transform coefficients. So the structure can handle 2 4×4 blocks at the same time.
When performing 8×8 transformation, the transformation operation units 0-3 are the same operation unit, first process the first 4 rows, store the result of the one-dimensional transformation into the transposition register, and then perform the one-dimensional transformation of the last 4 rows. The coefficients in the transpose register are transposed and then transformed into the second dimension.
The processing procedure of one macroblock is shown in Figure 5. According to the standard, 4×4 blocks/8×8 blocks in a macroblock are numbered. If 4×4 transformation is performed, the data is input to the operation unit according to the data arrangement order of 2 4×4 blocks; if 8×8 transformation is performed, the data is input to the operation unit according to the data sequence of 8×8 blocks. The 4×4 transformation and the 8×8 transformation are performed in the order indicated by the reference numerals in the ellipses in FIG. 5 .
3 Comprehensive results and performance analysis
The design uses Verilog HDL for design, QuartusII10.0 for synthesis, and TimeQuest Timing Analyzer for timing analysis. The operating frequency can reach 63 MHz after placement and routing.
When selecting the transform type, 4×4 and 8×8 Hadamard transforms are performed on the luminance components in turn, which require 8+1=9 and 6+3×5=21 clock cycles respectively. If the luminance component undergoes 4×4 transform, it takes at most 30+12+1=43 clock cycles to process one macroblock; if the luminance component undergoes 8×8 transform, it takes at most 30+21+4=55 to process one macroblock clock cycles. Taking the worst case, each macroblock requires 55 clock cycles, the throughput of the transform module per second is 63 000 000/55=1 145 454 macroblocks. The amount of data per second of [email protected] frames/s video is 1 920×1 080×50×1.5/(16×16)=607 500 macroblocks. Therefore, the adaptive transform structure can meet the real-time coding requirements of [email protected] frames/s video.
This paper proposes an adaptive transform module for H.264 High Profile video encoder, which can complete all transform operations required in the encoding process. The throughput rate of this design can meet the real-time encoding requirements of [email protected] frames/s video.
 SULLIVAN G, TOPIWALA P, LUTHRA A. The H.264/AVC advanced video coding standard: overview and introduction to the fidelity range extensions[C].in: SPIE Conference on Applications of Digital Image Processing XXVII, 2004.
 WIEN M.Variable blocksize transform for H.264/AVC.
IEEE Transoction.estimate power consumption[J].when supplied with 1.8 V and circuitssystem. Video Technology., 2003, 13(7): 604-613.
 Cheng Zhanyuan, Chen Chehong, Liu Binda, et al. High throughput 2-D transform architectures for H.264 advanced video coders[C].in IEEE Asia-Pacific Conference on Circuits and Systems, 2004.2:1141-1144.
 JVT-J029: Joint Video Team(JVT) of ISO/IEC MPEG & ITU-T VCEG(ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6) 10th Meeting: Waikoloa, Hawaii, USA, 8-12 , 2003(12).
 Liu Zhenyu, Zhou Junwei, Wang Dongsheng, et al. Register length analysis and VLSI optimization of VBS hadamard transform in H.264/AVC[J], IEEE Transactions on Circuits and Systems for Video Technology, 2011, 21(5) Issue: 5:601-610.