Programmable Vertex Processing Unit for Mobile Game Development

Tae-Young Kim¹, Kyoung-Su Oh², Byeong-Seok Shin³, CheolSu Lim¹

¹Dept. of Computer Engineering, Seokyeong University 136-704 Seoul, Korea
²Dept. of Media, Soongsil University 156-743 Seoul, Korea
³Dept. of Computer Engineering, Inha University 402-751 Incheon, Korea
tykim@skuniv.ac.kr, oks@ssu.ac.kr, bsshin@inha.ac.kr, cslim@skuniv.ac.kr

Abstract. Programmable vertex processing unit increases flexibility and enables customizations of transformation and lighting in the graphics pipeline. Because most embedded systems such as mobile phones and PDA’s have only the fixed-function pipeline, various special effects essential in development of realistic 3D games are not provided. We designed and implemented a programmable vertex processing unit for mobile devices based on the OpenGL ES 2.0 specification. It can be used as a development platform for 3D mobile games. Also, assembly instruction set and encoding scheme are examples of standard interface to high-level shading languages.

1 Introduction

In last a few decades, much research has been done to enhance the functionality and efficiency of graphics hardware [1]. One of them is the programmable graphics pipeline, which provides a programmer with the full control of the vertex and fragment processes. Various special effects which were impossible with the fixed pipeline can be implemented [2][3]. The vertex processing in the programmable pipeline does not use the fixed-function T&L (Transformation & Lighting) but a vertex program written by a programmer. As a result, this enables us to make realistic 3D games.

Unfortunately, most embedded systems such as mobile phones and PDA’s only have fixed-function pipeline. Although some mobile 3D game consoles equip with specially designed programmable units, they require a lot of computing resource. Since they are subset of GPU’s for desktop PC, they cannot be applied to generic mobile phones or PDA’s. Therefore, we have designed and implemented a programmable vertex processing unit for the mobile devices.

Our vertex processing unit is designed based on the OpenGL ES 2.0 [4] and GL_ARB_vertex_program [5]. The GL_ARB_vertex_program is the specification of assembly shading language for programmable graphics processor in the general computing systems. OpenGL ES is a graphics APIs standard for the embedded systems, which specifies graphics APIs and high level shading language for the programmable vertex and fragment programs [6]. But it does not include low-level
specification of the shading language [7]. We modified GL_ARB_vertex_program assembly language to fully support OpenGL ES 2.0. We defined some instructions and substituted an instruction with several other primitive instructions to encode/decode an instruction efficiently. Since it provides high-order flexibility to simple mobile devices, we can use them as mobile 3D game consoles. Also, our instruction design and operand encoding scheme can be used as an interface standard between low-level and high-level shader language.

In Sect. 2 we present the structure of our vertex processing unit. Instruction set design and encoding schemes are explained in Sect. 3. Implementation and results are in the next section. Lastly, we summarize and conclude our work.

2 Architecture of Vertex Processing Unit

A vertex program is a sequence of vector operations that determines how a set of program parameters and per-vertex input parameters are transformed to a set of per-vertex result parameters. Fig. 1 shows the architecture of our vertex processing unit.

![Diagram](image)

**Fig. 1.** Architecture of our vertex processing unit. It consists of seven components.

- **Machine code**: (Up to 128) binary codes to be executed in vertex processing unit.
- **Vertex Processing Unit**: A processing engine that fetches, decodes and operates each machine code.
- **Vertex data**: A set of 16 read-only registers containing 4-component floating point vector. Each register represents position, colors, normal of vertex.
- **Constant Registers**: A set of 96 read-only registers. It stores parameters such as matrices, lighting parameters and constants required by vertex programs.
- **Temporary Registers**: A set of 16 readable and writable registers to hold temporary results that can be read or written during the execution of a vertex program.
- **Address Register**: A register containing an integer used as an index to perform indirect accesses to constant data during the execution of a vertex program.
- **Output Registers**: A set of 16 write-only registers to hold the final results of a vertex program. They are passed to the remaining graphics pipelines.
3 Instruction Set and Encoding Scheme

We define 28 primitive instructions and 3 macro instructions based on the operation processing method, as shown tables 1. Since macro instruction means an instruction that can be replaced by a series of primitive instructions, each one is translated into multiple primitive instructions in assembling time. In table 1, the instructions in shadowed entries are additional instructions which are not included in the GL_ARB_vertex_program instruction set. We added them in order to implement the macro instructions as shown in table 1 (below).

Table 1. Primitive instructions and macro instructions used in our implementation

<table>
<thead>
<tr>
<th>Instruction</th>
<th>Description</th>
<th>Instruction</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>ARL</td>
<td>Address register load</td>
<td>DPH</td>
<td>Homogenous dot product</td>
</tr>
<tr>
<td>MOV</td>
<td>Move</td>
<td>DP3</td>
<td>3-component dot product</td>
</tr>
<tr>
<td>ABS</td>
<td>Absolute</td>
<td>DP4</td>
<td>4-component dot product</td>
</tr>
<tr>
<td>FLR</td>
<td>Floor</td>
<td>clamp</td>
<td>Clamp</td>
</tr>
<tr>
<td>FRC</td>
<td>Fraction</td>
<td>MuZ</td>
<td>Multiply on z</td>
</tr>
<tr>
<td>SWZ</td>
<td>Extended swizzle</td>
<td>MAD</td>
<td>Multiply and add</td>
</tr>
<tr>
<td>ADD</td>
<td>Addition</td>
<td>EXP</td>
<td>Exponential base 2(approximate)</td>
</tr>
<tr>
<td>MUL</td>
<td>Multiply</td>
<td>LOG</td>
<td>Logarithm base 2(approximate)</td>
</tr>
<tr>
<td>DST</td>
<td>Distance vector</td>
<td>EX2</td>
<td>Exponential base 2</td>
</tr>
<tr>
<td>XPD</td>
<td>Cross product</td>
<td>LG2</td>
<td>Logarithm base 2</td>
</tr>
<tr>
<td>MAX</td>
<td>Maximum</td>
<td>RCP</td>
<td>Reciprocal</td>
</tr>
<tr>
<td>MIN</td>
<td>Minimum</td>
<td>RSQ</td>
<td>Reciprocal square root</td>
</tr>
<tr>
<td>SGE</td>
<td>Set on greater or equal than</td>
<td>rEX2</td>
<td>Exponential base 2(rough)</td>
</tr>
<tr>
<td>SLT</td>
<td>Set on less than</td>
<td>rLG2</td>
<td>Logarithm base 2(rough)</td>
</tr>
</tbody>
</table>

macr o description | macro description | macro description |
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>LIT f, a, b</td>
<td>clamp tmp, a, b</td>
<td>POW f, a, b</td>
</tr>
<tr>
<td></td>
<td>rLG2 tmp.w, tmp.w</td>
<td>LG2 tmp, a</td>
</tr>
<tr>
<td></td>
<td>rEX2 tmp.w, tmp.w</td>
<td>MUL tmp, tmp, b</td>
</tr>
<tr>
<td></td>
<td>MuZ f, tmp, warp, warp</td>
<td>EX2 f, tmp</td>
</tr>
</tbody>
</table>

: Light coefficients

: Power (f = a^b)

: Subtraction (f = a - b)

Fig. 2 shows the 64bit machine code structure, which is composed of an opcode, a destination operand, and up to 3 source operands. The low bit fields [4th ~ 18th bit] can be used as a source operand (Src2) field or an extended swizzle field. They are recognized as a source operand field in MAD instruction, and as an extended swizzle field in other cases. MAD is the only instruction having three source operands.

The opcode has 6 bits, so it is possible up to 64 instructions. The destination operand field (register type, index, and mask) has 9 bits. Each bit is translated as follows:

T (1bit) : type
           / 0 (Temporary register)
           / 1 (Output register)
index (4bits) : register index (0~15)
mask (4bits) : mask flag for each component
Fig. 2. Machine instruction format: an opcode field, a destination operand field, source operand fields, extended swizzle field, and extended constant index field.

The source field (register type, index, and swizzle information) has 15 bits. Each bit is translated as follows:

- **neg(1bit):** negation flag
- **type(2bits):** type / 00(Temporary register) / 01(Vertex data) / 10(Constant register, absolute addressing) / 11(Constant register, relative addressing)
- **index (4bits):** register index (0~15)
- **Src_[?](2bits):** component swizzle / 00(x component) / 01(y component) / 10(z component) / 11(w component)

The extended swizzle field has additional swizzle information of source operand 0. With swizzle information, four components of source operand 0 can be negated or changed with other components value, zero or one. For example, if the swizzle suffix is "yzxz" and the specified source register value is contains {2,8,9,0}, the swizzled operand used by the instruction is {8,9,9,2}.

**Colr:**-{0|1|xyzw} [{-}01|xyzw] {-}[01|xyzw] [{-}]01|xyzw] {-}01|xyzw]

PARAM Colr = {5, 6, 7, 8};
TEMP Tmp1, Tmp2;
SWZ Tmp1, Colr.x|y|z|w; // Tmp1 = {5, 6, 0, 1};
SWZ Tmp2, Colr.-x|y|z|w; // Tmp2 = {-5,-6,7, 1};

In this field, N_1 and S_1 mean negation and zero or one value flags for each component. The extended index field has 3 bits, which is used for indexing the location of constant register. Totally, 7 bits indexing is possible with the 4 bits in the source operand field and the 3 bits in the extended constant index field.
4 Implementation and Results

We implemented our programmable vertex processing unit in software emulation. Our implementation can be used to emulate mobile game applications including vertex programs. We tested the performance of our work on a desktop PC with 4.3 GHz Pentium processor and ATI Radeon 9800 XT graphics card.

To test our vertex processing unit, we implemented the OpenGL ES 2.0 APIs related with vertex processing. Using the APIs, vertex data are stored and passed to our vertex processing unit. A vertex program is assembled into machine codes and they are passed to the vertex processing unit through our APIs. The vertex processing unit calculates the position and the color of each vertex by fetching, decoding, and executing the machine codes. The outputs of our vertex processing unit are sent to the OpenGL graphics pipeline installed in our computer via the original OpenGL APIs.

The arithmetic unit in our vertex processing unit supports 24 bit floating point format which satisfies the requirement of the OpenGL ES 2.0. We tested three vertex programs as shown in Fig.3.

![](image)

**Fig. 3.** Test vertex programs, left: Normal value, middle: Cook-Torrance illumination, right: Environment map. All programs use same model whose vertex count is 6,984.

We compared an image rendered by our system with an image rendered by pure OpenGL on PC. We found little differences that cannot be recognized with naked eye. Comparison of frame rates among test programs is shown in table 3. We can see that the frame rate is inversely proportional to the number of assembly commands.

<table>
<thead>
<tr>
<th>Number of assembly commands</th>
<th>sample 1</th>
<th>sample 2</th>
<th>sample 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>fps</td>
<td>61.79</td>
<td>17.54</td>
<td>26.2</td>
</tr>
</tbody>
</table>

5 Conclusion

We design and implement a programmable vertex processing unit for the mobile environments based on the OpenGL ES 2.0 specification. Since the final draft of
OpenGL ES 2.0 came out about September 2005, it is hard to find software or hardware implementation based on the specification. We present the architecture and instruction format of vertex processing unit. And we define 28 primitive instructions and 3 macro instructions based on the operation processing method. Our implementation and test results show that error is negligible and the performance is inversely proportional to the number of vertices and the number of instructions in the vertex program as we expected. At present we have only implemented the vertex processing unit. However, the fragment processing unit is also under development and the both units will be implemented as H/W chip.

Fig. 4. A screen shot of mobile game implemented with our vertex processing unit (left) and a hardware prototype of target system using FPGA (right).

Acknowledgement

This work was supported by the Ministry of Culture & Tourism and KOCCA under the Culture and Content Technology Research Center (CTRC) Support Program.

References