The Intel CPU Architecture Family represents a series of compatible processors including the Intel386(tm), Intel486(tm), and the Pentium® processors. Each successive member of the family is capable of executing any binaries created for members of previous generations. For example, any existing 8086/8088, 80286, Intel386 CPU (DX or SX), and Intel486 CPU applications will be able to execute on the Pentium processor without any modification or recompilation. However, there are certain code optimization techniques which will make applications execute faster on a specific member of the family with little or no impact on the performance of other members. Most of these optimizations deal with instruction sequence selection and instruction reordering to complement the processor micro architecture.
Pentium® Processor Architecture
Integer Pipelines
The Pentium® processor has two parallel integer pipelines, the main pipe (U), and the secondary pipe (V). The two pipes are similar except the V-pipe has a limited set of instructions it can execute. The following list contains the instructions each pipe can execute:
U-Pipe | V-Pipe | |
Executes all instructions | Executes simple instructions: mov reg, reg/mem/imm mov mem, reg/imm alu reg, reg/mem/imm alu mem, reg/imm inc reg/mem dec reg/mem |
push reg/mem pop reg lea reg, mem jmp/call/jcc near nop |
The Pentium® processor has the ability to execute two instructions simultaneously. This is called pairing. There are certain limitations on how two instructions can be paired. The limitations are as follows:
Optimization Strategies
Two models for instruction set selection exist--complex and simple. The complex model uses multi-operation instructions (i.e. inc [Mem]). This model provides advanced 3 or 4 component addressing and minimizes the size of the code. The simple instruction model uses single-operation instructions. The following code is an example of single-operation instructions:
Loop Oriented Code
Loop oriented code provides a good instruction cache hit rate if Pentium optimization strategies are used. One way to optimize loop oriented code is to break up complex instructions (i.e. enter, leave, loop, etc.) into a sequence of simple instructions. A simple instruction sequence can take advantage of more issue slots. Another way to optimize loop oriented code is to maximize the use of the V-pipe. Depending on the code being executed and the sequence of the code, the V-pipe can be utilized anywhere from zero percent to 100 percent. Loop oriented code is good for SPEC suite, Graphics, CAD and many other similar programs.
Branch Oriented Code
When programming with branch oriented code, the programmer should use very few loops. If at all possible, avoid branch instructions in inner loops because of a new feature of the Pentium processor called branch prediction. The processor will predict where the code is going at each branch based on previous operations. If the processor has guessed correctly, the branch will only take one clock cycle. However, a three cycle penalty is imposed if the conditional branch was executed in the U-pipe or a 4 cycle penalty if executed in the V pipe. Mispredicted calls and unconditional jump instructions have a 3 clock cycle penalty in either pipe. The Intel486 processor has a two clock penalty for taken branches. Hence, using a large number of loops will result in a low instruction cache hit rate. Also, generate code to have default branch direction (jump not taken) be the most common case, and minimize code as much as possible.
Floating Point Optimizations on the Pentium® Processor
The Pentium processor is the first generation of the Intel386 family that uses a pipelined floating-point unit. However, certain optimizations must be performed in order to achieve maximum throughput from the Pentium processor floating point unit.
Floating Point Pairing
Floating point (FP) instructions will not pair with integer instructions. Two FP instructions will only pair under the following conditions:
fld single/double all fadds fsub fmul fabs |
fdiv fcom fucom ftst fchs |
Memory Operands
A floating point operation performed on a memory operand instead of a stack register will not cost any clock cycles. It is better to avoid memory operands in the integer part of the Pentium processor, but use memory operands in the floating-point part of the processor.
Floating Point Stalls
It is not uncommon for a delay to occur between two operations. Certain procedures exist to avoid this complication. Instructions can be inserted between the pair that causes the pipe stall. These instructions should be integer instructions or floating-point instructions that will not in turn, cause a stall themselves. The number of instructions inserted depends on delay length.
Structuring C Code to Help Compilers Optimize
The compiler will need to know the dependencies of the program in order to optimize the code as much as possible. The following rules will assist in this operation:
Summary
This document summarizes the contents of AP-500, Optimizations for Intel's 32-Bit Processors. If you need further information, you can obtain this ap-note by calling Intel Literature. Refer to document number 241799.