[Intel Navigation Header]

Optimization Strategies For The
Pentium® Processor

Introduction

The Intel CPU Architecture Family represents a series of compatible processors including the Intel386(tm), Intel486(tm), and the Pentium® processors. Each successive member of the family is capable of executing any binaries created for members of previous generations. For example, any existing 8086/8088, 80286, Intel386 CPU (DX or SX), and Intel486 CPU applications will be able to execute on the Pentium processor without any modification or recompilation. However, there are certain code optimization techniques which will make applications execute faster on a specific member of the family with little or no impact on the performance of other members. Most of these optimizations deal with instruction sequence selection and instruction reordering to complement the processor micro architecture.

Pentium® Processor Architecture

Integer Pipelines

The Pentium® processor has two parallel integer pipelines, the main pipe (U), and the secondary pipe (V). The two pipes are similar except the V-pipe has a limited set of instructions it can execute. The following list contains the instructions each pipe can execute:

U-Pipe V-Pipe
Executes all instructions Executes simple instructions:

mov reg, reg/mem/imm
mov mem, reg/imm
alu reg, reg/mem/imm
alu mem, reg/imm
inc reg/mem
dec reg/mem


push reg/mem
pop reg
lea reg, mem
jmp/call/jcc near
nop
Integer Pairing

The Pentium® processor has the ability to execute two instructions simultaneously. This is called pairing. There are certain limitations on how two instructions can be paired. The limitations are as follows:

Optimization Strategies

Two models for instruction set selection exist--complex and simple. The complex model uses multi-operation instructions (i.e. inc [Mem]). This model provides advanced 3 or 4 component addressing and minimizes the size of the code. The simple instruction model uses single-operation instructions. The following code is an example of single-operation instructions:

Single-operation instructions are the strategy for superscalar execution.

Loop Oriented Code

Loop oriented code provides a good instruction cache hit rate if Pentium optimization strategies are used. One way to optimize loop oriented code is to break up complex instructions (i.e. enter, leave, loop, etc.) into a sequence of simple instructions. A simple instruction sequence can take advantage of more issue slots. Another way to optimize loop oriented code is to maximize the use of the V-pipe. Depending on the code being executed and the sequence of the code, the V-pipe can be utilized anywhere from zero percent to 100 percent. Loop oriented code is good for SPEC suite, Graphics, CAD and many other similar programs.

Branch Oriented Code

When programming with branch oriented code, the programmer should use very few loops. If at all possible, avoid branch instructions in inner loops because of a new feature of the Pentium processor called branch prediction. The processor will predict where the code is going at each branch based on previous operations. If the processor has guessed correctly, the branch will only take one clock cycle. However, a three cycle penalty is imposed if the conditional branch was executed in the U-pipe or a 4 cycle penalty if executed in the V pipe. Mispredicted calls and unconditional jump instructions have a 3 clock cycle penalty in either pipe. The Intel486 processor has a two clock penalty for taken branches. Hence, using a large number of loops will result in a low instruction cache hit rate. Also, generate code to have default branch direction (jump not taken) be the most common case, and minimize code as much as possible.

Floating Point Optimizations on the Pentium® Processor

The Pentium processor is the first generation of the Intel386 family that uses a pipelined floating-point unit. However, certain optimizations must be performed in order to achieve maximum throughput from the Pentium processor floating point unit.

Floating Point Pairing

Floating point (FP) instructions will not pair with integer instructions. Two FP instructions will only pair under the following conditions:

  • The first instruction is one of the following commands:
    fld single/double
    all fadds
    fsub
    fmul
    fabs
    fdiv
    fcom
    fucom
    ftst
    fchs
  • The second instruction is fxch

    Memory Operands

    A floating point operation performed on a memory operand instead of a stack register will not cost any clock cycles. It is better to avoid memory operands in the integer part of the Pentium processor, but use memory operands in the floating-point part of the processor.

    Floating Point Stalls

    It is not uncommon for a delay to occur between two operations. Certain procedures exist to avoid this complication. Instructions can be inserted between the pair that causes the pipe stall. These instructions should be integer instructions or floating-point instructions that will not in turn, cause a stall themselves. The number of instructions inserted depends on delay length.

    Structuring C Code to Help Compilers Optimize

    The compiler will need to know the dependencies of the program in order to optimize the code as much as possible. The following rules will assist in this operation:

    Summary

    This document summarizes the contents of AP-500, Optimizations for Intel's 32-Bit Processors. If you need further information, you can obtain this ap-note by calling Intel Literature. Refer to document number 241799.


    Legal stuff