Pentium® Optimization Cross-Reference by Instruction

The following is a list of optimizations that may come in handy. Each one is listed alphabetically (more or less) in the first column.

The second column lists the CPU or CPU's that this optimization is applicable to; alternatively it may be noted as applicable to 16-bit code or 32-bit code.

The third column contains one or more replacement sequences of code that is either faster or smaller (sometimes both) than the first column. For some obscure optimizations, the action of the first column instruction is explained.

The forth column contains a description and/or examples.

instruction     CPU's       or action             description/notes

aad (imm8)      all        AL = AL+(AH*imm8)    If imm8 is blank uses 10.
                           AH = 0               AAD is almost always slower,
                                                but only 2 bytes long.

aam (imm8)      all        AH = AL/imm8         Same as AAD.
                           AL = AL MOD imm8

add             16-bit     lea reg, [reg+reg+disp]

                                                Use LEA to add
                                                base + index + displacement
                                                Also preserves flags;
                                                for example:

                                                  add bx, 4

                                                can be replaced by:

                                                  lea  bx, [bx+4]

                                                when the flags must not
                                                be changed.

add             32-bit     lea reg, [reg+reg*scale+disp]

                                                Use LEA to add
                                                base + scaled index + disp
                                                Also preserves flags.
                                                (See previous example).
                                                The 32-bit form of LEA
                                                is much more powerful
                                                than the 16-bit version
                                                because of the scaling
                                                and the fact that almost
                                                all of the 8 General purpose
                                                registers can be used
                                                as base and index registers.

and reg, reg    Pent       test reg, reg        Use TEST instead of AND
                                                on the Pentium because
                                                fewer register conflict
                                                will result in better pairing

bswap           Pent       ror eax, 16          Pairs in U pipe, BSWAP
                                                doesn't pair.
                                                disadvantage: modifies flags
                                                (Not a direct replacement)

call dest1      286+       push offset dest2    When CALL is followed by
jmp  dest2                 jmp  dest1           a JMP, change the return
                                                address to the JMP destination.

call dest1      all        jmp  dest1           When a CALL is followed by a
ret                                             RET, the CALL can be replaced
                                                by a JMP.

cbw             386+       mov ah, 0            When you know AL < 128
                                                use MOV AH, 0 for speed.
                                                But use CBW for smaller
                                                code size.

cdq             486+       xor edx, edx         When you know EAX is positive
                                                Faster, better pairing.

                                                disadvantage: modifies flags

                Pent       mov edx, eax         When EAX value could be
                           sar edx, 31          positive or negative
                                                because of better pairing

cmp mem, reg    286        cmp reg, mem         reg, mem is 1 cycle faster

cmp reg, mem    386        cmp mem, reg         mem, reg is 1 cycle faster

dec reg16                  lea reg16, [reg16 - 1]   Use to preserve flags
                                                    for BX, BP, DI, SI

dec reg32                  lea reg32, [reg32 - 1]   Use to preserve flags
                                                    for EAX, EBX, ECX, EDX
                                                        EDI, ESI, EBP

div <op>         8088       shr accum, 1         When <op>  resolves to 2, use
                                                shift for division.
                                                (use CL for 4, 8, etc.)

div <op>         186+       shr accum, n         When <op>  resolves to a power
                                                of 2 use shifts for division.

enter imm16, 0  286+       push bp              ENTER is always slower
                           mov  bp, sp          and 4 bytes in length
                           sub  sp, imm16       if imm16 = 0 then push/mov
                                                is smaller

                386+       push ebp
                32-bit     mov  ebp, esp
                           sub  esp, imm16

inc reg16                  lea reg16, [reg16 + 1]   Use to preserve flags
                                                    for BX, BP, DI, SI

inc reg32                  lea reg32, [reg32 + 1]   Use to preserve flags
                                                    for EAX, EBX, ECX, EDX
                                                        EDI, ESI, EBP
jcxz <dest>:    486+        test cx, cx          JCXZ is faster and
                           je   <dest>:          smaller on 8088-286.
                                                On the 386 it is the
                                                about the same speed

               486+        test ecx, ecx        Never use JCXZ on 486
                           je   <dest>:          or Pentium except for

lea reg, mem   8088-286    mov reg, OFFSET mem  MOV reg, imm is faster
                                                on 8088 - 286. 386+
                                                they are the same.

        Note: There are many uses for LEA, see: add, inc, dec, mov, mul

leave           486+       mov sp, bp           LEAVE is only 1 byte
                           pop bp               long and is faster
                                                on the 186-386. The
                           mov esp, ebp         MOV/POP is much faster
                           pop ebp              on 486 and Pentium

lodsb           486+       mov al, [si]         LODS is only 1 byte long
                           inc si               and is faster on 8088-386,
                                                much slower on the 486.
                                                On the Pentium the MOV/INC
                                                or MOV/ADD instructions
                                                pair, taking only 1 cycle.

lodsw           486+       mov ax, [si]         see lodsb
                           add si, 2

lodsd           486+       mov eax, [esi]       see lodsb
                           add esi, 4

loop <dest>:     386+       dec cx               LOOP is faster and
                           jnz <dest>:           smaller on 8088-286.
                                                on 386+ DEC/JNZ is
loopd <dest>:               dec ecx              much faster. On the Pentium
                           jnz <dest>:           the DEC/JNZ instructions
                                                pair taking only 1 cycle.

loopXX <dest>:   486+       je  $+5              The 3 replacement instructions
( XX = e,ne,z or nz)       dec cx               are much faster on the 486+.
                           jnz <dest>:           LOOPxx is smaller and
                                                faster on 8088-286
loopdXX <dest>:  486+       je  $+5              The speed is about the
                           dec ecx              same on the 386.
                           jnz <dest>:

mov reg2, reg1  286+       lea reg2, [reg1+n]   LEA is faster, smaller and
 followed by:                                   preserves flags. This is a
 inc/dec/add/sub reg2                           way to do a MOV and ADD/SUB
                                                of a constant, n.

mov acc, reg    all        xchg acc, reg        Use XCHG for smaller code
                                                when one of the registers
                                                final value can be ignored.
                                                Note that acc = AL, AX or EAX.

mov mem, 1      Pent       lea bx, mem          Displacement/immediate does
                           mov [bx], 1          not pair. LEA/MOV can be used
                                                if other code can be placed
                                                inbetween to prevent AGI's.
                           mov ax, 1            MOV/MOV may be easier to pair.
                           mov mem, ax

mov [bx+2], 1   Pent       mov ax, 1            Better pairing because
                           mov [bx+2], ax       displacement/immediate
                                                instructions do not pair.

                           lea bx, [bx+2]
                           mov [bx], 1

movsb           486+       mov al, [si]         MOVS is faster and
                           inc si               smaller to move a single
                           mov [di], al         byte, word or dword
                           inc di               on the 8088-386.
                                                On the 486+ the MOV/INC
                                                method is faster.

                                                NOTE: REP MOVS is always
                                                faster to move a large block.

movsw           486+       mov ax, [si]         see MOVSB
                           add si, 2
                           mov [di], ax
                           add di, 2

movsd           486+       mov eax, [esi]       see MOVSB
                           add esi, 4
                           mov [edi], eax
                           add edi, 4

movzx r16, rm8  486+       xor bx, bx           MOVZX is faster and
                           mov bl, al           smaller on the 386.
                                                On the 486+ XOR/MOV
movzx r32, rm8  486+       xor ebx, ebx         is faster. Possible
                           mov bl, al           pairing on the Pentium.
                                                (source can be reg or mem)
movzx r32, rm16 486+       xor ebx, ebx         disadvantage: modifies flags
                           mov bx, ax

mul n           8088+      shl ax, cl           Use shifts or ADDs instead of
                                                multiply when n is a power of 2

mul n           Pent       add ax, ax           ADD is better than single
                                                shift because it pairs better.

mul             32-bit     lea                  Use LEA to multiply by
                                                2, 3, 4, 5, 7, 8, 9

                           lea eax, [eax+eax*4] (ex: multiply EAX * 5)

                                                LEA is better than SHL on the
                                                Pentium because it pairs in
                                                both pipes, SHL pairs only in
                                                the U pipe.

or reg, reg     Pent       test reg, reg        Better pairing because
                                                OR writes to register.
                                                (This is for src = dest.)

pop mem         486+       pop reg              Faster on 486+
                           mov mem, reg         Better pairing on Pentium

push mem        486+       mov  reg, mem        Faster on 486
                           push reg             Better pairing on Pentium

pushf           486+       rcr reg, 1           To save only the carry flag
                                                use a rotate (RCR or RCL)
                              or                into a register. RCR and RCL
                                                are pairiable (U pipe only)
                           rcl reg, 1           and take 1 cycle. PUSHF is
                                                slow and not pairable.

popf            486+       rcl reg, 1           To restore only the carry flag.
                                                See PUSHF.

                           rcr reg, 1

rep scasb       Pent       loop1:               REP SCAS is faster and
                             mov al, [di]       smaller on 8088-486.
                             inc di             Expanded code is faster
                             cmp al, reg2       on Pentium due to pairing.
                             je  exit
                             dec cx
                             jnz loop1

shl reg, 1      Pent       add reg, reg         ADD pairs better. SHL
                                                only pairs in the U pipe.

stosb           486+       mov [di], al         STOS is faster and smaller
                           inc di               on the 8088-286, and the same
                                                speed on the 386. On the 486+
stosw           486+       mov [di], ax         the MOV/INC is slightly
                           add di, 2            faster.

stosd           486+       mov [edi], eax       REP STOS is faster on 8088-386.
                           add edi, 4           MOV/INC or MOV/ADD is faster
                                                on the 486+

                                                Note: use LEA SI, [SI+n]
                                                to advance LEA without
                                                changing the flags.

xchg            all                             Use xchg acc, reg to do a
                                                1 byte MOV when one register
                                                can be ignored.

xchg reg1, reg2 Pent       push reg1            pushes and pops are 1 cycle
                           push reg2            faster on Pentium due to
                           pop  reg1            pairing.
                           pop  reg2

                                                disadvantage: uses stack

                Pent       mov  reg3, reg1      Faster and better pairing
                           mov  reg1, reg2      if reg3 is available.
                           mov  reg2, reg3

xlatb           486+       mov bh, 0            XLAT is faster and smaller
                           mov bl, al           on 8088-386. MOV's are faster
                           mov al, [bx]         on 486+. Best to rearrange
                                                instructions to prevent AGI's
xlatb           486+       xor ebx, ebx         and get pairing on Pentium.
                           mov bl, al           Force high part of BX/EBX
                           mov al, [ebx]        to zero outside of loop.

                                                disadvantage: modifies flags