Cyrix 6x86 Optimization Notes by Jorn Nystad [Ed note: The current state of this write up is very preliminary. Send questions and comments to me and I will forward them to Jorn -- PH] Here are all the pairing rules for the Cyrix 6x86 processor that I have found. Most of these have been obtained through extensive testing by myself. This information should cover most of the aspects of 6x86 performance, although it may not be complete..hopes it helps, though.. On-line references to 6x86 pairing rules: None that I know of. Pipeline: F -- Fetch ID1 -- Instruction Decode 1 : Instruction Length Determination ID2 -- Instruction Decode 2 : Actual decode AD1 -- Address Decode 1 : Address calculation AD2 -- Address Decode 2 : TLB & cache reads, register reads EX -- Execute WB -- Write-back : Register/memory writes, flags maintenance, conditional jump evaluation Having register read/writes outside the EX unit does NOT impair performance, because the processor does Register Renaming and Data Bypassing.. Instruction Timings: Refer to Cyrix docs. I haven't found any erroneous timings there yet.. Non-pairable instructions: These instructins are NOT pairable: PUSHA/PUSHAD, POPA/POPAD, IN/OUT, MUL/IMUL, DIV/IDIV, LODS/STOS/MOVS/CMPS/SCAS/INS/OUTS, CALL, intersegment JMP, BOUND, SMSW, XCHG, BSWAP (?) Protected-mode segment loads + all other privileged instructions. All other instructions are pairable. X and Y pipelines: Do not fully correspond to the Pentium U and V pipelines. 6x86 is able to swap instructions between the pipelines in the ID2 step. Normally, it works like this: If the previous instruction in the stream was passed in the Y pipeline, then the next instruction is passed in the X pipeline. If the previous instruction in the stream was passed in the X pipeline, then the next instruction is passed in the Y pipeline. The exceptions are as follows: - Jump instructions are passed in the X pipeline - FPU instructions are passed in the X pipeline - Non-pairable instructions are generaly passed in the X pipeline Prefixes: If the first byte of an instruction's opcode is not enough to determine the type of instruction (e.g. whether it's an ADD or SUB) then the 6x86 considers the byte to be a prefix. The 6x86 can decode up to 2 prefixed instructions per clock cycle as long as none of the instructions have more than one prefix and none of then has any immediates (except: Rotate/Shift instructions, Near conditional jumps ( 0F 8x xx xx ) ). If the instruction contains 2 or more prefixes, then it will stall the ID1 unit of the pipeline for (number of prefixes minus 1) cycles. If the instruction contains immediates (except those mentioned above) AND prefixes, then the ID1 unit cannot determine the length of any other instructions within the same clock cycle. Prefixes affect the speed with which the instructions' lengths are determined; the do NOT affect pairability in the EX unit. Instruction Length: If an instruction is 7 bytes or longer, then the ID1 unit cannot determine the length of any other instructions within the same clock cycle. If the instructions are 6 bytes or shorter (and not riddled with prefixes) then the ID1 unit can determine the length of 2 instructions per clock cycle. 6x86 stores NO predecode information in any of its caches. Read-after-write (RAW) dependencies: These appear if the first of two instructions (which we want to pair) writes to a register and the second instruction reads the same register. For example - ADD AX,BX ;; modifies AX MOV CX,AX ;; reads AX The 6x86 can pair the two instructions if: - only one of the instructions performs any arithemetic - the operands (register written and register read) have the same size. Otherwise, the 6x86 executes only the first instruction and tries to pair the other one with the next instruction in the stream. Some consequences of this are also: PUSH AX ;; modifies SP PUSH BX ;; also modifies SP 2 stack instructions can thus never pair on a 6x86. MOV CX,555 ;; modifies CX LOOP flag1 ;; also modifies CX Instructions will not pair because both of then modify CX. Write-after-read (WAR) , write-after write (WAW) dependencies: WAR: MOV AX,BX ;; reads BX ADD BX,5 ;; writes to BX WAV: ADD BX,5 ;; writes to BX MOV BX,AX ;; writes to BX Do not affect pairability. 6x86's Register Renaming capabilities avoid potential collisions. Memory accessing: Address Generation Interlock - occurs when one instruction modifies registers which another (later) instruction uses for memory accessing. Can stall the 6x86 processor for a maximum of 2 cycles. Unaligned accesses - unaligned reads need 2 cycles in the AD2 unit. Unaligned writes need 2 cycles in the WB unit. Unaligned memory accesses are defined as all memory accesses that cross an 8-byte boundary. A memory read that crosses a cache line boundary will NOT cause any memory to cached if it is not already cached Address generation - If the address expression is composed of 3 elements, something like MOV EAX,[ EBX + 8*ESI + 5555] then the AD1 unit is stalled for 1 cycle. Otherwise, it is not. (even with 2 registers, like MOV AX,[BX+SI]. The Cyrix documentation is in error here.) There are NO restrictions as to which pipeline can have access to which cache line, i.e. 2 instructions can read from the same cache line and still pair. The cache can respond to a maximum of 2 accesses per clock cycle. The priority seems to be: 1: Code Fetch (if necessary) 2: Data Read 3: Data Write 6x86 can fetch code either from the Instruction Line Cache (ILC) or, if the code is not in the ILC, from the unified cache. If the cache is so overloaded that data writes cannot take place immediately, then 6x86 writes to one of its 4 write buffers instead and waits until the cache becomes accessible again. In case of a conditional jump, the 6x86 will do a code fetch from the not-predicted target during evaluation. Memory Writes that hit the ILC will invalidate the appropriate ILC line. 6x86 also does checking to ensure that an instruction that writes to data already within the pipeline flush the pipeline and update the modified instructions properly. This mechanism works as long as the instruction responsible for the write is not actually paired with its target. Avoid using this mechanism, though; it causes a stall of something like 30 cycles. 6x86 does Code Fetches aligned to 16-byte (128-bit) boundaries. Try to avoid having jump targets within the last 4 or 5 bytes of any 16-byte block. Else 6x86 will have to fetch twice, which effectively wastes a cycle. The fastest way to initialize memory is REP STOSD (134 MB/s on my system -- with a 133MHz 6x86 and "Force Cache Line fill on Write Miss" feature enabled - see "Configuration Registers" below) The fastest way to move a memory block is REP MOVSD Do not try to use FILD/FISTP for 64-bit moves -- 6x86 does write-combining, which has the same effect and is faster. A small anomaly : CMP [memory],value will be cached/not cached as if it is actually doing a memory write. Avoid it if you are not absolutely sure that the memory location is cached. Flags: It seems to me that the WB unit is responsible for maintaining the flags - it collects the results of the instructions that have been executed and determines the appropriate flag values. It also seems to me that each of the pipelines maintains its own copy of the flags in the EX unit. If an instruction needs a flag to operate properly, this can give some strange results.. For example, the sequence ADD AX,BX ADC CX,DX will need 3 cycles to execute. These cycles go something like this: 1: Executes ADD AX,BX in the X pipeline. ADC CX,DX cannot be done yet. 2: The WB unit will update the general flags register ADC CX,DX can still not be executed 3: Flags are ready ADC CX,DX is now executed in the Y pipeline The sequence ADD AX,BX PUSH BX ADC CX,DX on the other hand, will need only 2 cycles: 1: Executes ADD AX,BX in the X pipeline Executes PUSH BX in the Y pipeline 2: Local flags are maintained in the X pipeline ADC CX,DX can be executed in the X pipeline. So: Put exactly one instruction between the instruction that generates flags and the instruction that requires them. And also: Do not pair the flag-generating instruction with another flag-generating instruction.. Conditional jump instructions are not affected as severely by this odd flag handling; 6x86 allows conditional jumps to be evaluated in the WB unit. Speculative Execution: Whenever the 6x86 executes a conditional jump or an FPU instruction, it checkpoints its registers and increments its speculation level (Level 4=Max, Level 1=No Speculative Execution available). It is decremented as soon as the 6x86 has evaluated the jump or FPU instruction in question. If the 6x86 speculation level was last incremented by a conditional jump, then the 6x86 doesn't allow any instructions to change the speculation level until the conditional jump has been properly resolved. (Information known to be incomplete/guesswork) This has at least 2 effects : - 6x86 cannot do a conditional jump more often than once per 2 clock cycles. - 6x86 cannot process any FPU instructions for the first 2 cycles after a conditional jump. ALL FPU instructions can be buffered using Speculative execution. So, if you wish, you can do the sequence FYL2XP1 FSINCOS FPATAN F2XM1 within 4 clock cycles and then spend the next 400 cycles doing something else while the FPU actually does the calculations. All memory writes during Speculative Execution are done to internal write buffers and not committed to cache or main memory until Speculative Execution ends. The write buffers can hold up to 4 writes. These actions will, if the 6x86 is in Speculative Execution, cause a stall until the Speculative Execution ends: - Write buffers are full and write is attempted - FPU or conditional jump when Speculation Level is 4 - Attempt to modify segment register (This incudes intersegment jumps, which modify CS) - Attempt to modify flags other than: Carry, Aux Carry, Overflow, Sign, Parity, Zero. - Attempt to execute privileged instructions FPU: Doing FPU calculation with very large or very small numbers tend to make the calculations slower on a 6x86. The penalty for using large/small numbers can be calculated using this pseudo-code: Penalty_1=ceil(abs((log(abs(number1)) / log(65536)))-1 Penalty_2=ceil(abs((log(abs(number2)) / log(65536)))-1 Penalty=max(Penalty_1, Penalty_2) If Penalty > 5 Penalty = 5 If abs(Number) = +INF or Number=NAN Penalty=0 Add the penalty to the minimum number of cycles to obtain the cycle count (This holds for FADDs, FSUBs -- I am not quite as sure about FMUL/FDIV) To have the FPU perform optimally, interleave integer and FPU instructions, so that each FPU instruction gets time to finish before the next instruction is dispatched to the FPU. Speculative Execution can, to some extent, compensate if you do not do this, but it has severe limitations. FPU instructions are generally pairable. Configuration registers: 6x86 has numerous registers accessible through IO ports 22h and 23h. Write the index of the register you want to access to port 22h, then read or write port 23h. Several of these registers affect performance; many of these are undocumented... Refer to http://grafi.ii.pw.edu.pl/gbm/x86/6x86reg.html or http://www.sandpile.org/ for more info on these registers. I have also come across some registers not even mentioned at those places: 10h -- seems to control some sort of an event counter 18h, 19h, 1Ah, 1Bh -- the event counter. I have no idea what kind of events these counters count... Jorn Nystad