16/scrasm/80X86.TXT

   1 Journal:   Dr. Dobb's Journal  March 1991 v16 n3 p16(8)\r
   2 -----------------------------------------------------------------------------\r
   3 Title:     80x86 optimization: aim down the middle and pray. (80x86 family of\r
   4            microprocessors) (tutorial)\r
   5 Author:    Abrash, Michael.\r
   6 AttFile:    Program:  80X86.ASC  Source code listing.\r
   7 \r
   8 Summary:   Optimizing code for 8088, 80286, 80386 and 80486 microprocessors\r
   9            is difficult because the chips use significantly different memory\r
  10            architectures and instruction execution times.  Code cannot be\r
  11            optimized for the 80x86 family; rather, code must be designed to\r
  12            produce good performance on a range of systems or optimized for\r
  13            particular combinations of processors and memory.  Programmers\r
  14            must avoid the unusual instructions supported by the 8088, which\r
  15            have lost their performance edge in subsequent chips.  String\r
  16            instructions should be used but not relied upon.  Registers should\r
  17            be used rather than memory operations.  Branching is also slow for\r
  18            all four processors.  Memory accesses should be aligned to improve\r
  19            performance.  Generally, optimizing an 80486 requires exactly the\r
  20            opposite steps as optimizing an 8088.\r
  21 -----------------------------------------------------------------------------\r
  22 Descriptors..\r
  23 Company:   Intel Corp. (Products).\r
  24 Ticker:    INTC.\r
  25 Product:   Intel 80286 (Microprocessor) (Programming)\r
  26            Intel 80386 (Microprocessor) (Programming)\r
  27            Intel 80486 (Microprocessor) (Programming)\r
  28            Intel 8088 (Microprocessor) (Programming).\r
  29 Topic:     Microprocessors\r
  30            Optimization\r
  31            Programming\r
  32            Tutorial\r
  33            Assembly Language\r
  34            Guidelines\r
  35            Type-In Programs\r
  36            Microcode\r
  37            Processor Architecture.\r
  38 Feature:   illustration\r
  39            graph.\r
  40 Caption:   Official and actual cycles per binary-to-hex ASCII conversion.\r
  41            (graph)\r
  42            Actual performance in microseconds of two solutions to a problem.\r
  43            (graph)\r
  44            Actual performance of three clearing approaches across the 80x86\r
  45            family. (graph)\r
  46 \r
  47 -----------------------------------------------------------------------------\r
  48 Full Text:\r
  49 \r
  50 Optimization\r
  51 \r
  52 Picture this: You're an archer aiming at a target 100 feet away.  A strong\r
  53 wind comes up and pushes each arrow to the left as it flies.  Naturally, you\r
  54 compensate by aiming farther to the right.  That's what it's like optimizing\r
  55 for the 8088; once you learn to compensate for the strong but steady effects\r
  56 of the prefetch queue and the 8-bit bus, you can continue merrily on your\r
  57 programming way.\r
  58 \r
  59 Now the wind starts gusting unpredictably.  There's no way to compensate, so\r
  60 you just aim for the bull's-eye and hope for the best.  That's what it's like\r
  61 writing code for good performance across the entire 80x86 family, or even for\r
  62 the 286/386SX/386 heart of today's market.  You just aim down the middle and\r
  63 pray.\r
  64 \r
  65 The New World of the 80x86\r
  66 \r
  67 In the beginning, the 8088 was king, and that was good.  The optimization\r
  68 rules weren't obvious, but once you learned them, you could count on them\r
  69 serving you well on every computer out there.\r
  70 \r
  71 Not so these days.  There are four major processor types--8088, 80286, 80386,\r
  72 and 80486--with a bewildering array of memory architectures: cached (in\r
  73 several forms), page mode, static-column RAM, interleaved, and, of course,\r
  74 the 386SX, with its half-pint memory interface.  The processors offer wildly\r
  75 differing instruction execution times, and memory architectures warp those\r
  76 times further by affecting the speed of instruction fetching and access to\r
  77 memory operands.  Because actual performance is a complex interaction of\r
  78 instruction characteristics, instruction execution times, and memory access\r
  79 speed, the myriad processor-memory combinations out there make "exact\r
  80 performance" a meaningless term.  A specific instruction sequence may run at\r
  81 a certain speed on a certain processor in a certain system, but that often\r
  82 says little about the performance of the same instructions on a different\r
  83 processor, or even on the same processor with a different memory system.  The\r
  84 result: Precise optimization for the general PC market is a thing of the\r
  85 past.  (We're talking about optimizing for speed here; optimizing for size is\r
  86 the same for all processors so long as you stick to 8088-compatible code.)\r
  87 \r
  88 So there is no way to optimize performance ideally across the 80x86 family.\r
  89 An optimization that suits one processor beautifully is often a dog on\r
  90 another.  Any 8088 programmer would instinctively replace:\r
  91 \r
  92 DEC  CX JNZ  LOOPTOP\r
  93 \r
  94 with:\r
  95 \r
  96 LOOP  LOOPTOP\r
  97 \r
  98 because LOOP is significantly faster on the 8088.  LOOP is also faster on the\r
  99 286.  On the 386, however, LOOP is actually two cycles slower than DEC/JNZ.\r
 100 The pendulum swings still further on the 486, where LOOP is about twice as\r
 101 slow as DEC/JNZ--and, mind you, we're talking about what was originally\r
 102 perhaps the most obvious optimization in the entire 80x86 instruction set.\r
 103 \r
 104 In short, there is no such thing as code that's truly optimized for the\r
 105 80x86.  Instead, code is either optimized for specific processor-memory\r
 106 combinations, or aimed down the middle, designed to produce good performance\r
 107 across a range of systems.  Optimizing for the 80x86 family by aiming down\r
 108 the middle is quite different from optimizing for the 8088, but many PC\r
 109 programmers are inappropriately still applying the optimization lore they've\r
 110 learned over the years on the PC (or AT).  The world has changed, and many of\r
 111 those old assumptions and tricks don't hold true anymore.\r
 112 \r
 113 You will not love the new world of 80x86 optimization, which is less precise\r
 114 and offers fewer clever tricks than optimizing for the 8088 alone.  Still,\r
 115 isn't it better to understand the forces affecting your code's performance\r
 116 out in the real world than to optimize for a single processor and hope for\r
 117 the best?\r
 118 \r
 119 Better, yes.  As much fun, no.  Optimizing for the 8088 was just about as\r
 120 good as it gets.  So it goes.\r
 121 \r
 122 Optimization Rules for a New World\r
 123 \r
 124 So, how do you go about writing fast code nowadays?  One way is to write\r
 125 different versions of critical code for various processors and memory access\r
 126 speeds, selecting the best version at runtime.  That's a great solution, but\r
 127 it requires an awful lot of knowledge and work.\r
 128 \r
 129 An alternative is to optimize for one particular processor and settle for\r
 130 whatever performance you get on the others.  This might make sense when the\r
 131 8088 is the target processor because it certainly needs the optimization more\r
 132 than any other processor.  However, 8088 optimization works poorly at the\r
 133 upper end of the 80x86 family.\r
 134 \r
 135 Nowadays, though, most of us want to optimize for the 286 and 386 systems\r
 136 that dominate the market, or across all 80x86 processors, and that's a tough\r
 137 nut to crack.  The 286 and 386 come in many configurations, and you can be\r
 138 sure, for example, that a 386SX, an interleaved 386, and a cached 386 have\r
 139 markedly different performance characteristics.  There are, alas, no hard and\r
 140 fast optimization rules that apply across all these environments.\r
 141 \r
 142 My own approach to 80x86 optimization has been to develop a set of general\r
 143 rules that serve reasonably well throughout the 80x86 line, especially the\r
 144 286 and 386, and to select a specific processor (in my case a cached 386, for\r
 145 which cycle times tend to be accurate) to serve as the tiebreaker when\r
 146 optimization details vary from one processor to another.  (Naturally, it's\r
 147 only worth bothering with these optimizations in critical code.)  The rules\r
 148 I've developed are:\r
 149 \r
 150 * Avoid accessing memory operands; use the registers to the max.\r
 151 \r
 152 * Don't branch.\r
 153 \r
 154 * Use string instructions, but don't go much out of your way to do so.\r
 155 \r
 156 * Keep memory accesses to a minimum by avoiding memory operands and keeping\r
 157 instructions short.\r
 158 \r
 159 * Align memory accesses.\r
 160 \r
 161 * Forget about many of those clever 8088 optimizations, using oddball\r
 162 instructions such as DAA and XLAT, that you spent years learning.\r
 163 \r
 164 Next I'll discuss each of these rules in turn in the context of\r
 165 8088-compatible real mode, which is still the focus of the 80x86 world.\r
 166 Later, I'll touch on protected mode.\r
 167 \r
 168 Let's start by looking at the last--and most surprising--rule.\r
 169 \r
 170 Kiss Those Tricks Goodbye\r
 171 \r
 172 To skilled assembly language programmers, the 8088 is perhaps the most\r
 173 wonderful processor ever created, largely because the instruction set is\r
 174 packed with odd instructions that are worthless to compilers but can work\r
 175 miracles in the hands of clever assembly programmers.  Unfortunately, each\r
 176 new generation of the 80x86 has rendered those odd instructions and marvelous\r
 177 tricks less desirable.  As the execution time for the commonly used\r
 178 instruction ADD BX, 4 has gone down from four cycles (8088) to three cycles\r
 179 (286) to two cycles (386) to one cycle (486), the time for the less\r
 180 frequently used instruction CBW has gone from two cycles (8088 and 286) up to\r
 181 three cycles (386 and 486)!\r
 182 \r
 183 Consider this ancient optimization for converting a binary digit to hex\r
 184 ASCII:\r
 185 \r
 186 ADD  AL,90H DAA ADC  AL,40H DAA\r
 187 \r
 188 Now consider the standard alternative:\r
 189 \r
 190 ADD  AL,'0' CMP  AL,'9' JBE  HaveAscii ADD  AL,'A'-('9'+1) HaveAscii:\r
 191 \r
 192 As Figure 1 indicates, the standard code should be slower on an 8088 or 286,\r
 193 but faster on a 386 or a 486--and real-world tests confirm those results, as\r
 194 shown in Figure 2.  (All "actual performance" timings in this article were\r
 195 performed with the Zen timer from Zen of Assembly Language, see "References"\r
 196 for details.  The systems used for the tests were: 8088, standard 4.77 MHz PC\r
 197 XT; 80286, standard one-wait-state, 8 MHz PC AT; 386SX, 16 MHz noncached;\r
 198 80386, 20 MHz externally cached with all instructions and data in external\r
 199 cache for all tests except Listings One and Two; 80486, 25 MHz internally\r
 200 cached, with all instructions and data in internal cache for all tests except\r
 201 Listings One and Two.)\r
 202 \r
 203 In other words, this nifty, time-tested optimization is an anti-optimization\r
 204 on the 386 and 486.\r
 205 \r
 206 Why is this?  On the 386, DAA--a rarely used instruction--takes four cycles,\r
 207 and on the 486 it takes two cycles, in both cases twice as long as the more\r
 208 common instructions CMP and ADD; in contrast, on the 8088 all three\r
 209 instructions are equally fast at four cycles.  Also, the instruction-fetching\r
 210 advantage that the 1-byte DAA provides on the 8088 means nothing on a cached\r
 211 386.\r
 212 \r
 213 Nor is this an isolated example.  Most oddball instructions, from AAA to\r
 214 XCHG, have failed to keep pace with the core instructions--ADC, ADD, AND,\r
 215 CALL, CMP, DEC, INC, Jcc, JMP, LEA, MOV, OR, POP, PUSH, RET, SBB, SUB, TEST,\r
 216 and XOR--during the evolution from 8088 to 486.  As we saw earlier, even LOOP\r
 217 lags behind on the 386 and 486.  Check your favorite tricks for yourself;\r
 218 they might or might not hold up on the 386, but will most likely be\r
 219 liabilities on the 486.  Sorry, but I just report the news, and the news is:\r
 220 Kiss most of those tricks goodbye as the 386 and 486 come to dominate the\r
 221 market.  (This means that hand-optimization in assembly language yields less\r
 222 of a performance boost nowadays than it did when the 8088 was king; the\r
 223 improvement is certainly significant, but rarely in the 200-500 percent range\r
 224 anymore.  Sic transit gloria mundi.)  Most startling of all, string\r
 225 instructions lose much of their allure as we move away from the 8088, hitting\r
 226 bottom on the 486.\r
 227 \r
 228 The 486: All the Rules Change\r
 229 \r
 230 The 486 represents a fundamental break with 8088-style optimization.\r
 231 Virtually all the old rules fail on the 486, where, incredibly, a move to or\r
 232 from memory often takes just one cycle, but exchanging two registers takes\r
 233 three cycles.  The nonbranching core instructions mentioned earlier take only\r
 234 one cycle on the 486 when operating on registers; MOV can, under most\r
 235 conditions, access memory in one cycle; and CALL and JMP take only three\r
 236 cycles, given a cache hit.  However, noncore instructions take considerably\r
 237 longer.  XLAT takes four cycles; even STC and CLC take two cycles each.  The\r
 238 486's highly asymmetric execution times heavily favor core instructions and\r
 239 defeat most pre-486 optimizations.\r
 240 \r
 241 Core instructions do have a weakness on the 486.  While 486 MOVs involving\r
 242 memory are remarkably fast, accessing memory for an operand to OR, ADD, or\r
 243 the like costs cycles.  Even with the 8K internal cache, memory is not as\r
 244 fast as registers, except when MOV is used (and sometimes not even then), so\r
 245 registers are still preferred operands.  (AND [BX],1 is fast, at only three\r
 246 cycles, but AND BX,1 takes only one cycle--three times as fast.)\r
 247 \r
 248 OUT should be avoided whenever possible on the 486, and likewise for IN.  OUT\r
 249 takes anywhere from 10 to 31 cycles, depending on processor mode and\r
 250 privileges, more than an order of magnitude slower than MOV.  The lousy\r
 251 performance of OUT -- true on the 386 as well -- has important implications\r
 252 for graphics applications.\r
 253 \r
 254 String instructions are so slow on the 486 that you should check cycle times\r
 255 before using any string instruction other than the always superior REP MOV's.\r
 256 For example, LODSB takes five cycles on the 486, but MOV AL,[SI]/INC SI takes\r
 257 only two cycles; likewise for STOSB and MOV [DI],AL/INC DI.  Listing One\r
 258 (page 73) uses LODSB/STOSB to copy a string, converting lowercase to\r
 259 uppercase while copying; Listing Two (page 73) uses MOV/INC instead.  Figure\r
 260 3 summarizes the performance of the two routines on a variety of processors;\r
 261 note the diminishing effectiveness of string instructions on the newer\r
 262 processors.  Think long and hard before using string instructions other than\r
 263 REP MOVS on the 486.\r
 264 \r
 265 Optimization for the 486 is really a whole new ball game.  When optimizing\r
 266 across the 80x86 family, the 486 will generally be the least of your worries\r
 267 because it is so much faster than the rest of the family; anything that runs\r
 268 adequately on any other processor will look terrific on the 486.  Still, the\r
 269 future surely holds millions of 486s, so it wouldn't hurt to keep one eye on\r
 270 the 486 as you optimize.\r
 271 \r
 272 String Instructions: Fading Stars\r
 273 \r
 274 On the 8088, string instructions are so far superior to other instructions\r
 275 that it's worth going to great lengths to use them, but they lose much of\r
 276 that status on newer processors.  One of the best things about string\r
 277 instructions on the 8088 is that they require little instruction fetching,\r
 278 because they're 1-byte instructions and because of the REP prefix; however,\r
 279 instruction fetching is less of a bottleneck on newer processors.  String\r
 280 instructions also have superior cycle times on the 8088, but that advantage\r
 281 fades on the 286 and 386 as well.\r
 282 \r
 283 On the 286, string instructions (when they do exactly what you need) are\r
 284 still clearly better than the alternatives.  On the 386, however, some string\r
 285 instructions are, even under ideal circumstances, the best choice only by a\r
 286 whisker, if at all.  For example, since Day One, clearing a buffer has been\r
 287 done with REP STOS.  That's certainly faster than the looping MOV/ADD\r
 288 approach shown in Listing Three (page 73), but on the 386 and 486 it's no\r
 289 faster than the unrolled loop MOV/ADD approach of Listing Four (page 73), as\r
 290 shown in Figure 4.  (Actually, in my tests REP STOS was a fraction of a cycle\r
 291 slower on the 386, and fractionally faster on the 486.)  REP STOS is much\r
 292 easier to code and more compact, so it's still the approach of choice for\r
 293 buffer clearing--but it's not necessarily fastest on a 486 or fast-memory\r
 294 386.  This again demonstrates just how unreliable the old optimization rules\r
 295 are on the newer processors.\r
 296 \r
 297 The point is not that you shouldn't use string instructions on the 386.  REP\r
 298 MOVs is the best way to move data, and the other string instructions are\r
 299 compact and usually faster, especially on uncached systems.  However, on the\r
 300 386 it's no longer worth going to the trouble of juggling registers and\r
 301 reorganizing data structures to use string instructions.  Furthermore, when\r
 302 you truly need maximum performance on the 386, check out nonstring\r
 303 instructions in unrolled loops.  It goes against every lesson learned in a\r
 304 decade of 8088 programming, but avoiding string instructions sometimes pays\r
 305 on the 386.\r
 306 \r
 307 The Siren Song of Memory Accesses\r
 308 \r
 309 Finally, here's a rule that's constant from the 8088 to the 486: Use the\r
 310 registers.  Avoid memory.\r
 311 \r
 312 Don't be fooled by the much faster memory access times of the 286 and 386.\r
 313 The effective address calculation time of the 8088 is mostly gone, so MOV\r
 314 AX,[BX] takes only five cycles on the 286, and ADD [SI],DX takes only seven\r
 315 on the 386.  That's so much faster than the 17 and 29 cycles, respectively,\r
 316 that they take on the 8088 that you might start thinking that memory is\r
 317 pretty much interchangeable with registers.\r
 318 \r
 319 Think again.  MOV AX,BX is still more than twice as fast as MOV AX,[BX] on\r
 320 the 286, and ADD SI,DX is more than three times as fast as ADD [SI],DX on the\r
 321 386.  Memory operands can also reduce performance by slowing instruction\r
 322 fetching.  Memory is fast on the 286 and 386.  Registers are faster.  Use\r
 323 them as heavily as possible.\r
 324 \r
 325 Don't Branch\r
 326 \r
 327 Here's another rule that stays the same across the 80x86 family: Don't\r
 328 branch.  Branching suffers on the 8088 from lengthy cycle counts and emptying\r
 329 the prefetch queue.  Emptying the prefetch queue is a lesser but nonetheless\r
 330 real problem in the post-8088 world, and the cycle counts of branches are\r
 331 still killers.  As Figure 4 indicates, it pays to eliminate branches by\r
 332 unrolling loops or using repeated string instructions.\r
 333 \r
 334 Modern-Day Instruction Fetching\r
 335 \r
 336 Instruction fetching is the bugbear of 8088 performance; the 8088 simply\r
 337 can't fetch instruction bytes as quickly as it can execute them, thanks to\r
 338 its undersized bus.  Minimizing all memory accesses, including instruction\r
 339 fetches, is paramount on the 8088.\r
 340 \r
 341 Instruction fetching is less of a problem nowadays.  Figure 5 shows the\r
 342 maximum rates at which various processors can fetch instruction bytes;\r
 343 clearly, matters have improved considerably since the 8088, although\r
 344 instructions also execute in fewer cycles on the newer processors.  Fetching\r
 345 problems can occur on any 80x86 processor, even the 486, but the only\r
 346 processors other than the 8088 that face major instruction fetching problems\r
 347 are the one-wait-state 286 and the 386SX, although uncached 386s may also\r
 348 outrun memory.  However, the problems here are different from and less\r
 349 serious than with the 8088.\r
 350 \r
 351 Consider: An 8088 executes a register ADD in three cycles, but requires eight\r
 352 cycles to fetch that instruction, a fetch/execute ratio of 2.67.  A\r
 353 one-wait-state 286 requires three cycles to fetch a register ADD and executes\r
 354 it in two cycles, a ratio of 1.5.  A 386SX can fetch a register ADD in two\r
 355 cycles, matching the execution time nicely, and a cached 386 can fetch two\r
 356 register ADDs in the two cycles it takes to execute just one.  For\r
 357 register-only code--the sort of code critical loops should contain--the 386\r
 358 generally runs flat out, and the 286 and 386SX usually (not always, but\r
 359 usually) outrun memory by only a little at worst.  Greater fetching problems\r
 360 can arise when working with large instructions or instruction sequences that\r
 361 access memory nonstop, but those are uncommon in critical code.  This is a\r
 362 welcome change from the 8088, where small, register-only instructions tend to\r
 363 suffer most from inadequate instruction fetching.\r
 364 \r
 365 Also, uncached 386 systems often use memory architectures that provide\r
 366 zero-wait-state performance when memory is accessed sequentially.  In\r
 367 register-only code, instruction fetches are the only memory accesses, so\r
 368 fetching proceeds at full speed when the registers are used heavily.\r
 369 \r
 370 So, is instruction fetching a problem in the post-8088 world?  Should\r
 371 instructions be kept short?\r
 372 \r
 373 Yes.  Smaller instructions can help considerably on the one-wait-state 286\r
 374 and on the 386SX.  Not as much as on the 8088, but it's still worth the\r
 375 trouble.  Even a cached 386 can suffer from fetching problems, although\r
 376 that's fairly uncommon.  For example, when several MOV WORD PTR [MemVar],0\r
 377 instructions are executed in a row, as might happen when initializing memory\r
 378 variables, performance tends to fall far below rated speed, as shown in\r
 379 Figure 6.  The particular problem with MOV WORD PTR [MemVar],0 is that it\r
 380 executes in just two (386) or three (286) cycles, yet has both an addressing\r
 381 displacement field and a constant field.  This eats up memory bandwidth by\r
 382 requiring more instruction fetching.  It also accesses memory, eating up\r
 383 still more bandwidth.  We'll see this again, and worse, when we discuss\r
 384 protected mode.\r
 385 \r
 386 Generally, though, post-8088 processors with fast memory systems and\r
 387 full-width buses run most instructions at pretty near their official cycle\r
 388 times; for these systems, optimization consists mostly of counting cycles.\r
 389 Slower memory or constricted buses (as in the 386SX) require that memory\r
 390 accesses (both instruction fetches and operand accesses) be minimized as\r
 391 well.  Fortunately, the same sort of code--register only--meets both\r
 392 requirements.\r
 393 \r
 394 Use the registers.  Avoid constants.  Avoid displacements.  Don't branch.\r
 395 That's the big picture.  Don't sweat the details.\r
 396 \r
 397 Alignment: The Easy Optimization\r
 398 \r
 399 The 286, 386SX, and 386 take twice as long to access memory words at odd\r
 400 addresses as at even addresses.  The 386 takes twice as long to access memory\r
 401 dwords at addresses that aren't multiples of four as those that are.  You\r
 402 should use ALIGN 2 to word align all word-sized data, and ALIGN 4 to dword\r
 403 align all data that's accessed as a dword operand, as in:\r
 404 \r
 405 ALIGN  4 MemVar  dd  ? : MOV EAX,[MemVar]\r
 406 \r
 407 Alignment also applies to code; you may want to word or dword align the\r
 408 starts of procedures, labels that can only be reached by branching, and the\r
 409 tops of loops.  (Code alignment matters only at branch targets, because only\r
 410 the first instruction fetch after a branch can suffer from nonalignment.)\r
 411 Dword alignment of code is optimal, and will help on the 386 even in real\r
 412 mode, but word alignment will produce nearly as much improvement as dword\r
 413 alignment without wasting nearly as many bytes.\r
 414 \r
 415 Alignment improves performance on many 80x86 systems without hindering it on\r
 416 any.  Recommended.\r
 417 \r
 418 Protected Mode\r
 419 \r
 420 There are two sorts of protected mode, 16-bit and 32-bit.  The primary\r
 421 optimization characteristic of 16-bit protected mode (OS/2 1.X, Rational DOS\r
 422 Extender) is that it takes an ungodly long time to load a segment register\r
 423 (for example, MOV ES,AX takes 17 cycles on a 286) so load segment registers\r
 424 as infrequently as possible in 16-bit protected mode.\r
 425 \r
 426 Optimizing for 32-bit protected mode (OS/2 2.0, SCO Unix, Phar Lap DOS\r
 427 Extender) is another matter entirely.  Typically, no segment loads are needed\r
 428 because of the flat address space.  However, 32-bit protected mode code can\r
 429 be bulky, and that can slow instruction fetching.  Constants and addressing\r
 430 displacements can be as large as 4 bytes each, and an extra byte, the SIB\r
 431 byte, is required whenever two 32-bit registers are used to address an\r
 432 operand or scaled addressing is used.  So, for example, MOV DWORD PTR\r
 433 [MemVar],0 is a 10-byte instruction in 32-bit protected mode.  The\r
 434 instruction is supposed to execute in two cycles, but even a 386 needs four\r
 435 to six cycles to fetch it, plus another two cycles to access memory; a few\r
 436 such instructions in a row can empty the prefetch queue and slow performance\r
 437 considerably.  The slowdown occurs more quickly and is more acute on a 386SX,\r
 438 which needs 14 cycles to perform the memory accesses for this nominally\r
 439 2-cycle instruction.\r
 440 \r
 441 Code can get even larger when 32-bit instructions are executed in 16-bit\r
 442 segments, adding prefix bytes.  (Avoid prefix bytes if you can; they increase\r
 443 instruction size and can cost cycles.)  Figure 7 shows actual versus nominal\r
 444 cycle times of multiple MOV DWORD PTR [EBX*4+MemVar],0 instructions running\r
 445 in a 16-bit segment.  Although cache type (write-back, write-through) and\r
 446 main-memory write time also affect the performance of stores to memory, there\r
 447 is clearly a significant penalty for using several large (in this case,\r
 448 13-byte) instructions in a row.\r
 449 \r
 450 Fortunately, this is a worst case, easily avoided by keeping constants and\r
 451 displacements out of critical loops.  For example, you should replace:\r
 452 \r
 453 ADDLOOP: MOV  DWORD PTR BaseTable[EDX+EBX],0 ADD  EBX,4 DEC  ECX JNZ  ADDLOOP\r
 454 \r
 455 with:\r
 456 \r
 457 LEA  EBX,BaseTable[EDX+EBX] SUB  EAX,EAX ADDLOOP: MOV  [EBX],EAX ADD  EBX,4\r
 458 DEC  ECX JNZ  ADDLOOP\r
 459 \r
 460 Better yet, use REP STOSD or unroll the loop!\r
 461 \r
 462 Happily, register-only instructions are no larger in 32-bit protected mode\r
 463 than otherwise and run at or near their rated speed in 32-bit protected mode\r
 464 on all processors.  All in all, in protected mode it's more important than\r
 465 ever to avoid large constants and displacements and to use the registers as\r
 466 much as possible.\r
 467 \r
 468 Conclusion\r
 469 \r
 470 Optimization across the 80x86 family isn't as precise as 8088 optimization,\r
 471 and it's a lot less fun, with fewer nifty tricks and less spectacular\r
 472 speed-ups.  Still, familiarity with the basix 80x86 optimization rules can\r
 473 give you a decided advantage over programmers still laboring under the\r
 474 delusion that the 286, 386, and 486 are merely faster 8088s.\r
 475 \r
 476 References\r
 477 \r
 478 Abrash, Michael.  Zen of Assembly Language.  Glenview, Ill.: Scott, Foresman,\r
 479 1990.\r
 480 \r
 481 Barrenechea, Mark.  "Peak Performance: On to the 486."  Programmer's Journal,\r
 482 (November-December 1990).\r
 483 \r
 484 Paterson, Tim.  "Assembly Language Tricks of the Trade."  Dr. Dobb's Journal\r
 485 (March 1990).\r
 486 \r
 487 Turbo Assembler Quick Reference Guide.  Borland International, 1990.\r
 488 \r
 489 i486 Microprocessor Programmer's Reference Manual.  Intel Corporation, 1989.\r
 490 \r
 491 80386 Programmer's Reference Manual.  Intel Corporation, 1986.\r
 492 \r
 493 Microsystems Components Handbook: Microprocessors Volume I.  Intel\r
 494 Corporation, 1985.\r