AMD Typewriter x86 사용자 설명서

다운로드
페이지 256
92
Efficient Implementation of Population Count Function
AMD Athlon™ Processor x86 Code Optimization 
22007E/0—November 1999
Step 3 
For the first time, the value in each k-bit field is small enough
that adding two k-bit fields results in a value that still fits in the
k-bit field. Thus the following computation is performed:
y = (x + (x >> 4)) & 0x0F0F0F0F
The result is four 8-bit fields whose lower half has the desired
sum and whose upper half contains "junk" that has to be
masked out. In a symbolic form:
x      = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
sum    = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
Th e   W W W W,   X X X X ,   Y Y Y Y,   a n d   Z Z Z Z   va l u e s   a re   t h e
interesting sums with each at most 1000b, or 8 decimal.
Step 4 
The four 4-bit sums can now be rapidly accumulated by means
of a multiply with a "magic" multiplier. This can be derived
from looking at the following chart of partial products:
0p0q0r0s * 01010101 =
        :0p0q0r0s
      0p:0q0r0s
    0p0q:0r0s
  0p0q0r:0s
000pxxww:vvuutt0s
Here p, q, r, and s are the 4-bit sums from the previous step, and
vv is the final result in which we are interested. Thus, the final
result:
z = (y * 0x01010101) >> 24
Example:  
unsigned int popcount(unsigned int v)
{
   unsigned int retVal;
   __asm {
MOV   EAX, [v] 
;v
MOV   EDX, EAX 
;v
SHR   EAX, 1 
;v >> 1
AND   EAX, 055555555h ;(v >> 1) & 0x55555555
SUB   EDX, EAX 
;w = v - ((v >> 1) & 0x55555555)
MOV   EAX, EDX 
;w
SHR   EDX, 2 
;w >> 2
AND   EAX, 033333333h ;w & 0x33333333
AND   EDX, 033333333h ;(w >> 2)  & 0x33333333