Справочник Пользователя для AMD 250

Скачать
Страница из 384
Chapter 8
Integer Optimizations
179
Software Optimization Guide for AMD64 Processors
25112
Rev. 3.06
September 2005
8.6
Efficient Implementation of Population-Count 
Function in 32-Bit Mode
Population count is an operation that determines the number of set bits in a bit string. For example, 
this can be used to determine the cardinality of a set. The example code in this section shows how to 
efficiently implement a population count operation for 32-bit operands. The example is written for the 
inline assembler of Microsoft
®
 Visual C.
Function 
popcount
 implements a branchless computation of the population count. It is based on a 
O(log(n)) algorithm that successively groups the bits into groups of 2, 4, 8, 16, and 32, while 
maintaining a count of the set bits in each group. The algorithm consists of the following steps:
1. Partition the integer into groups of two bits. Compute the population count for each 2-bit group 
and store the result in the 2-bit group. This calls for the following transformation to be performed 
for each 2-bit group: 
00b -> 00b
01b -> 01b
10b -> 01b
11b -> 10b
If the original value of a 2-bit group is v, then the new value will be – (>> 1). In order to handle 
all 2-bit groups simultaneously, it is necessary to mask appropriately to prevent spilling from one 
bit group to the next lower bit group. Thus: 
w = v - ((v >> 1) & 0x55555555)
2. Add the population count of adjacent 2-bit group and store the sum to the 4-bit group resulting 
from merging these adjacent 2-bit groups. To do this simultaneously to all groups, mask out the 
odd numbered groups, mask out the even numbered groups, and then add the odd numbered 
groups to the even numbered groups:
x = (w & 0x33333333) + ((w >> 2) & 0x33333333)
Each 4-bit field now has one of the following values: 0000b, 0001b, 0010b, 0011b, or 0100b.
3. For the first time, the value in each k-bit field is small enough that adding two k-bit fields results 
in a value that still fits in the k-bit field. Thus the following computation is performed:
y = (x + (x >> 4)) & 0x0F0F0F0F
The result is four 8-bit fields whose lower half has the desired sum and whose upper half contains 
“junk” that has to be masked out. A symbolic form is as follows:
x      = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh
x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg
sum    = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ
The WWWW, XXXX, YYYY, and ZZZZ values are the interesting sums with each at most 
1000b, or 8 decimal.