Справочник Пользователя для AMD 250

Chapter 8

Integer Optimizations

179

Software Optimization Guide for AMD64 Processors

25112

Rev. 3.06

September 2005

8.6

Efficient Implementation of Population-Count
Function in 32-Bit Mode

Population count is an operation that determines the number of set bits in a bit string. For example,
this can be used to determine the cardinality of a set. The example code in this section shows how to
efficiently implement a population count operation for 32-bit operands. The example is written for the
inline assembler of Microsoft

Visual C.

Function

popcount

implements a branchless computation of the population count. It is based on a

O(log(n)) algorithm that successively groups the bits into groups of 2, 4, 8, 16, and 32, while
maintaining a count of the set bits in each group. The algorithm consists of the following steps:

1. Partition the integer into groups of two bits. Compute the population count for each 2-bit group

and store the result in the 2-bit group. This calls for the following transformation to be performed
for each 2-bit group:

00b -> 00b

01b -> 01b

10b -> 01b

11b -> 10b

If the original value of a 2-bit group is v, then the new value will be v – (v >> 1). In order to handle
all 2-bit groups simultaneously, it is necessary to mask appropriately to prevent spilling from one
bit group to the next lower bit group. Thus:

w = v - ((v >> 1) & 0x55555555)

2. Add the population count of adjacent 2-bit group and store the sum to the 4-bit group resulting

from merging these adjacent 2-bit groups. To do this simultaneously to all groups, mask out the
odd numbered groups, mask out the even numbered groups, and then add the odd numbered
groups to the even numbered groups:

x = (w & 0x33333333) + ((w >> 2) & 0x33333333)

Each 4-bit field now has one of the following values: 0000b, 0001b, 0010b, 0011b, or 0100b.

3. For the first time, the value in each k-bit field is small enough that adding two k-bit fields results

in a value that still fits in the k-bit field. Thus the following computation is performed:

y = (x + (x >> 4)) & 0x0F0F0F0F

The result is four 8-bit fields whose lower half has the desired sum and whose upper half contains
“junk” that has to be masked out. A symbolic form is as follows:

x = 0aaa0bbb0ccc0ddd0eee0fff0ggg0hhh

x >> 4 = 00000aaa0bbb0ccc0ddd0eee0fff0ggg

sum = 0aaaWWWWiiiiXXXXjjjjYYYYkkkkZZZZ

The WWWW, XXXX, YYYY, and ZZZZ values are the interesting sums with each at most
1000b, or 8 decimal.