an overview of distress
DISTRIBUTION-ESTIMATING SEARCH

The conceit of the algorithm is returning a prob­a­bil­ity dis­tri­bu­tion rather than a value. Supposing that we’ve an eval­u­a­tion func­tion that does so, which we will con­sult at the leaves of the search tree, we are left with two pro­ce­dures to determine:

The prob­a­bil­ity dis­tri­bu­tions may be over centi­pawn or wdl scores,* *By which I mean expected outcome. but we will gen­er­ally assume the latter so that the sup­port is bounded, and the dis­tri­bu­tions may be either con­tin­u­ous or dis­crete with domain of any car­di­nal­ity, such as dis­tri­bu­tions over { loss, draw, win} or the inter­val from −1 to +1 sub­divided into any number of parts.

* By wdl score I mean expected outcome.

This first pro­ce­dure might be riven in twain:

Note that the policy may also be of use in answering q2.

† For example, if two moves are believed to be opti­mal and are equally believed to be optimal, we would assign them both a value of 1 rather than ½.

In two thousand twenty one, before having con­sid­ered the prob­lem as a whole and decom­posing it as above, I con­sid­ered q1 only and bluntly began down the following path.

The maximum of two random var­i­ables X 0 and X 1 with prob­a­bil­ity den­sity func­tions p 0 and p 1 and cumu­la­tive dis­tri­bu­tion func­tions q 0 and q 1 ,

q i ( x ) = P ( X i x ) = x p i ( t ) d t

if X 0 and X 1 are inde­pen­dent, has the cumu­la­tive dis­tri­bu­tion func­tion

P ( max ( X 0 , X 1 ) x ) = q 0 ( x ) × q 1 ( x ) .

I used dis­tri­bu­tions over centi­pawn score, and instead of work­ing with the dis­tri­bu­tions directly, I rep­re­sented them with three sum­mary sta­tis­tics: the peak (the median or mean), the right-hand dispersion (the dif­fer­ence between the peak and the ninety-fifth per­cen­tile if I recall correctly), and the left-hand dis­per­sion (. . . the fifth per­cen­tile). My hope was to find an accept­able approx­i­ma­tion of the effect of max ( , ) on the sum­mary sta­tis­tics. I did not succeed.

Four years later, I returned to the subject and began to con­tem­plate it more com­pletely, writing the following:

Let N be the number of suc­ces­sors, and then con­sider four quantities:

Informally, the quantities s and v are meant to describe the bulk of the dis­tri­bu­tion and the quan­ti­ties f and c are meant to describe the tails (“the dis­tri­bu­tion” here being the dis­tri­bu­tion of belief about the true score of the position).

When switching perspectives, s negates, v remains the same, and f and c swap.

The variables s, v, f, and c can also be indexed, e.g.

Then also consider:

The move with the highest p (which is also the move with the highest b) is the one we’d pick if this were the root. Here I’m treating b(i) as in the range 0 to 1 and ini­tial­ized to ½ or 1 / N or ini­tial­ized depending on s(i) and v(i) but should it be log odds and ini­tial­ized to 0, or something else?

When we write · · · , this is shorthand for · · · from i = 1 to i = N ”.

The p(i) obey the property that p ( i ) = 1 ; in other words, it’s a distribution.

Perhaps the b(i) are also a dis­tri­bu­tion, and perhaps further b(i) and p(i) should be the same. As a the­o­ret­i­cal matter, they are distinct; if all moves are pre­cisely equally good, b(i) should be 1 for all i, but p(i) should be 1 / N. And p(i) should depend on v(i), f (i), and c(i) in addi­tion to s(i), but b(i) may depend only on s(i).

In some sense, the c of a posi­tion is the f of the suc­ces­sor you will pick.

c pick = c ( i ) × p ( i )

In another sense, the c of a posi­tion is the prob­a­bil­ity that all suc­ces­sors are cata­strophic, since it only takes one of them being okay for the position to not be catastrophic.

c opt = c ( i )

The relevance of that obser­va­tion depends on our belief that we’d be able to find the right move if this posi­tion became the root, so we blend the two with some factor β:

c = c opt × β + c pick × ( 1 β ) .

This can be done asym­metrically we may pick a higher β for our oppo­nent than our­selves if we think they are less likely to make a mistake.

Similarly, f pick = f ( i ) × p ( i ) and the probability that any suc­ces­sor is fortuitous is

f opt = 1 ( 1 f ( i ) ) .

We blend the two: f = f opt × β + f pick × ( 1 β ) .

There is a defect: it oughtn’t make a dif­fer­ence whether a bad move has a high chance of being unex­pect­edly worse or not.

Because of this defect, and a dis­sat­is­fac­tion with the arbi­trar­i­ness and lack of the­o­ret­i­cal simplicity, I aban­doned this line of thinking as well.

This brings us to my ongoing endeavour. I have resolved to use dis­crete dis­tri­bu­tions over the inter­val from −1 to +1 sub­divided into S parts. For q1a we will use a dis­tri­bu­tion of the belief that the suc­ces­sor of a posi­tion would be played were the posi­tion the root. Then q1b be­comes rather simple: let π ( i ) for i = 1 , . . . , N be the policy for the N suc­ces­sors and p i ( x ) for x = 1 , . . . , S be the prob­a­bil­ity mass func­tion returned by the search of a successor. The var­i­able x indexes the sub­in­ter­vals,
so x = 1 refers to the inter­val from −1 to −1 + 2/S and x = S refers to the inter­val from +1 2/S to +1; more generally, x refers to the inter­val from −1 + 2/S × (x−1) to −1 + 2/S × x.
Then the dis­tri­bu­tion r ( x ) we will return is

r ( x ) = i = 1 N π ( i ) × p i ( x ) .

Now we define q1a. Consider again two ran­dom var­i­ables X 0 and X 1 with prob­a­bil­ity den­sity func­tions p 0 and p 1 and cumu­la­tive dis­tri­bu­tion func­tions q 0 and q 1 . If  X 0 and X 1 are inde­pen­dent, then

P ( X 0 X 1 ) = x 1 S ( y 1 x p 0 ( y ) ) × p 1 ( x )

or more simply

P ( X 0 X 1 ) = x 1 S q 0 ( x ) × p 1 ( x ) .

We can extend this to three ran­dom var­i­ables:

P ( X 0 X 2 X 1 X 2 ) = x q 0 ( x ) × q 1 ( x ) × p 2 ( x ) .

And more generally,

P ( k X k X i ) = x ( k i q k ( x ) ) × p i ( x ) .

This is our belief that the i-th successor is optimal (at least as good as any other successor).

After normalization, these terms become our policy:

s = i = 1 N P ( k X k X i ) π ( i ) = P ( k X k X i ) / s .

There may be numer­i­cal dif­fi­cul­ties when N is large because of the product q k ( x ) , which might vary wildly in mag­ni­tude depend­ing on the value of x, but this com­pletes q1 in concept.

I suspect that q2 admits no sim­i­larly simple answer. Much like how α/β search* *Depth-first minimax search with iterative deepening, α/β pruning, and zero-window pruning. is effec­tively made a best-first search through the heuristic appli­ca­tion of reduc­tions and exten­sions, Whether by reducing/extending some nodes directly according to some criteria or by uniformly applying late move reduction and ordering successors according to some criteria. the most suc­cess­ful pro­ce­dure is likely to be found empir­i­cally, perhaps as a syn­the­sis of var­i­ous methods. Like in α/β search, online learning is apt to be indispensable.

* Depth-first minimax search with iter­a­tive deep­ening, α/β pruning, and zero-window pruning.

† Whether by reducing/extending some nodes directly according to some crite­ria or by uniformly applying late move reduc­tion and ordering suc­ces­sors according to some criteria.

At nodes for which we happen to have saved infor­mation about the suc­ces­sors’ dis­tri­bu­tions, it may be rea­son­able to start by dividing time in pro­por­tion to π ( i ) , or if vis­i­ta­tion counts are also avail­able, by dividing time in proportion to

π ( i ) + c log k v ( k ) v ( i )

for some constant c, analogous to ucb1, or perhaps

π ( i ) + c A ( i ) k v ( k ) 1 + v ( i )

where c is scaled by roughly the square root of the var­i­ance and A ( i ) is an adjusting heuristic, analogous to puct. Note the absence of a log­a­rithm in this second formula.

Then search can pro­ceed by iter­a­tive deep­ening, giving a node count to approx­i­mately divy out recur­sively and pro­gres­sively searching with higher counts.

This cannot work as stated. Repeat­edly taking the mix­ture of dis­tri­bu­tions, as in our pre­scrip­tion of r ( x ) for q1b, can only increase the var­i­ance, but we would expect the var­i­ance to gen­er­ally decrease as draught increases. On the other hand, repeat­edly taking the max­i­mum of two bounded ran­dom var­i­ables will reduce var­i­ance. This suggests that the idea of blending ropt and rpick was a decent one, as it addresses a fun­da­men­tal problem.

By ropt we mean the dis­tri­bu­tion whose cumu­la­tive dis­tri­bu­tion func­tion is i q ( x ) and by rpick we mean r ( x ) as it was defined above.

It is tricky matter, avoiding degen­eracy under repeated appli­ca­tion into either a flat or a narrow dis­tri­bu­tion, dif­fusing or col­lapsing quickly. What process would push the var­i­ance toward a stable value, but could be over­come given prov­o­ca­tion?

We might con­sider two pos­sible prin­ciples at play:

To put it more suc­cinctly, low dis­per­sion in the policy drives inter­po­la­tion toward rpick and high dis­per­sion drives inter­po­la­tion toward ropt. This estab­lishes some degree of neg­a­tive feedback.

* We have been blithely neglecting the fact that the scores of suc­ces­sors are almost cer­tainly not inde­pen­dent. If the most prom­ising suc­ces­sor turns out to be unex­pect­edly bad, it is some­what likely the other emi­nent suc­ces­sors will be worse than hitherto believed.

† Separately, it is also jus­ti­fi­ca­tion to begin focusing on a sub­set of suc­ces­sors, as it is rel­a­tively safe to do so; a desire to dif­fer­en­ti­ate the suc­ces­sors would pro­vide an impetus to do so.

At the present time, it remains an open ques­tion how this ought to be car­ried out exactly, or what heuris­tics ought to mod­ify it. This essay was last updated on the fif­teenth day of Feb­ru­ary in the two thou­sand twenty sixth year of the common era.