Updated README: Equal sign not required with `--mode` flag.
[sgk-go] / doc / dfa.texi
CommitLineData
7eeb782e
AT
1In this chapter, we describe the principles of the GNU Go DFA
2pattern matcher. The aim of this system is to permit a fast
3pattern matching when it becomes time critical like in owl
4module (@ref{The Owl Code}). Since GNU Go 3.2, this is enabled
5by default. You can still get back the traditional pattern matcher
6by running @command{configure --disable-dfa} and then recompiling
7GNU Go.
8
9Otherwise, a finite state machine called a Deterministic Finite
10State Automaton (@ref{What is a DFA}) will be built off line from the
11pattern database. This is used at runtime to speedup pattern matching
12(@ref{Pattern matching with DFA} and @ref{Incremental Algorithm}). The
13runtime speedup is at the cost of an increase in memory use and
14compile time.
15
16
17@menu
18* Introduction to the DFA:: Scanning the board along a path
19* What is a DFA:: A recall of language theory.
20* Pattern matching with DFA:: How to retrieve go patterns with a DFA?
21* Building the DFA:: Playing with explosives.
22* Incremental Algorithm:: The joy of determinism.
23* DFA Optimizations:: Some possible optimizations.
24@end menu
25
26
27@cindex pattern matching
28@cindex fast pattern matching
29@cindex pattern database
30@cindex finite state automaton
31@cindex automaton
32@cindex dfa
33@cindex dfa.h
34@cindex dfa.c
35@cindex matchpat.c
36@cindex product
37
38
39@node Introduction to the DFA
40@section Introduction to the DFA
41
42The general idea is as follows:
43
44For each intersection of the board, its neighbourhood is scanned following
45a predefined path. The actual path used does not matter very much; GNU Go
46uses a spiral as shown below.
47
48@ifinfo
49@example
50 +---B--------------+
51 | C 4 A . . . . . .|
52 D 5 1 3 9 . . . . .|
53 E 6 2 8 . . X . . .|
54 | F 7 . . . . . . .|
55 | . +-> . . . . . .|
56 | . . . . . . . . .|
57 | . O . . . X . . .|
58 | . . . . . . . . .|
59 | . . . . . . . . .|
60 +------------------+
61@end example
62@end ifinfo
63@ifnotinfo
64@image{path}
65@end ifnotinfo
66
67In each step of the path, the pattern matcher jumps into a state
68determined by what it has found on the board so far. If we have
69successfully matched one or several patterns in this step, this state
70immediately tells us so (in its @dfn{attribute}).
71But the state also implicitly encodes which further patterns can still
72get matched: The information stored in the state contains in which state
73to jump next, depending on whether we find a black, white or empty
74intersection (or an intersection out of board) in the next step of the
75path. The state will also immediately tell us if we cannot find any
76further pattern (by telling us to jump into the @dfn{error} state).
77
78These sloppy explanations may become clearer with the definitions in
79the next section (@ref{What is a DFA}).
80
81Reading the board following a predefined path reduces
82the two dimentional pattern matching to a linear text searching problem.
83For example, this pattern
84
85@example
86?X?
87.O?
88?OO
89@end example
90
91@noindent
92scanned following the path
93
94@example
95 B
96C4A
975139
98628
99 7
100@end example
101
102@noindent
103gives the string @b{"OO?X.?*O*?*?"}
104where @b{"?"} means @b{'don't care'} and @b{"*"} means @b{'don't care, can
105even be out of board'}.
106
107So we can forget that we are dealing with two dimensional patterns and
108consider linear patterns.
109
110
111@node What is a DFA
112@section What is a DFA
113
114The acronym DFA means Deterministic Finite state Automaton
115(See @uref{http://www.eti.pg.gda.pl/~jandac/thesis/node12.html}
116or @cite{Hopcroft & Ullman "Introduction to Language Theory"}
117for more details).
118DFA are common tools in compilers design
119(Read @cite{Aho, Ravi Sethi, Ullman "COMPILERS: Principles, Techniques and Tools"}
120for a complete introduction), a lot of
121powerfull text searching algorithm like @cite{Knuth-Morris-Pratt}
122or @cite{Boyer-Moore} algorithms are based on DFA's
123(See @uref{http://www-igm.univ-mlv.fr/~lecroq/string/} for a bibliography
124of pattern matching algorithms).
125
126Basically,
127a DFA is a set of @dfn{states} connected by labeled @dfn{transitions}.
128The labels are the values read on the board,
129in GNU Go these values are EMPTY, WHITE, BLACK or OUT_BOARD, denoted
130respectively by '.','O','X' and '#'.
131
132
133The best way to represent a DFA is to draw its transition graph:
134the pattern @b{"????..X"} is recognized by the following DFA:
135
136@ifinfo
137@example
138@group
139 .,X,O .,X,O .,X,O .,X,O . . X
140[1]------>[2]----->[3]----->[4]----->[5]--->[6]--->[7]--->[8 OK!]
141Start
142@end group
143@end example
144@end ifinfo
145
146@ifnotinfo
147@image{dfa}
148@end ifnotinfo
149
150
151This means that
152starting from state [1], if you read '.','X' or 'O' on the board,
153go to state [2] and so on until you reach state [5].
154From state [5], if you read '.', go to state [6]
155otherwise go to error state [0].
156And so on until you reach state [8].
157As soon as you reach state [8],
158you recognize Pattern @b{"????..X"}
159
160Adding a pattern like @b{"XXo"} ('o' is a wildcard for not 'X')
161will transform directly the automaton
162by synchronization product (@ref{Building the DFA}).
163Consider the following DFA:
164
165@ifinfo
166@example
167@group
168Start .,O .,X,O .,O,X .,X,O . . X
169[1]---->[2]----->[3]----->[4]------>[5]--->[6]---->[7]--->[8 OK!]
170 | ^ ^ ^
171 | .,O | | |
172 | ---- | |
173 | | X | |
174 | | --- .,X,O |
175 | | | |
176 | X | X | O,. |
177 --------->[9]------>[A]--->[B OK!]-
178@end group
179@end example
180@end ifinfo
181
182@ifnotinfo
183@image{dfa2}
184@end ifnotinfo
185
186By adding a special @dfn{error} state and completing each state
187by a transition to error state when there is none, we transform
188easily a DFA in a @dfn{Complete Deterministic Finite state
189Automaton} (CDFA). The synchronization product
190(@ref{Building the DFA}) is only possible on CDFA's.
191
192
193@ifinfo
194@example
195@group
196Start .,O .,X,O .,O,X .,X,O . . X
197[1]---->[2]----->[3]----->[4]------>[5]--->[6]---->[7]--->[8 OK!]
198 | ^ ^ ^ | | |
199 | .,O | | | | | |
200 | ---- | | | | |
201 | | X | | |X,O | .,O |X,.,O
202 | | --- .,X,O | | | |
203 | | | | | | |
204 | X | X | O,. | \ / \ / \ /
205 --------->[9]------>[A]--->[B OK!]- [0 Error state !]
206@end group
207@end example
208@end ifinfo
209@ifnotinfo
210@image{cdfa}
211@end ifnotinfo
212
213The graph of a CDFA is coded by an array of states:
214The 0 state is the "error" state and
215the start state is 1.
216
217@example
218@group
219----------------------------------------------------
220 state | . | O | X | # | att
221----------------------------------------------------
222 1 | 2 | 2 | 9 | 0 |
223 2 | 3 | 3 | 3 | 0 |
224 3 | 4 | 4 | 4 | 0 |
225 5 | 6 | 0 | 0 | 0 |
226 6 | 7 | 0 | 0 | 0 |
227 7 | 0 | 0 | 8 | 0 |
228 8 | 0 | 0 | 0 | 0 | Found pattern "????..X"
229 9 | 3 | 3 | A | 0 |
230 A | B | B | 4 | 0 |
231 B | 5 | 5 | 5 | 0 | Found pattern "XXo"
232----------------------------------------------------
233@end group
234@end example
235
236To each state we associate an often empty
237list of attributes which is the
238list of pattern indexes recognized when this state is reached.
239In '@file{dfa.h}' this is basically represented by two stuctures:
240
241@example
242@group
243@code{
244/* dfa state */
245typedef struct state
246@{
247 int next[4]; /* transitions for EMPTY, BLACK, WHITE and OUT_BOARD */
248 attrib_t *att;
249@}
250state_t;
251
252/* dfa */
253typedef struct dfa
254@{
255 attrib_t *indexes; /* Array of pattern indexes */
256 int maxIndexes;
257
258 state_t *states; /* Array of states */
259 int maxStates;
260@}
261dfa_t;}
262@end group
263@end example
264
265@node Pattern matching with DFA
266@section Pattern matching with DFA
267
268Recognizing with a DFA is very simple
269and thus very fast
270(See '@code{scan_for_pattern()}' in the '@file{engine/matchpat.c}' file).
271
272
273Starting from the start state, we only need to read the board
274following the spiral path, jump from states to states following
275the transitions labelled by the values read on the board and
276collect the patterns indexes on the way. If we reach the error
277state (zero), it means that no more patterns will be matched.
278The worst case complexity of this algorithm is o(m) where m is
279the size of the biggest pattern.
280
281Here is an example of scan:
282
283First we build a minimal DFA recognizing these patterns:
284@b{"X..X"}, @b{"X???"}, @b{"X.OX"} and @b{"X?oX"}.
285Note that wildcards like '?','o', or 'x' give multiple out-transitions.
286
287@example
288@group
289----------------------------------------------------
290 state | . | O | X | # | att
291----------------------------------------------------
292 1 | 0 | 0 | 2 | 0 |
293 2 | 3 | 10 | 10 | 0 |
294 3 | 4 | 7 | 9 | 0 |
295 4 | 5 | 5 | 6 | 0 |
296 5 | 0 | 0 | 0 | 0 | 2
297 6 | 0 | 0 | 0 | 0 | 4 2 1
298 7 | 5 | 5 | 8 | 0 |
299 8 | 0 | 0 | 0 | 0 | 4 2 3
300 9 | 5 | 5 | 5 | 0 |
301 10 | 11 | 11 | 9 | 0 |
302 11 | 5 | 5 | 12 | 0 |
303 12 | 0 | 0 | 0 | 0 | 4 2
304----------------------------------------------------
305@end group
306@end example
307
308We perform the scan of the string
309@b{"X..XXO...."} starting from state 1:
310
311Current state: 1, substring to scan :@b{ X..XXO....}
312
313We read an 'X' value, so from state 1 we must go to
314state 2.
315
316Current state: 2, substring to scan : @b{..XXO....}
317
318We read a '.' value, so from state 2 we must go to
319state 3 and so on ...
320
321@example
322Current state: 3, substring to scan : .XXO....
323Current state: 4, substring to scan : XXO....
324Current state: 6, substring to scan : XO....
325Found pattern 4
326Found pattern 2
327Found pattern 1
328@end example
329
330After reaching state 6 where we match patterns
3311,2 and 4, there is no out-transitions so we stop the matching.
332To keep the same match order as in the standard algorithm,
333the patterns indexes are collected in an array and sorted by indexes.
334
335@node Building the DFA
336@section Building the DFA
337
338The most flavouring point is the building of the minimal DFA
339recognizing a given set of patterns.
340To perform the insertion of a new
341pattern into an already existing DFA one must completly rebuild
342the DFA: the principle is to build the minimal CDFA recognizing
343the new pattern to replace the original CDFA with its
344@dfn{synchronised product} by the new one.
345
346We first give a formal definition:
347Let @b{L} be the left CDFA and @b{R} be the right one.
348Let @b{B} be the @dfn{synchronised product} of @b{L} by @b{R}.
349Its states are the couples @b{(l,r)} where @b{l} is a state of @b{L} and
350@b{r} is a state of @b{R}.
351The state @b{(0,0)} is the error state of @b{B} and the state
352@b{(1,1)} is its initial state.
353To each couple @b{(l,r)} we associate the union of patterns
354recognized in both @b{l} and @b{r}.
355The transitions set of @b{B} is the set of transitions
356@b{(l1,r1)---a--->(l2,r2)} for
357each symbol @b{'a'} such that both @b{l1---a--->l2} in @b{L}
358and @b{r1---a--->r2} in @b{R}.
359
360The maximal number of states of @b{B} is the product of the
361number of states of @b{L} and @b{R} but almost all this states
362are non reachable from the initial state @b{(1,1)}.
363
364The algorithm used in function '@code{sync_product()}' builds
365the minimal product DFA only by keeping the reachable states.
366It recursively scans the product CDFA by following simultaneously
367the transitions of @b{L} and @b{R}. A hast table
368(@code{gtest}) is used to check if a state @b{(l,r)} has
369already been reached, the reachable states are remapped on
370a new DFA. The CDFA thus obtained is minimal and recognizes the
371union of the two patterns sets.
372
373@ifnotinfo
374For example these two CDFA's:
375
376
377@image{sync-prod1}
378
379Give by synchronization product the following one:
380
381
382@image{sync-prod2}
383@end ifnotinfo
384
385It is possible to construct a special pattern database that
386generates an "explosive" automaton: the size of the DFA is in
387the worst case exponential in the number of patterns it
388recognizes. But it doesn't occur in pratical situations:
389the DFA size tends to be @dfn{stable}. By @dfn{stable} we mean that if we
390add a pattern which greatly increases the size of the DFA it
391also increases the chance that the next added pattern does not
392increase its size at all. Nevertheless there are many ways to
393reduce the size of the DFA. Good compression methods are
394explained in @cite{Aho, Ravi Sethi, Ullman "COMPILERS:
395Principles, Techniques and Tools" chapter Optimization of
396DFA-based pattern matchers}.
397
398@node Incremental Algorithm
399@section Incremental Algorithm
400
401The incremental version of the DFA pattern matcher is not yet implemented
402in GNU Go but we explain here how it will work.
403By definition of a deterministic automaton, scanning the same
404string will reach the same states every time.
405
406Each reached state during pattern matching is stored in a stack
407@code{top_stack[i][j]} and @code{state_stack[i][j][stack_idx]}
408We use one stack by intersection @code{(i,j)}. A precomputed reverse
409path list allows to know for each couple of board intersections
410@code{(x,y)} its position @code{reverse(x,y)} in the spiral scan path
411starting from @code{(0,0)}.
412
413When a new stone is put on the board at @code{(lx,ly)}, the only work
414of the pattern matcher is:
415
416@example
417@group
418@code{
419 for(each stone on the board at (i,j))
420 if(reverse(lx-i,ly-j) < top_stack[i][j])
421 @{
422 begin the dfa scan from the state
423 state_stack[i][j][reverse(lx-i,ly-j)];
424 @}
425}
426@end group
427@end example
428
429In most situations reverse(lx-i,ly-j) will be inferior to
430top_stack[i][j]. This should speedup a lot pattern matching.
431
432@node DFA Optimizations
433@section Some DFA Optimizations
434
435The DFA is constructed to minimize jumps in memory making some
436assumptions about the frequencies of the values: the EMPTY
437value is supposed to appear often on the board, so the the '.'
438transition are almost always successors in memory. The
439OUT_BOARD are supposed to be rare, so '#' transitions will
440almost always imply a big jump.
441