git.subgeniuskitty.com - sgk-go/.git/blame_incremental

... / ...

Commit	Line	Data
	1	In this chapter, we describe the principles of the GNU Go DFA
	2	pattern matcher. The aim of this system is to permit a fast
	3	pattern matching when it becomes time critical like in owl
	4	module (@ref{The Owl Code}). Since GNU Go 3.2, this is enabled
	5	by default. You can still get back the traditional pattern matcher
	6	by running @command{configure --disable-dfa} and then recompiling
	7	GNU Go.
	8
	9	Otherwise, a finite state machine called a Deterministic Finite
	10	State Automaton (@ref{What is a DFA}) will be built off line from the
	11	pattern database. This is used at runtime to speedup pattern matching
	12	(@ref{Pattern matching with DFA} and @ref{Incremental Algorithm}). The
	13	runtime speedup is at the cost of an increase in memory use and
	14	compile time.
	15
	16
	17	@menu
	18	* Introduction to the DFA:: Scanning the board along a path
	19	* What is a DFA:: A recall of language theory.
	20	* Pattern matching with DFA:: How to retrieve go patterns with a DFA?
	21	* Building the DFA:: Playing with explosives.
	22	* Incremental Algorithm:: The joy of determinism.
	23	* DFA Optimizations:: Some possible optimizations.
	24	@end menu
	25
	26
	27	@cindex pattern matching
	28	@cindex fast pattern matching
	29	@cindex pattern database
	30	@cindex finite state automaton
	31	@cindex automaton
	32	@cindex dfa
	33	@cindex dfa.h
	34	@cindex dfa.c
	35	@cindex matchpat.c
	36	@cindex product
	37
	38
	39	@node Introduction to the DFA
	40	@section Introduction to the DFA
	41
	42	The general idea is as follows:
	43
	44	For each intersection of the board, its neighbourhood is scanned following
	45	a predefined path. The actual path used does not matter very much; GNU Go
	46	uses a spiral as shown below.
	47
	48	@ifinfo
	49	@example
	50	+---B--------------+
	51	\| C 4 A . . . . . .\|
	52	D 5 1 3 9 . . . . .\|
	53	E 6 2 8 . . X . . .\|
	54	\| F 7 . . . . . . .\|
	55	\| . +-> . . . . . .\|
	56	\| . . . . . . . . .\|
	57	\| . O . . . X . . .\|
	58	\| . . . . . . . . .\|
	59	\| . . . . . . . . .\|
	60	+------------------+
	61	@end example
	62	@end ifinfo
	63	@ifnotinfo
	64	@image{path}
	65	@end ifnotinfo
	66
	67	In each step of the path, the pattern matcher jumps into a state
	68	determined by what it has found on the board so far. If we have
	69	successfully matched one or several patterns in this step, this state
	70	immediately tells us so (in its @dfn{attribute}).
	71	But the state also implicitly encodes which further patterns can still
	72	get matched: The information stored in the state contains in which state
	73	to jump next, depending on whether we find a black, white or empty
	74	intersection (or an intersection out of board) in the next step of the
	75	path. The state will also immediately tell us if we cannot find any
	76	further pattern (by telling us to jump into the @dfn{error} state).
	77
	78	These sloppy explanations may become clearer with the definitions in
	79	the next section (@ref{What is a DFA}).
	80
	81	Reading the board following a predefined path reduces
	82	the two dimentional pattern matching to a linear text searching problem.
	83	For example, this pattern
	84
	85	@example
	86	?X?
	87	.O?
	88	?OO
	89	@end example
	90
	91	@noindent
	92	scanned following the path
	93
	94	@example
	95	B
	96	C4A
	97	5139
	98	628
	99	7
	100	@end example
	101
	102	@noindent
	103	gives the string @b{"OO?X.?O?*?"}
	104	where @b{"?"} means @b{'don't care'} and @b{"*"} means @b{'don't care, can
	105	even be out of board'}.
	106
	107	So we can forget that we are dealing with two dimensional patterns and
	108	consider linear patterns.
	109
	110
	111	@node What is a DFA
	112	@section What is a DFA
	113
	114	The acronym DFA means Deterministic Finite state Automaton
	115	(See @uref{http://www.eti.pg.gda.pl/~jandac/thesis/node12.html}
	116	or @cite{Hopcroft & Ullman "Introduction to Language Theory"}
	117	for more details).
	118	DFA are common tools in compilers design
	119	(Read @cite{Aho, Ravi Sethi, Ullman "COMPILERS: Principles, Techniques and Tools"}
	120	for a complete introduction), a lot of
	121	powerfull text searching algorithm like @cite{Knuth-Morris-Pratt}
	122	or @cite{Boyer-Moore} algorithms are based on DFA's
	123	(See @uref{http://www-igm.univ-mlv.fr/~lecroq/string/} for a bibliography
	124	of pattern matching algorithms).
	125
	126	Basically,
	127	a DFA is a set of @dfn{states} connected by labeled @dfn{transitions}.
	128	The labels are the values read on the board,
	129	in GNU Go these values are EMPTY, WHITE, BLACK or OUT_BOARD, denoted
	130	respectively by '.','O','X' and '#'.
	131
	132
	133	The best way to represent a DFA is to draw its transition graph:
	134	the pattern @b{"????..X"} is recognized by the following DFA:
	135
	136	@ifinfo
	137	@example
	138	@group
	139	.,X,O .,X,O .,X,O .,X,O . . X
	140	[1]------>[2]----->[3]----->[4]----->[5]--->[6]--->[7]--->[8 OK!]
	141	Start
	142	@end group
	143	@end example
	144	@end ifinfo
	145
	146	@ifnotinfo
	147	@image{dfa}
	148	@end ifnotinfo
	149
	150
	151	This means that
	152	starting from state [1], if you read '.','X' or 'O' on the board,
	153	go to state [2] and so on until you reach state [5].
	154	From state [5], if you read '.', go to state [6]
	155	otherwise go to error state [0].
	156	And so on until you reach state [8].
	157	As soon as you reach state [8],
	158	you recognize Pattern @b{"????..X"}
	159
	160	Adding a pattern like @b{"XXo"} ('o' is a wildcard for not 'X')
	161	will transform directly the automaton
	162	by synchronization product (@ref{Building the DFA}).
	163	Consider the following DFA:
	164
	165	@ifinfo
	166	@example
	167	@group
	168	Start .,O .,X,O .,O,X .,X,O . . X
	169	[1]---->[2]----->[3]----->[4]------>[5]--->[6]---->[7]--->[8 OK!]
	170	\| ^ ^ ^
	171	\| .,O \| \| \|
	172	\| ---- \| \|
	173	\| \| X \| \|
	174	\| \| --- .,X,O \|
	175	\| \| \| \|
	176	\| X \| X \| O,. \|
	177	--------->[9]------>[A]--->[B OK!]-
	178	@end group
	179	@end example
	180	@end ifinfo
	181
	182	@ifnotinfo
	183	@image{dfa2}
	184	@end ifnotinfo
	185
	186	By adding a special @dfn{error} state and completing each state
	187	by a transition to error state when there is none, we transform
	188	easily a DFA in a @dfn{Complete Deterministic Finite state
	189	Automaton} (CDFA). The synchronization product
	190	(@ref{Building the DFA}) is only possible on CDFA's.
	191
	192
	193	@ifinfo
	194	@example
	195	@group
	196	Start .,O .,X,O .,O,X .,X,O . . X
	197	[1]---->[2]----->[3]----->[4]------>[5]--->[6]---->[7]--->[8 OK!]
	198	\| ^ ^ ^ \| \| \|
	199	\| .,O \| \| \| \| \| \|
	200	\| ---- \| \| \| \| \|
	201	\| \| X \| \| \|X,O \| .,O \|X,.,O
	202	\| \| --- .,X,O \| \| \| \|
	203	\| \| \| \| \| \| \|
	204	\| X \| X \| O,. \| \ / \ / \ /
	205	--------->[9]------>[A]--->[B OK!]- [0 Error state !]
	206	@end group
	207	@end example
	208	@end ifinfo
	209	@ifnotinfo
	210	@image{cdfa}
	211	@end ifnotinfo
	212
	213	The graph of a CDFA is coded by an array of states:
	214	The 0 state is the "error" state and
	215	the start state is 1.
	216
	217	@example
	218	@group
	219	----------------------------------------------------
	220	state \| . \| O \| X \| # \| att
	221	----------------------------------------------------
	222	1 \| 2 \| 2 \| 9 \| 0 \|
	223	2 \| 3 \| 3 \| 3 \| 0 \|
	224	3 \| 4 \| 4 \| 4 \| 0 \|
	225	5 \| 6 \| 0 \| 0 \| 0 \|
	226	6 \| 7 \| 0 \| 0 \| 0 \|
	227	7 \| 0 \| 0 \| 8 \| 0 \|
	228	8 \| 0 \| 0 \| 0 \| 0 \| Found pattern "????..X"
	229	9 \| 3 \| 3 \| A \| 0 \|
	230	A \| B \| B \| 4 \| 0 \|
	231	B \| 5 \| 5 \| 5 \| 0 \| Found pattern "XXo"
	232	----------------------------------------------------
	233	@end group
	234	@end example
	235
	236	To each state we associate an often empty
	237	list of attributes which is the
	238	list of pattern indexes recognized when this state is reached.
	239	In '@file{dfa.h}' this is basically represented by two stuctures:
	240
	241	@example
	242	@group
	243	@code{
	244	/* dfa state */
	245	typedef struct state
	246	@{
	247	int next[4]; /* transitions for EMPTY, BLACK, WHITE and OUT_BOARD */
	248	attrib_t *att;
	249	@}
	250	state_t;
	251
	252	/* dfa */
	253	typedef struct dfa
	254	@{
	255	attrib_t indexes; / Array of pattern indexes */
	256	int maxIndexes;
	257
	258	state_t states; / Array of states */
	259	int maxStates;
	260	@}
	261	dfa_t;}
	262	@end group
	263	@end example
	264
	265	@node Pattern matching with DFA
	266	@section Pattern matching with DFA
	267
	268	Recognizing with a DFA is very simple
	269	and thus very fast
	270	(See '@code{scan_for_pattern()}' in the '@file{engine/matchpat.c}' file).
	271
	272
	273	Starting from the start state, we only need to read the board
	274	following the spiral path, jump from states to states following
	275	the transitions labelled by the values read on the board and
	276	collect the patterns indexes on the way. If we reach the error
	277	state (zero), it means that no more patterns will be matched.
	278	The worst case complexity of this algorithm is o(m) where m is
	279	the size of the biggest pattern.
	280
	281	Here is an example of scan:
	282
	283	First we build a minimal DFA recognizing these patterns:
	284	@b{"X..X"}, @b{"X???"}, @b{"X.OX"} and @b{"X?oX"}.
	285	Note that wildcards like '?','o', or 'x' give multiple out-transitions.
	286
	287	@example
	288	@group
	289	----------------------------------------------------
	290	state \| . \| O \| X \| # \| att
	291	----------------------------------------------------
	292	1 \| 0 \| 0 \| 2 \| 0 \|
	293	2 \| 3 \| 10 \| 10 \| 0 \|
	294	3 \| 4 \| 7 \| 9 \| 0 \|
	295	4 \| 5 \| 5 \| 6 \| 0 \|
	296	5 \| 0 \| 0 \| 0 \| 0 \| 2
	297	6 \| 0 \| 0 \| 0 \| 0 \| 4 2 1
	298	7 \| 5 \| 5 \| 8 \| 0 \|
	299	8 \| 0 \| 0 \| 0 \| 0 \| 4 2 3
	300	9 \| 5 \| 5 \| 5 \| 0 \|
	301	10 \| 11 \| 11 \| 9 \| 0 \|
	302	11 \| 5 \| 5 \| 12 \| 0 \|
	303	12 \| 0 \| 0 \| 0 \| 0 \| 4 2
	304	----------------------------------------------------
	305	@end group
	306	@end example
	307
	308	We perform the scan of the string
	309	@b{"X..XXO...."} starting from state 1:
	310
	311	Current state: 1, substring to scan :@b{ X..XXO....}
	312
	313	We read an 'X' value, so from state 1 we must go to
	314	state 2.
	315
	316	Current state: 2, substring to scan : @b{..XXO....}
	317
	318	We read a '.' value, so from state 2 we must go to
	319	state 3 and so on ...
	320
	321	@example
	322	Current state: 3, substring to scan : .XXO....
	323	Current state: 4, substring to scan : XXO....
	324	Current state: 6, substring to scan : XO....
	325	Found pattern 4
	326	Found pattern 2
	327	Found pattern 1
	328	@end example
	329
	330	After reaching state 6 where we match patterns
	331	1,2 and 4, there is no out-transitions so we stop the matching.
	332	To keep the same match order as in the standard algorithm,
	333	the patterns indexes are collected in an array and sorted by indexes.
	334
	335	@node Building the DFA
	336	@section Building the DFA
	337
	338	The most flavouring point is the building of the minimal DFA
	339	recognizing a given set of patterns.
	340	To perform the insertion of a new
	341	pattern into an already existing DFA one must completly rebuild
	342	the DFA: the principle is to build the minimal CDFA recognizing
	343	the new pattern to replace the original CDFA with its
	344	@dfn{synchronised product} by the new one.
	345
	346	We first give a formal definition:
	347	Let @b{L} be the left CDFA and @b{R} be the right one.
	348	Let @b{B} be the @dfn{synchronised product} of @b{L} by @b{R}.
	349	Its states are the couples @b{(l,r)} where @b{l} is a state of @b{L} and
	350	@b{r} is a state of @b{R}.
	351	The state @b{(0,0)} is the error state of @b{B} and the state
	352	@b{(1,1)} is its initial state.
	353	To each couple @b{(l,r)} we associate the union of patterns
	354	recognized in both @b{l} and @b{r}.
	355	The transitions set of @b{B} is the set of transitions
	356	@b{(l1,r1)---a--->(l2,r2)} for
	357	each symbol @b{'a'} such that both @b{l1---a--->l2} in @b{L}
	358	and @b{r1---a--->r2} in @b{R}.
	359
	360	The maximal number of states of @b{B} is the product of the
	361	number of states of @b{L} and @b{R} but almost all this states
	362	are non reachable from the initial state @b{(1,1)}.
	363
	364	The algorithm used in function '@code{sync_product()}' builds
	365	the minimal product DFA only by keeping the reachable states.
	366	It recursively scans the product CDFA by following simultaneously
	367	the transitions of @b{L} and @b{R}. A hast table
	368	(@code{gtest}) is used to check if a state @b{(l,r)} has
	369	already been reached, the reachable states are remapped on
	370	a new DFA. The CDFA thus obtained is minimal and recognizes the
	371	union of the two patterns sets.
	372
	373	@ifnotinfo
	374	For example these two CDFA's:
	375
	376
	377	@image{sync-prod1}
	378
	379	Give by synchronization product the following one:
	380
	381
	382	@image{sync-prod2}
	383	@end ifnotinfo
	384
	385	It is possible to construct a special pattern database that
	386	generates an "explosive" automaton: the size of the DFA is in
	387	the worst case exponential in the number of patterns it
	388	recognizes. But it doesn't occur in pratical situations:
	389	the DFA size tends to be @dfn{stable}. By @dfn{stable} we mean that if we
	390	add a pattern which greatly increases the size of the DFA it
	391	also increases the chance that the next added pattern does not
	392	increase its size at all. Nevertheless there are many ways to
	393	reduce the size of the DFA. Good compression methods are
	394	explained in @cite{Aho, Ravi Sethi, Ullman "COMPILERS:
	395	Principles, Techniques and Tools" chapter Optimization of
	396	DFA-based pattern matchers}.
	397
	398	@node Incremental Algorithm
	399	@section Incremental Algorithm
	400
	401	The incremental version of the DFA pattern matcher is not yet implemented
	402	in GNU Go but we explain here how it will work.
	403	By definition of a deterministic automaton, scanning the same
	404	string will reach the same states every time.
	405
	406	Each reached state during pattern matching is stored in a stack
	407	@code{top_stack[i][j]} and @code{state_stack[i][j][stack_idx]}
	408	We use one stack by intersection @code{(i,j)}. A precomputed reverse
	409	path list allows to know for each couple of board intersections
	410	@code{(x,y)} its position @code{reverse(x,y)} in the spiral scan path
	411	starting from @code{(0,0)}.
	412
	413	When a new stone is put on the board at @code{(lx,ly)}, the only work
	414	of the pattern matcher is:
	415
	416	@example
	417	@group
	418	@code{
	419	for(each stone on the board at (i,j))
	420	if(reverse(lx-i,ly-j) < top_stack[i][j])
	421	@{
	422	begin the dfa scan from the state
	423	state_stack[i][j][reverse(lx-i,ly-j)];
	424	@}
	425	}
	426	@end group
	427	@end example
	428
	429	In most situations reverse(lx-i,ly-j) will be inferior to
	430	top_stack[i][j]. This should speedup a lot pattern matching.
	431
	432	@node DFA Optimizations
	433	@section Some DFA Optimizations
	434
	435	The DFA is constructed to minimize jumps in memory making some
	436	assumptions about the frequencies of the values: the EMPTY
	437	value is supposed to appear often on the board, so the the '.'
	438	transition are almost always successors in memory. The
	439	OUT_BOARD are supposed to be rare, so '#' transitions will
	440	almost always imply a big jump.
	441