[obfuscated-c] / brainstorming / README.md

# Overview #

A few thoughts for use in future obfuscated programs.


# Digraphs, Trigraphs and Syntax Highlighting #

At this point, trigraphs betray their presence by requiring compiler flags,
making any direct benefit for obfuscation dubious.

However, trigraphs cause many syntax highlighting packages to incorrectly
highlight the source code. For example, the following code snippet frequently
displays the `exit(0);` line as code rather than comment when processed by
syntax highlighting programs which miss the trigraph `??/` converting to `\`,
thereby escaping the newline and creating a two line comment.

    // Should I exit early?????/
    exit(0);

As long as syntax highlighting is kept sane elsewhere in an obfuscated program,
the user may gradually come to trust it, perhaps allowing an instance or two of
trigraph-induced syntax highlighting failure to slip past the reader.

Of course, readers may run the equivalent of a search and replace, condensing
trigraphs to their single character equivalent. Since the CPP does an
equivalent search and replace before performing any other processing, this is
safe. On the other hand, digraphs are dealt with during the tokenization
process, meaning that a simple search-and-replace by the user is not
necessarily a safe transformation of the source code. Is it possible to include
two important digraphs hidden amongst frivolous usage, such that

  - one digraph breaks syntax highlighting in a useful way, like the example
    demonstrated above, and

  - the other digraph isn't a real digraph, rather being something which breaks
    the program if digraphs are converted with a simple search-and-replace?

One possible example of the 'false' digraph would be embedding the characters
inside another token, perhaps a multi-part string split across multiple lines?
If a naive search-and-replace would convert the string into something
syntax-breaking, then the reader may avoid doing a digraph conversion before
reading the source, despite knowing such digraphs are there, and thus may be
tricked into believing lies from their syntax highlighter.

I suppose that leads to the natural question: Do people typically do a
search-and-replace for digraphs when reading obfuscated code, or do they use a
more language-aware method?
Commit	Line	Data
435f6fd0 AT	1	# Overview #
	2
	3	A few thoughts for use in future obfuscated programs.
	4
	5
	6	# Digraphs, Trigraphs and Syntax Highlighting #
	7
	8	At this point, trigraphs betray their presence by requiring compiler flags,
	9	making any direct benefit for obfuscation dubious.
	10
	11	However, trigraphs cause many syntax highlighting packages to incorrectly
	12	highlight the source code. For example, the following code snippet frequently
	13	displays the `exit(0);` line as code rather than comment when processed by
	14	syntax highlighting programs which miss the trigraph `??/` converting to `\`,
	15	thereby escaping the newline and creating a two line comment.
	16
	17	// Should I exit early?????/
	18	exit(0);
	19
	20	As long as syntax highlighting is kept sane elsewhere in an obfuscated program,
	21	the user may gradually come to trust it, perhaps allowing an instance or two of
	22	trigraph-induced syntax highlighting failure to slip past the reader.
	23
	24	Of course, readers may run the equivalent of a search and replace, condensing
	25	trigraphs to their single character equivalent. Since the CPP does an
	26	equivalent search and replace before performing any other processing, this is
	27	safe. On the other hand, digraphs are dealt with during the tokenization
	28	process, meaning that a simple search-and-replace by the user is not
	29	necessarily a safe transformation of the source code. Is it possible to include
	30	two important digraphs hidden amongst frivolous usage, such that
	31
	32	- one digraph breaks syntax highlighting in a useful way, like the example
	33	demonstrated above, and
	34
	35	- the other digraph isn't a real digraph, rather being something which breaks
	36	the program if digraphs are converted with a simple search-and-replace?
	37
	38	One possible example of the 'false' digraph would be embedding the characters
	39	inside another token, perhaps a multi-part string split across multiple lines?
	40	If a naive search-and-replace would convert the string into something
	41	syntax-breaking, then the reader may avoid doing a digraph conversion before
	42	reading the source, despite knowing such digraphs are there, and thus may be
	43	tricked into believing lies from their syntax highlighter.
	44
	45	I suppose that leads to the natural question: Do people typically do a
	46	search-and-replace for digraphs when reading obfuscated code, or do they use a
	47	more language-aware method?