[unix-history] / usr / lisp / ch7.n

." $Header: ch7.n 1.3 83/07/01 11:22:58 layer Exp $
.Lc The\ Lisp\ Reader 7
.sh 2 Introduction \n(ch 1
.pp
The 
.i read
function is responsible for converting
a stream of 
characters into a Lisp expression.
.i Read 
is table driven and the table it uses is called a 
.i readtable.
The 
.i print
function does the
inverse of 
.i read ;
it converts a Lisp expression into a stream of 
characters.
Typically the conversion is done in such
a way that if that stream of characters were read by 
.i read ,
the 
result would be an expression equal to the one
.i print
was given.
.i Print
must also refer to the readtable in order to determine
how to format its output.
The 
.i explode
function, which returns a list of characters rather than
printing them,  must also refer to the readtable.
.pp
A readtable is created
with the
.i makereadtable 
function, modified with the
.i setsyntax
function and interrogated with the
.i getsyntax 
function.
The structure of a readtable is hidden from the user  - a
readtable should
only be manipulated with the three functions mentioned above.
.pp
There is one distinguished readtable called the 
.i current
.i readtable 
whose value determines what
.i read ,
.i print 
and 
.i explode 
do.
The current readtable is the value of the symbol 
.i readtable .
Thus it is  possible to rapidly change 
the current syntax by lambda binding 
a different readtable to the symbol 
.i readtable.
When the binding is undone, the syntax reverts to its old form.
.sh +0 Syntax\ Classes
.pp
The readtable describes how each of the 128 ascii characters should
be treated by the reader and printer.
Each character belongs to a 
.i syntax 
.i class 
which has three properties:
.ip character\ class\ - 
Tells what the reader should do when it sees this character.
There are a large number of character classes.  
They are described below.
.ip separator\ -
Most types of tokens the reader constructs are one character
long.
Four token types have an arbitrary length: number (1234), 
symbol print name (franz),
escaped symbol print name (|franz|), and string ("franz").
The reader can easily determine when it has
come to the
end of one of the last two types: it just looks for the
matching delimiter (| or ").
When the reader is reading a number or symbol print name, it 
stops reading when it comes to a character with the 
.i separator
property.
The separator character is pushed back into the input stream and will
be the first character read when the reader is called again.
.ip escape\ -
Tells the printer when to put escapes in front of, or around, a symbol
whose print name contains this character.
There are three possibilities: always escape a symbol with this character
in it, only escape a symbol if this is the only character in the symbol,
and only escape a symbol if this is the first character in the symbol.
[note: The printer will always escape a symbol which, if printed out, would
look like a valid number.]
.pp
When the Lisp system is built, Lisp code is added to a C-coded kernel
and the result becomes the standard lisp system.
The readtable present in the C-coded kernel, called the
.i raw
.i readtable ,
contains the bare necessities for reading in Lisp code.
During the 
construction of the complete Lisp system,
a copy is made of the raw readtable and 
then the copy  is modified by adding macro characters.
The result is what is called the
.i standard
.i readtable .
When a new readtable is created with 
.i makereadtable,
a copy is made of either the
raw readtable
or the current readtable (which is likely to be the standard readtable).
.sh +0 Reader\ operations
.pp
The reader has a very simple algorithm.
It is either 
.i scanning 
for a token, 
.i collecting 
a token,
or 
.i processing 
a token.
Scanning involves reading characters and throwing
away those which don't start tokens (such as blanks and tabs).
Collecting means gathering the characters which make up a
token into a buffer.
Processing may involve creating symbols, strings, lists, 
fixnums, bignums or flonums or calling a user written function called 
a character macro.
.pp
The components of the syntax class determine when the reader
switches between the scanning, collecting and processing states.
The reader will continue scanning as long as the character class
of the characters it reads is 
.i cseparator.
When it reads a character whose character class is not 
.i cseparator
it stores that character in its buffer and begins the collecting phase.
.pp
If the character class of that first character is 
.i ccharacter ,
.i cnumber ,
.i cperiod ,
or 
.i csign .
then it will continue collecting until it runs into a character whose
syntax class has the 
.i separator
property.
(That last character will be pushed back into the input buffer and will
be the first character read next time.)
Now the reader goes into the processing phase, checking to see if the
token it read is a number or symbol.
It is important to note that  after
the first character is collected the component of the syntax class which 
tells the reader to  stop 
collecting is the 
.i separator
property, not the character class.
.pp
If the character class of the character which stopped the scanning is not 
.i ccharacter ,
.i cnumber ,
.i cperiod ,
or
.i csign .
then the reader processes that character immediately.
The character classes
.i csingle-macro ,
.i csingle-splicing-macro ,
and 
.i csingle-infix-macro
will act like 
.i ccharacter
if the following token is not a 
.i separator.
The processing which is done for a given character class 
is described in detail in the next section.
.sh +0 Character\ classes
.de Cc
.sp 2v
.tl '\fI\\$1\fP''raw readtable:\\$2'
.tl '''standard readtable:\\$3'
..
.pc
.Cc ccharacter A-Z\ a-z\ ^H\ !#$%&*,/:;<=>?@^_`{}~ A-Z\ a-z\ ^H\ !$%&*/:;<=>?@^_{}~
.pc %
A normal character.
.Cc cnumber 0-9 0-9
This type is a digit.  
The syntax for an integer (fixnum or bignum) is a string of 
.i cnumber
characters optionally followed by a 
.i cperiod.
If the digits are not followed by a
.i cperiod ,
then they are interpreted in base
.i ibase
which must be eight or ten.
The syntax for a floating point number is
either zero or more
.i cnumber 's
followed by a
.i cperiod
and then followed by one or more 
.i cnumber 's.
A floating point number
may also be  an integer or floating point number followed
by 'e' or 'd', an optional '+' or '\-'
and then zero or more 
.i cnumber 's.
.Cc csign +\- +\-
A leading sign for a number.  
No other characters should be given this class.
.Cc cleft-paren ( (
A left parenthesis.
Tells the reader to begin forming a list.
.Cc cright-paren ) )
A right parenthesis.
Tells the reader that it has reached the end of a list.
.Cc cleft-bracket [ [
A left bracket.
Tells the reader that it should begin forming a list.
See the description of 
.i cright-bracket
for the difference between cleft-bracket and cleft-paren.
.Cc cright-bracket ] ]
A right bracket.
A 
.i cright-bracket 
finishes the formation of the current
list and all enclosing lists until it finds one which
begins with a 
.i cleft-bracket 
or until it reaches the 
top level list.
.Cc cperiod . .
The period is used to separate element of a cons cell 
[e.g. (a\ .\ (b\ .\ nil)) is the same as (a\ b)].
.i cperiod
is also used in numbers as described above.
.Cc cseparator ^I-^M\ esc\ space ^I-^M\ esc\ space
Separates tokens.  When the reader is scanning, these character
are passed over.
Note: there is a difference between the
.i cseparator 
character class and the 
.i separator 
property of a syntax class.
.Cc csingle-quote \\' \\'
This causes 
.i read
to be called recursively and the list
(quote <value read>) to be returned.
.Cc csymbol-delimiter | |
This causes the reader to begin collecting characters and to stop only
when another identical
.i csymbol-delimiter
is seen.  
The only way to escape a 
.i csymbol-delimiter 
within a symbol name is with a
.i  cescape 
character.
The collected characters are converted into a string which becomes
the print name of a symbol.
If a symbol with an identical print name already exists, then the
allocation is not done, rather the existing symbol is used.
.Cc cescape \e \e
This causes the next character to read in to be treated as a 
.b vcharacter .
A character whose syntax class is
.b vcharacter 
has a character class
.i ccharacter
and does not have
the 
.i separator
property so it will not separate symbols.
.Cc cstring-delimiter """" """"
This is the same as 
.i csymbol-delimiter
except the result is returned as a string instead of a symbol.
.Cc csingle-character-symbol none none
This returns a symbol whose print name is the the single character
which has been collected.
.Cc cmacro none `,
The reader calls the macro function associated with this character and 
the current readtable, passing it no arguments.
The result of the macro is added to the structure the reader is building,
just as if that form were directly read by the reader.
More details on macros are provided below.
.Cc csplicing-macro none #;
A 
.i csplicing-macro 
differs from a 
.i cmacro
in the way the result is incorporated in the structure the reader is 
building.
A 
.i csplicing-macro
must return a list of forms (possibly empty).
The reader acts as
if it read each element of
the list itself without
the surrounding parenthesis.
.Cc csingle-macro none none
This causes to reader to check the next character.
If it is a 
.i cseparator
then this acts like a 
.i cmacro.
Otherwise, it acts like a 
.i ccharacter.
.Cc csingle-splicing-macro none none
This is triggered like a 
.i csingle-macro
however the result is spliced in like a
.i csplicing-macro.
.Cc cinfix-macro none none
This is differs from a 
.i cmacro
in that the macro function is passed a form representing what the reader
has read so far. 
The result of the macro replaces what the reader had read so far.
.Cc csingle-infix-macro none none
This differs from the
.i cinfix-macro
in that the macro will only be triggered if the character following the
.i csingle-infix-macro
character is a 
.i cseparator .
.Cc cillegal ^@-^G^N-^Z^\e-^_rubout ^@-^G^N-^Z^\e-^_rubout
The characters cause the reader to signal an error if read.
.sh +0 Syntax\ classes
.pp
The readtable maps each character into a syntax class.
The syntax class contains three pieces of information: 
the character class, whether this is a separator, and the escape
properties.
The first two properties are used by the reader, the last by 
the printer (and 
.i explode ).
The initial lisp system has the following syntax classes defined.
The user may add syntax classes with
.i add-syntax-class .
For each syntax class, we list the properties of the class and 
which characters have this syntax class by default.
More information about each syntax class can be found under the
description of the syntax class's character class.
.de Sy
.sp 1v
.tl '\fB\\$1\fP''raw readtable:\\$2'
.tl '\fI\\$4\fP''standard readtable:\\$3'
.tl '\fI\\$5\fP'''
.tl '\fI\\$6\fP'''
..
.pc
.Sy vcharacter A-Z\ a-z\ ^H\ !#$%&*,/:;<=>?@^_`{}~ A-Z\ a-z\ ^H\ !$%&*/:;<=>?@^_{}~  ccharacter
.pc %
.Sy vnumber 0-9 0-9 cnumber
.Sy vsign +- +- csign
.Sy vleft-paren ( ( cleft-paren escape-always separator
.Sy vright-paren ) ) cright-paren escape-always separator
.Sy vleft-bracket [ [ cleft-bracket escape-always separator 
.Sy vright-bracket ] ] cright-bracket escape-always separator 
.Sy vperiod . . cperiod escape-when-unique
.Sy vseparator ^I-^M\ esc\ space ^I-^M\ esc\ space cseparator escape-always separator 
.Sy vsingle-quote \\' \\' csingle-quote escape-always separator 
.Sy vsymbol-delimiter | | csingle-delimiter escape-always
.Sy vescape \e \e cescape escape-always
.Sy vstring-delimiter """" """" cstring-delimiter escape-always
.Sy vsingle-character-symbol none none csingle-character-symbol separator
.Sy vmacro none `, cmacro escape-always separator 
.Sy vsplicing-macro none #; csplicing-macro escape-always separator 
.Sy vsingle-macro none none csingle-macro escape-when-unique
.Sy vsingle-splicing-macro none none csingle-splicing-macro escape-when-unique
.Sy vinfix-macro none none cinfix-macro escape-always separator 
.Sy vsingle-infix-macro none none csingle-infix-macro escape-when-unique
.Sy villegal ^@-^G^N-^Z^\e-^_rubout ^@-^G^N-^Z^\e-^_rubout cillegal escape-always separator 
.sh +0 Character\ Macros
.pp
Character macros are 
user written functions which are executed during the reading process.
The value returned by a character macro may or may not be used by 
the reader, depending on the type of macro and the value returned.
Character macros are always attached to a single character with
the 
.i setsyntax 
function.
.sh +1 Types
There are three types of character macros: normal, splicing and infix.
These types differ in the arguments they are given or in what is done
with the result they return.
.sh +1 Normal
.pp
A normal macro
is passed no arguments.
The value returned by a normal macro is simply used by
the reader as if it had read the value itself.
Here is an example of a macro which returns the abbreviation 
for a given state.
.Eb
\->\fI(de\kAfun stateabbrev nil
 \h'|\nAu'(cdr (assq (read) '((california . ca) (pennsylvania . pa)))))\fP
stateabbrev
\-> \fI(setsyntax '\e! 'vmacro 'stateabbrev)\fP
t
\-> \fI'( ! california ! wyoming ! pennsylvania)\fP
(ca nil pa)
.Ee
Notice what happened to 
\fI ! wyoming\fP.
Since it wasn't in the table, the macro probably didn't
want to return anything at all, but it had to return
something, and whatever it returned was put in the list.
The splicing macro, described next, allows a character macro function
to return a value that is ignored.
.sh +0 Splicing
.pp
The value returned from a splicing macro must be a list or nil.
If the value is nil, then the value is ignored, otherwise the reader
acts as if it read each object in the list.
Usually the list only contains one element. 
If the reader is reading at the top level (i.e. not collecting elements
of list),
then it is illegal for a splicing macro to return more then one
element in the list.
The major advantage of a splicing macro over a normal macro is the
ability of the splicing macro to return nothing. 
The comment character (usually ;) is a splicing macro bound to a
function which reads to the end of the line and always returns nil.
Here is the previous example written as a splicing macro
.Eb
\-> \fI(de\kAfun stateabbrev nil
\h'|\nAu'(\kC(lam\kBbda (value)
 \h'|\nBu'(cond \kA(value (list value))
 \h'|\nAu'(t nil)))
 \h'|\nCu'(cdr (assq (read) '((california . ca) (pennsylvania . pa))))))\fP
\-> \fI(setsyntax '! 'vsplicing-macro 'stateabbrev)\fP
\-> \fI'(!pennsylvania ! foo !california)\fP
(pa ca)
\-> \fI'!foo !bar !pennsylvania\fP
pa
\-> 
.Ee
.sh +0 Infix
.pp
Infix macros are passed a 
.i conc
structure representing what has been read so far.
Briefly, a 
tconc
structure is a single list cell whose car points to 
a list and whose cdr points to the last list cell in that list.
The interpretation by the reader of the value 
returned by  an infix macro depends on
whether the macro is called while the reader is constructing a 
list or whether it is called at the top level of the reader.
If the macro is called while a list is
being constructed, then the value returned should be  a tconc
structure.
The car of that structure replaces the list of elements that the
reader has been collecting.
If the macro is called at top level, then it will be passed the
value nil, and the value it returns should either be nil
or a tconc structure.
If the macro returns nil, then the value is ignored and the reader
continues to read.
If the macro returns a tconc structure of one element (i.e. whose car
is a list of one element), then that single element is returned
as the value of 
.i read.
If the macro returns a tconc structure of more than one element,
then that list of elements is returned as the value of read.
.Eb
\-> \fI(de\kAfun plusop (x)
   \h'|\nAu'(cond \kB((null x) (tconc nil '\e+))
	 \h'|\nBu'(t (lconc nil (list 'plus (caar x) (read))))))\fP

plusop
\-> \fI(setsyntax '\e+ 'vinfix-macro 'plusop)\fP
t
\-> \fI'(a + b)\fP
(plus a b)
\-> \fI'+\fP
|+|
\-> 
.Ee
.sh -1 Invocations
.pp
There are three different circumstances in which you would like
a macro function to be triggered.
.ip \fIAlways\ -\fP
Whenever the macro character is seen, the macro should be invoked.
This is accomplished by using the character classes 
.i cmacro ,
.i csplicing-macro ,
or 
.i cinfix-macro ,
and by using the
.i separator 
property.
The syntax classes 
.b vmacro ,
.b vsplicing-macro ,
and 
.b vsingle-macro
are defined this way.
.ip \fIWhen\ first\ -\fP
The macro should only be triggered when the macro character is the first
character found after the scanning process.
A syntax class for a 
.i when
.i first
macro would
be defined
using
.i cmacro , 
.i csplicing-macro ,
or 
.i cinfix-macro
and not including the 
.i separator 
property.
.ip \fIWhen\ unique\ -\fP
The macro should only be triggered when the macro character is the only
character collected in the token collection
phase of the reader, 
i.e the macro character is preceeded by zero or more 
.i cseparator s
and followed by a 
.i separator.
A syntax class for a 
.i when
.i unique
macro would
be defined using
.i csingle-macro ,
.i csingle-splicing-macro ,
or 
.i csingle-infix-macro
and not including the
.i separator 
property.
The syntax classes so defined are
.b vsingle-macro ,
.b vsingle-splicing-macro ,
and
.b vsingle-infix-macro .
.sh -1 Functions
.Lf setsyntax 's_symbol\ 's_synclass\ ['ls_func]
.Wh
ls_func is the name of a function or a lambda body.
.Re
t
.Se
S_symbol should be a symbol whose print name is only one character.
The syntax class for 
that character is
set to s_synclass in the current readtable.
If s_synclass is a class that requires a character macro, then
ls_func must be supplied. 
.No
The symbolic syntax codes are new to Opus 38.
For compatibility, s_synclass can be one of the fixnum syntax codes
which appeared in older versions of the 
.Fr
Manual.
This compatibility is only temporary: existing code which uses the
fixnum syntax codes should be converted.
.Lf getsyntax 's_symbol
.Re
the syntax class of the first character 
of s_symbol's print name.
s_symbol's print name must be exactly one character long.
.No
This function is new to Opus 38.
It supercedes \fI(status\ syntax)\fP which no longer exists.
.Lf add-syntax-class 's_synclass\ 'l_properties
.Re
s_synclass
.Se
Defines the syntax class s_synclass to have properties l_properties.
The list l_properties should contain a character classes mentioned
above.
l_properties may contain one of the escape properties:
.i escape-always ,
.i escape-when-unique ,
or 
.i escape-when-first .
l_properties may contain the 
.i separator
property.
After a syntax class has been defined with 
.i add-syntax-class ,
the 
.i setsyntax
function can be used to give characters that syntax class.
.Eb
; Define a non-separating macro character.  
; This type of macro character is used in UCI-Lisp, and
; it corresponds to a  FIRST MACRO in Interlisp
\-> \fI(add-syntax-class 'vuci-macro '(cmacro escape-when-first))\fP
vuci-macro
\->
.Ee