usr/src/usr.bin/sed/POSIX

#       @(#)POSIX       5.2 (Berkeley) %G%

                Comments on the IEEE P1003.2 Draft 12

                     Part 2: Shell and Utilities
                  Section 4.55: sed - Stream editor

                 Diomidis Spinellis <dds@doc.ic.ac.uk>

In the following paragraphs, `wrong' means `inconsistent with historic
practice'.  Many of the comments refer to undocumented inconsistencies
between the historical versions of sed and the POSIX standard.  All the
comments are notes taken while implementing a POSIX-compatible version
of sed, and should not be interpreted as official opinions or criticism
towards the POSIX committee.  Some are insignificant, pedantic and even
wrong.

 1.     For the text argument of the a command it is not specified if
        lines are stripped of their initial blanks or not.  Historical
        practice, followed in this implementation, is to strip the
        blanks, i.e.:

        #!/bin/sed -f
        a\
                foo\
                bar

        produces:

        foo
        bar

 2.     Historical versions of sed required that the w flag must be the
        last flag to an s command as it takes an additional argument.
        This is not specified in the standard.

 3.     Historical versions of sed required that whitespace follow a w
        flag to an s command.  This is not specified in the standard.
        This implementation permits whitespace but does not require
        it.

 4.     Historical versions of sed permitted any number of whitespace
        characters to follow the w command.  This is not specified in
        the standard.  This implementation permits whitespace but does
        not require it.

 5.     The specification of the a command is wrong.  With the current
        specification both of these scripts should produce the same
        output:

        #!/bin/sed -f
        d
        a\
        hello

        #!/bin/sed -f
        a\
        hello
        d

TK -- Diomidis, the current implementation looks wrong on this case.

 6.     The specification of the c command in conjunction with the
        specification of the default operation (D2 11293-11299) is
        wrong.  The default operation specifies that a newline is
        printed after the pattern space.  This is not the case when
        the pattern space has been deleted by a c command.

TK Diomidis, the spec seems right to me -- the language in 11293
TK talks about copying the pattern space to stdout -- if the pattern space
TK is deleted, it can't be copied.

 7.     The rule for the l command differs from historic practice.
        Table 2-15 includes the various ANSI C escape sequences,
        including \\ for backslash.  Some historical versions of
        sed displayed two digit octal numbers.  The POSIX
        specification is a cleanup, and this implementation follows
        to it.

 8.     The specification for ! does not specify that for a single
        command the command must not contain an address specification
        whereas the command list can contain address specifications.

TK I think this is wrong: the script:
TK
TK      3!p
TK
TK works fine.  Am I misunderstanding your point?

 9.     The standard does not specify what happens with consecutive
        ! commands (e.g. /foo/!!!p).  Historic implementations
        allow any number of !'s without changing behaviour.  (It
        seems logical that each one should reverse the default
        behaviour.)  This implementation follows historic practice.

10.     Historic versions of sed permitted commands to be separated
        by semi-colons, e.g. 'sed -ne '1p;2p;3q' prints the first
        three lines of a file.  This is not specified by POSIX.
        Note, the ; command separator is not allowed for the commands
        a, c, i, w, r, :, b, t, # and at the end of a w flag in the s
        command.  This implementation follows historic practice.

11.     The standard does not specify that if EOF is reached during
        the execution of the n command the program terminates (e.g.

        sed -e '
        n
        i\
        hello
        ' </dev/null

        will not produce any output.  This implementation follows
        historic practice.

12.     The standard does not specify that the q command causes all
        lines that have been appended to be output and that the pattern
        space is printed before exiting.  This implementation follows
        historic practice.

13.     Historic implementations ignore comments in the text of the i
        and a commands.  This implementation follows historic practice.

14.     Historic implementations do not consider the last line of a
        file to match $ if an empty file follows, e.g.

        sed -n -e '$p' /usr/dict/words /dev/null

        will not print anything.  This is not mentioned in the POSIX
        specification and is almost certainly a bug.  This implementation
        follows the POSIX specification.

TK      Diomidis, I think we need to fix this, can you do it?
DDS     We follow POSIX.  You don't mean to do it buggy?
TK      I see... (I didn't understand that problem until now.)  I think
TK      that we *should* print out the last line of the dictionary, in
TK      the above example, but I can see how it would be hard.  What do
TK      you think?

15.     Historical implementations do not output the change text
        of a c command in the case of an address range whose second
        line number is greater than the first (e.g. 3,1).  The POSIX
        standard requires that the text be output.  Since the historic
        behavior doesn't seem to have any particular purpose, this
        implementation follows the POSIX behavior.

16.     Historical implementations output the c text on EVERY line not
        included in the two address range in the case of a negation '!'.

TK      Diomidis, this seems reasonable, I don't see where the standard
TK      conflicts with this.

17.     The standard does not specify that the p flag at the s command will
        write the pattern space plus a newline on the standard output

TK      I think this is covered by the general language aruond 11293
TK      that says that the pattern space is always followed by a newline
TK      when output.

18.     The standard does not specify whether address ranges are
        checked and reset if a command is not executed due to a
        jump.  The following program can behave in two different
        ways depending on whether the range operator is reset at
        line 6 or not.  This is important in the case of pattern
        matches.

        sed -n -e '
        4,8b
        s/^/XXX/p
        1,6 {
                p
        }'

TK      I don't understand this -- can you explain further?
DDS     The 1,6 operator will not be executed on line 6 (due to the 4,8b
DDS     line) and thus it will not clear.  In this case you can check for
DDS     line > 6 in apply, but what if the 1,6 was /BEGIN/,/END/
TK      OK, I understand, now.  Well, I think I do, anyhow.  It seems to
TK      me that applies() will never see the 1,6 line under any circumstances
TK      (even if it was /BEGIN/,/END/ because for lines 4 through 8.
TK      A nastier example, as you point out, is:
TK              2,4b
TK              /one/,/three/c\
TK                      append some text
TK
TK      The BSD sed appends the text after the "branch" no longer applies,
TK      i.e. with the input: one\ntwo\nthree\nfour\nfive\nsix it displays
TK      two\nthree\nfour\nappend some text BUT THEN IT STOPS!
TK      Our sed, of course, simply never outputs "append some text".  It
TK      seems to me that our current approach is "right", because it would
TK      be possible to have:
TK              1,4b
TK              /one/,/five/c\
TK                      message
TK
TK      where you only want to see "message" if the patterns "one" ... "five"
TK      occur, but not in lines 1 to 4.  What do you think?

18.     Historical implementations allow an output suppressing #n at the
        beginning of -e arguments as well.  This implementation follows
        historical practice.

19.     POSIX does not specify whether more than one numeric flag is
        allowed on the s command

TK      What's historic practice?  Currently we don't report an error or
        do all of the flags.

20.     The standard does not specify whether a script is mandatory.
        Historic sed implementations behave differently with ls | sed
        (no output) and ls | sed - e'' (behaves like cat).

TK      I don't understand what 'sed - e' does (it should be illegal,
TK      right?)  It seems to me that a script should be mandatory,
TK      and sed should fail with an error if not given one.

21.     The requirement to open all wfiles from the beginning makes sed
        behave nonintuitively when the w commands are preceded by addresses
        or are within conditional blocks.  This implementation follows
        historic practice, by default, and provides a flag for more
        reasonable behavior.

TK      I'll put it on my TODO list... ;-}

22.     The rule specified in lines 11412-11413 of the standard does
        not seem consistent with existing practice.  Historic sed
        implementations I tested copied the rfile on standard output
        every time the r command was executed and not before reading
        a line of input.  The wording should be changed to be
        consistent with the 'a' command i.e.

TK      Something got dropped, here... Can you explain furtehr what
TK      historic versoins did, what they should do, what we do?

23.     The standard does not specify how excape sequences other
        than \n and \D (where D is the delimiter character) are to
        be treated.   A strict interpretation would be that they
        should be treated literaly.  In the sed implementations I
        have tried the \ is simply ingored.

TK      I don't understand what you're saying, here.  Can you explain?

24.     The standard specifies that an address can be "empty".  This
        implies that constructs like ,d or 1,d and ,5d are allowed.
        This is not true for historic implementations of sed.  This
        implementation follows historic practice.

25.     The b t and : commands ignore leading white space, but not
        trailing white space.  This is not specified in the standard.

TK      I think that line 11347 points out the the synopsis shows
TK      which are valid.

        Although the standard specifies that reading from files that
        do not exist from within the script must not terminate the
        script, it does not specify what happens if a write command
        fails.  Historic practice is to fail immediately if the file
        cannot be open or written.  This implementation follows that
        practice.

26.     Historic practice is that the \n construct can be used for
        either string1 or string2 of the y command.  This is not
        specified by the standard.  This implementation follows
        historic practice.

29.     The standard does not specify if the "nth occurrence" of a
        regular expression in a substitute command is an overlapping
        or a non-overlapping one, e.g. what is the result of s/a*/A/2
        on the pattern "aaaaa aaaaa".  Historical practice is to drop
        core or do non-overlapping expressions.  This implementation
        follows historic practice.

30.     Historic implementations of sed ignore the regular expression
        delimiter characters within character classes.  This is not
        specified in the standard.  This implementation follows historic
        practice.