Programming Language Syntax

Introduction

Most of the technical issues around programming language syntax - such as formal language theory and parser generation - were solved early on in computing science. However, the syntax of a language remains its most visible characteristic.

Here we take a somewhat unstructured tour through some of the varieties of syntactic forms that have been tried over the years.

Compound statements

The formal notion of a compound statement probably originated with Algol 60 which used begin and end to delimit a compound 'block' with semicolons famously used as statement separators within it.

CPL (circa 1963) borrows the classical typographical section sign § which is used to both begin and end a block. It also allowed this sign to be followed by a section number, potentially in hierarchical dotted decimal form. For example:

§1.1 i := i + 1
     j := j + 1 §1.1

This use of section numbers can be used to resolve situations of ambiguity, in which it is not clear if the § is terminating a block or starting a new one. With section numbers it's also possible for a single § to terminate multiple blocks.

CPLs decendent, BCPL (circa 1968), replaces the section character with the digraphs $( and $). The optional section number of CPL was also retained. The BCPL implementation at Xerox Parc (circa 1972) used [ and ] and, as is well known, BCPLs decedent C (circa 1973) changed these to the braces { and } that we are stuck with today. More recent BCPL implementations usually allow C style braces.

The RLisp language (circa 1970) used by the Reduce algebra system used Algol style begin end but allowed the alternative digraphs << and >>.

The original ML language (circa 1973) uses the semi-colon as a compound forming operator - so a compound statement may be written without any starting or ending markers, although parentheses may be used. So a compound statement formed from A, B and C would be written as

A;B;C

Since the semi-colon is used in this way, an alternative method is needed to indicate the end of a top-level expression or declaration. This is the double semi-colon digraph ;; a feature so distinctive that the successor language Caml used it as its console icon on windows. Standard ML, the later offical version of the language uses implicit blocks so the ;; is unncessary.

Implicit compounds

There are a few issues with the 'explicit' compound statements above. One is verboseness; a lot of effort seems to be spent reading and typing begin and end in Algol like languages. Another is a piece of ambiguity, the 'dangling-else' problem. This arises out of a naive attempt to express the syntax of an if statement with optional else in a BNF like rule:

  <statement> := if <expr> then <statement> [else <statement>]

Since an if clause is itself a statement another if can directly follow the then, for example:

 if a=1 then if b=1 then print('alpha')

when an else is added at the end of this statement, the rule does not allow us to decide which of two possible readings is intended:

 if a=1 then
     if b=1 then print('alpha')
     else print('beta')
 if a=1 then
     if b=1 then print('alpha')
 else print('beta')

In practice, all languages in which this issue could arise do specify which alternative must be taken, but their definitions have to do so either by using an extra-syntactic note, or with a more complex set of rules.

A final problem can occur when editing program text and forgetting to add the compound block symbols when moving from a single statement to a compound. This can result in legal code that compiles and runs, but does not give the expected result. (Some coding standards have mandated use of compound blocks even for a single statement to avoid this).

All these problems are solved when the language syntax is changed to require a closing symbol to mark the end of the block, forming an 'implicit' compound statement. In the example above, an endif symbol could be added to the rule:

 <statement> := if <expr> then <statement> [else <statement>] endif

Now the different readings would have to have different forms:

 if a=1 then
     if b=1 then print('alpha')
     else print('beta'
    endif
 endif
 if a=1 then
     if b=1 then print('alpha')
     endif
 else print('beta')
 endif

Closing symbols

There are a number approaches to chosing the closing symbols to use. The Wirth family of Pascal decendents (the modulas and oberons) use the same symbol, end in all situations, as does Matlab. This saves adding lots of new keywords to the language, but misses the chance of making programs more readable. Ada and Fortran require two symbols, end and the opening keyword such as end if and end while. Others such as Pop-2 just add multiple keywords as in endif and endwhile. Mupad includes a underscore as in end_if and end_while.

The most dramatic convention for closing symbols originated with Algol-68. Here the letters of the opening symbol are reversed to give the closing symbol as in if ... fi, do ... od and case ... esac. C.H. Lindsey, in his history of Algol-68 describes this as whimsical, and it could serve as a point of humor in an otherwise dry textbook.

Separators and terminators

The semi-colon of Algol has become near universal for separating and/or terminating statements. There are differences between whether statements must be terminated or not. Algol for example just uses a separator, so we might write:

begin statement1;statement2 end

In C on the other hand, statements are terminated so we write

{ statement1;statement2; }

There are some subtleties in C with the compound statements not needing termination.

Some Lisp based languages such as Balm and Musimp have preferred to use a comma instead of the semi-colon as a statement separator.

The statistical language Glim uses $ as a command prefix instead of a separator, and for some commands, a terminator. For example, here there are three commands:

$ACCURACY 2$ FACTOR AGE 2$PLOT WEIGHT AGE $

Because PLOT takes a variable number of arguments - that might be split across a line - the final $ terminator is necessary.

Basic uses the colon to separate multiple statements on a single line.

Function call, grouping and implicit multiplication

Writing function and procedure calls in the mathematical style with parantheses, such as f(x), is by far the most common but there are some exceptions. Many functional languages - such as Haskell, Hope, ML - omit the parentheses and write just f x. Parentheses are stil available for grouping as in (sin x)+(cos x) instead of the more conventional sin(x)+cos(x).

The use of parentheses for both function call and grouping causes ambiguity if we wish to allow the mathematical convention of writing multiplication without an operator like sin(x) cos(x). CPL was an early language to try this, and as a result, function calls are written using square brackets, as in f[x]. This syntax is also used by Mathematica. Since CPL also used upper case letters for single character variables, X*X could be written as XX, without any spaces at all.

BCPL (a CPL decendent) drops the implicit multiplication but still allows the use of square brackets as well as parentheses for function calls.