From 3c88151852425b1b87d0b7018b28e77f8a96eec8 Mon Sep 17 00:00:00 2001 From: davidcok <davidcok@github.com> Date: Wed, 31 Jul 2019 14:59:42 +0200 Subject: [PATCH] More editing --- doc/userman/description.tex | 185 +++++++++++++++++++++++----------- doc/userman/grammar.tex | 63 +++++++++--- doc/userman/limitations.tex | 4 +- doc/userman/main.tex | 36 ++++++- doc/userman/preprocessing.tex | 36 ++++--- 5 files changed, 231 insertions(+), 93 deletions(-) diff --git a/doc/userman/description.tex b/doc/userman/description.tex index 52f15957..683f7787 100644 --- a/doc/userman/description.tex +++ b/doc/userman/description.tex @@ -8,50 +8,74 @@ The instructions for doing so are provided at \fclang depends on two software packages: \begin{itemize} -\item A current version of \framac itself +\item A current version of \framac itself. It is highly recommended to install \framac using \lstinline|opam|, as described in the installation procedures for \framac (\url{https://frama-c.com/download.html}) \item An installation of Clang, which is available as part of LLVM, which itself is available from \url{http://releases.llvm.org}. \end{itemize} Building and installing \fclang has two effects: \begin{itemize} -\item The \fclang executable files are installed within the \framac installation +\item The \fclang executable files are installed within the \framac installation. In particular, if \framac has been installed using \lstinline|opam|, then \irg will be installed in the +\lstinline|opam| \lstinline|bin| directory. You must be sure that this directory is on your system \verb|$PATH| (try \lstinline|which framaCIRGen| after installation to be sure). \item Include files containing \acslpp specifications of C++ library functions are copied to \verb|$FRAMAC_SHARE/libc| and -\verb|$FRAMAC_SHARE/libc++|, where \verb|$FRAMAC_SHARE| is the path +\verb|$FRAMAC_SHARE/frama-clang/libc++|, where \verb|$FRAMAC_SHARE| is the path given by the command \lstinline|frama-c-config -print-share-path|. \end{itemize} These include files are replacements for the standard system include files. They should have the same definitions of C and C++ functions and classes, but with \acslpp annotations giving their specifications. +The plugin can be built by hand from source using the following commands. Create a new directory to which you download and unpack the source distribution. Then \lstinline|cd| into the source directory itself (one level down) and execute: +\begin{listing-nonumber} +make distclean +autoconf -f +./configure --prefix=<opamroot> +make +make install +\end{listing-nonumber} +\lstinline|<opamroot>| should be the absolute path of the root of the opam installation, e.g. something like \lstinline|~/.opam/4.05.0| or \lstinline|~/.opam/system|. If \framac has been installed independently of \lstinline|opam|, then set \lstinline|<opamroot>| above so that the location of the \framac executable (that is, the result of executing \lstinline|which frama-c|) is +\lstinline|<opamroot>/bin/frama-c|. + \chapter{Running the plug-in} \section{C++ files} Once installed the plugin is run automatically by Frama-C on any C++ files listed on the command-line. C++ files are identified by their filename suffixes. The default suffixes recognized as C++ are these: - .cpp, .C, .cxx, .ii, .ixx, .ipp, .i++, .inl, .h, .hh +\lstinline| .cpp, .C, .cxx, .ii, .ixx, .ipp, .i++, .inl, .h, .hh| Currently this set of suffixes is compiled in the plugin (in file \texttt{frama\_Clang\_register.ml}) and can only be changed by recompiling and reinstalling the plugin. \section{Frama-clang executable} -The plug-in operates by invoking the executable \irg +The plug-in operates by invoking the executable \irg (which must be on the system \verb|\$PATH|) on each file identified as C++, in turn. For each file it produces a temporary output file containing the C AST, which is then given as input to \framac. This executable is a single-file-at-a-time command-line executable only. Various options control its behavior. The file-system path identifying the executable is provided by the \textbf{-cxx-clang-command <cmd>} -option and is \irg by default. The path may be absolute; if it is a relative path, it is found relative to the TODO directory. [ TODO - check this] +option and is \irg by default. The path may be absolute; if it is a relative path, it is found by searching the system \verb|\$PATH|. +The launching of \irg by \framac includes the following options by default: +\begin{itemize} +\item \verb|-target <target>| with the target being set according to the \lstinline|-machdep| and \lstinline|-target| options given to \framac (cf. \S\ref{sec:bit}) +\item \verb|-D__FC_MACHDEP_86_32| -- also set according to the chosen architecture +\item \verb|-std=c++11| -- target C++11 features +\item \verb|-nostdinc| -- use \fcl and \framac system header files, and not the compiler's own header files +\item \verb|-I$FRAMAC_SHARE/frama-clangs/libc++ -I$FRAMAC_SHARE/libc| -- include the \fclang and \framac header files, which contain system library definitions with \acslpp annotations +\item \verb|-stop-annot-error| if the corresponding option (\lstinline|-fclang-warn-key=annot-error=abort|) is given to \framac +\item options to set the level of info messages and warning messages +\end{itemize} + +The PARSING section of the output of \lstinline|frama-c -kernel-h| lists some options for controlling the behavior described above. Also see the options listed by \lstinline|frama-c -fclang-h| such as \lstinline|-cxx-stdlib-path|, \lstinline|-cxx-cstdlib-path|, \lstinline|-cxx-nostdinc|, \lstinline|-cxx-stdinc|. \section{Frama-clang options} -The options controlling \fclang are of two sorts: +The options controlling \fclang are of three sorts: \begin{itemize} -\item options known to \framac are interpreted and passed on to \fclang when -\irg is invoked, perhaps with a different option name -\item options known to \fclang directly (and not to \framac) must be +\item options known to the \framac kernel +\item options \fcl plug-in has registered with the \fc kernel. These also are recognized by the \fc command-line. +\item options known to \irg directly (and not to \fc) must be included in the internal command that invokes \irg using the \option{-cpp-extra-args} option. These options are described in \S\ref{sec:standalone}. \end{itemize} @@ -67,37 +91,81 @@ shows all \fcl-specific options. The most important of the options are these: \begin{itemize} - \item \option{-cpp-extra-args <string>} -- the single string argument to this option is appended to the command-line when + \item \option{-cpp-extra-args <string>} -- the single string argument to this option is \textit{prepended} to the command-line when \lstinline|frama-clang| is invoked internally. It is particularly important for adding include directories (\lstinline|-I|) and other options to be passed on to the clang compiler or the \irg executable. - Multiple instances of this option have a cumulative effect (rather - than later instances replacing earlier ones). [TODO - check this] - \item \option{-machdep <arg>} -- TODO + Multiple instances of this option have a cumulative effect, in order (rather + than later instances replacing earlier ones). + \item \option{-machdep <arg>} -- sets the target machine architecture, cf. \S\ref{sec:bit} \item \option{-arg <arg>} -- TODO - \item \option{-fclang-msg-key <categories>} -- TODO - \item \option{-fclang-warn-key <categories>} -- TODO - \item \option{-fclang-verbose <n>} -- TODO - \item \option{-fclang-debug <n>} -- TODO + \item \option{-fclang-msg-key <categories>} -- sets the amount of informational messages + \item \option{-fclang-warn-key <categories>} -- sets the amount and behavior of warnings + \item \option{-fclang-verbose <n>} -- sets the amount of information from the \fclang plug-in + \item \option{-fclang-debug <n>} -- sets the amount of debug information from the \fclang plug-in \item TODO \end{itemize} \section{Include directories} -TODO +By default \irg is given the paths to the two directories containing the \fcl and \fc header files, which include \acslpp specifications for the C++ library functions. The default paths to these directories are set by the \lstinline|-cxx-c++stdlib-path| and \lstinline|-cxx-cstdlib-path| options. + +Users typically have additional header files for their own projects. These are supplied to the \fcl preprocessor using the option \lstinline|-cpp-extra-args|. + +You can use \lstinline|-fclang-cpp-extra-args| instead of \lstinline|cpp-extra-args|; multiple such options also have a cumulative effect. the \fcl option only affects the \fcl plugin, whereas +\lstinline|cpp-extra-args| may be seen by other plugins as well, if such plugins do their own preprocessing. Also note that the presence of any instance of \lstinline|-fclang-cpp-extra-args| will cause uses of \lstinline|cpp-extra-args| to be ignored. + +The system header files supplied by \fcl may not include all C++ system files. Omissions should be reported to the \fc team. + + \section{32 and 64-bit targets} +\label{sec:bit} + +\acslpp is for the most part machine-independent. There are some features of C++ that can be environment-dependent, such as the sizes of fundamental datatypes. Consequently, \framac has some options that allow the user to state what machine target is intended. The \fcl options are discussed in \S\ref{sec:fcloptions}. + +\begin{itemize} +\item The \textbf{-machdep} option to \framac. See the allowed values using teh command \lstinline|frama-c -machdep help|. For example, with a value of x86\_32, \lstinline|sizeof(long)| has a value of 4, whereas with the option \textbf{-machdep x86\_64}, \lstinline|sizeof(long)| has a value of 8. +\item Alternnately, the value of \textbf{-machdep} can be set instead using an environment variable: \lstinline|MACHDEP|. The variable can be set either in the shell environment or on the command line with \lstinline|-DMACHDEP=...| + +\end{itemize} + + +\section{Warnings, errors, and informational output} + +Output messages arise from multiple places: from the \fcl plugin, from the \irg lexer and parser, from the clang parser, and from the frama-c kernel (as well as from any other plugins that may be invoked, such as wp). They are controlled by a number of options within the \fc kernel and each plugin. Remember that clang and \irg options must be put in the \lstinline|-cpp-extra-args| option. + +\subsection{Errors} + +Error messages are always output. The key question is whether processing stops or continues with each error. Continuing can result in a cascade of unhelpful error messages, but stopping immediately can hide errors that occur later in source files. +\begin{itemize} +\item \lstinline|-Werror| is a clang and \irg option that causes any parser warnings to be treated as errors +\item \lstinline|--stop-annot-error| is a \irg option that causes prompt termination on errors (default is to continue) +\item \lstinline|-fclang-warn-key=annot-error=abort| is a \fcl plug-in option that will invoke \irg with \lstinline|--stop-annot-error|. \lstinline|error| and \lstinline|error_once| (instead of \lstinline|abort|) have the same effect; other values for the key will allow continuing after errors. +\end{itemize} + +\subsection{Warnings} + +The various categories of warnings from \fcl can be seen with the command \lstinline|frama-c -fclang-warn-key help|. +Warning messages are emitted by default. + +Warning messages from \irg can be controlled with the \lstinline|-warn| option of ]\irg, or, equivalently, the \lstinline|-fclang-warn-key=parse| option of \fc. -TODO +\begin{itemize} +\item the \irg option \lstinline|--no-warn| or \lstinline|-warn=0|turns off \irg warning messages +\item the \irg option \lstinline|--warn=<n>|, with $n > 0$ turns on \irg warning messages; the higher the value $n$ the more messages +\item the \irg option \lstinline|--warn| is the same as \lstinline|--warn=1| +\item the \fc option \lstinline|-fclang-warn-key=parse=inactive| is the same as the \irg option \lstinline|--warn=0| +\item the \fc option \lstinline|-fclang-warn-key=parse=active| is the same as the \irg option \lstinline|--warn=1| +\end{itemize} +The clang options are not currently integrated with the \fc warning and error key system. -\section{Controlling output} +\subsection{Informational output} -TODO +\textit{This section is not yet written} -- make warnings errors -- stop om first error -- supplying an exisintg IRG file to frama-c +The clang informational output is not currently integrated with the \fc warning and error key system. \chapter{Running the \fclang front-end standalone} \label{sec:standalone} @@ -113,7 +181,7 @@ of the Cil (C Intermediate language). Frama-Clang is built on the Clang C++ parser. As such, the \irg executable accepts the clang compiler options and passes them on to clang. There are many of these. -Many of these are irrelevant to \fcl as they relate to lstinline|framaCIRGen| +Many of these are irrelevant to \fcl as they relate to code generation, whereas \fcl only uses clang for parsing, name and type resolution. You can see a list of options by running @@ -121,41 +189,42 @@ You can see a list of options by running The most significant options for \irg are these: \begin{itemize} - \item \option{-I <dir>} -- adds a directory to the include file search path. Using absolute paths is recommended; relative paths are relative to TODO. - \item \option{-o <file>} -- specifies the location of the output file (that is, the file to contain the generated AST). The output path may be absolute or relative to the current working directory. + \item \option{-I <dir>} -- adds a directory to the include file search path. Using absolute paths is recommended; relative paths are relative to the current working directory. + \item \option{-o <file>} -- specifies the name and location of the output file (that is, the file to contain the generated AST). The output path may be absolute or relative to the current working directory. \textit{This option is required.} \item \option{-std=...} -- the input programming language (default is \lstinline|-std=c++11|) - \item \option{-w} -- suppress warnings \item \option{-{-}version} -- print version information \end{itemize} -\section{Custom ASTs} - -In standard mode, \framac invokes \irg on a file, producing an AST in intermediate format, and the reads that intermediate file into \framac to complete the processing. -If some manipulation of the AST intermediate is needed, those two steps can be performed separately as follows: -\begin{itemize} -\item Produce an intermediate AST \lstinline|f.ast| for a given input file \lstinline|f.cpp| using the command \\ -\centerline{\normalsize{framaCIRGen <options> -o f.ast f.cpp}} -\item Manipulate \lstinline|f.ast| as desired. -\item Run \framac on the AST using the command \\ -\centerline{\normalsize{frama-c <options> -cpp-command "cat f.ast" f.cpp}} -\end{itemize} - -If you have multiple files, do the following: -\begin{itemize} -\item Create the ast files for a group of files in \$files:\\ -\centerline{\bf \texttt{ for f in \$files; do framaCIRGen <options> -o \${f\%.*}.ast \$f ; done }} -\item Manipulate the resulting .ast files as needed -\item Execute a command like \\ -\centerline{\bf \texttt{frama-c -cpp-command "`pwd`/ct" \$files}} -\end{itemize} -where \lstinline|ct| is an executable similar to\\ -\centerline{\bf \texttt{do f in \$@ ; cat \$\{f.\%*\}.ast ; done}} - -\section{Frama-Clang specific options} - -There are also options that are specific to \fcl. +%\section{Custom ASTs} +%\lstset{keepspaces=true} +%In standard mode, \framac invokes \irg on a file, producing an AST in intermediate format, and the reads that intermediate file into \framac to complete the processing. +%If some manipulation of the AST intermediate is needed, those two steps can be performed separately as follows: +%\begin{itemize} +%\item Produce an intermediate AST \lstinline|f.ast| for a given input file \lstinline|f.cpp| using the command \\ +%\centerline{\lstinline|framaCIRGen <options> -o f.ast f.cpp|} +%\item Manipulate \lstinline|f.ast| as desired. +%\item Run \framac on the AST using the command \\ +%\centerline{\lstinline|frama-c <options> -cpp-command "cat f.ast" f.cpp|} +%\end{itemize} +% +%If you have multiple files, do the following: +%\begin{itemize} +%\item Create the ast files for a group of files in \$files:\\ +%\centerline{\bf \texttt{ for f in \$files; do framaCIRGen <options> -o \${f\%.*}.ast \$f ; done }} +%\item Manipulate the resulting .ast files as needed +%\item Execute a command like \\ +%\centerline{\bf \texttt{frama-c -cpp-command "`pwd`/ct" \$files}} +%\end{itemize} +%where \lstinline|ct| is an executable similar to\\ +%\centerline{\bf \texttt{for f in \$@ ; do cat \$\{f\%.*\}.ast ; done}} + +\section{\irg specific options} +\label{sec:fcloptions} + +There are also options that are specific to \irg. These are listed below: \begin{itemize} + \item \option{-o <file>} -- specifies the given file as the output file for the translated AST (file is overwritten) \item \option{-{-}info=<n>} -- sets the level of informational messages to \textbf{n}; 0 is completely quiet; increasing values are more verbose. \\ \option{-{-}info} sets the level to 1 \\ @@ -175,15 +244,11 @@ The \lstinline|frama-c| option \option{-fclang-debug=<n>} is equivalent to setti In particular, a debug value of 1 shows the command-line that invokes \irg. \item \option{-{-}stop-annot-error} -- if set, then parsing stops on the first error; default is off - \item \option{-{-}gen-impl-method} -- TODO \item \option{-{-}exit-code} -- if set, then the exit code of \irg is 1 if errors occur; this is not the default because then \lstinline|frama-c| would terminate upon any error in \lstinline|framaCIRGen| \item \option{-{-}no-exit-code} -- disables \option{-{-}exit-code}, so that the exit code is always 0. - -\item TODO: -b -x -o -v + \item \option{-{-}gen-impl-method} -- TODO \end{itemize} -TBD - including command-line options - diff --git a/doc/userman/grammar.tex b/doc/userman/grammar.tex index afa166d5..c58e3dcf 100644 --- a/doc/userman/grammar.tex +++ b/doc/userman/grammar.tex @@ -1,23 +1,62 @@ -\newcommand{\lang}{C++} +\newcommand{\lang}{C++\xspace} \chapter{Grammar and parser for \acslpp} \label{sec:grammar} -This section summarizes some of the technical implementation considerations in writing a parser for \acslpp within \fclang. This material may be useful for developers and maintainers of \fclang; it is not needed by users of \fclang. +This section summarizes some of the technical implementation considerations in writing a parser for \acslpp within \fclang. +This material may be useful for developers and maintainers of \fclang; it is not needed by users of \fclang. -Recall that \fclang uses clang to parse \lang and a custom parser to parse \acslpp annotations, jointly producing a representation of the \lang+\acslpp source input in the Frama-C intermediate language. The first version of the \acslpp custom parser, swritten during the STANCE project, used a hand-written bison-like parser, but with function pointers to record state and actions rather than using a tool-generated table to drive the parsing. This design proved to be too brittle and difficult to efficiently change as new features were added to \acslpp. Consequently during the VESSEDIA project, the scanner and parser were completely rewritten, largely retaining the previous design's connections to clang, token definitions, and name lookup and type resolution. +Recall that \fclang uses clang to parse \lang and a custom parser to parse \acslpp annotations, jointly producing a representation of the \lang and \acslpp source input in the Frama-C intermediate language. +The first version of the \acslpp custom parser, written during the STANCE project, used a hand-written bison-like parser, but with function pointers to record state and actions rather than using a tool-generated table to drive the parsing. +This design proved to be too brittle and difficult to efficiently evolve as new features were added to \acslpp. +Consequently during the VESSEDIA project, the scanner and parser were completely rewritten, largely retaining the previous design's connections to clang, token definitions, and name lookup and type resolution. -The new parser uses a recursive descent design inn which the names of functions doing the parsing can match the names of non-terminals in the grammar. Consequently the implementation of the parser is much more readable, human checkable, and modifiable as the \acslpp language evolves. The drawback of this design is that \acslpp is not LL(1); it is not even LL(k) for fixed k. Thus some amount of backtracking is required. The offending features are mostly inherited from \lang. The next few subsections describe the problematic aspects of \acslpp and how they are addressed. +The new parser uses a recursive descent design in which the names of functions doing the parsing can match the names of non-terminals in the grammar. +Consequently the implementation of the parser is much more readable, human checkable, and modifiable as the \acslpp language evolves. +The drawback of this design is that \acslpp is not LL(1); it is not even LL(k) for fixed k. +Thus some amount of backtracking is required. The offending features are mostly inherited from \lang. +The next few subsections describe the problematic aspects of \acslpp and how they are addressed. -The principal goal of an LL(k) forumulation of a grammar is to be able to predict which grammar production is being seen in the input stream from a small amount of look-ahead. -Most \acslpp constructs start with a unique keyword (e.g., clauses begin with \lstinline|requires|, \lstinline|ensures| etc.) which serves this purpopse. But the constructs inherited from \lang pose some challenges. +The principal goal of an LL(k) formulation of a grammar is to be able to predict which grammar production is being seen in the input stream from a small amount of look-ahead. +Most \acslpp constructs start with a unique keyword (e.g., clauses begin with \lstinline|requires|, \lstinline|ensures| etc.) which serves this purpopse. +But the constructs inherited from \lang pose some challenges. \begin{itemize} -\item \textbf{Left recursion}. Expression grammars are typically left recursive. However, it is well-known how to factor out left recursion. The precedence order of operators is largely hard-coded into the grammar implementation; the usual left recursion poses no particular challenge. -\item {terms vs. predicates}. \acslpp distinguishes terms and predicates, following the distinction between propositional and predicate logic. However, terms and predicates have very similar grammars. Furthermore, \acslpp allows boolean-value terms to be implicitly converted to predicates and allows predicates to be used within terms (such as for the conditional sub-expression in a ternary expression). This makes it not possible to distinguish terms and predicates inn top-down parsing. However, Frama-C has different AST nodes for the two, so it would require a very significant refactoring of Frama-C and all its plugins to unify terms and predicates (as other specification languages have done). Note that this problem is a challenge for any parser design. The previous and current parser designs adopted the same solution; maintain two parallel parses of expressions --- one as a term and one as a predicate. Error messages are emitted only when both parses fail or when a particular grammar production call for a particular type (term or predicate) and that one is not available. -\item{terms vs. tsets}. TODO -\item \textbf{cast vs. parenthesized expression} To determine whether an input like \lstinline|(T)+x| is (a) the sum of the parenthesized expression \lstinline|(T)| and \lstinline|x| or (b) a cast of \lstinline|x| to the type \lstinline|T|, one must know whether \lstinline|T| is a variable or type. This is a classic problem in parsing \lang; it requires that identifiers be known to be either type names or variable names in the scanner. In addition, \lstinline|T| here can be an arbitrary type expression. For example, in \lang, a type expression can contain pointer operators that can look, at least initially like multiplications and they can contain template instantiations that look initially like comparisons. \fclang handles this situation by allowing a backtrackable parse. When a left parenthesis is parsed in an expression context, the parser proceeds by attempting a parse of a cast expression. If the contents of the parenthesis pair is successfully parsed as a type expression, it is assumed to be a cast expression. If such a parse fails, no error messages are emitted; rather the parse is rewound and proceeds again assuming the token sequence to be a parenthesized expression. -\item \textbf{set comprehension}. The set comprehension expression follows traditional mathematics: \lstinline: { $expr$ | $binders$ ; $predicate$ }:. This structure poses two difficulties for parsers. First, the expression may contain bitwise-or operators, so it is not known to the parser whether an occurence of \verb:|: is the beginning of the binders or is just a bitwise-or operator. Second, the expression will use the variables declared in the binders section. However, the binders are not seen until after the expression is scanned. Note that these problems are not unique to a recursive descent design; they would challenge a LALR parser just as much. \textit{This particular feature is not yet implemented in \fclang, nor in Frama-C and so is not yet resolved in the parser implementation.} -\item \textbf{labeled expressions}. \acslpp allows expressions to have labels, designated by a \lstinline|id : | prefix. Thus the parser cannot know whether an initial \lstinline|id| is a variable or just a label until the colon is parsed. Thus this situation requires a lookahead of 2 tokens. +\item \textbf{Left recursion}. Expression grammars are typically left recursive, which is problematic for recursive descent parsers. +However, it is well-known how to factor out left recursion. +The precedence order of operators is largely hard-coded into the grammar implementation; the usual left recursion poses no particular challenge. + +\item \textbf{terms vs. predicates}. \acslpp distinguishes terms and predicates, following the distinction between propositional and predicate logic. +However, terms and predicates have very similar grammars. +Furthermore, \acslpp allows boolean-value terms to be implicitly converted to predicates and allows predicates to be used within terms (such as for the conditional sub-expression in a ternary expression). +This makes it not possible to distinguish terms and predicates in top-down parsing. +However, Frama-C has different AST nodes for the two, so it would require a very significant refactoring of Frama-C and all its plugins to unify terms and predicates (as other specification languages have done). +Note that this problem is a challenge for any parser design. +The previous and current parser designs adopted the same solution: maintain two parallel parses of expressions --- one as a term and one as a predicate. +Error messages are emitted only when both parses fail or when a particular grammar production calls for a particular type (term or predicate) and that one is not available. + +\item \textbf{terms vs. tsets}. Similarly, the ACSL++ language definition defines tsets (sets of locations) with grammar productions separately from terms. +However, the grammars for the two are very similar. +It is much easier to parse and to implement if tsets are seen as terms with a specific type, namely sets. +Many operations on a data type are also simply element-by-element operations on sets of such data types. +Also, errors found during type-checking can be associated with more readable error messages than those found during parsing. + +\item \textbf{cast vs. parenthesized expression} To determine whether an input like \lstinline|(T)-x| is (a) the difference of the parenthesized expression \lstinline|(T)| and \lstinline|x| or (b) a cast of \lstinline|-x| to the type \lstinline|T|, one must know whether \lstinline|T| is a variable or type. +This is a classic problem in parsing \lang; it requires that identifiers be known to be either type names or variable names in the scanner. +In addition, \lstinline|T| here can be an arbitrary type expression. +For example, in \lang, a type expression can contain pointer operators that can look, at least initially like multiplications and they can contain template instantiations that look initially like comparisons. +\fclang handles this situation by allowing a backtrackable parse. +When a left parenthesis is parsed in an expression context, the parser proceeds by attempting a parse of a cast expression. If the contents of the parenthesis pair is successfully parsed as a type expression, it is assumed to be a cast expression. + +If such a parse fails, no error messages are emitted; rather the parse is rewound and proceeds again assuming the token sequence to be a parenthesized expression. + +\item \textbf{set comprehension}. The syntax of the set comprehension expression follows traditional mathematics: \lstinline: { $expr$ | $binders$ ; $predicate$ }:. This structure poses two difficulties for parsers. First, the expression may contain bitwise-or operators, so it is not known to the parser whether an occurence of \verb:|: is the beginning of the binders or is just a bitwise-or operator. Second, the expression will use the variables declared in the binders section. However, the binders are not seen until after the expression is scanned. Note that these problems are not unique to a recursive descent design; they would challenge a LR parser just as much. \textit{This particular feature is not yet implemented in \fclang, nor in Frama-C and so is not yet resolved in the parser implementation.} + +\item \textbf{labeled expressions}. \acslpp allows expressions to have labels, designated by a \lstinline|id : | prefix. +So the parser cannot know whether an initial \lstinline|id| is a variable or just a label until the colon is parsed. +Thus this situation requires a lookahead of 2 tokens. + +Ambiguity arises with the use of a colon for the else part of a conditional expression. So in an expression such as +\lstinline| a ? b ? c : d : e : f|, it is not know whether \lstinline|c| or \lstinline|d| or \lstinline|e| is a label. Parenthesizing must be used to solve this problem. \fcl presumes that if the then part of a conditional expression is being parsed, a following colon is always the colon introducing the else part. That is, the binding to a conditional expression has tighter precedence than to a naming expression. \end{itemize} diff --git a/doc/userman/limitations.tex b/doc/userman/limitations.tex index 2612fb3c..e0ca857c 100644 --- a/doc/userman/limitations.tex +++ b/doc/userman/limitations.tex @@ -7,7 +7,9 @@ The mosty important such limitations are described in this section. \section{Implementation of C++} -The following are not implemented in \acslpp. +\textit{These lists are not (nearly) complete} + +The following C++ features are not implemented in \acslpp. \begin{itemize} \item \acslpp definitions within template declarations \item digraphs within \acslpp annotations diff --git a/doc/userman/main.tex b/doc/userman/main.tex index 336cb0c0..15029170 100644 --- a/doc/userman/main.tex +++ b/doc/userman/main.tex @@ -75,6 +75,40 @@ which received funding from the European Union's 2020 Research and Innovation Program under grant agreement No. 731453. +\section*{TODO} + +These items are not yet addressed +\begin{itemize} +\item (Introduction) Verify the version of Clang supported +\item The custom build instructions with a prefix to a temp directory fail +\item Sort out the interactions among the options -verbose -debug -quiet -fclang-debug -fclang-verbose -fclang-msg-key -fclang-warn-key +\item Sort out and describe interactions among -annot -pp-annot and frama-clang +\item frama-clang info/warn/error messages are not yet properly categorized and integrated with -fclang-log, -fclang-msg-key, fclang-warn-key. In particular, clang messages are completely independent +\item Add comments about using system header files +\item Remove \$FRAMAC\_SHARE from the default command-line, and builtin.h from \$FRAMAC\_SHARE +\item Comment on the demangling options +\item Comment on printing out the preprocessed file +\item \irg options are not included in -fclang-h +\item -print option to \irg +\item Review the clang and \irg options +\item Fix how to do custom ASTs +\item decribe the framaCIRGen options: -x -b -v --gen-impl-meth +\item Comment on handling of unicode in the preprocessor +\item Review the preprocessor tokens +\end{itemize} + +Bugs +\begin{itemize} +\item framaCIRGen -help expects an argument +\item non-clang options not included in framaCIRGen -help +\item The -o option is required +\item What is the -w option +\item framaCIRGen -o prints help summary twice +\item it appears that trigraph processing cannot be turned off, is on by default for framaCIRGen but off by default for clang++; not sure why +\item it does not seem that clang options are passed through, e.g. -fdollars-in-identifiers; framaCIRGen defaults are different than clang +\item preprocessor does not concatenate string literals +\end{itemize} + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% @@ -82,7 +116,7 @@ No. 731453. \include{description} \include{limitations} -\include{preprocessing}http://web.ift.uib.no/Teori/KURS/WRK/TeX/symALL.html +\include{preprocessing} \include{grammar} diff --git a/doc/userman/preprocessing.tex b/doc/userman/preprocessing.tex index d414fe56..0b53ee57 100644 --- a/doc/userman/preprocessing.tex +++ b/doc/userman/preprocessing.tex @@ -9,13 +9,13 @@ preprocessing within \acslpp annotations itself. As a refresher, the C/C++ preprocessor (CPP) (cf. \url{https://gcc.gnu.org/onlinedocs/cpp/}) conceptually implements the following operations on a C++ source file: \begin{itemize} - \item The input is read and broken it into a sequence of physical lines according to the line terminators (ASCII character sequences \verb|\\r|, \verb|\\n|, or \verb|\\r\\n|). - \item Each C trigraph is replaced by its corresponding character. + \item The input is read and broken into a sequence of physical lines according to the line terminators (ASCII character sequences \verb|\r|, \verb|\n|, or \verb|\r\n|). + \item Each C trigraph is replaced by its corresponding character (if trigraph processing is enabled). \item Any backslash-whitespace-line-terminator sequence is removed and the line that it ends is combined with the following line. \item Comments are replaced by single spaces. This requires tokenizing the input to avoid recognizing comment markers within strings as indicating a comment. Note that this allows block comments to hide line terminations. \item The input text is divided into preprocessing tokens grouped in logical lines. Each preprocessor token becomes a compiler token (except where \#\# concatenation occurs). However, \acslb tokens are slightly different, as described below. - \item The source text is transformed according to any preprocessing directives contained within it. Each preprocessing directive must be contained within one logical line - \item adjacent string literals (separated only by white-space or line-endings) are concatenated into a single string literal. + \item The source text is transformed according to any preprocessing directives contained within it. Each preprocessing directive must be contained within one logical line. The result has no preprocessing directives remaining. + \item Adjacent string literals (separated only by white-space or line-endings) are concatenated into a single string literal. \end{itemize} The result is a sequence of preprocessing tokens that is passed on to the @@ -23,22 +23,22 @@ remaining compiler phases. \subsection{Trigraphs} -TBD - are these supported by clang? +Trigraphs are defined for C++ but will currently be removed in C++17. Since trigraph processing by clang occurs before any recognition of comments, trigraphs in \acslpp annotations are translated. As they will be removed from C++, they are not recommended for use in \acslpp annotations. \subsection{Digraphs} -Digraphs are alternate spellings of preprocessort tokens, in particular, of -punctuation character sequences. No digraphs are defined in \acslpp for use -in \acslpp annotations. +Digraphs are alternate spellings of preprocessor tokens, in particular, of +punctuation character sequences. Digraphs in \acslpp annotations are translated just as they are in C++ (by clang). \subsection{Preprocessor tokens} Preprocessor tokens (per CPP) belong to one of five categories. White space (space, tab) serves only to separate tokens; it is not needed between tokens whose concatenation is not a single token. Line terminators also separate tokens and also delineate certain features: preprocessing directives and string literals do not extend over more than one (logical) line. \begin{itemize} \item identifiers: character sequences that match the regular expression \\ -\centerline{\texttt{[a-zA-Z\_][a-zA-Z\_0-9]*}} .\\ Dollar signs are allowed as non-digit identifier characters if the clang option \texttt{-fdollars-in-identifiers} is used. (TBD - on by default?) +\centerline{\texttt{[a-zA-Z\_][a-zA-Z\_0-9]*}} \\ Dollar signs are allowed as non-digit identifier characters if the clang option \texttt{-fdollars-in-identifiers} is enabled. +Enable with \texttt{-fdollars-in-identifiers} ; disable with \texttt{-fno-dollars-in-identifiers} ; enabled by default. \item number: character sequences that match the regular expression \\ -\centerline{\texttt{[.]?[0-9]([0-9a-zA-Z.]|[eEpP][+-]))*}} . \\ -(TBD - dollar signs also allowed?) - \item string literals: character sequence enclosed in " " or ' ' or < >, with \textbackslash " for " in a double-quoted literal (that is not a header file name) and \' for ' in a single-quoted literal. +\centerline{\texttt{[.]?[0-9]([0-9a-zA-Z.]|[eEpP][+-]))*}} \\ + + \item string literals: character sequence enclosed in " " or ' ' or < >, with \textbackslash " for " in a double-quoted literal (that is not a header file name) and \' for ' in a single-quoted literal. After all other preprocessor phases are complete, the preprocessort concatenates adjacent string literals. \item punctuator: all single non-alphanumeric printable characters except \verb|@|, \verb|#| and \verb|`|, and all multi-punctuation sequences that meaningful to C/C++ (e.g. > and >{>}= ), but not an arbitrary multi-punctuation character sequence. \item other tokens: \verb|@|, \verb|#|, \verb|`| and all non-ASCII single characters. @@ -46,18 +46,16 @@ Preprocessor tokens (per CPP) belong to one of five categories. White space (spa The above token definitions imply that arbitrary text can always be broken into legitimate preprocessor tokens, with the exception of a few characters and of badly formed unicode sequences. -TODO - unicode characters Note that not all preprocessor tokens are valid C/C++ parser tokens. Tokens in the other category have no meaning to C/C++ and the \texttt{number} category allows many sequences that are not legal C/C++ numeric tokens. These tokens will generally provoke compiler errors. For example in C/C++, 0..2 is one token and is not interpreted as two consecutive numeric tokens. \acsl and \acslpp have slightly different tokens than the above, so the preprocessor tokens need to be re-tokenized in some cases: \begin{itemize} - \item The \verb|@| token is a whitespace character in \acslb. + \item The \verb|@| token is a whitespace character in the interior of a \acslb annotation. \item There are some \acslb multi-character punctuator tokens that are not single preprocessor tokens: \begin{itemize} - \item[] all \acslb keywords that begin with a backslash, such as - \verb|\result|. + \item[] all \acslb keywords that begin with a backslash, such as \verb|\result|. \item \verb|==>| (logical implies) \item \verb|-->| (bit-wise implies) \item \verb|<==>| (logical equality) @@ -89,15 +87,15 @@ The preprocessing language contains a fixed set of preprocessing directive ident \end{itemize} In addition, identifiers that have been defined (by a \#define directive) as macros are expanded according to the macro expansion rules (not described here). -Now \acslb annotations are contained in C/C++ comments. -Consequently, any directives contained in the annotation are not seen when the source file is processed simply as a C/C++ source file. However, the effect of some directives lasts until the end of the source file. +Because \acslb annotations are contained in C/C++ comments, +any directives contained in the annotation are not seen when the source file is processed simply as a C/C++ source file. However, the effect of some directives lasts until the end of the source file. Accordingly, \acslpp imposes constraints on the directives that may be present within annotations: \begin{itemize} \item \texttt{define} and \texttt{undef} are not permitted in an annotation \item \texttt{if}, \texttt{ifdef}, \texttt{ifndef}, \texttt{elif}, \texttt{else}, \texttt{endif} are permitted but must be completely nested within the annotation in which they appear (an \texttt{endif} and its corresponding \texttt{if}, \texttt{ifdef}, or \texttt{ifndef} directive must be in the same annotation comment.) \item \texttt{warning} and \texttt{error} are permitted \item \texttt{include} is permitted, but will cause errors if it contains, as is almost always the case, either \texttt{define} or \texttt{pragma} directives - \item \texttt{line} - TBD + \item \texttt{line} is not permitted \item \texttt{pragma} is not permitted \end{itemize} -- GitLab