fatbase

portable OpenBSD tools
git clone git@git.2f30.org/fatbase.git
Log | Files | Refs

flex.1 (104849B)


      1 .\"	$OpenBSD: flex.1,v 1.37 2014/03/23 16:28:29 jmc Exp $
      2 .\"
      3 .\" Copyright (c) 1990 The Regents of the University of California.
      4 .\" All rights reserved.
      5 .\"
      6 .\" This code is derived from software contributed to Berkeley by
      7 .\" Vern Paxson.
      8 .\"
      9 .\" The United States Government has rights in this work pursuant
     10 .\" to contract no. DE-AC03-76SF00098 between the United States
     11 .\" Department of Energy and the University of California.
     12 .\"
     13 .\" Redistribution and use in source and binary forms, with or without
     14 .\" modification, are permitted provided that the following conditions
     15 .\" are met:
     16 .\"
     17 .\" 1. Redistributions of source code must retain the above copyright
     18 .\"    notice, this list of conditions and the following disclaimer.
     19 .\" 2. Redistributions in binary form must reproduce the above copyright
     20 .\"    notice, this list of conditions and the following disclaimer in the
     21 .\"    documentation and/or other materials provided with the distribution.
     22 .\"
     23 .\" Neither the name of the University nor the names of its contributors
     24 .\" may be used to endorse or promote products derived from this software
     25 .\" without specific prior written permission.
     26 .\"
     27 .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR
     28 .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED
     29 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
     30 .\" PURPOSE.
     31 .\"
     32 .Dd $Mdocdate: March 23 2014 $
     33 .Dt FLEX 1
     34 .Os
     35 .Sh NAME
     36 .Nm flex
     37 .Nd fast lexical analyzer generator
     38 .Sh SYNOPSIS
     39 .Nm
     40 .Bk -words
     41 .Op Fl 78BbdFfhIiLlnpsTtVvw+?
     42 .Op Fl C Ns Op Cm aeFfmr
     43 .Op Fl Fl help
     44 .Op Fl Fl version
     45 .Op Fl o Ns Ar output
     46 .Op Fl P Ns Ar prefix
     47 .Op Fl S Ns Ar skeleton
     48 .Op Ar
     49 .Ek
     50 .Sh DESCRIPTION
     51 .Nm
     52 is a tool for generating
     53 .Em scanners :
     54 programs which recognize lexical patterns in text.
     55 .Nm
     56 reads the given input files, or its standard input if no file names are given,
     57 for a description of a scanner to generate.
     58 The description is in the form of pairs of regular expressions and C code,
     59 called
     60 .Em rules .
     61 .Nm
     62 generates as output a C source file,
     63 .Pa lex.yy.c ,
     64 which defines a routine
     65 .Fn yylex .
     66 This file is compiled and linked with the
     67 .Fl lfl
     68 library to produce an executable.
     69 When the executable is run, it analyzes its input for occurrences
     70 of the regular expressions.
     71 Whenever it finds one, it executes the corresponding C code.
     72 .Pp
     73 The manual includes both tutorial and reference sections:
     74 .Bl -ohang
     75 .It Sy Some Simple Examples
     76 .It Sy Format of the Input File
     77 .It Sy Patterns
     78 The extended regular expressions used by
     79 .Nm .
     80 .It Sy How the Input is Matched
     81 The rules for determining what has been matched.
     82 .It Sy Actions
     83 How to specify what to do when a pattern is matched.
     84 .It Sy The Generated Scanner
     85 Details regarding the scanner that
     86 .Nm
     87 produces;
     88 how to control the input source.
     89 .It Sy Start Conditions
     90 Introducing context into scanners, and managing
     91 .Qq mini-scanners .
     92 .It Sy Multiple Input Buffers
     93 How to manipulate multiple input sources;
     94 how to scan from strings instead of files.
     95 .It Sy End-of-File Rules
     96 Special rules for matching the end of the input.
     97 .It Sy Miscellaneous Macros
     98 A summary of macros available to the actions.
     99 .It Sy Values Available to the User
    100 A summary of values available to the actions.
    101 .It Sy Interfacing with Yacc
    102 Connecting flex scanners together with
    103 .Xr yacc 1
    104 parsers.
    105 .It Sy Options
    106 .Nm
    107 command-line options, and the
    108 .Dq %option
    109 directive.
    110 .It Sy Performance Considerations
    111 How to make scanners go as fast as possible.
    112 .It Sy Generating C++ Scanners
    113 The
    114 .Pq experimental
    115 facility for generating C++ scanner classes.
    116 .It Sy Incompatibilities with Lex and POSIX
    117 How
    118 .Nm
    119 differs from
    120 .At
    121 .Nm lex
    122 and the
    123 .Tn POSIX
    124 .Nm lex
    125 standard.
    126 .It Sy Files
    127 Files used by
    128 .Nm .
    129 .It Sy Diagnostics
    130 Those error messages produced by
    131 .Nm
    132 .Pq or scanners it generates
    133 whose meanings might not be apparent.
    134 .It Sy See Also
    135 Other documentation, related tools.
    136 .It Sy Authors
    137 Includes contact information.
    138 .It Sy Bugs
    139 Known problems with
    140 .Nm .
    141 .El
    142 .Sh SOME SIMPLE EXAMPLES
    143 First some simple examples to get the flavor of how one uses
    144 .Nm .
    145 The following
    146 .Nm
    147 input specifies a scanner which whenever it encounters the string
    148 .Qq username
    149 will replace it with the user's login name:
    150 .Bd -literal -offset indent
    151 %%
    152 username    printf("%s", getlogin());
    153 .Ed
    154 .Pp
    155 By default, any text not matched by a
    156 .Nm
    157 scanner is copied to the output, so the net effect of this scanner is
    158 to copy its input file to its output with each occurrence of
    159 .Qq username
    160 expanded.
    161 In this input, there is just one rule.
    162 .Qq username
    163 is the
    164 .Em pattern
    165 and the
    166 .Qq printf
    167 is the
    168 .Em action .
    169 The
    170 .Qq %%
    171 marks the beginning of the rules.
    172 .Pp
    173 Here's another simple example:
    174 .Bd -literal -offset indent
    175 %{
    176 int num_lines = 0, num_chars = 0;
    177 %}
    178 
    179 %%
    180 \en      ++num_lines; ++num_chars;
    181 \&.       ++num_chars;
    182 
    183 %%
    184 main()
    185 {
    186 	yylex();
    187 	printf("# of lines = %d, # of chars = %d\en",
    188             num_lines, num_chars);
    189 }
    190 .Ed
    191 .Pp
    192 This scanner counts the number of characters and the number
    193 of lines in its input
    194 (it produces no output other than the final report on the counts).
    195 The first line declares two globals,
    196 .Qq num_lines
    197 and
    198 .Qq num_chars ,
    199 which are accessible both inside
    200 .Fn yylex
    201 and in the
    202 .Fn main
    203 routine declared after the second
    204 .Qq %% .
    205 There are two rules, one which matches a newline
    206 .Pq \&"\en\&"
    207 and increments both the line count and the character count,
    208 and one which matches any character other than a newline
    209 (indicated by the
    210 .Qq \&.
    211 regular expression).
    212 .Pp
    213 A somewhat more complicated example:
    214 .Bd -literal -offset indent
    215 /* scanner for a toy Pascal-like language */
    216 
    217 %{
    218 /* need this for the call to atof() below */
    219 #include <math.h>
    220 %}
    221 
    222 DIGIT    [0-9]
    223 ID       [a-z][a-z0-9]*
    224 
    225 %%
    226 
    227 {DIGIT}+ {
    228         printf("An integer: %s (%d)\en", yytext,
    229             atoi(yytext));
    230 }
    231 
    232 {DIGIT}+"."{DIGIT}* {
    233         printf("A float: %s (%g)\en", yytext,
    234             atof(yytext));
    235 }
    236 
    237 if|then|begin|end|procedure|function {
    238         printf("A keyword: %s\en", yytext);
    239 }
    240 
    241 {ID}    printf("An identifier: %s\en", yytext);
    242 
    243 "+"|"-"|"*"|"/"   printf("An operator: %s\en", yytext);
    244 
    245 "{"[^}\en]*"}"     /* eat up one-line comments */
    246 
    247 [ \et\en]+          /* eat up whitespace */
    248 
    249 \&.       printf("Unrecognized character: %s\en", yytext);
    250 
    251 %%
    252 
    253 main(int argc, char *argv[])
    254 {
    255         ++argv; --argc;  /* skip over program name */
    256         if (argc > 0)
    257                 yyin = fopen(argv[0], "r");
    258         else
    259                 yyin = stdin;
    260 
    261         yylex();
    262 }
    263 .Ed
    264 .Pp
    265 This is the beginnings of a simple scanner for a language like Pascal.
    266 It identifies different types of
    267 .Em tokens
    268 and reports on what it has seen.
    269 .Pp
    270 The details of this example will be explained in the following sections.
    271 .Sh FORMAT OF THE INPUT FILE
    272 The
    273 .Nm
    274 input file consists of three sections, separated by a line with just
    275 .Qq %%
    276 in it:
    277 .Bd -unfilled -offset indent
    278 definitions
    279 %%
    280 rules
    281 %%
    282 user code
    283 .Ed
    284 .Pp
    285 The
    286 .Em definitions
    287 section contains declarations of simple
    288 .Em name
    289 definitions to simplify the scanner specification, and declarations of
    290 .Em start conditions ,
    291 which are explained in a later section.
    292 .Pp
    293 Name definitions have the form:
    294 .Pp
    295 .D1 name definition
    296 .Pp
    297 The
    298 .Qq name
    299 is a word beginning with a letter or an underscore
    300 .Pq Sq _
    301 followed by zero or more letters, digits,
    302 .Sq _ ,
    303 or
    304 .Sq -
    305 .Pq dash .
    306 The definition is taken to begin at the first non-whitespace character
    307 following the name and continuing to the end of the line.
    308 The definition can subsequently be referred to using
    309 .Qq {name} ,
    310 which will expand to
    311 .Qq (definition) .
    312 For example:
    313 .Bd -literal -offset indent
    314 DIGIT    [0-9]
    315 ID       [a-z][a-z0-9]*
    316 .Ed
    317 .Pp
    318 This defines
    319 .Qq DIGIT
    320 to be a regular expression which matches a single digit, and
    321 .Qq ID
    322 to be a regular expression which matches a letter
    323 followed by zero-or-more letters-or-digits.
    324 A subsequent reference to
    325 .Pp
    326 .Dl {DIGIT}+"."{DIGIT}*
    327 .Pp
    328 is identical to
    329 .Pp
    330 .Dl ([0-9])+"."([0-9])*
    331 .Pp
    332 and matches one-or-more digits followed by a
    333 .Sq .\&
    334 followed by zero-or-more digits.
    335 .Pp
    336 The
    337 .Em rules
    338 section of the
    339 .Nm
    340 input contains a series of rules of the form:
    341 .Pp
    342 .Dl pattern	action
    343 .Pp
    344 The pattern must be unindented and the action must begin
    345 on the same line.
    346 .Pp
    347 See below for a further description of patterns and actions.
    348 .Pp
    349 Finally, the user code section is simply copied to
    350 .Pa lex.yy.c
    351 verbatim.
    352 It is used for companion routines which call or are called by the scanner.
    353 The presence of this section is optional;
    354 if it is missing, the second
    355 .Qq %%
    356 in the input file may be skipped too.
    357 .Pp
    358 In the definitions and rules sections, any indented text or text enclosed in
    359 .Sq %{
    360 and
    361 .Sq %}
    362 is copied verbatim to the output
    363 .Pq with the %{}'s removed .
    364 The %{}'s must appear unindented on lines by themselves.
    365 .Pp
    366 In the rules section,
    367 any indented or %{} text appearing before the first rule may be used to
    368 declare variables which are local to the scanning routine and
    369 .Pq after the declarations
    370 code which is to be executed whenever the scanning routine is entered.
    371 Other indented or %{} text in the rule section is still copied to the output,
    372 but its meaning is not well-defined and it may well cause compile-time
    373 errors (this feature is present for
    374 .Tn POSIX
    375 compliance; see below for other such features).
    376 .Pp
    377 In the definitions section
    378 .Pq but not in the rules section ,
    379 an unindented comment
    380 (i.e., a line beginning with
    381 .Qq /* )
    382 is also copied verbatim to the output up to the next
    383 .Qq */ .
    384 .Sh PATTERNS
    385 The patterns in the input are written using an extended set of regular
    386 expressions.
    387 These are:
    388 .Bl -tag -width "XXXXXXXX"
    389 .It x
    390 Match the character
    391 .Sq x .
    392 .It .\&
    393 Any character
    394 .Pq byte
    395 except newline.
    396 .It [xyz]
    397 A
    398 .Qq character class ;
    399 in this case, the pattern matches either an
    400 .Sq x ,
    401 a
    402 .Sq y ,
    403 or a
    404 .Sq z .
    405 .It [abj-oZ]
    406 A
    407 .Qq character class
    408 with a range in it; matches an
    409 .Sq a ,
    410 a
    411 .Sq b ,
    412 any letter from
    413 .Sq j
    414 through
    415 .Sq o ,
    416 or a
    417 .Sq Z .
    418 .It [^A-Z]
    419 A
    420 .Qq negated character class ,
    421 i.e., any character but those in the class.
    422 In this case, any character EXCEPT an uppercase letter.
    423 .It [^A-Z\en]
    424 Any character EXCEPT an uppercase letter or a newline.
    425 .It r*
    426 Zero or more r's, where
    427 .Sq r
    428 is any regular expression.
    429 .It r+
    430 One or more r's.
    431 .It r?
    432 Zero or one r's (that is,
    433 .Qq an optional r ) .
    434 .It r{2,5}
    435 Anywhere from two to five r's.
    436 .It r{2,}
    437 Two or more r's.
    438 .It r{4}
    439 Exactly 4 r's.
    440 .It {name}
    441 The expansion of the
    442 .Qq name
    443 definition
    444 .Pq see above .
    445 .It \&"[xyz]\e\&"foo\&"
    446 The literal string: [xyz]"foo.
    447 .It \eX
    448 If
    449 .Sq X
    450 is an
    451 .Sq a ,
    452 .Sq b ,
    453 .Sq f ,
    454 .Sq n ,
    455 .Sq r ,
    456 .Sq t ,
    457 or
    458 .Sq v ,
    459 then the ANSI-C interpretation of
    460 .Sq \eX .
    461 Otherwise, a literal
    462 .Sq X
    463 (used to escape operators such as
    464 .Sq * ) .
    465 .It \e0
    466 A NUL character
    467 .Pq ASCII code 0 .
    468 .It \e123
    469 The character with octal value 123.
    470 .It \ex2a
    471 The character with hexadecimal value 2a.
    472 .It (r)
    473 Match an
    474 .Sq r ;
    475 parentheses are used to override precedence
    476 .Pq see below .
    477 .It rs
    478 The regular expression
    479 .Sq r
    480 followed by the regular expression
    481 .Sq s ;
    482 called
    483 .Qq concatenation .
    484 .It r|s
    485 Either an
    486 .Sq r
    487 or an
    488 .Sq s .
    489 .It r/s
    490 An
    491 .Sq r ,
    492 but only if it is followed by an
    493 .Sq s .
    494 The text matched by
    495 .Sq s
    496 is included when determining whether this rule is the
    497 .Qq longest match ,
    498 but is then returned to the input before the action is executed.
    499 So the action only sees the text matched by
    500 .Sq r .
    501 This type of pattern is called
    502 .Qq trailing context .
    503 (There are some combinations of r/s that
    504 .Nm
    505 cannot match correctly; see notes in the
    506 .Sx BUGS
    507 section below regarding
    508 .Qq dangerous trailing context . )
    509 .It ^r
    510 An
    511 .Sq r ,
    512 but only at the beginning of a line
    513 (i.e., just starting to scan, or right after a newline has been scanned).
    514 .It r$
    515 An
    516 .Sq r ,
    517 but only at the end of a line
    518 .Pq i.e., just before a newline .
    519 Equivalent to
    520 .Qq r/\en .
    521 .Pp
    522 Note that
    523 .Nm flex Ns 's
    524 notion of
    525 .Qq newline
    526 is exactly whatever the C compiler used to compile
    527 .Nm
    528 interprets
    529 .Sq \en
    530 as.
    531 .\" In particular, on some DOS systems you must either filter out \er's in the
    532 .\" input yourself, or explicitly use r/\er\en for
    533 .\" .Qq r$ .
    534 .It <s>r
    535 An
    536 .Sq r ,
    537 but only in start condition
    538 .Sq s
    539 .Pq see below for discussion of start conditions .
    540 .It <s1,s2,s3>r
    541 The same, but in any of start conditions s1, s2, or s3.
    542 .It <*>r
    543 An
    544 .Sq r
    545 in any start condition, even an exclusive one.
    546 .It <<EOF>>
    547 An end-of-file.
    548 .It <s1,s2><<EOF>>
    549 An end-of-file when in start condition s1 or s2.
    550 .El
    551 .Pp
    552 Note that inside of a character class, all regular expression operators
    553 lose their special meaning except escape
    554 .Pq Sq \e
    555 and the character class operators,
    556 .Sq - ,
    557 .Sq ]\& ,
    558 and, at the beginning of the class,
    559 .Sq ^ .
    560 .Pp
    561 The regular expressions listed above are grouped according to
    562 precedence, from highest precedence at the top to lowest at the bottom.
    563 Those grouped together have equal precedence.
    564 For example,
    565 .Pp
    566 .D1 foo|bar*
    567 .Pp
    568 is the same as
    569 .Pp
    570 .D1 (foo)|(ba(r*))
    571 .Pp
    572 since the
    573 .Sq *
    574 operator has higher precedence than concatenation,
    575 and concatenation higher than alternation
    576 .Pq Sq |\& .
    577 This pattern therefore matches
    578 .Em either
    579 the string
    580 .Qq foo
    581 .Em or
    582 the string
    583 .Qq ba
    584 followed by zero-or-more r's.
    585 To match
    586 .Qq foo
    587 or zero-or-more "bar"'s,
    588 use:
    589 .Pp
    590 .D1 foo|(bar)*
    591 .Pp
    592 and to match zero-or-more "foo"'s-or-"bar"'s:
    593 .Pp
    594 .D1 (foo|bar)*
    595 .Pp
    596 In addition to characters and ranges of characters, character classes
    597 can also contain character class
    598 .Em expressions .
    599 These are expressions enclosed inside
    600 .Sq [:
    601 and
    602 .Sq :]
    603 delimiters (which themselves must appear between the
    604 .Sq \&[
    605 and
    606 .Sq ]\&
    607 of the
    608 character class; other elements may occur inside the character class, too).
    609 The valid expressions are:
    610 .Bd -unfilled -offset indent
    611 [:alnum:] [:alpha:] [:blank:]
    612 [:cntrl:] [:digit:] [:graph:]
    613 [:lower:] [:print:] [:punct:]
    614 [:space:] [:upper:] [:xdigit:]
    615 .Ed
    616 .Pp
    617 These expressions all designate a set of characters equivalent to
    618 the corresponding standard C
    619 .Fn isXXX
    620 function.
    621 For example, [:alnum:] designates those characters for which
    622 .Xr isalnum 3
    623 returns true \- i.e., any alphabetic or numeric.
    624 Some systems don't provide
    625 .Xr isblank 3 ,
    626 so
    627 .Nm
    628 defines [:blank:] as a blank or a tab.
    629 .Pp
    630 For example, the following character classes are all equivalent:
    631 .Bd -unfilled -offset indent
    632 [[:alnum:]]
    633 [[:alpha:][:digit:]]
    634 [[:alpha:]0-9]
    635 [a-zA-Z0-9]
    636 .Ed
    637 .Pp
    638 If the scanner is case-insensitive (the
    639 .Fl i
    640 flag), then [:upper:] and [:lower:] are equivalent to [:alpha:].
    641 .Pp
    642 Some notes on patterns:
    643 .Bl -dash
    644 .It
    645 A negated character class such as the example
    646 .Qq [^A-Z]
    647 above will match a newline unless "\en"
    648 .Pq or an equivalent escape sequence
    649 is one of the characters explicitly present in the negated character class
    650 (e.g.,
    651 .Qq [^A-Z\en] ) .
    652 This is unlike how many other regular expression tools treat negated character
    653 classes, but unfortunately the inconsistency is historically entrenched.
    654 Matching newlines means that a pattern like
    655 .Qq [^"]*
    656 can match the entire input unless there's another quote in the input.
    657 .It
    658 A rule can have at most one instance of trailing context
    659 (the
    660 .Sq /
    661 operator or the
    662 .Sq $
    663 operator).
    664 The start condition,
    665 .Sq ^ ,
    666 and
    667 .Qq <<EOF>>
    668 patterns can only occur at the beginning of a pattern, and, as well as with
    669 .Sq /
    670 and
    671 .Sq $ ,
    672 cannot be grouped inside parentheses.
    673 A
    674 .Sq ^
    675 which does not occur at the beginning of a rule or a
    676 .Sq $
    677 which does not occur at the end of a rule loses its special properties
    678 and is treated as a normal character.
    679 .It
    680 The following are illegal:
    681 .Bd -unfilled -offset indent
    682 foo/bar$
    683 <sc1>foo<sc2>bar
    684 .Ed
    685 .Pp
    686 Note that the first of these, can be written
    687 .Qq foo/bar\en .
    688 .It
    689 The following will result in
    690 .Sq $
    691 or
    692 .Sq ^
    693 being treated as a normal character:
    694 .Bd -unfilled -offset indent
    695 foo|(bar$)
    696 foo|^bar
    697 .Ed
    698 .Pp
    699 If what's wanted is a
    700 .Qq foo
    701 or a bar-followed-by-a-newline, the following could be used
    702 (the special
    703 .Sq |\&
    704 action is explained below):
    705 .Bd -unfilled -offset indent
    706 foo      |
    707 bar$     /* action goes here */
    708 .Ed
    709 .Pp
    710 A similar trick will work for matching a foo or a
    711 bar-at-the-beginning-of-a-line.
    712 .El
    713 .Sh HOW THE INPUT IS MATCHED
    714 When the generated scanner is run,
    715 it analyzes its input looking for strings which match any of its patterns.
    716 If it finds more than one match,
    717 it takes the one matching the most text
    718 (for trailing context rules, this includes the length of the trailing part,
    719 even though it will then be returned to the input).
    720 If it finds two or more matches of the same length,
    721 the rule listed first in the
    722 .Nm
    723 input file is chosen.
    724 .Pp
    725 Once the match is determined, the text corresponding to the match
    726 (called the
    727 .Em token )
    728 is made available in the global character pointer
    729 .Fa yytext ,
    730 and its length in the global integer
    731 .Fa yyleng .
    732 The
    733 .Em action
    734 corresponding to the matched pattern is then executed
    735 .Pq a more detailed description of actions follows ,
    736 and then the remaining input is scanned for another match.
    737 .Pp
    738 If no match is found, then the default rule is executed:
    739 the next character in the input is considered matched and
    740 copied to the standard output.
    741 Thus, the simplest legal
    742 .Nm
    743 input is:
    744 .Pp
    745 .D1 %%
    746 .Pp
    747 which generates a scanner that simply copies its input
    748 .Pq one character at a time
    749 to its output.
    750 .Pp
    751 Note that
    752 .Fa yytext
    753 can be defined in two different ways:
    754 either as a character pointer or as a character array.
    755 Which definition
    756 .Nm
    757 uses can be controlled by including one of the special directives
    758 .Dq %pointer
    759 or
    760 .Dq %array
    761 in the first
    762 .Pq definitions
    763 section of flex input.
    764 The default is
    765 .Dq %pointer ,
    766 unless the
    767 .Fl l
    768 .Nm lex
    769 compatibility option is used, in which case
    770 .Fa yytext
    771 will be an array.
    772 The advantage of using
    773 .Dq %pointer
    774 is substantially faster scanning and no buffer overflow when matching
    775 very large tokens
    776 .Pq unless not enough dynamic memory is available .
    777 The disadvantage is that actions are restricted in how they can modify
    778 .Fa yytext
    779 .Pq see the next section ,
    780 and calls to the
    781 .Fn unput
    782 function destroy the present contents of
    783 .Fa yytext ,
    784 which can be a considerable porting headache when moving between different
    785 .Nm lex
    786 versions.
    787 .Pp
    788 The advantage of
    789 .Dq %array
    790 is that
    791 .Fa yytext
    792 can be modified as much as wanted, and calls to
    793 .Fn unput
    794 do not destroy
    795 .Fa yytext
    796 .Pq see below .
    797 Furthermore, existing
    798 .Nm lex
    799 programs sometimes access
    800 .Fa yytext
    801 externally using declarations of the form:
    802 .Pp
    803 .D1 extern char yytext[];
    804 .Pp
    805 This definition is erroneous when used with
    806 .Dq %pointer ,
    807 but correct for
    808 .Dq %array .
    809 .Pp
    810 .Dq %array
    811 defines
    812 .Fa yytext
    813 to be an array of
    814 .Dv YYLMAX
    815 characters, which defaults to a fairly large value.
    816 The size can be changed by simply #define'ing
    817 .Dv YYLMAX
    818 to a different value in the first section of
    819 .Nm
    820 input.
    821 As mentioned above, with
    822 .Dq %pointer
    823 yytext grows dynamically to accommodate large tokens.
    824 While this means a
    825 .Dq %pointer
    826 scanner can accommodate very large tokens
    827 .Pq such as matching entire blocks of comments ,
    828 bear in mind that each time the scanner must resize
    829 .Fa yytext
    830 it also must rescan the entire token from the beginning, so matching such
    831 tokens can prove slow.
    832 .Fa yytext
    833 presently does not dynamically grow if a call to
    834 .Fn unput
    835 results in too much text being pushed back; instead, a run-time error results.
    836 .Pp
    837 Also note that
    838 .Dq %array
    839 cannot be used with C++ scanner classes
    840 .Pq the c++ option; see below .
    841 .Sh ACTIONS
    842 Each pattern in a rule has a corresponding action,
    843 which can be any arbitrary C statement.
    844 The pattern ends at the first non-escaped whitespace character;
    845 the remainder of the line is its action.
    846 If the action is empty,
    847 then when the pattern is matched the input token is simply discarded.
    848 For example, here is the specification for a program
    849 which deletes all occurrences of
    850 .Qq zap me
    851 from its input:
    852 .Bd -literal -offset indent
    853 %%
    854 "zap me"
    855 .Ed
    856 .Pp
    857 (It will copy all other characters in the input to the output since
    858 they will be matched by the default rule.)
    859 .Pp
    860 Here is a program which compresses multiple blanks and tabs down to
    861 a single blank, and throws away whitespace found at the end of a line:
    862 .Bd -literal -offset indent
    863 %%
    864 [ \et]+        putchar(' ');
    865 [ \et]+$       /* ignore this token */
    866 .Ed
    867 .Pp
    868 If the action contains a
    869 .Sq { ,
    870 then the action spans till the balancing
    871 .Sq }
    872 is found, and the action may cross multiple lines.
    873 .Nm
    874 knows about C strings and comments and won't be fooled by braces found
    875 within them, but also allows actions to begin with
    876 .Sq %{
    877 and will consider the action to be all the text up to the next
    878 .Sq %}
    879 .Pq regardless of ordinary braces inside the action .
    880 .Pp
    881 An action consisting solely of a vertical bar
    882 .Pq Sq |\&
    883 means
    884 .Qq same as the action for the next rule .
    885 See below for an illustration.
    886 .Pp
    887 Actions can include arbitrary C code,
    888 including return statements to return a value to whatever routine called
    889 .Fn yylex .
    890 Each time
    891 .Fn yylex
    892 is called, it continues processing tokens from where it last left off
    893 until it either reaches the end of the file or executes a return.
    894 .Pp
    895 Actions are free to modify
    896 .Fa yytext
    897 except for lengthening it
    898 (adding characters to its end \- these will overwrite later characters in the
    899 input stream).
    900 This, however, does not apply when using
    901 .Dq %array
    902 .Pq see above ;
    903 in that case,
    904 .Fa yytext
    905 may be freely modified in any way.
    906 .Pp
    907 Actions are free to modify
    908 .Fa yyleng
    909 except they should not do so if the action also includes use of
    910 .Fn yymore
    911 .Pq see below .
    912 .Pp
    913 There are a number of special directives which can be included within
    914 an action:
    915 .Bl -tag -width Ds
    916 .It ECHO
    917 Copies
    918 .Fa yytext
    919 to the scanner's output.
    920 .It BEGIN
    921 Followed by the name of a start condition, places the scanner in the
    922 corresponding start condition
    923 .Pq see below .
    924 .It REJECT
    925 Directs the scanner to proceed on to the
    926 .Qq second best
    927 rule which matched the input
    928 .Pq or a prefix of the input .
    929 The rule is chosen as described above in
    930 .Sx HOW THE INPUT IS MATCHED ,
    931 and
    932 .Fa yytext
    933 and
    934 .Fa yyleng
    935 set up appropriately.
    936 It may either be one which matched as much text
    937 as the originally chosen rule but came later in the
    938 .Nm
    939 input file, or one which matched less text.
    940 For example, the following will both count the
    941 words in the input and call the routine
    942 .Fn special
    943 whenever
    944 .Qq frob
    945 is seen:
    946 .Bd -literal -offset indent
    947 int word_count = 0;
    948 %%
    949 
    950 frob        special(); REJECT;
    951 [^ \et\en]+   ++word_count;
    952 .Ed
    953 .Pp
    954 Without the
    955 .Em REJECT ,
    956 any "frob"'s in the input would not be counted as words,
    957 since the scanner normally executes only one action per token.
    958 Multiple
    959 .Em REJECT Ns 's
    960 are allowed,
    961 each one finding the next best choice to the currently active rule.
    962 For example, when the following scanner scans the token
    963 .Qq abcd ,
    964 it will write
    965 .Qq abcdabcaba
    966 to the output:
    967 .Bd -literal -offset indent
    968 %%
    969 a        |
    970 ab       |
    971 abc      |
    972 abcd     ECHO; REJECT;
    973 \&.|\en     /* eat up any unmatched character */
    974 .Ed
    975 .Pp
    976 (The first three rules share the fourth's action since they use
    977 the special
    978 .Sq |\&
    979 action.)
    980 .Em REJECT
    981 is a particularly expensive feature in terms of scanner performance;
    982 if it is used in any of the scanner's actions it will slow down
    983 all of the scanner's matching.
    984 Furthermore,
    985 .Em REJECT
    986 cannot be used with the
    987 .Fl Cf
    988 or
    989 .Fl CF
    990 options
    991 .Pq see below .
    992 .Pp
    993 Note also that unlike the other special actions,
    994 .Em REJECT
    995 is a
    996 .Em branch ;
    997 code immediately following it in the action will not be executed.
    998 .It yymore()
    999 Tells the scanner that the next time it matches a rule, the corresponding
   1000 token should be appended onto the current value of
   1001 .Fa yytext
   1002 rather than replacing it.
   1003 For example, given the input
   1004 .Qq mega-kludge
   1005 the following will write
   1006 .Qq mega-mega-kludge
   1007 to the output:
   1008 .Bd -literal -offset indent
   1009 %%
   1010 mega-    ECHO; yymore();
   1011 kludge   ECHO;
   1012 .Ed
   1013 .Pp
   1014 First
   1015 .Qq mega-
   1016 is matched and echoed to the output.
   1017 Then
   1018 .Qq kludge
   1019 is matched, but the previous
   1020 .Qq mega-
   1021 is still hanging around at the beginning of
   1022 .Fa yytext
   1023 so the
   1024 .Em ECHO
   1025 for the
   1026 .Qq kludge
   1027 rule will actually write
   1028 .Qq mega-kludge .
   1029 .Pp
   1030 Two notes regarding use of
   1031 .Fn yymore :
   1032 First,
   1033 .Fn yymore
   1034 depends on the value of
   1035 .Fa yyleng
   1036 correctly reflecting the size of the current token, so
   1037 .Fa yyleng
   1038 must not be modified when using
   1039 .Fn yymore .
   1040 Second, the presence of
   1041 .Fn yymore
   1042 in the scanner's action entails a minor performance penalty in the
   1043 scanner's matching speed.
   1044 .It yyless(n)
   1045 Returns all but the first
   1046 .Ar n
   1047 characters of the current token back to the input stream, where they
   1048 will be rescanned when the scanner looks for the next match.
   1049 .Fa yytext
   1050 and
   1051 .Fa yyleng
   1052 are adjusted appropriately (e.g.,
   1053 .Fa yyleng
   1054 will now be equal to
   1055 .Ar n ) .
   1056 For example, on the input
   1057 .Qq foobar
   1058 the following will write out
   1059 .Qq foobarbar :
   1060 .Bd -literal -offset indent
   1061 %%
   1062 foobar    ECHO; yyless(3);
   1063 [a-z]+    ECHO;
   1064 .Ed
   1065 .Pp
   1066 An argument of 0 to
   1067 .Fa yyless
   1068 will cause the entire current input string to be scanned again.
   1069 Unless how the scanner will subsequently process its input has been changed
   1070 (using
   1071 .Em BEGIN ,
   1072 for example),
   1073 this will result in an endless loop.
   1074 .Pp
   1075 Note that
   1076 .Fa yyless
   1077 is a macro and can only be used in the
   1078 .Nm
   1079 input file, not from other source files.
   1080 .It unput(c)
   1081 Puts the character
   1082 .Ar c
   1083 back into the input stream.
   1084 It will be the next character scanned.
   1085 The following action will take the current token and cause it
   1086 to be rescanned enclosed in parentheses.
   1087 .Bd -literal -offset indent
   1088 {
   1089         int i;
   1090         char *yycopy;
   1091 
   1092         /* Copy yytext because unput() trashes yytext */
   1093         if ((yycopy = strdup(yytext)) == NULL)
   1094                 err(1, NULL);
   1095         unput(')');
   1096         for (i = yyleng - 1; i >= 0; --i)
   1097                 unput(yycopy[i]);
   1098         unput('(');
   1099         free(yycopy);
   1100 }
   1101 .Ed
   1102 .Pp
   1103 Note that since each
   1104 .Fn unput
   1105 puts the given character back at the beginning of the input stream,
   1106 pushing back strings must be done back-to-front.
   1107 .Pp
   1108 An important potential problem when using
   1109 .Fn unput
   1110 is that if using
   1111 .Dq %pointer
   1112 .Pq the default ,
   1113 a call to
   1114 .Fn unput
   1115 destroys the contents of
   1116 .Fa yytext ,
   1117 starting with its rightmost character and devouring one character to
   1118 the left with each call.
   1119 If the value of
   1120 .Fa yytext
   1121 should be preserved after a call to
   1122 .Fn unput
   1123 .Pq as in the above example ,
   1124 it must either first be copied elsewhere, or the scanner must be built using
   1125 .Dq %array
   1126 instead (see
   1127 .Sx HOW THE INPUT IS MATCHED ) .
   1128 .Pp
   1129 Finally, note that EOF cannot be put back
   1130 to attempt to mark the input stream with an end-of-file.
   1131 .It input()
   1132 Reads the next character from the input stream.
   1133 For example, the following is one way to eat up C comments:
   1134 .Bd -literal -offset indent
   1135 %%
   1136 "/*" {
   1137         int c;
   1138 
   1139         for (;;) {
   1140                 while ((c = input()) != '*' && c != EOF)
   1141                         ; /* eat up text of comment */
   1142 
   1143                 if (c == '*') {
   1144                         while ((c = input()) == '*')
   1145                                 ;
   1146                         if (c == '/')
   1147                                 break; /* found the end */
   1148                 }
   1149 
   1150                 if (c == EOF) {
   1151                         errx(1, "EOF in comment");
   1152                         break;
   1153                 }
   1154         }
   1155 }
   1156 .Ed
   1157 .Pp
   1158 (Note that if the scanner is compiled using C++, then
   1159 .Fn input
   1160 is instead referred to as
   1161 .Fn yyinput ,
   1162 in order to avoid a name clash with the C++ stream by the name of input.)
   1163 .It YY_FLUSH_BUFFER
   1164 Flushes the scanner's internal buffer
   1165 so that the next time the scanner attempts to match a token,
   1166 it will first refill the buffer using
   1167 .Dv YY_INPUT
   1168 (see
   1169 .Sx THE GENERATED SCANNER ,
   1170 below).
   1171 This action is a special case of the more general
   1172 .Fn yy_flush_buffer
   1173 function, described below in the section
   1174 .Sx MULTIPLE INPUT BUFFERS .
   1175 .It yyterminate()
   1176 Can be used in lieu of a return statement in an action.
   1177 It terminates the scanner and returns a 0 to the scanner's caller, indicating
   1178 .Qq all done .
   1179 By default,
   1180 .Fn yyterminate
   1181 is also called when an end-of-file is encountered.
   1182 It is a macro and may be redefined.
   1183 .El
   1184 .Sh THE GENERATED SCANNER
   1185 The output of
   1186 .Nm
   1187 is the file
   1188 .Pa lex.yy.c ,
   1189 which contains the scanning routine
   1190 .Fn yylex ,
   1191 a number of tables used by it for matching tokens,
   1192 and a number of auxiliary routines and macros.
   1193 By default,
   1194 .Fn yylex
   1195 is declared as follows:
   1196 .Bd -unfilled -offset indent
   1197 int yylex()
   1198 {
   1199     ... various definitions and the actions in here ...
   1200 }
   1201 .Ed
   1202 .Pp
   1203 (If the environment supports function prototypes, then it will
   1204 be "int yylex(void)".)
   1205 This definition may be changed by defining the
   1206 .Dv YY_DECL
   1207 macro.
   1208 For example:
   1209 .Bd -literal -offset indent
   1210 #define YY_DECL float lexscan(a, b) float a, b;
   1211 .Ed
   1212 .Pp
   1213 would give the scanning routine the name
   1214 .Em lexscan ,
   1215 returning a float, and taking two floats as arguments.
   1216 Note that if arguments are given to the scanning routine using a
   1217 K&R-style/non-prototyped function declaration,
   1218 the definition must be terminated with a semi-colon
   1219 .Pq Sq ;\& .
   1220 .Pp
   1221 Whenever
   1222 .Fn yylex
   1223 is called, it scans tokens from the global input file
   1224 .Pa yyin
   1225 .Pq which defaults to stdin .
   1226 It continues until it either reaches an end-of-file
   1227 .Pq at which point it returns the value 0
   1228 or one of its actions executes a
   1229 .Em return
   1230 statement.
   1231 .Pp
   1232 If the scanner reaches an end-of-file, subsequent calls are undefined
   1233 unless either
   1234 .Em yyin
   1235 is pointed at a new input file
   1236 .Pq in which case scanning continues from that file ,
   1237 or
   1238 .Fn yyrestart
   1239 is called.
   1240 .Fn yyrestart
   1241 takes one argument, a
   1242 .Fa FILE *
   1243 pointer (which can be nil, if
   1244 .Dv YY_INPUT
   1245 has been set up to scan from a source other than
   1246 .Em yyin ) ,
   1247 and initializes
   1248 .Em yyin
   1249 for scanning from that file.
   1250 Essentially there is no difference between just assigning
   1251 .Em yyin
   1252 to a new input file or using
   1253 .Fn yyrestart
   1254 to do so; the latter is available for compatibility with previous versions of
   1255 .Nm ,
   1256 and because it can be used to switch input files in the middle of scanning.
   1257 It can also be used to throw away the current input buffer,
   1258 by calling it with an argument of
   1259 .Em yyin ;
   1260 but better is to use
   1261 .Dv YY_FLUSH_BUFFER
   1262 .Pq see above .
   1263 Note that
   1264 .Fn yyrestart
   1265 does not reset the start condition to
   1266 .Em INITIAL
   1267 (see
   1268 .Sx START CONDITIONS ,
   1269 below).
   1270 .Pp
   1271 If
   1272 .Fn yylex
   1273 stops scanning due to executing a
   1274 .Em return
   1275 statement in one of the actions, the scanner may then be called again and it
   1276 will resume scanning where it left off.
   1277 .Pp
   1278 By default
   1279 .Pq and for purposes of efficiency ,
   1280 the scanner uses block-reads rather than simple
   1281 .Xr getc 3
   1282 calls to read characters from
   1283 .Em yyin .
   1284 The nature of how it gets its input can be controlled by defining the
   1285 .Dv YY_INPUT
   1286 macro.
   1287 .Dv YY_INPUT Ns 's
   1288 calling sequence is
   1289 .Qq YY_INPUT(buf,result,max_size) .
   1290 Its action is to place up to
   1291 .Dv max_size
   1292 characters in the character array
   1293 .Em buf
   1294 and return in the integer variable
   1295 .Em result
   1296 either the number of characters read or the constant
   1297 .Dv YY_NULL
   1298 (0 on
   1299 .Ux
   1300 systems)
   1301 to indicate
   1302 .Dv EOF .
   1303 The default
   1304 .Dv YY_INPUT
   1305 reads from the global file-pointer
   1306 .Qq yyin .
   1307 .Pp
   1308 A sample definition of
   1309 .Dv YY_INPUT
   1310 .Pq in the definitions section of the input file :
   1311 .Bd -unfilled -offset indent
   1312 %{
   1313 #define YY_INPUT(buf,result,max_size) \e
   1314 { \e
   1315         int c = getchar(); \e
   1316         result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e
   1317 }
   1318 %}
   1319 .Ed
   1320 .Pp
   1321 This definition will change the input processing to occur
   1322 one character at a time.
   1323 .Pp
   1324 When the scanner receives an end-of-file indication from
   1325 .Dv YY_INPUT ,
   1326 it then checks the
   1327 .Fn yywrap
   1328 function.
   1329 If
   1330 .Fn yywrap
   1331 returns false
   1332 .Pq zero ,
   1333 then it is assumed that the function has gone ahead and set up
   1334 .Em yyin
   1335 to point to another input file, and scanning continues.
   1336 If it returns true
   1337 .Pq non-zero ,
   1338 then the scanner terminates, returning 0 to its caller.
   1339 Note that in either case, the start condition remains unchanged;
   1340 it does not revert to
   1341 .Em INITIAL .
   1342 .Pp
   1343 If you do not supply your own version of
   1344 .Fn yywrap ,
   1345 then you must either use
   1346 .Dq %option noyywrap
   1347 (in which case the scanner behaves as though
   1348 .Fn yywrap
   1349 returned 1), or you must link with
   1350 .Fl lfl
   1351 to obtain the default version of the routine, which always returns 1.
   1352 .Pp
   1353 Three routines are available for scanning from in-memory buffers rather
   1354 than files:
   1355 .Fn yy_scan_string ,
   1356 .Fn yy_scan_bytes ,
   1357 and
   1358 .Fn yy_scan_buffer .
   1359 See the discussion of them below in the section
   1360 .Sx MULTIPLE INPUT BUFFERS .
   1361 .Pp
   1362 The scanner writes its
   1363 .Em ECHO
   1364 output to the
   1365 .Em yyout
   1366 global
   1367 .Pq default, stdout ,
   1368 which may be redefined by the user simply by assigning it to some other
   1369 .Va FILE
   1370 pointer.
   1371 .Sh START CONDITIONS
   1372 .Nm
   1373 provides a mechanism for conditionally activating rules.
   1374 Any rule whose pattern is prefixed with
   1375 .Qq Aq sc
   1376 will only be active when the scanner is in the start condition named
   1377 .Qq sc .
   1378 For example,
   1379 .Bd -literal -offset indent
   1380 <STRING>[^"]* { /* eat up the string body ... */
   1381         ...
   1382 }
   1383 .Ed
   1384 .Pp
   1385 will be active only when the scanner is in the
   1386 .Qq STRING
   1387 start condition, and
   1388 .Bd -literal -offset indent
   1389 <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */
   1390         ...
   1391 }
   1392 .Ed
   1393 .Pp
   1394 will be active only when the current start condition is either
   1395 .Qq INITIAL ,
   1396 .Qq STRING ,
   1397 or
   1398 .Qq QUOTE .
   1399 .Pp
   1400 Start conditions are declared in the definitions
   1401 .Pq first
   1402 section of the input using unindented lines beginning with either
   1403 .Sq %s
   1404 or
   1405 .Sq %x
   1406 followed by a list of names.
   1407 The former declares
   1408 .Em inclusive
   1409 start conditions, the latter
   1410 .Em exclusive
   1411 start conditions.
   1412 A start condition is activated using the
   1413 .Em BEGIN
   1414 action.
   1415 Until the next
   1416 .Em BEGIN
   1417 action is executed, rules with the given start condition will be active and
   1418 rules with other start conditions will be inactive.
   1419 If the start condition is inclusive,
   1420 then rules with no start conditions at all will also be active.
   1421 If it is exclusive,
   1422 then only rules qualified with the start condition will be active.
   1423 A set of rules contingent on the same exclusive start condition
   1424 describe a scanner which is independent of any of the other rules in the
   1425 .Nm
   1426 input.
   1427 Because of this, exclusive start conditions make it easy to specify
   1428 .Qq mini-scanners
   1429 which scan portions of the input that are syntactically different
   1430 from the rest
   1431 .Pq e.g., comments .
   1432 .Pp
   1433 If the distinction between inclusive and exclusive start conditions
   1434 is still a little vague, here's a simple example illustrating the
   1435 connection between the two.
   1436 The set of rules:
   1437 .Bd -literal -offset indent
   1438 %s example
   1439 %%
   1440 
   1441 <example>foo   do_something();
   1442 
   1443 bar            something_else();
   1444 .Ed
   1445 .Pp
   1446 is equivalent to
   1447 .Bd -literal -offset indent
   1448 %x example
   1449 %%
   1450 
   1451 <example>foo   do_something();
   1452 
   1453 <INITIAL,example>bar    something_else();
   1454 .Ed
   1455 .Pp
   1456 Without the
   1457 .Aq INITIAL,example
   1458 qualifier, the
   1459 .Dq bar
   1460 pattern in the second example wouldn't be active
   1461 .Pq i.e., couldn't match
   1462 when in start condition
   1463 .Dq example .
   1464 If we just used
   1465 .Aq example
   1466 to qualify
   1467 .Dq bar ,
   1468 though, then it would only be active in
   1469 .Dq example
   1470 and not in
   1471 .Em INITIAL ,
   1472 while in the first example it's active in both,
   1473 because in the first example the
   1474 .Dq example
   1475 start condition is an inclusive
   1476 .Pq Sq %s
   1477 start condition.
   1478 .Pp
   1479 Also note that the special start-condition specifier
   1480 .Sq Aq *
   1481 matches every start condition.
   1482 Thus, the above example could also have been written:
   1483 .Bd -literal -offset indent
   1484 %x example
   1485 %%
   1486 
   1487 <example>foo   do_something();
   1488 
   1489 <*>bar         something_else();
   1490 .Ed
   1491 .Pp
   1492 The default rule (to
   1493 .Em ECHO
   1494 any unmatched character) remains active in start conditions.
   1495 It is equivalent to:
   1496 .Bd -literal -offset indent
   1497 <*>.|\en     ECHO;
   1498 .Ed
   1499 .Pp
   1500 .Dq BEGIN(0)
   1501 returns to the original state where only the rules with
   1502 no start conditions are active.
   1503 This state can also be referred to as the start-condition
   1504 .Em INITIAL ,
   1505 so
   1506 .Dq BEGIN(INITIAL)
   1507 is equivalent to
   1508 .Dq BEGIN(0) .
   1509 (The parentheses around the start condition name are not required but
   1510 are considered good style.)
   1511 .Pp
   1512 .Em BEGIN
   1513 actions can also be given as indented code at the beginning
   1514 of the rules section.
   1515 For example, the following will cause the scanner to enter the
   1516 .Qq SPECIAL
   1517 start condition whenever
   1518 .Fn yylex
   1519 is called and the global variable
   1520 .Fa enter_special
   1521 is true:
   1522 .Bd -literal -offset indent
   1523 int enter_special;
   1524 
   1525 %x SPECIAL
   1526 %%
   1527         if (enter_special)
   1528                 BEGIN(SPECIAL);
   1529 
   1530 <SPECIAL>blahblahblah
   1531 \&...more rules follow...
   1532 .Ed
   1533 .Pp
   1534 To illustrate the uses of start conditions,
   1535 here is a scanner which provides two different interpretations
   1536 of a string like
   1537 .Qq 123.456 .
   1538 By default it will treat it as three tokens: the integer
   1539 .Qq 123 ,
   1540 a dot
   1541 .Pq Sq .\& ,
   1542 and the integer
   1543 .Qq 456 .
   1544 But if the string is preceded earlier in the line by the string
   1545 .Qq expect-floats
   1546 it will treat it as a single token, the floating-point number 123.456:
   1547 .Bd -literal -offset indent
   1548 %{
   1549 #include <math.h>
   1550 %}
   1551 %s expect
   1552 
   1553 %%
   1554 expect-floats        BEGIN(expect);
   1555 
   1556 <expect>[0-9]+"."[0-9]+ {
   1557         printf("found a float, = %f\en",
   1558             atof(yytext));
   1559 }
   1560 <expect>\en {
   1561         /*
   1562          * That's the end of the line, so
   1563          * we need another "expect-number"
   1564          * before we'll recognize any more
   1565          * numbers.
   1566          */
   1567         BEGIN(INITIAL);
   1568 }
   1569 
   1570 [0-9]+ {
   1571         printf("found an integer, = %d\en",
   1572             atoi(yytext));
   1573 }
   1574 
   1575 "."     printf("found a dot\en");
   1576 .Ed
   1577 .Pp
   1578 Here is a scanner which recognizes
   1579 .Pq and discards
   1580 C comments while maintaining a count of the current input line:
   1581 .Bd -literal -offset indent
   1582 %x comment
   1583 %%
   1584 int line_num = 1;
   1585 
   1586 "/*"                    BEGIN(comment);
   1587 
   1588 <comment>[^*\en]*        /* eat anything that's not a '*' */
   1589 <comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
   1590 <comment>\en             ++line_num;
   1591 <comment>"*"+"/"        BEGIN(INITIAL);
   1592 .Ed
   1593 .Pp
   1594 This scanner goes to a bit of trouble to match as much
   1595 text as possible with each rule.
   1596 In general, when attempting to write a high-speed scanner
   1597 try to match as much as possible in each rule, as it's a big win.
   1598 .Pp
   1599 Note that start-condition names are really integer values and
   1600 can be stored as such.
   1601 Thus, the above could be extended in the following fashion:
   1602 .Bd -literal -offset indent
   1603 %x comment foo
   1604 %%
   1605 int line_num = 1;
   1606 int comment_caller;
   1607 
   1608 "/*" {
   1609         comment_caller = INITIAL;
   1610         BEGIN(comment);
   1611 }
   1612 
   1613 \&...
   1614 
   1615 <foo>"/*" {
   1616         comment_caller = foo;
   1617         BEGIN(comment);
   1618 }
   1619 
   1620 <comment>[^*\en]*        /* eat anything that's not a '*' */
   1621 <comment>"*"+[^*/\en]*   /* eat up '*'s not followed by '/'s */
   1622 <comment>\en             ++line_num;
   1623 <comment>"*"+"/"        BEGIN(comment_caller);
   1624 .Ed
   1625 .Pp
   1626 Furthermore, the current start condition can be accessed by using
   1627 the integer-valued
   1628 .Dv YY_START
   1629 macro.
   1630 For example, the above assignments to
   1631 .Em comment_caller
   1632 could instead be written
   1633 .Pp
   1634 .Dl comment_caller = YY_START;
   1635 .Pp
   1636 Flex provides
   1637 .Dv YYSTATE
   1638 as an alias for
   1639 .Dv YY_START
   1640 (since that is what's used by
   1641 .At
   1642 .Nm lex ) .
   1643 .Pp
   1644 Note that start conditions do not have their own name-space;
   1645 %s's and %x's declare names in the same fashion as #define's.
   1646 .Pp
   1647 Finally, here's an example of how to match C-style quoted strings using
   1648 exclusive start conditions, including expanded escape sequences
   1649 (but not including checking for a string that's too long):
   1650 .Bd -literal -offset indent
   1651 %x str
   1652 
   1653 %%
   1654 #define MAX_STR_CONST 1024
   1655 char string_buf[MAX_STR_CONST];
   1656 char *string_buf_ptr;
   1657 
   1658 \e"      string_buf_ptr = string_buf; BEGIN(str);
   1659 
   1660 <str>\e" { /* saw closing quote - all done */
   1661         BEGIN(INITIAL);
   1662         *string_buf_ptr = '\e0';
   1663         /*
   1664          * return string constant token type and
   1665          * value to parser
   1666          */
   1667 }
   1668 
   1669 <str>\en {
   1670         /* error - unterminated string constant */
   1671         /* generate error message */
   1672 }
   1673 
   1674 <str>\e\e[0-7]{1,3} {
   1675         /* octal escape sequence */
   1676         int result;
   1677 
   1678         (void) sscanf(yytext + 1, "%o", &result);
   1679 
   1680         if (result > 0xff) {
   1681                 /* error, constant is out-of-bounds */
   1682 	} else
   1683 	        *string_buf_ptr++ = result;
   1684 }
   1685 
   1686 <str>\e\e[0-9]+ {
   1687         /*
   1688          * generate error - bad escape sequence; something
   1689          * like '\e48' or '\e0777777'
   1690          */
   1691 }
   1692 
   1693 <str>\e\en  *string_buf_ptr++ = '\en';
   1694 <str>\e\et  *string_buf_ptr++ = '\et';
   1695 <str>\e\er  *string_buf_ptr++ = '\er';
   1696 <str>\e\eb  *string_buf_ptr++ = '\eb';
   1697 <str>\e\ef  *string_buf_ptr++ = '\ef';
   1698 
   1699 <str>\e\e(.|\en)  *string_buf_ptr++ = yytext[1];
   1700 
   1701 <str>[^\e\e\en\e"]+ {
   1702         char *yptr = yytext;
   1703 
   1704         while (*yptr)
   1705                 *string_buf_ptr++ = *yptr++;
   1706 }
   1707 .Ed
   1708 .Pp
   1709 Often, such as in some of the examples above,
   1710 a whole bunch of rules are all preceded by the same start condition(s).
   1711 .Nm
   1712 makes this a little easier and cleaner by introducing a notion of
   1713 start condition
   1714 .Em scope .
   1715 A start condition scope is begun with:
   1716 .Pp
   1717 .Dl <SCs>{
   1718 .Pp
   1719 where
   1720 .Dq SCs
   1721 is a list of one or more start conditions.
   1722 Inside the start condition scope, every rule automatically has the prefix
   1723 .Aq SCs
   1724 applied to it, until a
   1725 .Sq }
   1726 which matches the initial
   1727 .Sq { .
   1728 So, for example,
   1729 .Bd -literal -offset indent
   1730 <ESC>{
   1731     "\e\en"   return '\en';
   1732     "\e\er"   return '\er';
   1733     "\e\ef"   return '\ef';
   1734     "\e\e0"   return '\e0';
   1735 }
   1736 .Ed
   1737 .Pp
   1738 is equivalent to:
   1739 .Bd -literal -offset indent
   1740 <ESC>"\e\en"  return '\en';
   1741 <ESC>"\e\er"  return '\er';
   1742 <ESC>"\e\ef"  return '\ef';
   1743 <ESC>"\e\e0"  return '\e0';
   1744 .Ed
   1745 .Pp
   1746 Start condition scopes may be nested.
   1747 .Pp
   1748 Three routines are available for manipulating stacks of start conditions:
   1749 .Bl -tag -width Ds
   1750 .It void yy_push_state(int new_state)
   1751 Pushes the current start condition onto the top of the start condition
   1752 stack and switches to
   1753 .Fa new_state
   1754 as though
   1755 .Dq BEGIN new_state
   1756 had been used
   1757 .Pq recall that start condition names are also integers .
   1758 .It void yy_pop_state()
   1759 Pops the top of the stack and switches to it via
   1760 .Em BEGIN .
   1761 .It int yy_top_state()
   1762 Returns the top of the stack without altering the stack's contents.
   1763 .El
   1764 .Pp
   1765 The start condition stack grows dynamically and so has no built-in
   1766 size limitation.
   1767 If memory is exhausted, program execution aborts.
   1768 .Pp
   1769 To use start condition stacks, scanners must include a
   1770 .Dq %option stack
   1771 directive (see
   1772 .Sx OPTIONS
   1773 below).
   1774 .Sh MULTIPLE INPUT BUFFERS
   1775 Some scanners
   1776 (such as those which support
   1777 .Qq include
   1778 files)
   1779 require reading from several input streams.
   1780 As
   1781 .Nm
   1782 scanners do a large amount of buffering, one cannot control
   1783 where the next input will be read from by simply writing a
   1784 .Dv YY_INPUT
   1785 which is sensitive to the scanning context.
   1786 .Dv YY_INPUT
   1787 is only called when the scanner reaches the end of its buffer, which
   1788 may be a long time after scanning a statement such as an
   1789 .Qq include
   1790 which requires switching the input source.
   1791 .Pp
   1792 To negotiate these sorts of problems,
   1793 .Nm
   1794 provides a mechanism for creating and switching between multiple
   1795 input buffers.
   1796 An input buffer is created by using:
   1797 .Pp
   1798 .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size)
   1799 .Pp
   1800 which takes a
   1801 .Fa FILE
   1802 pointer and a
   1803 .Fa size
   1804 and creates a buffer associated with the given file and large enough to hold
   1805 .Fa size
   1806 characters (when in doubt, use
   1807 .Dv YY_BUF_SIZE
   1808 for the size).
   1809 It returns a
   1810 .Dv YY_BUFFER_STATE
   1811 handle, which may then be passed to other routines
   1812 .Pq see below .
   1813 The
   1814 .Dv YY_BUFFER_STATE
   1815 type is a pointer to an opaque
   1816 .Dq struct yy_buffer_state
   1817 structure, so
   1818 .Dv YY_BUFFER_STATE
   1819 variables may be safely initialized to
   1820 .Dq ((YY_BUFFER_STATE) 0)
   1821 if desired, and the opaque structure can also be referred to in order to
   1822 correctly declare input buffers in source files other than that of scanners.
   1823 Note that the
   1824 .Fa FILE
   1825 pointer in the call to
   1826 .Fn yy_create_buffer
   1827 is only used as the value of
   1828 .Fa yyin
   1829 seen by
   1830 .Dv YY_INPUT ;
   1831 if
   1832 .Dv YY_INPUT
   1833 is redefined so that it no longer uses
   1834 .Fa yyin ,
   1835 then a nil
   1836 .Fa FILE
   1837 pointer can safely be passed to
   1838 .Fn yy_create_buffer .
   1839 To select a particular buffer to scan:
   1840 .Pp
   1841 .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer)
   1842 .Pp
   1843 It switches the scanner's input buffer so subsequent tokens will
   1844 come from
   1845 .Fa new_buffer .
   1846 Note that
   1847 .Fn yy_switch_to_buffer
   1848 may be used by
   1849 .Fn yywrap
   1850 to set things up for continued scanning,
   1851 instead of opening a new file and pointing
   1852 .Fa yyin
   1853 at it.
   1854 Note also that switching input sources via either
   1855 .Fn yy_switch_to_buffer
   1856 or
   1857 .Fn yywrap
   1858 does not change the start condition.
   1859 .Pp
   1860 .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer)
   1861 .Pp
   1862 is used to reclaim the storage associated with a buffer.
   1863 .Pf ( Fa buffer
   1864 can be nil, in which case the routine does nothing.)
   1865 To clear the current contents of a buffer:
   1866 .Pp
   1867 .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer)
   1868 .Pp
   1869 This function discards the buffer's contents,
   1870 so the next time the scanner attempts to match a token from the buffer,
   1871 it will first fill the buffer anew using
   1872 .Dv YY_INPUT .
   1873 .Pp
   1874 .Fn yy_new_buffer
   1875 is an alias for
   1876 .Fn yy_create_buffer ,
   1877 provided for compatibility with the C++ use of
   1878 .Em new
   1879 and
   1880 .Em delete
   1881 for creating and destroying dynamic objects.
   1882 .Pp
   1883 Finally, the
   1884 .Dv YY_CURRENT_BUFFER
   1885 macro returns a
   1886 .Dv YY_BUFFER_STATE
   1887 handle to the current buffer.
   1888 .Pp
   1889 Here is an example of using these features for writing a scanner
   1890 which expands include files (the
   1891 .Aq Aq EOF
   1892 feature is discussed below):
   1893 .Bd -literal -offset indent
   1894 /*
   1895  * the "incl" state is used for picking up the name
   1896  * of an include file
   1897  */
   1898 %x incl
   1899 
   1900 %{
   1901 #define MAX_INCLUDE_DEPTH 10
   1902 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
   1903 int include_stack_ptr = 0;
   1904 %}
   1905 
   1906 %%
   1907 include             BEGIN(incl);
   1908 
   1909 [a-z]+              ECHO;
   1910 [^a-z\en]*\en?        ECHO;
   1911 
   1912 <incl>[ \et]*        /* eat the whitespace */
   1913 <incl>[^ \et\en]+ {   /* got the include file name */
   1914         if (include_stack_ptr >= MAX_INCLUDE_DEPTH)
   1915                 errx(1, "Includes nested too deeply");
   1916 
   1917         include_stack[include_stack_ptr++] =
   1918             YY_CURRENT_BUFFER;
   1919 
   1920         yyin = fopen(yytext, "r");
   1921 
   1922         if (yyin == NULL)
   1923                 err(1, NULL);
   1924 
   1925         yy_switch_to_buffer(
   1926             yy_create_buffer(yyin, YY_BUF_SIZE));
   1927 
   1928         BEGIN(INITIAL);
   1929 }
   1930 
   1931 <<EOF>> {
   1932         if (--include_stack_ptr < 0)
   1933                 yyterminate();
   1934         else {
   1935                 yy_delete_buffer(YY_CURRENT_BUFFER);
   1936                 yy_switch_to_buffer(
   1937                     include_stack[include_stack_ptr]);
   1938        }
   1939 }
   1940 .Ed
   1941 .Pp
   1942 Three routines are available for setting up input buffers for
   1943 scanning in-memory strings instead of files.
   1944 All of them create a new input buffer for scanning the string,
   1945 and return a corresponding
   1946 .Dv YY_BUFFER_STATE
   1947 handle (which should be deleted afterwards using
   1948 .Fn yy_delete_buffer ) .
   1949 They also switch to the new buffer using
   1950 .Fn yy_switch_to_buffer ,
   1951 so the next call to
   1952 .Fn yylex
   1953 will start scanning the string.
   1954 .Bl -tag -width Ds
   1955 .It yy_scan_string(const char *str)
   1956 Scans a NUL-terminated string.
   1957 .It yy_scan_bytes(const char *bytes, int len)
   1958 Scans
   1959 .Fa len
   1960 bytes
   1961 .Pq including possibly NUL's
   1962 starting at location
   1963 .Fa bytes .
   1964 .El
   1965 .Pp
   1966 Note that both of these functions create and scan a copy
   1967 of the string or bytes.
   1968 (This may be desirable, since
   1969 .Fn yylex
   1970 modifies the contents of the buffer it is scanning.)
   1971 The copy can be avoided by using:
   1972 .Bl -tag -width Ds
   1973 .It yy_scan_buffer(char *base, yy_size_t size)
   1974 Which scans the buffer starting at
   1975 .Fa base ,
   1976 consisting of
   1977 .Fa size
   1978 bytes, the last two bytes of which must be
   1979 .Dv YY_END_OF_BUFFER_CHAR
   1980 .Pq ASCII NUL .
   1981 These last two bytes are not scanned; thus, scanning consists of
   1982 base[0] through base[size-2], inclusive.
   1983 .Pp
   1984 If
   1985 .Fa base
   1986 is not set up in this manner
   1987 (i.e., forget the final two
   1988 .Dv YY_END_OF_BUFFER_CHAR
   1989 bytes), then
   1990 .Fn yy_scan_buffer
   1991 returns a nil pointer instead of creating a new input buffer.
   1992 .Pp
   1993 The type
   1994 .Fa yy_size_t
   1995 is an integral type which can be cast to an integer expression
   1996 reflecting the size of the buffer.
   1997 .El
   1998 .Sh END-OF-FILE RULES
   1999 The special rule
   2000 .Qq Aq Aq EOF
   2001 indicates actions which are to be taken when an end-of-file is encountered and
   2002 .Fn yywrap
   2003 returns non-zero
   2004 .Pq i.e., indicates no further files to process .
   2005 The action must finish by doing one of four things:
   2006 .Bl -dash
   2007 .It
   2008 Assigning
   2009 .Em yyin
   2010 to a new input file
   2011 (in previous versions of
   2012 .Nm ,
   2013 after doing the assignment, it was necessary to call the special action
   2014 .Dv YY_NEW_FILE ;
   2015 this is no longer necessary).
   2016 .It
   2017 Executing a
   2018 .Em return
   2019 statement.
   2020 .It
   2021 Executing the special
   2022 .Fn yyterminate
   2023 action.
   2024 .It
   2025 Switching to a new buffer using
   2026 .Fn yy_switch_to_buffer
   2027 as shown in the example above.
   2028 .El
   2029 .Pp
   2030 .Aq Aq EOF
   2031 rules may not be used with other patterns;
   2032 they may only be qualified with a list of start conditions.
   2033 If an unqualified
   2034 .Aq Aq EOF
   2035 rule is given, it applies to all start conditions which do not already have
   2036 .Aq Aq EOF
   2037 actions.
   2038 To specify an
   2039 .Aq Aq EOF
   2040 rule for only the initial start condition, use
   2041 .Pp
   2042 .Dl <INITIAL><<EOF>>
   2043 .Pp
   2044 These rules are useful for catching things like unclosed comments.
   2045 An example:
   2046 .Bd -literal -offset indent
   2047 %x quote
   2048 %%
   2049 
   2050 \&...other rules for dealing with quotes...
   2051 
   2052 <quote><<EOF>> {
   2053          error("unterminated quote");
   2054          yyterminate();
   2055 }
   2056 <<EOF>> {
   2057          if (*++filelist)
   2058                  yyin = fopen(*filelist, "r");
   2059          else
   2060                  yyterminate();
   2061 }
   2062 .Ed
   2063 .Sh MISCELLANEOUS MACROS
   2064 The macro
   2065 .Dv YY_USER_ACTION
   2066 can be defined to provide an action
   2067 which is always executed prior to the matched rule's action.
   2068 For example,
   2069 it could be #define'd to call a routine to convert yytext to lower-case.
   2070 When
   2071 .Dv YY_USER_ACTION
   2072 is invoked, the variable
   2073 .Fa yy_act
   2074 gives the number of the matched rule
   2075 .Pq rules are numbered starting with 1 .
   2076 For example, to profile how often each rule is matched,
   2077 the following would do the trick:
   2078 .Pp
   2079 .Dl #define YY_USER_ACTION ++ctr[yy_act]
   2080 .Pp
   2081 where
   2082 .Fa ctr
   2083 is an array to hold the counts for the different rules.
   2084 Note that the macro
   2085 .Dv YY_NUM_RULES
   2086 gives the total number of rules
   2087 (including the default rule, even if
   2088 .Fl s
   2089 is used),
   2090 so a correct declaration for
   2091 .Fa ctr
   2092 is:
   2093 .Pp
   2094 .Dl int ctr[YY_NUM_RULES];
   2095 .Pp
   2096 The macro
   2097 .Dv YY_USER_INIT
   2098 may be defined to provide an action which is always executed before
   2099 the first scan
   2100 .Pq and before the scanner's internal initializations are done .
   2101 For example, it could be used to call a routine to read
   2102 in a data table or open a logging file.
   2103 .Pp
   2104 The macro
   2105 .Dv yy_set_interactive(is_interactive)
   2106 can be used to control whether the current buffer is considered
   2107 .Em interactive .
   2108 An interactive buffer is processed more slowly,
   2109 but must be used when the scanner's input source is indeed
   2110 interactive to avoid problems due to waiting to fill buffers
   2111 (see the discussion of the
   2112 .Fl I
   2113 flag below).
   2114 A non-zero value in the macro invocation marks the buffer as interactive,
   2115 a zero value as non-interactive.
   2116 Note that use of this macro overrides
   2117 .Dq %option always-interactive
   2118 or
   2119 .Dq %option never-interactive
   2120 (see
   2121 .Sx OPTIONS
   2122 below).
   2123 .Fn yy_set_interactive
   2124 must be invoked prior to beginning to scan the buffer that is
   2125 .Pq or is not
   2126 to be considered interactive.
   2127 .Pp
   2128 The macro
   2129 .Dv yy_set_bol(at_bol)
   2130 can be used to control whether the current buffer's scanning
   2131 context for the next token match is done as though at the
   2132 beginning of a line.
   2133 A non-zero macro argument makes rules anchored with
   2134 .Sq ^
   2135 active, while a zero argument makes
   2136 .Sq ^
   2137 rules inactive.
   2138 .Pp
   2139 The macro
   2140 .Dv YY_AT_BOL
   2141 returns true if the next token scanned from the current buffer will have
   2142 .Sq ^
   2143 rules active, false otherwise.
   2144 .Pp
   2145 In the generated scanner, the actions are all gathered in one large
   2146 switch statement and separated using
   2147 .Dv YY_BREAK ,
   2148 which may be redefined.
   2149 By default, it is simply a
   2150 .Qq break ,
   2151 to separate each rule's action from the following rules.
   2152 Redefining
   2153 .Dv YY_BREAK
   2154 allows, for example, C++ users to
   2155 .Dq #define YY_BREAK
   2156 to do nothing
   2157 (while being very careful that every rule ends with a
   2158 .Qq break
   2159 or a
   2160 .Qq return ! )
   2161 to avoid suffering from unreachable statement warnings where because a rule's
   2162 action ends with
   2163 .Dq return ,
   2164 the
   2165 .Dv YY_BREAK
   2166 is inaccessible.
   2167 .Sh VALUES AVAILABLE TO THE USER
   2168 This section summarizes the various values available to the user
   2169 in the rule actions.
   2170 .Bl -tag -width Ds
   2171 .It char *yytext
   2172 Holds the text of the current token.
   2173 It may be modified but not lengthened
   2174 .Pq characters cannot be appended to the end .
   2175 .Pp
   2176 If the special directive
   2177 .Dq %array
   2178 appears in the first section of the scanner description, then
   2179 .Fa yytext
   2180 is instead declared
   2181 .Dq char yytext[YYLMAX] ,
   2182 where
   2183 .Dv YYLMAX
   2184 is a macro definition that can be redefined in the first section
   2185 to change the default value
   2186 .Pq generally 8KB .
   2187 Using
   2188 .Dq %array
   2189 results in somewhat slower scanners, but the value of
   2190 .Fa yytext
   2191 becomes immune to calls to
   2192 .Fn input
   2193 and
   2194 .Fn unput ,
   2195 which potentially destroy its value when
   2196 .Fa yytext
   2197 is a character pointer.
   2198 The opposite of
   2199 .Dq %array
   2200 is
   2201 .Dq %pointer ,
   2202 which is the default.
   2203 .Pp
   2204 .Dq %array
   2205 cannot be used when generating C++ scanner classes
   2206 (the
   2207 .Fl +
   2208 flag).
   2209 .It int yyleng
   2210 Holds the length of the current token.
   2211 .It FILE *yyin
   2212 Is the file which by default
   2213 .Nm
   2214 reads from.
   2215 It may be redefined, but doing so only makes sense before
   2216 scanning begins or after an
   2217 .Dv EOF
   2218 has been encountered.
   2219 Changing it in the midst of scanning will have unexpected results since
   2220 .Nm
   2221 buffers its input; use
   2222 .Fn yyrestart
   2223 instead.
   2224 Once scanning terminates because an end-of-file
   2225 has been seen,
   2226 .Fa yyin
   2227 can be assigned as the new input file
   2228 and the scanner can be called again to continue scanning.
   2229 .It void yyrestart(FILE *new_file)
   2230 May be called to point
   2231 .Fa yyin
   2232 at the new input file.
   2233 The switch-over to the new file is immediate
   2234 .Pq any previously buffered-up input is lost .
   2235 Note that calling
   2236 .Fn yyrestart
   2237 with
   2238 .Fa yyin
   2239 as an argument thus throws away the current input buffer and continues
   2240 scanning the same input file.
   2241 .It FILE *yyout
   2242 Is the file to which
   2243 .Em ECHO
   2244 actions are done.
   2245 It can be reassigned by the user.
   2246 .It YY_CURRENT_BUFFER
   2247 Returns a
   2248 .Dv YY_BUFFER_STATE
   2249 handle to the current buffer.
   2250 .It YY_START
   2251 Returns an integer value corresponding to the current start condition.
   2252 This value can subsequently be used with
   2253 .Em BEGIN
   2254 to return to that start condition.
   2255 .El
   2256 .Sh INTERFACING WITH YACC
   2257 One of the main uses of
   2258 .Nm
   2259 is as a companion to the
   2260 .Xr yacc 1
   2261 parser-generator.
   2262 yacc parsers expect to call a routine named
   2263 .Fn yylex
   2264 to find the next input token.
   2265 The routine is supposed to return the type of the next token
   2266 as well as putting any associated value in the global
   2267 .Fa yylval ,
   2268 which is defined externally,
   2269 and can be a union or any other complex data structure.
   2270 To use
   2271 .Nm
   2272 with yacc, one specifies the
   2273 .Fl d
   2274 option to yacc to instruct it to generate the file
   2275 .Pa y.tab.h
   2276 containing definitions of all the
   2277 .Dq %tokens
   2278 appearing in the yacc input.
   2279 This file is then included in the
   2280 .Nm
   2281 scanner.
   2282 For example, if one of the tokens is
   2283 .Qq TOK_NUMBER ,
   2284 part of the scanner might look like:
   2285 .Bd -literal -offset indent
   2286 %{
   2287 #include "y.tab.h"
   2288 %}
   2289 
   2290 %%
   2291 
   2292 [0-9]+        yylval = atoi(yytext); return TOK_NUMBER;
   2293 .Ed
   2294 .Sh OPTIONS
   2295 .Nm
   2296 has the following options:
   2297 .Bl -tag -width Ds
   2298 .It Fl 7
   2299 Instructs
   2300 .Nm
   2301 to generate a 7-bit scanner, i.e., one which can only recognize 7-bit
   2302 characters in its input.
   2303 The advantage of using
   2304 .Fl 7
   2305 is that the scanner's tables can be up to half the size of those generated
   2306 using the
   2307 .Fl 8
   2308 option
   2309 .Pq see below .
   2310 The disadvantage is that such scanners often hang
   2311 or crash if their input contains an 8-bit character.
   2312 .Pp
   2313 Note, however, that unless generating a scanner using the
   2314 .Fl Cf
   2315 or
   2316 .Fl CF
   2317 table compression options, use of
   2318 .Fl 7
   2319 will save only a small amount of table space,
   2320 and make the scanner considerably less portable.
   2321 .Nm flex Ns 's
   2322 default behavior is to generate an 8-bit scanner unless
   2323 .Fl Cf
   2324 or
   2325 .Fl CF
   2326 is specified, in which case
   2327 .Nm
   2328 defaults to generating 7-bit scanners unless it was
   2329 configured to generate 8-bit scanners
   2330 (as will often be the case with non-USA sites).
   2331 It is possible tell whether
   2332 .Nm
   2333 generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the
   2334 .Fl v
   2335 output as described below.
   2336 .Pp
   2337 Note that if
   2338 .Fl Cfe
   2339 or
   2340 .Fl CFe
   2341 are used
   2342 (the table compression options, but also using equivalence classes as
   2343 discussed below),
   2344 .Nm
   2345 still defaults to generating an 8-bit scanner,
   2346 since usually with these compression options full 8-bit tables
   2347 are not much more expensive than 7-bit tables.
   2348 .It Fl 8
   2349 Instructs
   2350 .Nm
   2351 to generate an 8-bit scanner, i.e., one which can recognize 8-bit
   2352 characters.
   2353 This flag is only needed for scanners generated using
   2354 .Fl Cf
   2355 or
   2356 .Fl CF ,
   2357 as otherwise
   2358 .Nm
   2359 defaults to generating an 8-bit scanner anyway.
   2360 .Pp
   2361 See the discussion of
   2362 .Fl 7
   2363 above for
   2364 .Nm flex Ns 's
   2365 default behavior and the tradeoffs between 7-bit and 8-bit scanners.
   2366 .It Fl B
   2367 Instructs
   2368 .Nm
   2369 to generate a
   2370 .Em batch
   2371 scanner, the opposite of
   2372 .Em interactive
   2373 scanners generated by
   2374 .Fl I
   2375 .Pq see below .
   2376 In general,
   2377 .Fl B
   2378 is used when the scanner will never be used interactively,
   2379 and you want to squeeze a little more performance out of it.
   2380 If the aim is instead to squeeze out a lot more performance,
   2381 use the
   2382 .Fl Cf
   2383 or
   2384 .Fl CF
   2385 options
   2386 .Pq discussed below ,
   2387 which turn on
   2388 .Fl B
   2389 automatically anyway.
   2390 .It Fl b
   2391 Generate backing-up information to
   2392 .Pa lex.backup .
   2393 This is a list of scanner states which require backing up
   2394 and the input characters on which they do so.
   2395 By adding rules one can remove backing-up states.
   2396 If all backing-up states are eliminated and
   2397 .Fl Cf
   2398 or
   2399 .Fl CF
   2400 is used, the generated scanner will run faster (see the
   2401 .Fl p
   2402 flag).
   2403 Only users who wish to squeeze every last cycle out of their
   2404 scanners need worry about this option.
   2405 (See the section on
   2406 .Sx PERFORMANCE CONSIDERATIONS
   2407 below.)
   2408 .It Fl C Ns Op Cm aeFfmr
   2409 Controls the degree of table compression and, more generally, trade-offs
   2410 between small scanners and fast scanners.
   2411 .Bl -tag -width Ds
   2412 .It Fl Ca
   2413 Instructs
   2414 .Nm
   2415 to trade off larger tables in the generated scanner for faster performance
   2416 because the elements of the tables are better aligned for memory access
   2417 and computation.
   2418 On some
   2419 .Tn RISC
   2420 architectures, fetching and manipulating longwords is more efficient
   2421 than with smaller-sized units such as shortwords.
   2422 This option can double the size of the tables used by the scanner.
   2423 .It Fl Ce
   2424 Directs
   2425 .Nm
   2426 to construct
   2427 .Em equivalence classes ,
   2428 i.e., sets of characters which have identical lexical properties
   2429 (for example, if the only appearance of digits in the
   2430 .Nm
   2431 input is in the character class
   2432 .Qq [0-9]
   2433 then the digits
   2434 .Sq 0 ,
   2435 .Sq 1 ,
   2436 .Sq ... ,
   2437 .Sq 9
   2438 will all be put in the same equivalence class).
   2439 Equivalence classes usually give dramatic reductions in the final
   2440 table/object file sizes
   2441 .Pq typically a factor of 2\-5
   2442 and are pretty cheap performance-wise
   2443 .Pq one array look-up per character scanned .
   2444 .It Fl CF
   2445 Specifies that the alternate fast scanner representation
   2446 (described below under the
   2447 .Fl F
   2448 option)
   2449 should be used.
   2450 This option cannot be used with
   2451 .Fl + .
   2452 .It Fl Cf
   2453 Specifies that the
   2454 .Em full
   2455 scanner tables should be generated \-
   2456 .Nm
   2457 should not compress the tables by taking advantage of
   2458 similar transition functions for different states.
   2459 .It Fl \&Cm
   2460 Directs
   2461 .Nm
   2462 to construct
   2463 .Em meta-equivalence classes ,
   2464 which are sets of equivalence classes
   2465 (or characters, if equivalence classes are not being used)
   2466 that are commonly used together.
   2467 Meta-equivalence classes are often a big win when using compressed tables,
   2468 but they have a moderate performance impact
   2469 (one or two
   2470 .Qq if
   2471 tests and one array look-up per character scanned).
   2472 .It Fl Cr
   2473 Causes the generated scanner to
   2474 .Em bypass
   2475 use of the standard I/O library
   2476 .Pq stdio
   2477 for input.
   2478 Instead of calling
   2479 .Xr fread 3
   2480 or
   2481 .Xr getc 3 ,
   2482 the scanner will use the
   2483 .Xr read 2
   2484 system call,
   2485 resulting in a performance gain which varies from system to system,
   2486 but in general is probably negligible unless
   2487 .Fl Cf
   2488 or
   2489 .Fl CF
   2490 are being used.
   2491 Using
   2492 .Fl Cr
   2493 can cause strange behavior if, for example, reading from
   2494 .Fa yyin
   2495 using stdio prior to calling the scanner
   2496 (because the scanner will miss whatever text previous reads left
   2497 in the stdio input buffer).
   2498 .Pp
   2499 .Fl Cr
   2500 has no effect if
   2501 .Dv YY_INPUT
   2502 is defined
   2503 (see
   2504 .Sx THE GENERATED SCANNER
   2505 above).
   2506 .El
   2507 .Pp
   2508 A lone
   2509 .Fl C
   2510 specifies that the scanner tables should be compressed but neither
   2511 equivalence classes nor meta-equivalence classes should be used.
   2512 .Pp
   2513 The options
   2514 .Fl Cf
   2515 or
   2516 .Fl CF
   2517 and
   2518 .Fl \&Cm
   2519 do not make sense together \- there is no opportunity for meta-equivalence
   2520 classes if the table is not being compressed.
   2521 Otherwise the options may be freely mixed, and are cumulative.
   2522 .Pp
   2523 The default setting is
   2524 .Fl Cem
   2525 which specifies that
   2526 .Nm
   2527 should generate equivalence classes and meta-equivalence classes.
   2528 This setting provides the highest degree of table compression.
   2529 It is possible to trade off faster-executing scanners at the cost of
   2530 larger tables with the following generally being true:
   2531 .Bd -unfilled -offset indent
   2532 slowest & smallest
   2533       -Cem
   2534       -Cm
   2535       -Ce
   2536       -C
   2537       -C{f,F}e
   2538       -C{f,F}
   2539       -C{f,F}a
   2540 fastest & largest
   2541 .Ed
   2542 .Pp
   2543 Note that scanners with the smallest tables are usually generated and
   2544 compiled the quickest,
   2545 so during development the default is usually best,
   2546 maximal compression.
   2547 .Pp
   2548 .Fl Cfe
   2549 is often a good compromise between speed and size for production scanners.
   2550 .It Fl d
   2551 Makes the generated scanner run in debug mode.
   2552 Whenever a pattern is recognized and the global
   2553 .Fa yy_flex_debug
   2554 is non-zero
   2555 .Pq which is the default ,
   2556 the scanner will write to stderr a line of the form:
   2557 .Pp
   2558 .D1 --accepting rule at line 53 ("the matched text")
   2559 .Pp
   2560 The line number refers to the location of the rule in the file
   2561 defining the scanner
   2562 (i.e., the file that was fed to
   2563 .Nm ) .
   2564 Messages are also generated when the scanner backs up,
   2565 accepts the default rule,
   2566 reaches the end of its input buffer
   2567 (or encounters a NUL;
   2568 at this point, the two look the same as far as the scanner's concerned),
   2569 or reaches an end-of-file.
   2570 .It Fl F
   2571 Specifies that the fast scanner table representation should be used
   2572 .Pq and stdio bypassed .
   2573 This representation is about as fast as the full table representation
   2574 .Pq Fl f ,
   2575 and for some sets of patterns will be considerably smaller
   2576 .Pq and for others, larger .
   2577 In general, if the pattern set contains both
   2578 .Qq keywords
   2579 and a catch-all,
   2580 .Qq identifier
   2581 rule, such as in the set:
   2582 .Bd -unfilled -offset indent
   2583 "case"    return TOK_CASE;
   2584 "switch"  return TOK_SWITCH;
   2585 \&...
   2586 "default" return TOK_DEFAULT;
   2587 [a-z]+    return TOK_ID;
   2588 .Ed
   2589 .Pp
   2590 then it's better to use the full table representation.
   2591 If only the
   2592 .Qq identifier
   2593 rule is present and a hash table or some such is used to detect the keywords,
   2594 it's better to use
   2595 .Fl F .
   2596 .Pp
   2597 This option is equivalent to
   2598 .Fl CFr
   2599 .Pq see above .
   2600 It cannot be used with
   2601 .Fl + .
   2602 .It Fl f
   2603 Specifies
   2604 .Em fast scanner .
   2605 No table compression is done and stdio is bypassed.
   2606 The result is large but fast.
   2607 This option is equivalent to
   2608 .Fl Cfr
   2609 .Pq see above .
   2610 .It Fl h
   2611 Generates a help summary of
   2612 .Nm flex Ns 's
   2613 options to stdout and then exits.
   2614 .Fl ?\&
   2615 and
   2616 .Fl Fl help
   2617 are synonyms for
   2618 .Fl h .
   2619 .It Fl I
   2620 Instructs
   2621 .Nm
   2622 to generate an
   2623 .Em interactive
   2624 scanner.
   2625 An interactive scanner is one that only looks ahead to decide
   2626 what token has been matched if it absolutely must.
   2627 It turns out that always looking one extra character ahead,
   2628 even if the scanner has already seen enough text
   2629 to disambiguate the current token, is a bit faster than
   2630 only looking ahead when necessary.
   2631 But scanners that always look ahead give dreadful interactive performance;
   2632 for example, when a user types a newline,
   2633 it is not recognized as a newline token until they enter
   2634 .Em another
   2635 token, which often means typing in another whole line.
   2636 .Pp
   2637 .Nm
   2638 scanners default to
   2639 .Em interactive
   2640 unless
   2641 .Fl Cf
   2642 or
   2643 .Fl CF
   2644 table-compression options are specified
   2645 .Pq see above .
   2646 That's because if high-performance is most important,
   2647 one of these options should be used,
   2648 so if they weren't,
   2649 .Nm
   2650 assumes it is preferable to trade off a bit of run-time performance for
   2651 intuitive interactive behavior.
   2652 Note also that
   2653 .Fl I
   2654 cannot be used in conjunction with
   2655 .Fl Cf
   2656 or
   2657 .Fl CF .
   2658 Thus, this option is not really needed; it is on by default for all those
   2659 cases in which it is allowed.
   2660 .Pp
   2661 A scanner can be forced to not be interactive by using
   2662 .Fl B
   2663 .Pq see above .
   2664 .It Fl i
   2665 Instructs
   2666 .Nm
   2667 to generate a case-insensitive scanner.
   2668 The case of letters given in the
   2669 .Nm
   2670 input patterns will be ignored,
   2671 and tokens in the input will be matched regardless of case.
   2672 The matched text given in
   2673 .Fa yytext
   2674 will have the preserved case
   2675 .Pq i.e., it will not be folded .
   2676 .It Fl L
   2677 Instructs
   2678 .Nm
   2679 not to generate
   2680 .Dq #line
   2681 directives.
   2682 Without this option,
   2683 .Nm
   2684 peppers the generated scanner with #line directives so error messages
   2685 in the actions will be correctly located with respect to either the original
   2686 .Nm
   2687 input file
   2688 (if the errors are due to code in the input file),
   2689 or
   2690 .Pa lex.yy.c
   2691 (if the errors are
   2692 .Nm flex Ns 's
   2693 fault \- these sorts of errors should be reported to the email address
   2694 given below).
   2695 .It Fl l
   2696 Turns on maximum compatibility with the original
   2697 .At
   2698 .Nm lex
   2699 implementation.
   2700 Note that this does not mean full compatibility.
   2701 Use of this option costs a considerable amount of performance,
   2702 and it cannot be used with the
   2703 .Fl + , f , F , Cf ,
   2704 or
   2705 .Fl CF
   2706 options.
   2707 For details on the compatibilities it provides, see the section
   2708 .Sx INCOMPATIBILITIES WITH LEX AND POSIX
   2709 below.
   2710 This option also results in the name
   2711 .Dv YY_FLEX_LEX_COMPAT
   2712 being #define'd in the generated scanner.
   2713 .It Fl n
   2714 Another do-nothing, deprecated option included only for
   2715 .Tn POSIX
   2716 compliance.
   2717 .It Fl o Ns Ar output
   2718 Directs
   2719 .Nm
   2720 to write the scanner to the file
   2721 .Ar output
   2722 instead of
   2723 .Pa lex.yy.c .
   2724 If
   2725 .Fl o
   2726 is combined with the
   2727 .Fl t
   2728 option, then the scanner is written to stdout but its
   2729 .Dq #line
   2730 directives
   2731 (see the
   2732 .Fl L
   2733 option above)
   2734 refer to the file
   2735 .Ar output .
   2736 .It Fl P Ns Ar prefix
   2737 Changes the default
   2738 .Qq yy
   2739 prefix used by
   2740 .Nm
   2741 for all globally visible variable and function names to instead be
   2742 .Ar prefix .
   2743 For example,
   2744 .Fl P Ns Ar foo
   2745 changes the name of
   2746 .Fa yytext
   2747 to
   2748 .Fa footext .
   2749 It also changes the name of the default output file from
   2750 .Pa lex.yy.c
   2751 to
   2752 .Pa lex.foo.c .
   2753 Here are all of the names affected:
   2754 .Bd -unfilled -offset indent
   2755 yy_create_buffer
   2756 yy_delete_buffer
   2757 yy_flex_debug
   2758 yy_init_buffer
   2759 yy_flush_buffer
   2760 yy_load_buffer_state
   2761 yy_switch_to_buffer
   2762 yyin
   2763 yyleng
   2764 yylex
   2765 yylineno
   2766 yyout
   2767 yyrestart
   2768 yytext
   2769 yywrap
   2770 .Ed
   2771 .Pp
   2772 (If using a C++ scanner, then only
   2773 .Fa yywrap
   2774 and
   2775 .Fa yyFlexLexer
   2776 are affected.)
   2777 Within the scanner itself, it is still possible to refer to the global variables
   2778 and functions using either version of their name; but externally, they
   2779 have the modified name.
   2780 .Pp
   2781 This option allows multiple
   2782 .Nm
   2783 programs to be easily linked together into the same executable.
   2784 Note, though, that using this option also renames
   2785 .Fn yywrap ,
   2786 so now either an
   2787 .Pq appropriately named
   2788 version of the routine for the scanner must be supplied, or
   2789 .Dq %option noyywrap
   2790 must be used, as linking with
   2791 .Fl lfl
   2792 no longer provides one by default.
   2793 .It Fl p
   2794 Generates a performance report to stderr.
   2795 The report consists of comments regarding features of the
   2796 .Nm
   2797 input file which will cause a serious loss of performance in the resulting
   2798 scanner.
   2799 If the flag is specified twice,
   2800 comments regarding features that lead to minor performance losses
   2801 will also be reported>
   2802 .Pp
   2803 Note that the use of
   2804 .Em REJECT ,
   2805 .Dq %option yylineno ,
   2806 and variable trailing context
   2807 (see the
   2808 .Sx BUGS
   2809 section below)
   2810 entails a substantial performance penalty; use of
   2811 .Fn yymore ,
   2812 the
   2813 .Sq ^
   2814 operator, and the
   2815 .Fl I
   2816 flag entail minor performance penalties.
   2817 .It Fl S Ns Ar skeleton
   2818 Overrides the default skeleton file from which
   2819 .Nm
   2820 constructs its scanners.
   2821 This option is needed only for
   2822 .Nm
   2823 maintenance or development.
   2824 .It Fl s
   2825 Causes the default rule
   2826 .Pq that unmatched scanner input is echoed to stdout
   2827 to be suppressed.
   2828 If the scanner encounters input that does not
   2829 match any of its rules, it aborts with an error.
   2830 This option is useful for finding holes in a scanner's rule set.
   2831 .It Fl T
   2832 Makes
   2833 .Nm
   2834 run in
   2835 .Em trace
   2836 mode.
   2837 It will generate a lot of messages to stderr concerning
   2838 the form of the input and the resultant non-deterministic and deterministic
   2839 finite automata.
   2840 This option is mostly for use in maintaining
   2841 .Nm .
   2842 .It Fl t
   2843 Instructs
   2844 .Nm
   2845 to write the scanner it generates to standard output instead of
   2846 .Pa lex.yy.c .
   2847 .It Fl V
   2848 Prints the version number to stdout and exits.
   2849 .Fl Fl version
   2850 is a synonym for
   2851 .Fl V .
   2852 .It Fl v
   2853 Specifies that
   2854 .Nm
   2855 should write to stderr
   2856 a summary of statistics regarding the scanner it generates.
   2857 Most of the statistics are meaningless to the casual
   2858 .Nm
   2859 user, but the first line identifies the version of
   2860 .Nm
   2861 (same as reported by
   2862 .Fl V ) ,
   2863 and the next line the flags used when generating the scanner,
   2864 including those that are on by default.
   2865 .It Fl w
   2866 Suppresses warning messages.
   2867 .It Fl +
   2868 Specifies that
   2869 .Nm
   2870 should generate a C++ scanner class.
   2871 See the section on
   2872 .Sx GENERATING C++ SCANNERS
   2873 below for details.
   2874 .El
   2875 .Pp
   2876 .Nm
   2877 also provides a mechanism for controlling options within the
   2878 scanner specification itself, rather than from the
   2879 .Nm
   2880 command line.
   2881 This is done by including
   2882 .Dq %option
   2883 directives in the first section of the scanner specification.
   2884 Multiple options can be specified with a single
   2885 .Dq %option
   2886 directive, and multiple directives in the first section of the
   2887 .Nm
   2888 input file.
   2889 .Pp
   2890 Most options are given simply as names, optionally preceded by the word
   2891 .Qq no
   2892 .Pq with no intervening whitespace
   2893 to negate their meaning.
   2894 A number are equivalent to
   2895 .Nm
   2896 flags or their negation:
   2897 .Bd -unfilled -offset indent
   2898 7bit            -7 option
   2899 8bit            -8 option
   2900 align           -Ca option
   2901 backup          -b option
   2902 batch           -B option
   2903 c++             -+ option
   2904 
   2905 caseful or
   2906 case-sensitive  opposite of -i (default)
   2907 
   2908 case-insensitive or
   2909 caseless        -i option
   2910 
   2911 debug           -d option
   2912 default         opposite of -s option
   2913 ecs             -Ce option
   2914 fast            -F option
   2915 full            -f option
   2916 interactive     -I option
   2917 lex-compat      -l option
   2918 meta-ecs        -Cm option
   2919 perf-report     -p option
   2920 read            -Cr option
   2921 stdout          -t option
   2922 verbose         -v option
   2923 warn            opposite of -w option
   2924                 (use "%option nowarn" for -w)
   2925 
   2926 array           equivalent to "%array"
   2927 pointer         equivalent to "%pointer" (default)
   2928 .Ed
   2929 .Pp
   2930 Some %option's provide features otherwise not available:
   2931 .Bl -tag -width Ds
   2932 .It always-interactive
   2933 Instructs
   2934 .Nm
   2935 to generate a scanner which always considers its input
   2936 .Qq interactive .
   2937 Normally, on each new input file the scanner calls
   2938 .Fn isatty
   2939 in an attempt to determine whether the scanner's input source is interactive
   2940 and thus should be read a character at a time.
   2941 When this option is used, however, no such call is made.
   2942 .It main
   2943 Directs
   2944 .Nm
   2945 to provide a default
   2946 .Fn main
   2947 program for the scanner, which simply calls
   2948 .Fn yylex .
   2949 This option implies
   2950 .Dq noyywrap
   2951 .Pq see below .
   2952 .It never-interactive
   2953 Instructs
   2954 .Nm
   2955 to generate a scanner which never considers its input
   2956 .Qq interactive
   2957 (again, no call made to
   2958 .Fn isatty ) .
   2959 This is the opposite of
   2960 .Dq always-interactive .
   2961 .It stack
   2962 Enables the use of start condition stacks
   2963 (see
   2964 .Sx START CONDITIONS
   2965 above).
   2966 .It stdinit
   2967 If set (i.e.,
   2968 .Dq %option stdinit ) ,
   2969 initializes
   2970 .Fa yyin
   2971 and
   2972 .Fa yyout
   2973 to stdin and stdout, instead of the default of
   2974 .Dq nil .
   2975 Some existing
   2976 .Nm lex
   2977 programs depend on this behavior, even though it is not compliant with ANSI C,
   2978 which does not require stdin and stdout to be compile-time constant.
   2979 .It yylineno
   2980 Directs
   2981 .Nm
   2982 to generate a scanner that maintains the number of the current line
   2983 read from its input in the global variable
   2984 .Fa yylineno .
   2985 This option is implied by
   2986 .Dq %option lex-compat .
   2987 .It yywrap
   2988 If unset (i.e.,
   2989 .Dq %option noyywrap ) ,
   2990 makes the scanner not call
   2991 .Fn yywrap
   2992 upon an end-of-file, but simply assume that there are no more files to scan
   2993 (until the user points
   2994 .Fa yyin
   2995 at a new file and calls
   2996 .Fn yylex
   2997 again).
   2998 .El
   2999 .Pp
   3000 .Nm
   3001 scans rule actions to determine whether the
   3002 .Em REJECT
   3003 or
   3004 .Fn yymore
   3005 features are being used.
   3006 The
   3007 .Dq reject
   3008 and
   3009 .Dq yymore
   3010 options are available to override its decision as to whether to use the
   3011 options, either by setting them (e.g.,
   3012 .Dq %option reject )
   3013 to indicate the feature is indeed used,
   3014 or unsetting them to indicate it actually is not used
   3015 (e.g.,
   3016 .Dq %option noyymore ) .
   3017 .Pp
   3018 Three options take string-delimited values, offset with
   3019 .Sq = :
   3020 .Pp
   3021 .D1 %option outfile="ABC"
   3022 .Pp
   3023 is equivalent to
   3024 .Fl o Ns Ar ABC ,
   3025 and
   3026 .Pp
   3027 .D1 %option prefix="XYZ"
   3028 .Pp
   3029 is equivalent to
   3030 .Fl P Ns Ar XYZ .
   3031 Finally,
   3032 .Pp
   3033 .D1 %option yyclass="foo"
   3034 .Pp
   3035 only applies when generating a C++ scanner
   3036 .Pf ( Fl +
   3037 option).
   3038 It informs
   3039 .Nm
   3040 that
   3041 .Dq foo
   3042 has been derived as a subclass of yyFlexLexer, so
   3043 .Nm
   3044 will place actions in the member function
   3045 .Dq foo::yylex()
   3046 instead of
   3047 .Dq yyFlexLexer::yylex() .
   3048 It also generates a
   3049 .Dq yyFlexLexer::yylex()
   3050 member function that emits a run-time error (by invoking
   3051 .Dq yyFlexLexer::LexerError() )
   3052 if called.
   3053 See
   3054 .Sx GENERATING C++ SCANNERS ,
   3055 below, for additional information.
   3056 .Pp
   3057 A number of options are available for
   3058 lint
   3059 purists who want to suppress the appearance of unneeded routines
   3060 in the generated scanner.
   3061 Each of the following, if unset
   3062 (e.g.,
   3063 .Dq %option nounput ) ,
   3064 results in the corresponding routine not appearing in the generated scanner:
   3065 .Bd -unfilled -offset indent
   3066 input, unput
   3067 yy_push_state, yy_pop_state, yy_top_state
   3068 yy_scan_buffer, yy_scan_bytes, yy_scan_string
   3069 .Ed
   3070 .Pp
   3071 (though
   3072 .Fn yy_push_state
   3073 and friends won't appear anyway unless
   3074 .Dq %option stack
   3075 is being used).
   3076 .Sh PERFORMANCE CONSIDERATIONS
   3077 The main design goal of
   3078 .Nm
   3079 is that it generate high-performance scanners.
   3080 It has been optimized for dealing well with large sets of rules.
   3081 Aside from the effects on scanner speed of the table compression
   3082 .Fl C
   3083 options outlined above,
   3084 there are a number of options/actions which degrade performance.
   3085 These are, from most expensive to least:
   3086 .Bd -unfilled -offset indent
   3087 REJECT
   3088 %option yylineno
   3089 arbitrary trailing context
   3090 
   3091 pattern sets that require backing up
   3092 %array
   3093 %option interactive
   3094 %option always-interactive
   3095 
   3096 \&'^' beginning-of-line operator
   3097 yymore()
   3098 .Ed
   3099 .Pp
   3100 with the first three all being quite expensive
   3101 and the last two being quite cheap.
   3102 Note also that
   3103 .Fn unput
   3104 is implemented as a routine call that potentially does quite a bit of work,
   3105 while
   3106 .Fn yyless
   3107 is a quite-cheap macro; so if just putting back some excess text,
   3108 use
   3109 .Fn yyless .
   3110 .Pp
   3111 .Em REJECT
   3112 should be avoided at all costs when performance is important.
   3113 It is a particularly expensive option.
   3114 .Pp
   3115 Getting rid of backing up is messy and often may be an enormous
   3116 amount of work for a complicated scanner.
   3117 In principal, one begins by using the
   3118 .Fl b
   3119 flag to generate a
   3120 .Pa lex.backup
   3121 file.
   3122 For example, on the input
   3123 .Bd -literal -offset indent
   3124 %%
   3125 foo        return TOK_KEYWORD;
   3126 foobar     return TOK_KEYWORD;
   3127 .Ed
   3128 .Pp
   3129 the file looks like:
   3130 .Bd -literal -offset indent
   3131 State #6 is non-accepting -
   3132  associated rule line numbers:
   3133        2       3
   3134  out-transitions: [ o ]
   3135  jam-transitions: EOF [ \e001-n  p-\e177 ]
   3136 
   3137 State #8 is non-accepting -
   3138  associated rule line numbers:
   3139        3
   3140  out-transitions: [ a ]
   3141  jam-transitions: EOF [ \e001-`  b-\e177 ]
   3142 
   3143 State #9 is non-accepting -
   3144  associated rule line numbers:
   3145        3
   3146  out-transitions: [ r ]
   3147  jam-transitions: EOF [ \e001-q  s-\e177 ]
   3148 
   3149 Compressed tables always back up.
   3150 .Ed
   3151 .Pp
   3152 The first few lines tell us that there's a scanner state in
   3153 which it can make a transition on an
   3154 .Sq o
   3155 but not on any other character,
   3156 and that in that state the currently scanned text does not match any rule.
   3157 The state occurs when trying to match the rules found
   3158 at lines 2 and 3 in the input file.
   3159 If the scanner is in that state and then reads something other than an
   3160 .Sq o ,
   3161 it will have to back up to find a rule which is matched.
   3162 With a bit of headscratching one can see that this must be the
   3163 state it's in when it has seen
   3164 .Sq fo .
   3165 When this has happened, if anything other than another
   3166 .Sq o
   3167 is seen, the scanner will have to back up to simply match the
   3168 .Sq f
   3169 .Pq by the default rule .
   3170 .Pp
   3171 The comment regarding State #8 indicates there's a problem when
   3172 .Qq foob
   3173 has been scanned.
   3174 Indeed, on any character other than an
   3175 .Sq a ,
   3176 the scanner will have to back up to accept
   3177 .Qq foo .
   3178 Similarly, the comment for State #9 concerns when
   3179 .Qq fooba
   3180 has been scanned and an
   3181 .Sq r
   3182 does not follow.
   3183 .Pp
   3184 The final comment reminds us that there's no point going to
   3185 all the trouble of removing backing up from the rules unless we're using
   3186 .Fl Cf
   3187 or
   3188 .Fl CF ,
   3189 since there's no performance gain doing so with compressed scanners.
   3190 .Pp
   3191 The way to remove the backing up is to add
   3192 .Qq error
   3193 rules:
   3194 .Bd -literal -offset indent
   3195 %%
   3196 foo    return TOK_KEYWORD;
   3197 foobar return TOK_KEYWORD;
   3198 
   3199 fooba  |
   3200 foob   |
   3201 fo {
   3202         /* false alarm, not really a keyword */
   3203         return TOK_ID;
   3204 }
   3205 .Ed
   3206 .Pp
   3207 Eliminating backing up among a list of keywords can also be done using a
   3208 .Qq catch-all
   3209 rule:
   3210 .Bd -literal -offset indent
   3211 %%
   3212 foo    return TOK_KEYWORD;
   3213 foobar return TOK_KEYWORD;
   3214 
   3215 [a-z]+ return TOK_ID;
   3216 .Ed
   3217 .Pp
   3218 This is usually the best solution when appropriate.
   3219 .Pp
   3220 Backing up messages tend to cascade.
   3221 With a complicated set of rules it's not uncommon to get hundreds of messages.
   3222 If one can decipher them, though,
   3223 it often only takes a dozen or so rules to eliminate the backing up
   3224 (though it's easy to make a mistake and have an error rule accidentally match
   3225 a valid token; a possible future
   3226 .Nm
   3227 feature will be to automatically add rules to eliminate backing up).
   3228 .Pp
   3229 It's important to keep in mind that the benefits of eliminating
   3230 backing up are gained only if
   3231 .Em every
   3232 instance of backing up is eliminated.
   3233 Leaving just one gains nothing.
   3234 .Pp
   3235 .Em Variable
   3236 trailing context
   3237 (where both the leading and trailing parts do not have a fixed length)
   3238 entails almost the same performance loss as
   3239 .Em REJECT
   3240 .Pq i.e., substantial .
   3241 So when possible a rule like:
   3242 .Bd -literal -offset indent
   3243 %%
   3244 mouse|rat/(cat|dog)   run();
   3245 .Ed
   3246 .Pp
   3247 is better written:
   3248 .Bd -literal -offset indent
   3249 %%
   3250 mouse/cat|dog         run();
   3251 rat/cat|dog           run();
   3252 .Ed
   3253 .Pp
   3254 or as
   3255 .Bd -literal -offset indent
   3256 %%
   3257 mouse|rat/cat         run();
   3258 mouse|rat/dog         run();
   3259 .Ed
   3260 .Pp
   3261 Note that here the special
   3262 .Sq |\&
   3263 action does not provide any savings, and can even make things worse (see
   3264 .Sx BUGS
   3265 below).
   3266 .Pp
   3267 Another area where the user can increase a scanner's performance
   3268 .Pq and one that's easier to implement
   3269 arises from the fact that the longer the tokens matched,
   3270 the faster the scanner will run.
   3271 This is because with long tokens the processing of most input
   3272 characters takes place in the
   3273 .Pq short
   3274 inner scanning loop, and does not often have to go through the additional work
   3275 of setting up the scanning environment (e.g.,
   3276 .Fa yytext )
   3277 for the action.
   3278 Recall the scanner for C comments:
   3279 .Bd -literal -offset indent
   3280 %x comment
   3281 %%
   3282 int line_num = 1;
   3283 
   3284 "/*"                    BEGIN(comment);
   3285 
   3286 <comment>[^*\en]*
   3287 <comment>"*"+[^*/\en]*
   3288 <comment>\en             ++line_num;
   3289 <comment>"*"+"/"        BEGIN(INITIAL);
   3290 .Ed
   3291 .Pp
   3292 This could be sped up by writing it as:
   3293 .Bd -literal -offset indent
   3294 %x comment
   3295 %%
   3296 int line_num = 1;
   3297 
   3298 "/*"                    BEGIN(comment);
   3299 
   3300 <comment>[^*\en]*
   3301 <comment>[^*\en]*\en      ++line_num;
   3302 <comment>"*"+[^*/\en]*
   3303 <comment>"*"+[^*/\en]*\en ++line_num;
   3304 <comment>"*"+"/"        BEGIN(INITIAL);
   3305 .Ed
   3306 .Pp
   3307 Now instead of each newline requiring the processing of another action,
   3308 recognizing the newlines is
   3309 .Qq distributed
   3310 over the other rules to keep the matched text as long as possible.
   3311 Note that adding rules does
   3312 .Em not
   3313 slow down the scanner!
   3314 The speed of the scanner is independent of the number of rules or
   3315 (modulo the considerations given at the beginning of this section)
   3316 how complicated the rules are with regard to operators such as
   3317 .Sq *
   3318 and
   3319 .Sq |\& .
   3320 .Pp
   3321 A final example in speeding up a scanner:
   3322 scan through a file containing identifiers and keywords, one per line
   3323 and with no other extraneous characters, and recognize all the keywords.
   3324 A natural first approach is:
   3325 .Bd -literal -offset indent
   3326 %%
   3327 asm      |
   3328 auto     |
   3329 break    |
   3330 \&... etc ...
   3331 volatile |
   3332 while    /* it's a keyword */
   3333 
   3334 \&.|\en     /* it's not a keyword */
   3335 .Ed
   3336 .Pp
   3337 To eliminate the back-tracking, introduce a catch-all rule:
   3338 .Bd -literal -offset indent
   3339 %%
   3340 asm      |
   3341 auto     |
   3342 break    |
   3343 \&... etc ...
   3344 volatile |
   3345 while    /* it's a keyword */
   3346 
   3347 [a-z]+   |
   3348 \&.|\en     /* it's not a keyword */
   3349 .Ed
   3350 .Pp
   3351 Now, if it's guaranteed that there's exactly one word per line,
   3352 then we can reduce the total number of matches by a half by
   3353 merging in the recognition of newlines with that of the other tokens:
   3354 .Bd -literal -offset indent
   3355 %%
   3356 asm\en      |
   3357 auto\en     |
   3358 break\en    |
   3359 \&... etc ...
   3360 volatile\en |
   3361 while\en    /* it's a keyword */
   3362 
   3363 [a-z]+\en   |
   3364 \&.|\en       /* it's not a keyword */
   3365 .Ed
   3366 .Pp
   3367 One has to be careful here,
   3368 as we have now reintroduced backing up into the scanner.
   3369 In particular, while we know that there will never be any characters
   3370 in the input stream other than letters or newlines,
   3371 .Nm
   3372 can't figure this out, and it will plan for possibly needing to back up
   3373 when it has scanned a token like
   3374 .Qq auto
   3375 and then the next character is something other than a newline or a letter.
   3376 Previously it would then just match the
   3377 .Qq auto
   3378 rule and be done, but now it has no
   3379 .Qq auto
   3380 rule, only an
   3381 .Qq auto\en
   3382 rule.
   3383 To eliminate the possibility of backing up,
   3384 we could either duplicate all rules but without final newlines, or,
   3385 since we never expect to encounter such an input and therefore don't
   3386 how it's classified, we can introduce one more catch-all rule,
   3387 this one which doesn't include a newline:
   3388 .Bd -literal -offset indent
   3389 %%
   3390 asm\en      |
   3391 auto\en     |
   3392 break\en    |
   3393 \&... etc ...
   3394 volatile\en |
   3395 while\en    /* it's a keyword */
   3396 
   3397 [a-z]+\en   |
   3398 [a-z]+     |
   3399 \&.|\en       /* it's not a keyword */
   3400 .Ed
   3401 .Pp
   3402 Compiled with
   3403 .Fl Cf ,
   3404 this is about as fast as one can get a
   3405 .Nm
   3406 scanner to go for this particular problem.
   3407 .Pp
   3408 A final note:
   3409 .Nm
   3410 is slow when matching NUL's,
   3411 particularly when a token contains multiple NUL's.
   3412 It's best to write rules which match short
   3413 amounts of text if it's anticipated that the text will often include NUL's.
   3414 .Pp
   3415 Another final note regarding performance: as mentioned above in the section
   3416 .Sx HOW THE INPUT IS MATCHED ,
   3417 dynamically resizing
   3418 .Fa yytext
   3419 to accommodate huge tokens is a slow process because it presently requires that
   3420 the
   3421 .Pq huge
   3422 token be rescanned from the beginning.
   3423 Thus if performance is vital, it is better to attempt to match
   3424 .Qq large
   3425 quantities of text but not
   3426 .Qq huge
   3427 quantities, where the cutoff between the two is at about 8K characters/token.
   3428 .Sh GENERATING C++ SCANNERS
   3429 .Nm
   3430 provides two different ways to generate scanners for use with C++.
   3431 The first way is to simply compile a scanner generated by
   3432 .Nm
   3433 using a C++ compiler instead of a C compiler.
   3434 This should not generate any compilation errors
   3435 (please report any found to the email address given in the
   3436 .Sx AUTHORS
   3437 section below).
   3438 C++ code can then be used in rule actions instead of C code.
   3439 Note that the default input source for scanners remains
   3440 .Fa yyin ,
   3441 and default echoing is still done to
   3442 .Fa yyout .
   3443 Both of these remain
   3444 .Fa FILE *
   3445 variables and not C++ streams.
   3446 .Pp
   3447 .Nm
   3448 can also be used to generate a C++ scanner class, using the
   3449 .Fl +
   3450 option (or, equivalently,
   3451 .Dq %option c++ ) ,
   3452 which is automatically specified if the name of the flex executable ends in a
   3453 .Sq + ,
   3454 such as
   3455 .Nm flex++ .
   3456 When using this option,
   3457 .Nm
   3458 defaults to generating the scanner to the file
   3459 .Pa lex.yy.cc
   3460 instead of
   3461 .Pa lex.yy.c .
   3462 The generated scanner includes the header file
   3463 .Aq Pa g++/FlexLexer.h ,
   3464 which defines the interface to two C++ classes.
   3465 .Pp
   3466 The first class,
   3467 .Em FlexLexer ,
   3468 provides an abstract base class defining the general scanner class interface.
   3469 It provides the following member functions:
   3470 .Bl -tag -width Ds
   3471 .It const char* YYText()
   3472 Returns the text of the most recently matched token, the equivalent of
   3473 .Fa yytext .
   3474 .It int YYLeng()
   3475 Returns the length of the most recently matched token, the equivalent of
   3476 .Fa yyleng .
   3477 .It int lineno() const
   3478 Returns the current input line number
   3479 (see
   3480 .Dq %option yylineno ) ,
   3481 or 1 if
   3482 .Dq %option yylineno
   3483 was not used.
   3484 .It void set_debug(int flag)
   3485 Sets the debugging flag for the scanner, equivalent to assigning to
   3486 .Fa yy_flex_debug
   3487 (see the
   3488 .Sx OPTIONS
   3489 section above).
   3490 Note that the scanner must be built using
   3491 .Dq %option debug
   3492 to include debugging information in it.
   3493 .It int debug() const
   3494 Returns the current setting of the debugging flag.
   3495 .El
   3496 .Pp
   3497 Also provided are member functions equivalent to
   3498 .Fn yy_switch_to_buffer ,
   3499 .Fn yy_create_buffer
   3500 (though the first argument is an
   3501 .Fa std::istream*
   3502 object pointer and not a
   3503 .Fa FILE* ) ,
   3504 .Fn yy_flush_buffer ,
   3505 .Fn yy_delete_buffer ,
   3506 and
   3507 .Fn yyrestart
   3508 (again, the first argument is an
   3509 .Fa std::istream*
   3510 object pointer).
   3511 .Pp
   3512 The second class defined in
   3513 .Aq Pa g++/FlexLexer.h
   3514 is
   3515 .Fa yyFlexLexer ,
   3516 which is derived from
   3517 .Fa FlexLexer .
   3518 It defines the following additional member functions:
   3519 .Bl -tag -width Ds
   3520 .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)"
   3521 Constructs a
   3522 .Fa yyFlexLexer
   3523 object using the given streams for input and output.
   3524 If not specified, the streams default to
   3525 .Fa cin
   3526 and
   3527 .Fa cout ,
   3528 respectively.
   3529 .It virtual int yylex()
   3530 Performs the same role as
   3531 .Fn yylex
   3532 does for ordinary flex scanners: it scans the input stream, consuming
   3533 tokens, until a rule's action returns a value.
   3534 If subclass
   3535 .Sq S
   3536 is derived from
   3537 .Fa yyFlexLexer ,
   3538 in order to access the member functions and variables of
   3539 .Sq S
   3540 inside
   3541 .Fn yylex ,
   3542 use
   3543 .Dq %option yyclass="S"
   3544 to inform
   3545 .Nm
   3546 that the
   3547 .Sq S
   3548 subclass will be used instead of
   3549 .Fa yyFlexLexer .
   3550 In this case, rather than generating
   3551 .Dq yyFlexLexer::yylex() ,
   3552 .Nm
   3553 generates
   3554 .Dq S::yylex()
   3555 (and also generates a dummy
   3556 .Dq yyFlexLexer::yylex()
   3557 that calls
   3558 .Dq yyFlexLexer::LexerError()
   3559 if called).
   3560 .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)"
   3561 Reassigns
   3562 .Fa yyin
   3563 to
   3564 .Fa new_in
   3565 .Pq if non-nil
   3566 and
   3567 .Fa yyout
   3568 to
   3569 .Fa new_out
   3570 .Pq ditto ,
   3571 deleting the previous input buffer if
   3572 .Fa yyin
   3573 is reassigned.
   3574 .It int yylex(std::istream* new_in, std::ostream* new_out = 0)
   3575 First switches the input streams via
   3576 .Dq switch_streams(new_in, new_out)
   3577 and then returns the value of
   3578 .Fn yylex .
   3579 .El
   3580 .Pp
   3581 In addition,
   3582 .Fa yyFlexLexer
   3583 defines the following protected virtual functions which can be redefined
   3584 in derived classes to tailor the scanner:
   3585 .Bl -tag -width Ds
   3586 .It virtual int LexerInput(char* buf, int max_size)
   3587 Reads up to
   3588 .Fa max_size
   3589 characters into
   3590 .Fa buf
   3591 and returns the number of characters read.
   3592 To indicate end-of-input, return 0 characters.
   3593 Note that
   3594 .Qq interactive
   3595 scanners (see the
   3596 .Fl B
   3597 and
   3598 .Fl I
   3599 flags) define the macro
   3600 .Dv YY_INTERACTIVE .
   3601 If
   3602 .Fn LexerInput
   3603 has been redefined, and it's necessary to take different actions depending on
   3604 whether or not the scanner might be scanning an interactive input source,
   3605 it's possible to test for the presence of this name via
   3606 .Dq #ifdef .
   3607 .It virtual void LexerOutput(const char* buf, int size)
   3608 Writes out
   3609 .Fa size
   3610 characters from the buffer
   3611 .Fa buf ,
   3612 which, while NUL-terminated, may also contain
   3613 .Qq internal
   3614 NUL's if the scanner's rules can match text with NUL's in them.
   3615 .It virtual void LexerError(const char* msg)
   3616 Reports a fatal error message.
   3617 The default version of this function writes the message to the stream
   3618 .Fa cerr
   3619 and exits.
   3620 .El
   3621 .Pp
   3622 Note that a
   3623 .Fa yyFlexLexer
   3624 object contains its entire scanning state.
   3625 Thus such objects can be used to create reentrant scanners.
   3626 Multiple instances of the same
   3627 .Fa yyFlexLexer
   3628 class can be instantiated, and multiple C++ scanner classes can be combined
   3629 in the same program using the
   3630 .Fl P
   3631 option discussed above.
   3632 .Pp
   3633 Finally, note that the
   3634 .Dq %array
   3635 feature is not available to C++ scanner classes;
   3636 .Dq %pointer
   3637 must be used
   3638 .Pq the default .
   3639 .Pp
   3640 Here is an example of a simple C++ scanner:
   3641 .Bd -literal -offset indent
   3642 // An example of using the flex C++ scanner class.
   3643 
   3644 %{
   3645 #include <errno.h>
   3646 int mylineno = 0;
   3647 %}
   3648 
   3649 string  \e"[^\en"]+\e"
   3650 
   3651 ws      [ \et]+
   3652 
   3653 alpha   [A-Za-z]
   3654 dig     [0-9]
   3655 name    ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])*
   3656 num1    [-+]?{dig}+\e.?([eE][-+]?{dig}+)?
   3657 num2    [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)?
   3658 number  {num1}|{num2}
   3659 
   3660 %%
   3661 
   3662 {ws}    /* skip blanks and tabs */
   3663 
   3664 "/*" {
   3665         int c;
   3666 
   3667         while ((c = yyinput()) != 0) {
   3668                 if(c == '\en')
   3669                     ++mylineno;
   3670                 else if(c == '*') {
   3671                     if ((c = yyinput()) == '/')
   3672                         break;
   3673                     else
   3674                         unput(c);
   3675                 }
   3676         }
   3677 }
   3678 
   3679 {number}  cout << "number " << YYText() << '\en';
   3680 
   3681 \en        mylineno++;
   3682 
   3683 {name}    cout << "name " << YYText() << '\en';
   3684 
   3685 {string}  cout << "string " << YYText() << '\en';
   3686 
   3687 %%
   3688 
   3689 int main(int /* argc */, char** /* argv */)
   3690 {
   3691 	FlexLexer* lexer = new yyFlexLexer;
   3692 	while(lexer->yylex() != 0)
   3693 	    ;
   3694 	return 0;
   3695 }
   3696 .Ed
   3697 .Pp
   3698 To create multiple
   3699 .Pq different
   3700 lexer classes, use the
   3701 .Fl P
   3702 flag
   3703 (or the
   3704 .Dq prefix=
   3705 option)
   3706 to rename each
   3707 .Fa yyFlexLexer
   3708 to some other
   3709 .Fa xxFlexLexer .
   3710 .Aq Pa g++/FlexLexer.h
   3711 can then be included in other sources once per lexer class, first renaming
   3712 .Fa yyFlexLexer
   3713 as follows:
   3714 .Bd -literal -offset indent
   3715 #undef yyFlexLexer
   3716 #define yyFlexLexer xxFlexLexer
   3717 #include <g++/FlexLexer.h>
   3718 
   3719 #undef yyFlexLexer
   3720 #define yyFlexLexer zzFlexLexer
   3721 #include <g++/FlexLexer.h>
   3722 .Ed
   3723 .Pp
   3724 If, for example,
   3725 .Dq %option prefix="xx"
   3726 is used for one scanner and
   3727 .Dq %option prefix="zz"
   3728 is used for the other.
   3729 .Pp
   3730 .Sy IMPORTANT :
   3731 the present form of the scanning class is experimental
   3732 and may change considerably between major releases.
   3733 .Sh INCOMPATIBILITIES WITH LEX AND POSIX
   3734 .Nm
   3735 is a rewrite of the
   3736 .At
   3737 .Nm lex
   3738 tool
   3739 (the two implementations do not share any code, though),
   3740 with some extensions and incompatibilities, both of which are of concern
   3741 to those who wish to write scanners acceptable to either implementation.
   3742 .Nm
   3743 is fully compliant with the
   3744 .Tn POSIX
   3745 .Nm lex
   3746 specification, except that when using
   3747 .Dq %pointer
   3748 .Pq the default ,
   3749 a call to
   3750 .Fn unput
   3751 destroys the contents of
   3752 .Fa yytext ,
   3753 which is counter to the
   3754 .Tn POSIX
   3755 specification.
   3756 .Pp
   3757 In this section we discuss all of the known areas of incompatibility between
   3758 .Nm ,
   3759 .At
   3760 .Nm lex ,
   3761 and the
   3762 .Tn POSIX
   3763 specification.
   3764 .Pp
   3765 .Nm flex Ns 's
   3766 .Fl l
   3767 option turns on maximum compatibility with the original
   3768 .At
   3769 .Nm lex
   3770 implementation, at the cost of a major loss in the generated scanner's
   3771 performance.
   3772 We note below which incompatibilities can be overcome using the
   3773 .Fl l
   3774 option.
   3775 .Pp
   3776 .Nm
   3777 is fully compatible with
   3778 .Nm lex
   3779 with the following exceptions:
   3780 .Bl -dash
   3781 .It
   3782 The undocumented
   3783 .Nm lex
   3784 scanner internal variable
   3785 .Fa yylineno
   3786 is not supported unless
   3787 .Fl l
   3788 or
   3789 .Dq %option yylineno
   3790 is used.
   3791 .Pp
   3792 .Fa yylineno
   3793 should be maintained on a per-buffer basis, rather than a per-scanner
   3794 .Pq single global variable
   3795 basis.
   3796 .Pp
   3797 .Fa yylineno
   3798 is not part of the
   3799 .Tn POSIX
   3800 specification.
   3801 .It
   3802 The
   3803 .Fn input
   3804 routine is not redefinable, though it may be called to read characters
   3805 following whatever has been matched by a rule.
   3806 If
   3807 .Fn input
   3808 encounters an end-of-file, the normal
   3809 .Fn yywrap
   3810 processing is done.
   3811 A
   3812 .Dq real
   3813 end-of-file is returned by
   3814 .Fn input
   3815 as
   3816 .Dv EOF .
   3817 .Pp
   3818 Input is instead controlled by defining the
   3819 .Dv YY_INPUT
   3820 macro.
   3821 .Pp
   3822 The
   3823 .Nm
   3824 restriction that
   3825 .Fn input
   3826 cannot be redefined is in accordance with the
   3827 .Tn POSIX
   3828 specification, which simply does not specify any way of controlling the
   3829 scanner's input other than by making an initial assignment to
   3830 .Fa yyin .
   3831 .It
   3832 The
   3833 .Fn unput
   3834 routine is not redefinable.
   3835 This restriction is in accordance with
   3836 .Tn POSIX .
   3837 .It
   3838 .Nm
   3839 scanners are not as reentrant as
   3840 .Nm lex
   3841 scanners.
   3842 In particular, if a scanner is interactive and
   3843 an interrupt handler long-jumps out of the scanner,
   3844 and the scanner is subsequently called again,
   3845 the following error message may be displayed:
   3846 .Pp
   3847 .D1 fatal flex scanner internal error--end of buffer missed
   3848 .Pp
   3849 To reenter the scanner, first use
   3850 .Pp
   3851 .Dl yyrestart(yyin);
   3852 .Pp
   3853 Note that this call will throw away any buffered input;
   3854 usually this isn't a problem with an interactive scanner.
   3855 .Pp
   3856 Also note that flex C++ scanner classes are reentrant,
   3857 so if using C++ is an option , they should be used instead.
   3858 See
   3859 .Sx GENERATING C++ SCANNERS
   3860 above for details.
   3861 .It
   3862 .Fn output
   3863 is not supported.
   3864 Output from the
   3865 .Em ECHO
   3866 macro is done to the file-pointer
   3867 .Fa yyout
   3868 .Pq default stdout .
   3869 .Pp
   3870 .Fn output
   3871 is not part of the
   3872 .Tn POSIX
   3873 specification.
   3874 .It
   3875 .Nm lex
   3876 does not support exclusive start conditions
   3877 .Pq %x ,
   3878 though they are in the
   3879 .Tn POSIX
   3880 specification.
   3881 .It
   3882 When definitions are expanded,
   3883 .Nm
   3884 encloses them in parentheses.
   3885 With
   3886 .Nm lex ,
   3887 the following:
   3888 .Bd -literal -offset indent
   3889 NAME    [A-Z][A-Z0-9]*
   3890 %%
   3891 foo{NAME}?      printf("Found it\en");
   3892 %%
   3893 .Ed
   3894 .Pp
   3895 will not match the string
   3896 .Qq foo
   3897 because when the macro is expanded the rule is equivalent to
   3898 .Qq foo[A-Z][A-Z0-9]*?
   3899 and the precedence is such that the
   3900 .Sq ?\&
   3901 is associated with
   3902 .Qq [A-Z0-9]* .
   3903 With
   3904 .Nm ,
   3905 the rule will be expanded to
   3906 .Qq foo([A-Z][A-Z0-9]*)?
   3907 and so the string
   3908 .Qq foo
   3909 will match.
   3910 .Pp
   3911 Note that if the definition begins with
   3912 .Sq ^
   3913 or ends with
   3914 .Sq $
   3915 then it is not expanded with parentheses, to allow these operators to appear in
   3916 definitions without losing their special meanings.
   3917 But the
   3918 .Sq Aq s ,
   3919 .Sq / ,
   3920 and
   3921 .Aq Aq EOF
   3922 operators cannot be used in a
   3923 .Nm
   3924 definition.
   3925 .Pp
   3926 Using
   3927 .Fl l
   3928 results in the
   3929 .Nm lex
   3930 behavior of no parentheses around the definition.
   3931 .Pp
   3932 The
   3933 .Tn POSIX
   3934 specification is that the definition be enclosed in parentheses.
   3935 .It
   3936 Some implementations of
   3937 .Nm lex
   3938 allow a rule's action to begin on a separate line,
   3939 if the rule's pattern has trailing whitespace:
   3940 .Bd -literal -offset indent
   3941 %%
   3942 foo|bar<space here>
   3943   { foobar_action(); }
   3944 .Ed
   3945 .Pp
   3946 .Nm
   3947 does not support this feature.
   3948 .It
   3949 The
   3950 .Nm lex
   3951 .Sq %r
   3952 .Pq generate a Ratfor scanner
   3953 option is not supported.
   3954 It is not part of the
   3955 .Tn POSIX
   3956 specification.
   3957 .It
   3958 After a call to
   3959 .Fn unput ,
   3960 .Fa yytext
   3961 is undefined until the next token is matched,
   3962 unless the scanner was built using
   3963 .Dq %array .
   3964 This is not the case with
   3965 .Nm lex
   3966 or the
   3967 .Tn POSIX
   3968 specification.
   3969 The
   3970 .Fl l
   3971 option does away with this incompatibility.
   3972 .It
   3973 The precedence of the
   3974 .Sq {}
   3975 .Pq numeric range
   3976 operator is different.
   3977 .Nm lex
   3978 interprets
   3979 .Qq abc{1,3}
   3980 as match one, two, or three occurrences of
   3981 .Sq abc ,
   3982 whereas
   3983 .Nm
   3984 interprets it as match
   3985 .Sq ab
   3986 followed by one, two, or three occurrences of
   3987 .Sq c .
   3988 The latter is in agreement with the
   3989 .Tn POSIX
   3990 specification.
   3991 .It
   3992 The precedence of the
   3993 .Sq ^
   3994 operator is different.
   3995 .Nm lex
   3996 interprets
   3997 .Qq ^foo|bar
   3998 as match either
   3999 .Sq foo
   4000 at the beginning of a line, or
   4001 .Sq bar
   4002 anywhere, whereas
   4003 .Nm
   4004 interprets it as match either
   4005 .Sq foo
   4006 or
   4007 .Sq bar
   4008 if they come at the beginning of a line.
   4009 The latter is in agreement with the
   4010 .Tn POSIX
   4011 specification.
   4012 .It
   4013 The special table-size declarations such as
   4014 .Sq %a
   4015 supported by
   4016 .Nm lex
   4017 are not required by
   4018 .Nm
   4019 scanners;
   4020 .Nm
   4021 ignores them.
   4022 .It
   4023 The name
   4024 .Dv FLEX_SCANNER
   4025 is #define'd so scanners may be written for use with either
   4026 .Nm
   4027 or
   4028 .Nm lex .
   4029 Scanners also include
   4030 .Dv YY_FLEX_MAJOR_VERSION
   4031 and
   4032 .Dv YY_FLEX_MINOR_VERSION
   4033 indicating which version of
   4034 .Nm
   4035 generated the scanner
   4036 (for example, for the 2.5 release, these defines would be 2 and 5,
   4037 respectively).
   4038 .El
   4039 .Pp
   4040 The following
   4041 .Nm
   4042 features are not included in
   4043 .Nm lex
   4044 or the
   4045 .Tn POSIX
   4046 specification:
   4047 .Bd -unfilled -offset indent
   4048 C++ scanners
   4049 %option
   4050 start condition scopes
   4051 start condition stacks
   4052 interactive/non-interactive scanners
   4053 yy_scan_string() and friends
   4054 yyterminate()
   4055 yy_set_interactive()
   4056 yy_set_bol()
   4057 YY_AT_BOL()
   4058 <<EOF>>
   4059 <*>
   4060 YY_DECL
   4061 YY_START
   4062 YY_USER_ACTION
   4063 YY_USER_INIT
   4064 #line directives
   4065 %{}'s around actions
   4066 multiple actions on a line
   4067 .Ed
   4068 .Pp
   4069 plus almost all of the
   4070 .Nm
   4071 flags.
   4072 The last feature in the list refers to the fact that with
   4073 .Nm
   4074 multiple actions can be placed on the same line,
   4075 separated with semi-colons, while with
   4076 .Nm lex ,
   4077 the following
   4078 .Pp
   4079 .Dl foo    handle_foo(); ++num_foos_seen;
   4080 .Pp
   4081 is
   4082 .Pq rather surprisingly
   4083 truncated to
   4084 .Pp
   4085 .Dl foo    handle_foo();
   4086 .Pp
   4087 .Nm
   4088 does not truncate the action.
   4089 Actions that are not enclosed in braces
   4090 are simply terminated at the end of the line.
   4091 .Sh FILES
   4092 .Bl -tag -width "<g++/FlexLexer.h>"
   4093 .It flex.skl
   4094 Skeleton scanner.
   4095 This file is only used when building flex, not when
   4096 .Nm
   4097 executes.
   4098 .It lex.backup
   4099 Backing-up information for the
   4100 .Fl b
   4101 flag (called
   4102 .Pa lex.bck
   4103 on some systems).
   4104 .It lex.yy.c
   4105 Generated scanner
   4106 (called
   4107 .Pa lexyy.c
   4108 on some systems).
   4109 .It lex.yy.cc
   4110 Generated C++ scanner class, when using
   4111 .Fl + .
   4112 .It Aq g++/FlexLexer.h
   4113 Header file defining the C++ scanner base class,
   4114 .Fa FlexLexer ,
   4115 and its derived class,
   4116 .Fa yyFlexLexer .
   4117 .It /usr/lib/libl.*
   4118 .Nm
   4119 libraries.
   4120 The
   4121 .Pa /usr/lib/libfl.*\&
   4122 libraries are links to these.
   4123 Scanners must be linked using either
   4124 .Fl \&ll
   4125 or
   4126 .Fl lfl .
   4127 .El
   4128 .Sh EXIT STATUS
   4129 .Ex -std flex
   4130 .Sh DIAGNOSTICS
   4131 .Bl -diag
   4132 .It warning, rule cannot be matched
   4133 Indicates that the given rule cannot be matched because it follows other rules
   4134 that will always match the same text as it.
   4135 For example, in the following
   4136 .Dq foo
   4137 cannot be matched because it comes after an identifier
   4138 .Qq catch-all
   4139 rule:
   4140 .Bd -literal -offset indent
   4141 [a-z]+    got_identifier();
   4142 foo       got_foo();
   4143 .Ed
   4144 .Pp
   4145 Using
   4146 .Em REJECT
   4147 in a scanner suppresses this warning.
   4148 .It "warning, \-s option given but default rule can be matched"
   4149 Means that it is possible
   4150 .Pq perhaps only in a particular start condition
   4151 that the default rule
   4152 .Pq match any single character
   4153 is the only one that will match a particular input.
   4154 Since
   4155 .Fl s
   4156 was given, presumably this is not intended.
   4157 .It reject_used_but_not_detected undefined
   4158 .It yymore_used_but_not_detected undefined
   4159 These errors can occur at compile time.
   4160 They indicate that the scanner uses
   4161 .Em REJECT
   4162 or
   4163 .Fn yymore
   4164 but that
   4165 .Nm
   4166 failed to notice the fact, meaning that
   4167 .Nm
   4168 scanned the first two sections looking for occurrences of these actions
   4169 and failed to find any, but somehow they snuck in
   4170 .Pq via an #include file, for example .
   4171 Use
   4172 .Dq %option reject
   4173 or
   4174 .Dq %option yymore
   4175 to indicate to
   4176 .Nm
   4177 that these features are really needed.
   4178 .It flex scanner jammed
   4179 A scanner compiled with
   4180 .Fl s
   4181 has encountered an input string which wasn't matched by any of its rules.
   4182 This error can also occur due to internal problems.
   4183 .It token too large, exceeds YYLMAX
   4184 The scanner uses
   4185 .Dq %array
   4186 and one of its rules matched a string longer than the
   4187 .Dv YYLMAX
   4188 constant
   4189 .Pq 8K bytes by default .
   4190 The value can be increased by #define'ing
   4191 .Dv YYLMAX
   4192 in the definitions section of
   4193 .Nm
   4194 input.
   4195 .It "scanner requires \-8 flag to use the character 'x'"
   4196 The scanner specification includes recognizing the 8-bit character
   4197 .Sq x
   4198 and the
   4199 .Fl 8
   4200 flag was not specified, and defaulted to 7-bit because the
   4201 .Fl Cf
   4202 or
   4203 .Fl CF
   4204 table compression options were used.
   4205 See the discussion of the
   4206 .Fl 7
   4207 flag for details.
   4208 .It flex scanner push-back overflow
   4209 unput() was used to push back so much text that the scanner's buffer
   4210 could not hold both the pushed-back text and the current token in
   4211 .Fa yytext .
   4212 Ideally the scanner should dynamically resize the buffer in this case,
   4213 but at present it does not.
   4214 .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT"
   4215 The scanner was working on matching an extremely large token and needed
   4216 to expand the input buffer.
   4217 This doesn't work with scanners that use
   4218 .Em REJECT .
   4219 .It "fatal flex scanner internal error--end of buffer missed"
   4220 This can occur in an scanner which is reentered after a long-jump
   4221 has jumped out
   4222 .Pq or over
   4223 the scanner's activation frame.
   4224 Before reentering the scanner, use:
   4225 .Pp
   4226 .Dl yyrestart(yyin);
   4227 .Pp
   4228 or, as noted above, switch to using the C++ scanner class.
   4229 .It "too many start conditions in <> construct!"
   4230 More start conditions than exist were listed in a <> construct
   4231 (so at least one of them must have been listed twice).
   4232 .El
   4233 .Sh SEE ALSO
   4234 .Xr awk 1 ,
   4235 .Xr sed 1 ,
   4236 .Xr yacc 1
   4237 .Rs
   4238 .%A John Levine
   4239 .%A Tony Mason
   4240 .%A Doug Brown
   4241 .%B Lex & Yacc
   4242 .%I O'Reilly and Associates
   4243 .%N 2nd edition
   4244 .Re
   4245 .Rs
   4246 .%A Alfred Aho
   4247 .%A Ravi Sethi
   4248 .%A Jeffrey Ullman
   4249 .%B Compilers: Principles, Techniques and Tools
   4250 .%I Addison-Wesley
   4251 .%D 1986
   4252 .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)"
   4253 .Re
   4254 .Sh STANDARDS
   4255 The
   4256 .Nm lex
   4257 utility is compliant with the
   4258 .St -p1003.1-2008
   4259 specification,
   4260 though its presence is optional.
   4261 .Pp
   4262 The flags
   4263 .Op Fl 78BbCdFfhIiLloPpSsTVw+? ,
   4264 .Op Fl -help ,
   4265 and
   4266 .Op Fl -version
   4267 are extensions to that specification.
   4268 .Pp
   4269 See also the
   4270 .Sx INCOMPATIBILITIES WITH LEX AND POSIX
   4271 section, above.
   4272 .Sh AUTHORS
   4273 Vern Paxson, with the help of many ideas and much inspiration from
   4274 Van Jacobson.
   4275 Original version by Jef Poskanzer.
   4276 The fast table representation is a partial implementation of a design done by
   4277 Van Jacobson.
   4278 The implementation was done by Kevin Gong and Vern Paxson.
   4279 .Pp
   4280 Thanks to the many
   4281 .Nm
   4282 beta-testers, feedbackers, and contributors, especially Francois Pinard,
   4283 Casey Leedom,
   4284 Robert Abramovitz,
   4285 Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai,
   4286 Neal Becker, Nelson H.F. Beebe, benson@odi.com,
   4287 Karl Berry, Peter A. Bigot, Simon Blanchard,
   4288 Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher,
   4289 Brian Clapper, J.T. Conklin,
   4290 Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David
   4291 Daniels, Chris G. Demetriou, Theo de Raadt,
   4292 Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin,
   4293 Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl,
   4294 Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz,
   4295 Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel,
   4296 Jan Hajic, Charles Hemphill, NORO Hideo,
   4297 Jarkko Hietaniemi, Scott Hofmann,
   4298 Jeff Honig, Dana Hudes, Eric Hughes, John Interrante,
   4299 Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones,
   4300 Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane,
   4301 Amir Katz, ken@ken.hilco.com, Kevin B. Kenny,
   4302 Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht,
   4303 Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle,
   4304 David Loffredo, Mike Long,
   4305 Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall,
   4306 Bengt Martensson, Chris Metcalf,
   4307 Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum,
   4308 G.T. Nicol, Landon Noll, James Nordby, Marc Nozell,
   4309 Richard Ohnemus, Karsten Pahnke,
   4310 Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre,
   4311 Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha,
   4312 Frederic Raimbault, Pat Rankin, Rick Richardson,
   4313 Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini,
   4314 Andreas Scherer, Darrell Schiebel, Raf Schietekat,
   4315 Doug Schmidt, Philippe Schnoebelen, Andreas Schwab,
   4316 Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist,
   4317 Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor,
   4318 Chris Thewalt, Richard M. Timoney, Jodi Tsai,
   4319 Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams,
   4320 Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn,
   4321 and those whose names have slipped my marginal mail-archiving skills
   4322 but whose contributions are appreciated all the
   4323 same.
   4324 .Pp
   4325 Thanks to Keith Bostic, Jon Forrest, Noah Friedman,
   4326 John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T.
   4327 Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various
   4328 distribution headaches.
   4329 .Pp
   4330 Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
   4331 to Benson Margulies and Fred Burke for C++ support;
   4332 to Kent Williams and Tom Epperly for C++ class support;
   4333 to Ove Ewerlid for support of NUL's;
   4334 and to Eric Hughes for support of multiple buffers.
   4335 .Pp
   4336 This work was primarily done when I was with the Real Time Systems Group
   4337 at the Lawrence Berkeley Laboratory in Berkeley, CA.
   4338 Many thanks to all there for the support I received.
   4339 .Pp
   4340 Send comments to
   4341 .Aq Mt vern@ee.lbl.gov .
   4342 .Sh BUGS
   4343 Some trailing context patterns cannot be properly matched and generate
   4344 warning messages
   4345 .Pq "dangerous trailing context" .
   4346 These are patterns where the ending of the first part of the rule
   4347 matches the beginning of the second part, such as
   4348 .Qq zx*/xy* ,
   4349 where the
   4350 .Sq x*
   4351 matches the
   4352 .Sq x
   4353 at the beginning of the trailing context.
   4354 (Note that the POSIX draft states that the text matched by such patterns
   4355 is undefined.)
   4356 .Pp
   4357 For some trailing context rules, parts which are actually fixed-length are
   4358 not recognized as such, leading to the above mentioned performance loss.
   4359 In particular, parts using
   4360 .Sq |\&
   4361 or
   4362 .Sq {n}
   4363 (such as
   4364 .Qq foo{3} )
   4365 are always considered variable-length.
   4366 .Pp
   4367 Combining trailing context with the special
   4368 .Sq |\&
   4369 action can result in fixed trailing context being turned into
   4370 the more expensive variable trailing context.
   4371 For example, in the following:
   4372 .Bd -literal -offset indent
   4373 %%
   4374 abc      |
   4375 xyz/def
   4376 .Ed
   4377 .Pp
   4378 Use of
   4379 .Fn unput
   4380 invalidates yytext and yyleng, unless the
   4381 .Dq %array
   4382 directive
   4383 or the
   4384 .Fl l
   4385 option has been used.
   4386 .Pp
   4387 Pattern-matching of NUL's is substantially slower than matching other
   4388 characters.
   4389 .Pp
   4390 Dynamic resizing of the input buffer is slow, as it entails rescanning
   4391 all the text matched so far by the current
   4392 .Pq generally huge
   4393 token.
   4394 .Pp
   4395 Due to both buffering of input and read-ahead,
   4396 it is not possible to intermix calls to
   4397 .Aq Pa stdio.h
   4398 routines, such as, for example,
   4399 .Fn getchar ,
   4400 with
   4401 .Nm
   4402 rules and expect it to work.
   4403 Call
   4404 .Fn input
   4405 instead.
   4406 .Pp
   4407 The total table entries listed by the
   4408 .Fl v
   4409 flag excludes the number of table entries needed to determine
   4410 what rule has been matched.
   4411 The number of entries is equal to the number of DFA states
   4412 if the scanner does not use
   4413 .Em REJECT ,
   4414 and somewhat greater than the number of states if it does.
   4415 .Pp
   4416 .Em REJECT
   4417 cannot be used with the
   4418 .Fl f
   4419 or
   4420 .Fl F
   4421 options.
   4422 .Pp
   4423 The
   4424 .Nm
   4425 internal algorithms need documentation.