flex.1 (104849B)
1 .\" $OpenBSD: flex.1,v 1.37 2014/03/23 16:28:29 jmc Exp $ 2 .\" 3 .\" Copyright (c) 1990 The Regents of the University of California. 4 .\" All rights reserved. 5 .\" 6 .\" This code is derived from software contributed to Berkeley by 7 .\" Vern Paxson. 8 .\" 9 .\" The United States Government has rights in this work pursuant 10 .\" to contract no. DE-AC03-76SF00098 between the United States 11 .\" Department of Energy and the University of California. 12 .\" 13 .\" Redistribution and use in source and binary forms, with or without 14 .\" modification, are permitted provided that the following conditions 15 .\" are met: 16 .\" 17 .\" 1. Redistributions of source code must retain the above copyright 18 .\" notice, this list of conditions and the following disclaimer. 19 .\" 2. Redistributions in binary form must reproduce the above copyright 20 .\" notice, this list of conditions and the following disclaimer in the 21 .\" documentation and/or other materials provided with the distribution. 22 .\" 23 .\" Neither the name of the University nor the names of its contributors 24 .\" may be used to endorse or promote products derived from this software 25 .\" without specific prior written permission. 26 .\" 27 .\" THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR 28 .\" IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED 29 .\" WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR 30 .\" PURPOSE. 31 .\" 32 .Dd $Mdocdate: March 23 2014 $ 33 .Dt FLEX 1 34 .Os 35 .Sh NAME 36 .Nm flex 37 .Nd fast lexical analyzer generator 38 .Sh SYNOPSIS 39 .Nm 40 .Bk -words 41 .Op Fl 78BbdFfhIiLlnpsTtVvw+? 42 .Op Fl C Ns Op Cm aeFfmr 43 .Op Fl Fl help 44 .Op Fl Fl version 45 .Op Fl o Ns Ar output 46 .Op Fl P Ns Ar prefix 47 .Op Fl S Ns Ar skeleton 48 .Op Ar 49 .Ek 50 .Sh DESCRIPTION 51 .Nm 52 is a tool for generating 53 .Em scanners : 54 programs which recognize lexical patterns in text. 55 .Nm 56 reads the given input files, or its standard input if no file names are given, 57 for a description of a scanner to generate. 58 The description is in the form of pairs of regular expressions and C code, 59 called 60 .Em rules . 61 .Nm 62 generates as output a C source file, 63 .Pa lex.yy.c , 64 which defines a routine 65 .Fn yylex . 66 This file is compiled and linked with the 67 .Fl lfl 68 library to produce an executable. 69 When the executable is run, it analyzes its input for occurrences 70 of the regular expressions. 71 Whenever it finds one, it executes the corresponding C code. 72 .Pp 73 The manual includes both tutorial and reference sections: 74 .Bl -ohang 75 .It Sy Some Simple Examples 76 .It Sy Format of the Input File 77 .It Sy Patterns 78 The extended regular expressions used by 79 .Nm . 80 .It Sy How the Input is Matched 81 The rules for determining what has been matched. 82 .It Sy Actions 83 How to specify what to do when a pattern is matched. 84 .It Sy The Generated Scanner 85 Details regarding the scanner that 86 .Nm 87 produces; 88 how to control the input source. 89 .It Sy Start Conditions 90 Introducing context into scanners, and managing 91 .Qq mini-scanners . 92 .It Sy Multiple Input Buffers 93 How to manipulate multiple input sources; 94 how to scan from strings instead of files. 95 .It Sy End-of-File Rules 96 Special rules for matching the end of the input. 97 .It Sy Miscellaneous Macros 98 A summary of macros available to the actions. 99 .It Sy Values Available to the User 100 A summary of values available to the actions. 101 .It Sy Interfacing with Yacc 102 Connecting flex scanners together with 103 .Xr yacc 1 104 parsers. 105 .It Sy Options 106 .Nm 107 command-line options, and the 108 .Dq %option 109 directive. 110 .It Sy Performance Considerations 111 How to make scanners go as fast as possible. 112 .It Sy Generating C++ Scanners 113 The 114 .Pq experimental 115 facility for generating C++ scanner classes. 116 .It Sy Incompatibilities with Lex and POSIX 117 How 118 .Nm 119 differs from 120 .At 121 .Nm lex 122 and the 123 .Tn POSIX 124 .Nm lex 125 standard. 126 .It Sy Files 127 Files used by 128 .Nm . 129 .It Sy Diagnostics 130 Those error messages produced by 131 .Nm 132 .Pq or scanners it generates 133 whose meanings might not be apparent. 134 .It Sy See Also 135 Other documentation, related tools. 136 .It Sy Authors 137 Includes contact information. 138 .It Sy Bugs 139 Known problems with 140 .Nm . 141 .El 142 .Sh SOME SIMPLE EXAMPLES 143 First some simple examples to get the flavor of how one uses 144 .Nm . 145 The following 146 .Nm 147 input specifies a scanner which whenever it encounters the string 148 .Qq username 149 will replace it with the user's login name: 150 .Bd -literal -offset indent 151 %% 152 username printf("%s", getlogin()); 153 .Ed 154 .Pp 155 By default, any text not matched by a 156 .Nm 157 scanner is copied to the output, so the net effect of this scanner is 158 to copy its input file to its output with each occurrence of 159 .Qq username 160 expanded. 161 In this input, there is just one rule. 162 .Qq username 163 is the 164 .Em pattern 165 and the 166 .Qq printf 167 is the 168 .Em action . 169 The 170 .Qq %% 171 marks the beginning of the rules. 172 .Pp 173 Here's another simple example: 174 .Bd -literal -offset indent 175 %{ 176 int num_lines = 0, num_chars = 0; 177 %} 178 179 %% 180 \en ++num_lines; ++num_chars; 181 \&. ++num_chars; 182 183 %% 184 main() 185 { 186 yylex(); 187 printf("# of lines = %d, # of chars = %d\en", 188 num_lines, num_chars); 189 } 190 .Ed 191 .Pp 192 This scanner counts the number of characters and the number 193 of lines in its input 194 (it produces no output other than the final report on the counts). 195 The first line declares two globals, 196 .Qq num_lines 197 and 198 .Qq num_chars , 199 which are accessible both inside 200 .Fn yylex 201 and in the 202 .Fn main 203 routine declared after the second 204 .Qq %% . 205 There are two rules, one which matches a newline 206 .Pq \&"\en\&" 207 and increments both the line count and the character count, 208 and one which matches any character other than a newline 209 (indicated by the 210 .Qq \&. 211 regular expression). 212 .Pp 213 A somewhat more complicated example: 214 .Bd -literal -offset indent 215 /* scanner for a toy Pascal-like language */ 216 217 %{ 218 /* need this for the call to atof() below */ 219 #include <math.h> 220 %} 221 222 DIGIT [0-9] 223 ID [a-z][a-z0-9]* 224 225 %% 226 227 {DIGIT}+ { 228 printf("An integer: %s (%d)\en", yytext, 229 atoi(yytext)); 230 } 231 232 {DIGIT}+"."{DIGIT}* { 233 printf("A float: %s (%g)\en", yytext, 234 atof(yytext)); 235 } 236 237 if|then|begin|end|procedure|function { 238 printf("A keyword: %s\en", yytext); 239 } 240 241 {ID} printf("An identifier: %s\en", yytext); 242 243 "+"|"-"|"*"|"/" printf("An operator: %s\en", yytext); 244 245 "{"[^}\en]*"}" /* eat up one-line comments */ 246 247 [ \et\en]+ /* eat up whitespace */ 248 249 \&. printf("Unrecognized character: %s\en", yytext); 250 251 %% 252 253 main(int argc, char *argv[]) 254 { 255 ++argv; --argc; /* skip over program name */ 256 if (argc > 0) 257 yyin = fopen(argv[0], "r"); 258 else 259 yyin = stdin; 260 261 yylex(); 262 } 263 .Ed 264 .Pp 265 This is the beginnings of a simple scanner for a language like Pascal. 266 It identifies different types of 267 .Em tokens 268 and reports on what it has seen. 269 .Pp 270 The details of this example will be explained in the following sections. 271 .Sh FORMAT OF THE INPUT FILE 272 The 273 .Nm 274 input file consists of three sections, separated by a line with just 275 .Qq %% 276 in it: 277 .Bd -unfilled -offset indent 278 definitions 279 %% 280 rules 281 %% 282 user code 283 .Ed 284 .Pp 285 The 286 .Em definitions 287 section contains declarations of simple 288 .Em name 289 definitions to simplify the scanner specification, and declarations of 290 .Em start conditions , 291 which are explained in a later section. 292 .Pp 293 Name definitions have the form: 294 .Pp 295 .D1 name definition 296 .Pp 297 The 298 .Qq name 299 is a word beginning with a letter or an underscore 300 .Pq Sq _ 301 followed by zero or more letters, digits, 302 .Sq _ , 303 or 304 .Sq - 305 .Pq dash . 306 The definition is taken to begin at the first non-whitespace character 307 following the name and continuing to the end of the line. 308 The definition can subsequently be referred to using 309 .Qq {name} , 310 which will expand to 311 .Qq (definition) . 312 For example: 313 .Bd -literal -offset indent 314 DIGIT [0-9] 315 ID [a-z][a-z0-9]* 316 .Ed 317 .Pp 318 This defines 319 .Qq DIGIT 320 to be a regular expression which matches a single digit, and 321 .Qq ID 322 to be a regular expression which matches a letter 323 followed by zero-or-more letters-or-digits. 324 A subsequent reference to 325 .Pp 326 .Dl {DIGIT}+"."{DIGIT}* 327 .Pp 328 is identical to 329 .Pp 330 .Dl ([0-9])+"."([0-9])* 331 .Pp 332 and matches one-or-more digits followed by a 333 .Sq .\& 334 followed by zero-or-more digits. 335 .Pp 336 The 337 .Em rules 338 section of the 339 .Nm 340 input contains a series of rules of the form: 341 .Pp 342 .Dl pattern action 343 .Pp 344 The pattern must be unindented and the action must begin 345 on the same line. 346 .Pp 347 See below for a further description of patterns and actions. 348 .Pp 349 Finally, the user code section is simply copied to 350 .Pa lex.yy.c 351 verbatim. 352 It is used for companion routines which call or are called by the scanner. 353 The presence of this section is optional; 354 if it is missing, the second 355 .Qq %% 356 in the input file may be skipped too. 357 .Pp 358 In the definitions and rules sections, any indented text or text enclosed in 359 .Sq %{ 360 and 361 .Sq %} 362 is copied verbatim to the output 363 .Pq with the %{}'s removed . 364 The %{}'s must appear unindented on lines by themselves. 365 .Pp 366 In the rules section, 367 any indented or %{} text appearing before the first rule may be used to 368 declare variables which are local to the scanning routine and 369 .Pq after the declarations 370 code which is to be executed whenever the scanning routine is entered. 371 Other indented or %{} text in the rule section is still copied to the output, 372 but its meaning is not well-defined and it may well cause compile-time 373 errors (this feature is present for 374 .Tn POSIX 375 compliance; see below for other such features). 376 .Pp 377 In the definitions section 378 .Pq but not in the rules section , 379 an unindented comment 380 (i.e., a line beginning with 381 .Qq /* ) 382 is also copied verbatim to the output up to the next 383 .Qq */ . 384 .Sh PATTERNS 385 The patterns in the input are written using an extended set of regular 386 expressions. 387 These are: 388 .Bl -tag -width "XXXXXXXX" 389 .It x 390 Match the character 391 .Sq x . 392 .It .\& 393 Any character 394 .Pq byte 395 except newline. 396 .It [xyz] 397 A 398 .Qq character class ; 399 in this case, the pattern matches either an 400 .Sq x , 401 a 402 .Sq y , 403 or a 404 .Sq z . 405 .It [abj-oZ] 406 A 407 .Qq character class 408 with a range in it; matches an 409 .Sq a , 410 a 411 .Sq b , 412 any letter from 413 .Sq j 414 through 415 .Sq o , 416 or a 417 .Sq Z . 418 .It [^A-Z] 419 A 420 .Qq negated character class , 421 i.e., any character but those in the class. 422 In this case, any character EXCEPT an uppercase letter. 423 .It [^A-Z\en] 424 Any character EXCEPT an uppercase letter or a newline. 425 .It r* 426 Zero or more r's, where 427 .Sq r 428 is any regular expression. 429 .It r+ 430 One or more r's. 431 .It r? 432 Zero or one r's (that is, 433 .Qq an optional r ) . 434 .It r{2,5} 435 Anywhere from two to five r's. 436 .It r{2,} 437 Two or more r's. 438 .It r{4} 439 Exactly 4 r's. 440 .It {name} 441 The expansion of the 442 .Qq name 443 definition 444 .Pq see above . 445 .It \&"[xyz]\e\&"foo\&" 446 The literal string: [xyz]"foo. 447 .It \eX 448 If 449 .Sq X 450 is an 451 .Sq a , 452 .Sq b , 453 .Sq f , 454 .Sq n , 455 .Sq r , 456 .Sq t , 457 or 458 .Sq v , 459 then the ANSI-C interpretation of 460 .Sq \eX . 461 Otherwise, a literal 462 .Sq X 463 (used to escape operators such as 464 .Sq * ) . 465 .It \e0 466 A NUL character 467 .Pq ASCII code 0 . 468 .It \e123 469 The character with octal value 123. 470 .It \ex2a 471 The character with hexadecimal value 2a. 472 .It (r) 473 Match an 474 .Sq r ; 475 parentheses are used to override precedence 476 .Pq see below . 477 .It rs 478 The regular expression 479 .Sq r 480 followed by the regular expression 481 .Sq s ; 482 called 483 .Qq concatenation . 484 .It r|s 485 Either an 486 .Sq r 487 or an 488 .Sq s . 489 .It r/s 490 An 491 .Sq r , 492 but only if it is followed by an 493 .Sq s . 494 The text matched by 495 .Sq s 496 is included when determining whether this rule is the 497 .Qq longest match , 498 but is then returned to the input before the action is executed. 499 So the action only sees the text matched by 500 .Sq r . 501 This type of pattern is called 502 .Qq trailing context . 503 (There are some combinations of r/s that 504 .Nm 505 cannot match correctly; see notes in the 506 .Sx BUGS 507 section below regarding 508 .Qq dangerous trailing context . ) 509 .It ^r 510 An 511 .Sq r , 512 but only at the beginning of a line 513 (i.e., just starting to scan, or right after a newline has been scanned). 514 .It r$ 515 An 516 .Sq r , 517 but only at the end of a line 518 .Pq i.e., just before a newline . 519 Equivalent to 520 .Qq r/\en . 521 .Pp 522 Note that 523 .Nm flex Ns 's 524 notion of 525 .Qq newline 526 is exactly whatever the C compiler used to compile 527 .Nm 528 interprets 529 .Sq \en 530 as. 531 .\" In particular, on some DOS systems you must either filter out \er's in the 532 .\" input yourself, or explicitly use r/\er\en for 533 .\" .Qq r$ . 534 .It <s>r 535 An 536 .Sq r , 537 but only in start condition 538 .Sq s 539 .Pq see below for discussion of start conditions . 540 .It <s1,s2,s3>r 541 The same, but in any of start conditions s1, s2, or s3. 542 .It <*>r 543 An 544 .Sq r 545 in any start condition, even an exclusive one. 546 .It <<EOF>> 547 An end-of-file. 548 .It <s1,s2><<EOF>> 549 An end-of-file when in start condition s1 or s2. 550 .El 551 .Pp 552 Note that inside of a character class, all regular expression operators 553 lose their special meaning except escape 554 .Pq Sq \e 555 and the character class operators, 556 .Sq - , 557 .Sq ]\& , 558 and, at the beginning of the class, 559 .Sq ^ . 560 .Pp 561 The regular expressions listed above are grouped according to 562 precedence, from highest precedence at the top to lowest at the bottom. 563 Those grouped together have equal precedence. 564 For example, 565 .Pp 566 .D1 foo|bar* 567 .Pp 568 is the same as 569 .Pp 570 .D1 (foo)|(ba(r*)) 571 .Pp 572 since the 573 .Sq * 574 operator has higher precedence than concatenation, 575 and concatenation higher than alternation 576 .Pq Sq |\& . 577 This pattern therefore matches 578 .Em either 579 the string 580 .Qq foo 581 .Em or 582 the string 583 .Qq ba 584 followed by zero-or-more r's. 585 To match 586 .Qq foo 587 or zero-or-more "bar"'s, 588 use: 589 .Pp 590 .D1 foo|(bar)* 591 .Pp 592 and to match zero-or-more "foo"'s-or-"bar"'s: 593 .Pp 594 .D1 (foo|bar)* 595 .Pp 596 In addition to characters and ranges of characters, character classes 597 can also contain character class 598 .Em expressions . 599 These are expressions enclosed inside 600 .Sq [: 601 and 602 .Sq :] 603 delimiters (which themselves must appear between the 604 .Sq \&[ 605 and 606 .Sq ]\& 607 of the 608 character class; other elements may occur inside the character class, too). 609 The valid expressions are: 610 .Bd -unfilled -offset indent 611 [:alnum:] [:alpha:] [:blank:] 612 [:cntrl:] [:digit:] [:graph:] 613 [:lower:] [:print:] [:punct:] 614 [:space:] [:upper:] [:xdigit:] 615 .Ed 616 .Pp 617 These expressions all designate a set of characters equivalent to 618 the corresponding standard C 619 .Fn isXXX 620 function. 621 For example, [:alnum:] designates those characters for which 622 .Xr isalnum 3 623 returns true \- i.e., any alphabetic or numeric. 624 Some systems don't provide 625 .Xr isblank 3 , 626 so 627 .Nm 628 defines [:blank:] as a blank or a tab. 629 .Pp 630 For example, the following character classes are all equivalent: 631 .Bd -unfilled -offset indent 632 [[:alnum:]] 633 [[:alpha:][:digit:]] 634 [[:alpha:]0-9] 635 [a-zA-Z0-9] 636 .Ed 637 .Pp 638 If the scanner is case-insensitive (the 639 .Fl i 640 flag), then [:upper:] and [:lower:] are equivalent to [:alpha:]. 641 .Pp 642 Some notes on patterns: 643 .Bl -dash 644 .It 645 A negated character class such as the example 646 .Qq [^A-Z] 647 above will match a newline unless "\en" 648 .Pq or an equivalent escape sequence 649 is one of the characters explicitly present in the negated character class 650 (e.g., 651 .Qq [^A-Z\en] ) . 652 This is unlike how many other regular expression tools treat negated character 653 classes, but unfortunately the inconsistency is historically entrenched. 654 Matching newlines means that a pattern like 655 .Qq [^"]* 656 can match the entire input unless there's another quote in the input. 657 .It 658 A rule can have at most one instance of trailing context 659 (the 660 .Sq / 661 operator or the 662 .Sq $ 663 operator). 664 The start condition, 665 .Sq ^ , 666 and 667 .Qq <<EOF>> 668 patterns can only occur at the beginning of a pattern, and, as well as with 669 .Sq / 670 and 671 .Sq $ , 672 cannot be grouped inside parentheses. 673 A 674 .Sq ^ 675 which does not occur at the beginning of a rule or a 676 .Sq $ 677 which does not occur at the end of a rule loses its special properties 678 and is treated as a normal character. 679 .It 680 The following are illegal: 681 .Bd -unfilled -offset indent 682 foo/bar$ 683 <sc1>foo<sc2>bar 684 .Ed 685 .Pp 686 Note that the first of these, can be written 687 .Qq foo/bar\en . 688 .It 689 The following will result in 690 .Sq $ 691 or 692 .Sq ^ 693 being treated as a normal character: 694 .Bd -unfilled -offset indent 695 foo|(bar$) 696 foo|^bar 697 .Ed 698 .Pp 699 If what's wanted is a 700 .Qq foo 701 or a bar-followed-by-a-newline, the following could be used 702 (the special 703 .Sq |\& 704 action is explained below): 705 .Bd -unfilled -offset indent 706 foo | 707 bar$ /* action goes here */ 708 .Ed 709 .Pp 710 A similar trick will work for matching a foo or a 711 bar-at-the-beginning-of-a-line. 712 .El 713 .Sh HOW THE INPUT IS MATCHED 714 When the generated scanner is run, 715 it analyzes its input looking for strings which match any of its patterns. 716 If it finds more than one match, 717 it takes the one matching the most text 718 (for trailing context rules, this includes the length of the trailing part, 719 even though it will then be returned to the input). 720 If it finds two or more matches of the same length, 721 the rule listed first in the 722 .Nm 723 input file is chosen. 724 .Pp 725 Once the match is determined, the text corresponding to the match 726 (called the 727 .Em token ) 728 is made available in the global character pointer 729 .Fa yytext , 730 and its length in the global integer 731 .Fa yyleng . 732 The 733 .Em action 734 corresponding to the matched pattern is then executed 735 .Pq a more detailed description of actions follows , 736 and then the remaining input is scanned for another match. 737 .Pp 738 If no match is found, then the default rule is executed: 739 the next character in the input is considered matched and 740 copied to the standard output. 741 Thus, the simplest legal 742 .Nm 743 input is: 744 .Pp 745 .D1 %% 746 .Pp 747 which generates a scanner that simply copies its input 748 .Pq one character at a time 749 to its output. 750 .Pp 751 Note that 752 .Fa yytext 753 can be defined in two different ways: 754 either as a character pointer or as a character array. 755 Which definition 756 .Nm 757 uses can be controlled by including one of the special directives 758 .Dq %pointer 759 or 760 .Dq %array 761 in the first 762 .Pq definitions 763 section of flex input. 764 The default is 765 .Dq %pointer , 766 unless the 767 .Fl l 768 .Nm lex 769 compatibility option is used, in which case 770 .Fa yytext 771 will be an array. 772 The advantage of using 773 .Dq %pointer 774 is substantially faster scanning and no buffer overflow when matching 775 very large tokens 776 .Pq unless not enough dynamic memory is available . 777 The disadvantage is that actions are restricted in how they can modify 778 .Fa yytext 779 .Pq see the next section , 780 and calls to the 781 .Fn unput 782 function destroy the present contents of 783 .Fa yytext , 784 which can be a considerable porting headache when moving between different 785 .Nm lex 786 versions. 787 .Pp 788 The advantage of 789 .Dq %array 790 is that 791 .Fa yytext 792 can be modified as much as wanted, and calls to 793 .Fn unput 794 do not destroy 795 .Fa yytext 796 .Pq see below . 797 Furthermore, existing 798 .Nm lex 799 programs sometimes access 800 .Fa yytext 801 externally using declarations of the form: 802 .Pp 803 .D1 extern char yytext[]; 804 .Pp 805 This definition is erroneous when used with 806 .Dq %pointer , 807 but correct for 808 .Dq %array . 809 .Pp 810 .Dq %array 811 defines 812 .Fa yytext 813 to be an array of 814 .Dv YYLMAX 815 characters, which defaults to a fairly large value. 816 The size can be changed by simply #define'ing 817 .Dv YYLMAX 818 to a different value in the first section of 819 .Nm 820 input. 821 As mentioned above, with 822 .Dq %pointer 823 yytext grows dynamically to accommodate large tokens. 824 While this means a 825 .Dq %pointer 826 scanner can accommodate very large tokens 827 .Pq such as matching entire blocks of comments , 828 bear in mind that each time the scanner must resize 829 .Fa yytext 830 it also must rescan the entire token from the beginning, so matching such 831 tokens can prove slow. 832 .Fa yytext 833 presently does not dynamically grow if a call to 834 .Fn unput 835 results in too much text being pushed back; instead, a run-time error results. 836 .Pp 837 Also note that 838 .Dq %array 839 cannot be used with C++ scanner classes 840 .Pq the c++ option; see below . 841 .Sh ACTIONS 842 Each pattern in a rule has a corresponding action, 843 which can be any arbitrary C statement. 844 The pattern ends at the first non-escaped whitespace character; 845 the remainder of the line is its action. 846 If the action is empty, 847 then when the pattern is matched the input token is simply discarded. 848 For example, here is the specification for a program 849 which deletes all occurrences of 850 .Qq zap me 851 from its input: 852 .Bd -literal -offset indent 853 %% 854 "zap me" 855 .Ed 856 .Pp 857 (It will copy all other characters in the input to the output since 858 they will be matched by the default rule.) 859 .Pp 860 Here is a program which compresses multiple blanks and tabs down to 861 a single blank, and throws away whitespace found at the end of a line: 862 .Bd -literal -offset indent 863 %% 864 [ \et]+ putchar(' '); 865 [ \et]+$ /* ignore this token */ 866 .Ed 867 .Pp 868 If the action contains a 869 .Sq { , 870 then the action spans till the balancing 871 .Sq } 872 is found, and the action may cross multiple lines. 873 .Nm 874 knows about C strings and comments and won't be fooled by braces found 875 within them, but also allows actions to begin with 876 .Sq %{ 877 and will consider the action to be all the text up to the next 878 .Sq %} 879 .Pq regardless of ordinary braces inside the action . 880 .Pp 881 An action consisting solely of a vertical bar 882 .Pq Sq |\& 883 means 884 .Qq same as the action for the next rule . 885 See below for an illustration. 886 .Pp 887 Actions can include arbitrary C code, 888 including return statements to return a value to whatever routine called 889 .Fn yylex . 890 Each time 891 .Fn yylex 892 is called, it continues processing tokens from where it last left off 893 until it either reaches the end of the file or executes a return. 894 .Pp 895 Actions are free to modify 896 .Fa yytext 897 except for lengthening it 898 (adding characters to its end \- these will overwrite later characters in the 899 input stream). 900 This, however, does not apply when using 901 .Dq %array 902 .Pq see above ; 903 in that case, 904 .Fa yytext 905 may be freely modified in any way. 906 .Pp 907 Actions are free to modify 908 .Fa yyleng 909 except they should not do so if the action also includes use of 910 .Fn yymore 911 .Pq see below . 912 .Pp 913 There are a number of special directives which can be included within 914 an action: 915 .Bl -tag -width Ds 916 .It ECHO 917 Copies 918 .Fa yytext 919 to the scanner's output. 920 .It BEGIN 921 Followed by the name of a start condition, places the scanner in the 922 corresponding start condition 923 .Pq see below . 924 .It REJECT 925 Directs the scanner to proceed on to the 926 .Qq second best 927 rule which matched the input 928 .Pq or a prefix of the input . 929 The rule is chosen as described above in 930 .Sx HOW THE INPUT IS MATCHED , 931 and 932 .Fa yytext 933 and 934 .Fa yyleng 935 set up appropriately. 936 It may either be one which matched as much text 937 as the originally chosen rule but came later in the 938 .Nm 939 input file, or one which matched less text. 940 For example, the following will both count the 941 words in the input and call the routine 942 .Fn special 943 whenever 944 .Qq frob 945 is seen: 946 .Bd -literal -offset indent 947 int word_count = 0; 948 %% 949 950 frob special(); REJECT; 951 [^ \et\en]+ ++word_count; 952 .Ed 953 .Pp 954 Without the 955 .Em REJECT , 956 any "frob"'s in the input would not be counted as words, 957 since the scanner normally executes only one action per token. 958 Multiple 959 .Em REJECT Ns 's 960 are allowed, 961 each one finding the next best choice to the currently active rule. 962 For example, when the following scanner scans the token 963 .Qq abcd , 964 it will write 965 .Qq abcdabcaba 966 to the output: 967 .Bd -literal -offset indent 968 %% 969 a | 970 ab | 971 abc | 972 abcd ECHO; REJECT; 973 \&.|\en /* eat up any unmatched character */ 974 .Ed 975 .Pp 976 (The first three rules share the fourth's action since they use 977 the special 978 .Sq |\& 979 action.) 980 .Em REJECT 981 is a particularly expensive feature in terms of scanner performance; 982 if it is used in any of the scanner's actions it will slow down 983 all of the scanner's matching. 984 Furthermore, 985 .Em REJECT 986 cannot be used with the 987 .Fl Cf 988 or 989 .Fl CF 990 options 991 .Pq see below . 992 .Pp 993 Note also that unlike the other special actions, 994 .Em REJECT 995 is a 996 .Em branch ; 997 code immediately following it in the action will not be executed. 998 .It yymore() 999 Tells the scanner that the next time it matches a rule, the corresponding 1000 token should be appended onto the current value of 1001 .Fa yytext 1002 rather than replacing it. 1003 For example, given the input 1004 .Qq mega-kludge 1005 the following will write 1006 .Qq mega-mega-kludge 1007 to the output: 1008 .Bd -literal -offset indent 1009 %% 1010 mega- ECHO; yymore(); 1011 kludge ECHO; 1012 .Ed 1013 .Pp 1014 First 1015 .Qq mega- 1016 is matched and echoed to the output. 1017 Then 1018 .Qq kludge 1019 is matched, but the previous 1020 .Qq mega- 1021 is still hanging around at the beginning of 1022 .Fa yytext 1023 so the 1024 .Em ECHO 1025 for the 1026 .Qq kludge 1027 rule will actually write 1028 .Qq mega-kludge . 1029 .Pp 1030 Two notes regarding use of 1031 .Fn yymore : 1032 First, 1033 .Fn yymore 1034 depends on the value of 1035 .Fa yyleng 1036 correctly reflecting the size of the current token, so 1037 .Fa yyleng 1038 must not be modified when using 1039 .Fn yymore . 1040 Second, the presence of 1041 .Fn yymore 1042 in the scanner's action entails a minor performance penalty in the 1043 scanner's matching speed. 1044 .It yyless(n) 1045 Returns all but the first 1046 .Ar n 1047 characters of the current token back to the input stream, where they 1048 will be rescanned when the scanner looks for the next match. 1049 .Fa yytext 1050 and 1051 .Fa yyleng 1052 are adjusted appropriately (e.g., 1053 .Fa yyleng 1054 will now be equal to 1055 .Ar n ) . 1056 For example, on the input 1057 .Qq foobar 1058 the following will write out 1059 .Qq foobarbar : 1060 .Bd -literal -offset indent 1061 %% 1062 foobar ECHO; yyless(3); 1063 [a-z]+ ECHO; 1064 .Ed 1065 .Pp 1066 An argument of 0 to 1067 .Fa yyless 1068 will cause the entire current input string to be scanned again. 1069 Unless how the scanner will subsequently process its input has been changed 1070 (using 1071 .Em BEGIN , 1072 for example), 1073 this will result in an endless loop. 1074 .Pp 1075 Note that 1076 .Fa yyless 1077 is a macro and can only be used in the 1078 .Nm 1079 input file, not from other source files. 1080 .It unput(c) 1081 Puts the character 1082 .Ar c 1083 back into the input stream. 1084 It will be the next character scanned. 1085 The following action will take the current token and cause it 1086 to be rescanned enclosed in parentheses. 1087 .Bd -literal -offset indent 1088 { 1089 int i; 1090 char *yycopy; 1091 1092 /* Copy yytext because unput() trashes yytext */ 1093 if ((yycopy = strdup(yytext)) == NULL) 1094 err(1, NULL); 1095 unput(')'); 1096 for (i = yyleng - 1; i >= 0; --i) 1097 unput(yycopy[i]); 1098 unput('('); 1099 free(yycopy); 1100 } 1101 .Ed 1102 .Pp 1103 Note that since each 1104 .Fn unput 1105 puts the given character back at the beginning of the input stream, 1106 pushing back strings must be done back-to-front. 1107 .Pp 1108 An important potential problem when using 1109 .Fn unput 1110 is that if using 1111 .Dq %pointer 1112 .Pq the default , 1113 a call to 1114 .Fn unput 1115 destroys the contents of 1116 .Fa yytext , 1117 starting with its rightmost character and devouring one character to 1118 the left with each call. 1119 If the value of 1120 .Fa yytext 1121 should be preserved after a call to 1122 .Fn unput 1123 .Pq as in the above example , 1124 it must either first be copied elsewhere, or the scanner must be built using 1125 .Dq %array 1126 instead (see 1127 .Sx HOW THE INPUT IS MATCHED ) . 1128 .Pp 1129 Finally, note that EOF cannot be put back 1130 to attempt to mark the input stream with an end-of-file. 1131 .It input() 1132 Reads the next character from the input stream. 1133 For example, the following is one way to eat up C comments: 1134 .Bd -literal -offset indent 1135 %% 1136 "/*" { 1137 int c; 1138 1139 for (;;) { 1140 while ((c = input()) != '*' && c != EOF) 1141 ; /* eat up text of comment */ 1142 1143 if (c == '*') { 1144 while ((c = input()) == '*') 1145 ; 1146 if (c == '/') 1147 break; /* found the end */ 1148 } 1149 1150 if (c == EOF) { 1151 errx(1, "EOF in comment"); 1152 break; 1153 } 1154 } 1155 } 1156 .Ed 1157 .Pp 1158 (Note that if the scanner is compiled using C++, then 1159 .Fn input 1160 is instead referred to as 1161 .Fn yyinput , 1162 in order to avoid a name clash with the C++ stream by the name of input.) 1163 .It YY_FLUSH_BUFFER 1164 Flushes the scanner's internal buffer 1165 so that the next time the scanner attempts to match a token, 1166 it will first refill the buffer using 1167 .Dv YY_INPUT 1168 (see 1169 .Sx THE GENERATED SCANNER , 1170 below). 1171 This action is a special case of the more general 1172 .Fn yy_flush_buffer 1173 function, described below in the section 1174 .Sx MULTIPLE INPUT BUFFERS . 1175 .It yyterminate() 1176 Can be used in lieu of a return statement in an action. 1177 It terminates the scanner and returns a 0 to the scanner's caller, indicating 1178 .Qq all done . 1179 By default, 1180 .Fn yyterminate 1181 is also called when an end-of-file is encountered. 1182 It is a macro and may be redefined. 1183 .El 1184 .Sh THE GENERATED SCANNER 1185 The output of 1186 .Nm 1187 is the file 1188 .Pa lex.yy.c , 1189 which contains the scanning routine 1190 .Fn yylex , 1191 a number of tables used by it for matching tokens, 1192 and a number of auxiliary routines and macros. 1193 By default, 1194 .Fn yylex 1195 is declared as follows: 1196 .Bd -unfilled -offset indent 1197 int yylex() 1198 { 1199 ... various definitions and the actions in here ... 1200 } 1201 .Ed 1202 .Pp 1203 (If the environment supports function prototypes, then it will 1204 be "int yylex(void)".) 1205 This definition may be changed by defining the 1206 .Dv YY_DECL 1207 macro. 1208 For example: 1209 .Bd -literal -offset indent 1210 #define YY_DECL float lexscan(a, b) float a, b; 1211 .Ed 1212 .Pp 1213 would give the scanning routine the name 1214 .Em lexscan , 1215 returning a float, and taking two floats as arguments. 1216 Note that if arguments are given to the scanning routine using a 1217 K&R-style/non-prototyped function declaration, 1218 the definition must be terminated with a semi-colon 1219 .Pq Sq ;\& . 1220 .Pp 1221 Whenever 1222 .Fn yylex 1223 is called, it scans tokens from the global input file 1224 .Pa yyin 1225 .Pq which defaults to stdin . 1226 It continues until it either reaches an end-of-file 1227 .Pq at which point it returns the value 0 1228 or one of its actions executes a 1229 .Em return 1230 statement. 1231 .Pp 1232 If the scanner reaches an end-of-file, subsequent calls are undefined 1233 unless either 1234 .Em yyin 1235 is pointed at a new input file 1236 .Pq in which case scanning continues from that file , 1237 or 1238 .Fn yyrestart 1239 is called. 1240 .Fn yyrestart 1241 takes one argument, a 1242 .Fa FILE * 1243 pointer (which can be nil, if 1244 .Dv YY_INPUT 1245 has been set up to scan from a source other than 1246 .Em yyin ) , 1247 and initializes 1248 .Em yyin 1249 for scanning from that file. 1250 Essentially there is no difference between just assigning 1251 .Em yyin 1252 to a new input file or using 1253 .Fn yyrestart 1254 to do so; the latter is available for compatibility with previous versions of 1255 .Nm , 1256 and because it can be used to switch input files in the middle of scanning. 1257 It can also be used to throw away the current input buffer, 1258 by calling it with an argument of 1259 .Em yyin ; 1260 but better is to use 1261 .Dv YY_FLUSH_BUFFER 1262 .Pq see above . 1263 Note that 1264 .Fn yyrestart 1265 does not reset the start condition to 1266 .Em INITIAL 1267 (see 1268 .Sx START CONDITIONS , 1269 below). 1270 .Pp 1271 If 1272 .Fn yylex 1273 stops scanning due to executing a 1274 .Em return 1275 statement in one of the actions, the scanner may then be called again and it 1276 will resume scanning where it left off. 1277 .Pp 1278 By default 1279 .Pq and for purposes of efficiency , 1280 the scanner uses block-reads rather than simple 1281 .Xr getc 3 1282 calls to read characters from 1283 .Em yyin . 1284 The nature of how it gets its input can be controlled by defining the 1285 .Dv YY_INPUT 1286 macro. 1287 .Dv YY_INPUT Ns 's 1288 calling sequence is 1289 .Qq YY_INPUT(buf,result,max_size) . 1290 Its action is to place up to 1291 .Dv max_size 1292 characters in the character array 1293 .Em buf 1294 and return in the integer variable 1295 .Em result 1296 either the number of characters read or the constant 1297 .Dv YY_NULL 1298 (0 on 1299 .Ux 1300 systems) 1301 to indicate 1302 .Dv EOF . 1303 The default 1304 .Dv YY_INPUT 1305 reads from the global file-pointer 1306 .Qq yyin . 1307 .Pp 1308 A sample definition of 1309 .Dv YY_INPUT 1310 .Pq in the definitions section of the input file : 1311 .Bd -unfilled -offset indent 1312 %{ 1313 #define YY_INPUT(buf,result,max_size) \e 1314 { \e 1315 int c = getchar(); \e 1316 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \e 1317 } 1318 %} 1319 .Ed 1320 .Pp 1321 This definition will change the input processing to occur 1322 one character at a time. 1323 .Pp 1324 When the scanner receives an end-of-file indication from 1325 .Dv YY_INPUT , 1326 it then checks the 1327 .Fn yywrap 1328 function. 1329 If 1330 .Fn yywrap 1331 returns false 1332 .Pq zero , 1333 then it is assumed that the function has gone ahead and set up 1334 .Em yyin 1335 to point to another input file, and scanning continues. 1336 If it returns true 1337 .Pq non-zero , 1338 then the scanner terminates, returning 0 to its caller. 1339 Note that in either case, the start condition remains unchanged; 1340 it does not revert to 1341 .Em INITIAL . 1342 .Pp 1343 If you do not supply your own version of 1344 .Fn yywrap , 1345 then you must either use 1346 .Dq %option noyywrap 1347 (in which case the scanner behaves as though 1348 .Fn yywrap 1349 returned 1), or you must link with 1350 .Fl lfl 1351 to obtain the default version of the routine, which always returns 1. 1352 .Pp 1353 Three routines are available for scanning from in-memory buffers rather 1354 than files: 1355 .Fn yy_scan_string , 1356 .Fn yy_scan_bytes , 1357 and 1358 .Fn yy_scan_buffer . 1359 See the discussion of them below in the section 1360 .Sx MULTIPLE INPUT BUFFERS . 1361 .Pp 1362 The scanner writes its 1363 .Em ECHO 1364 output to the 1365 .Em yyout 1366 global 1367 .Pq default, stdout , 1368 which may be redefined by the user simply by assigning it to some other 1369 .Va FILE 1370 pointer. 1371 .Sh START CONDITIONS 1372 .Nm 1373 provides a mechanism for conditionally activating rules. 1374 Any rule whose pattern is prefixed with 1375 .Qq Aq sc 1376 will only be active when the scanner is in the start condition named 1377 .Qq sc . 1378 For example, 1379 .Bd -literal -offset indent 1380 <STRING>[^"]* { /* eat up the string body ... */ 1381 ... 1382 } 1383 .Ed 1384 .Pp 1385 will be active only when the scanner is in the 1386 .Qq STRING 1387 start condition, and 1388 .Bd -literal -offset indent 1389 <INITIAL,STRING,QUOTE>\e. { /* handle an escape ... */ 1390 ... 1391 } 1392 .Ed 1393 .Pp 1394 will be active only when the current start condition is either 1395 .Qq INITIAL , 1396 .Qq STRING , 1397 or 1398 .Qq QUOTE . 1399 .Pp 1400 Start conditions are declared in the definitions 1401 .Pq first 1402 section of the input using unindented lines beginning with either 1403 .Sq %s 1404 or 1405 .Sq %x 1406 followed by a list of names. 1407 The former declares 1408 .Em inclusive 1409 start conditions, the latter 1410 .Em exclusive 1411 start conditions. 1412 A start condition is activated using the 1413 .Em BEGIN 1414 action. 1415 Until the next 1416 .Em BEGIN 1417 action is executed, rules with the given start condition will be active and 1418 rules with other start conditions will be inactive. 1419 If the start condition is inclusive, 1420 then rules with no start conditions at all will also be active. 1421 If it is exclusive, 1422 then only rules qualified with the start condition will be active. 1423 A set of rules contingent on the same exclusive start condition 1424 describe a scanner which is independent of any of the other rules in the 1425 .Nm 1426 input. 1427 Because of this, exclusive start conditions make it easy to specify 1428 .Qq mini-scanners 1429 which scan portions of the input that are syntactically different 1430 from the rest 1431 .Pq e.g., comments . 1432 .Pp 1433 If the distinction between inclusive and exclusive start conditions 1434 is still a little vague, here's a simple example illustrating the 1435 connection between the two. 1436 The set of rules: 1437 .Bd -literal -offset indent 1438 %s example 1439 %% 1440 1441 <example>foo do_something(); 1442 1443 bar something_else(); 1444 .Ed 1445 .Pp 1446 is equivalent to 1447 .Bd -literal -offset indent 1448 %x example 1449 %% 1450 1451 <example>foo do_something(); 1452 1453 <INITIAL,example>bar something_else(); 1454 .Ed 1455 .Pp 1456 Without the 1457 .Aq INITIAL,example 1458 qualifier, the 1459 .Dq bar 1460 pattern in the second example wouldn't be active 1461 .Pq i.e., couldn't match 1462 when in start condition 1463 .Dq example . 1464 If we just used 1465 .Aq example 1466 to qualify 1467 .Dq bar , 1468 though, then it would only be active in 1469 .Dq example 1470 and not in 1471 .Em INITIAL , 1472 while in the first example it's active in both, 1473 because in the first example the 1474 .Dq example 1475 start condition is an inclusive 1476 .Pq Sq %s 1477 start condition. 1478 .Pp 1479 Also note that the special start-condition specifier 1480 .Sq Aq * 1481 matches every start condition. 1482 Thus, the above example could also have been written: 1483 .Bd -literal -offset indent 1484 %x example 1485 %% 1486 1487 <example>foo do_something(); 1488 1489 <*>bar something_else(); 1490 .Ed 1491 .Pp 1492 The default rule (to 1493 .Em ECHO 1494 any unmatched character) remains active in start conditions. 1495 It is equivalent to: 1496 .Bd -literal -offset indent 1497 <*>.|\en ECHO; 1498 .Ed 1499 .Pp 1500 .Dq BEGIN(0) 1501 returns to the original state where only the rules with 1502 no start conditions are active. 1503 This state can also be referred to as the start-condition 1504 .Em INITIAL , 1505 so 1506 .Dq BEGIN(INITIAL) 1507 is equivalent to 1508 .Dq BEGIN(0) . 1509 (The parentheses around the start condition name are not required but 1510 are considered good style.) 1511 .Pp 1512 .Em BEGIN 1513 actions can also be given as indented code at the beginning 1514 of the rules section. 1515 For example, the following will cause the scanner to enter the 1516 .Qq SPECIAL 1517 start condition whenever 1518 .Fn yylex 1519 is called and the global variable 1520 .Fa enter_special 1521 is true: 1522 .Bd -literal -offset indent 1523 int enter_special; 1524 1525 %x SPECIAL 1526 %% 1527 if (enter_special) 1528 BEGIN(SPECIAL); 1529 1530 <SPECIAL>blahblahblah 1531 \&...more rules follow... 1532 .Ed 1533 .Pp 1534 To illustrate the uses of start conditions, 1535 here is a scanner which provides two different interpretations 1536 of a string like 1537 .Qq 123.456 . 1538 By default it will treat it as three tokens: the integer 1539 .Qq 123 , 1540 a dot 1541 .Pq Sq .\& , 1542 and the integer 1543 .Qq 456 . 1544 But if the string is preceded earlier in the line by the string 1545 .Qq expect-floats 1546 it will treat it as a single token, the floating-point number 123.456: 1547 .Bd -literal -offset indent 1548 %{ 1549 #include <math.h> 1550 %} 1551 %s expect 1552 1553 %% 1554 expect-floats BEGIN(expect); 1555 1556 <expect>[0-9]+"."[0-9]+ { 1557 printf("found a float, = %f\en", 1558 atof(yytext)); 1559 } 1560 <expect>\en { 1561 /* 1562 * That's the end of the line, so 1563 * we need another "expect-number" 1564 * before we'll recognize any more 1565 * numbers. 1566 */ 1567 BEGIN(INITIAL); 1568 } 1569 1570 [0-9]+ { 1571 printf("found an integer, = %d\en", 1572 atoi(yytext)); 1573 } 1574 1575 "." printf("found a dot\en"); 1576 .Ed 1577 .Pp 1578 Here is a scanner which recognizes 1579 .Pq and discards 1580 C comments while maintaining a count of the current input line: 1581 .Bd -literal -offset indent 1582 %x comment 1583 %% 1584 int line_num = 1; 1585 1586 "/*" BEGIN(comment); 1587 1588 <comment>[^*\en]* /* eat anything that's not a '*' */ 1589 <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */ 1590 <comment>\en ++line_num; 1591 <comment>"*"+"/" BEGIN(INITIAL); 1592 .Ed 1593 .Pp 1594 This scanner goes to a bit of trouble to match as much 1595 text as possible with each rule. 1596 In general, when attempting to write a high-speed scanner 1597 try to match as much as possible in each rule, as it's a big win. 1598 .Pp 1599 Note that start-condition names are really integer values and 1600 can be stored as such. 1601 Thus, the above could be extended in the following fashion: 1602 .Bd -literal -offset indent 1603 %x comment foo 1604 %% 1605 int line_num = 1; 1606 int comment_caller; 1607 1608 "/*" { 1609 comment_caller = INITIAL; 1610 BEGIN(comment); 1611 } 1612 1613 \&... 1614 1615 <foo>"/*" { 1616 comment_caller = foo; 1617 BEGIN(comment); 1618 } 1619 1620 <comment>[^*\en]* /* eat anything that's not a '*' */ 1621 <comment>"*"+[^*/\en]* /* eat up '*'s not followed by '/'s */ 1622 <comment>\en ++line_num; 1623 <comment>"*"+"/" BEGIN(comment_caller); 1624 .Ed 1625 .Pp 1626 Furthermore, the current start condition can be accessed by using 1627 the integer-valued 1628 .Dv YY_START 1629 macro. 1630 For example, the above assignments to 1631 .Em comment_caller 1632 could instead be written 1633 .Pp 1634 .Dl comment_caller = YY_START; 1635 .Pp 1636 Flex provides 1637 .Dv YYSTATE 1638 as an alias for 1639 .Dv YY_START 1640 (since that is what's used by 1641 .At 1642 .Nm lex ) . 1643 .Pp 1644 Note that start conditions do not have their own name-space; 1645 %s's and %x's declare names in the same fashion as #define's. 1646 .Pp 1647 Finally, here's an example of how to match C-style quoted strings using 1648 exclusive start conditions, including expanded escape sequences 1649 (but not including checking for a string that's too long): 1650 .Bd -literal -offset indent 1651 %x str 1652 1653 %% 1654 #define MAX_STR_CONST 1024 1655 char string_buf[MAX_STR_CONST]; 1656 char *string_buf_ptr; 1657 1658 \e" string_buf_ptr = string_buf; BEGIN(str); 1659 1660 <str>\e" { /* saw closing quote - all done */ 1661 BEGIN(INITIAL); 1662 *string_buf_ptr = '\e0'; 1663 /* 1664 * return string constant token type and 1665 * value to parser 1666 */ 1667 } 1668 1669 <str>\en { 1670 /* error - unterminated string constant */ 1671 /* generate error message */ 1672 } 1673 1674 <str>\e\e[0-7]{1,3} { 1675 /* octal escape sequence */ 1676 int result; 1677 1678 (void) sscanf(yytext + 1, "%o", &result); 1679 1680 if (result > 0xff) { 1681 /* error, constant is out-of-bounds */ 1682 } else 1683 *string_buf_ptr++ = result; 1684 } 1685 1686 <str>\e\e[0-9]+ { 1687 /* 1688 * generate error - bad escape sequence; something 1689 * like '\e48' or '\e0777777' 1690 */ 1691 } 1692 1693 <str>\e\en *string_buf_ptr++ = '\en'; 1694 <str>\e\et *string_buf_ptr++ = '\et'; 1695 <str>\e\er *string_buf_ptr++ = '\er'; 1696 <str>\e\eb *string_buf_ptr++ = '\eb'; 1697 <str>\e\ef *string_buf_ptr++ = '\ef'; 1698 1699 <str>\e\e(.|\en) *string_buf_ptr++ = yytext[1]; 1700 1701 <str>[^\e\e\en\e"]+ { 1702 char *yptr = yytext; 1703 1704 while (*yptr) 1705 *string_buf_ptr++ = *yptr++; 1706 } 1707 .Ed 1708 .Pp 1709 Often, such as in some of the examples above, 1710 a whole bunch of rules are all preceded by the same start condition(s). 1711 .Nm 1712 makes this a little easier and cleaner by introducing a notion of 1713 start condition 1714 .Em scope . 1715 A start condition scope is begun with: 1716 .Pp 1717 .Dl <SCs>{ 1718 .Pp 1719 where 1720 .Dq SCs 1721 is a list of one or more start conditions. 1722 Inside the start condition scope, every rule automatically has the prefix 1723 .Aq SCs 1724 applied to it, until a 1725 .Sq } 1726 which matches the initial 1727 .Sq { . 1728 So, for example, 1729 .Bd -literal -offset indent 1730 <ESC>{ 1731 "\e\en" return '\en'; 1732 "\e\er" return '\er'; 1733 "\e\ef" return '\ef'; 1734 "\e\e0" return '\e0'; 1735 } 1736 .Ed 1737 .Pp 1738 is equivalent to: 1739 .Bd -literal -offset indent 1740 <ESC>"\e\en" return '\en'; 1741 <ESC>"\e\er" return '\er'; 1742 <ESC>"\e\ef" return '\ef'; 1743 <ESC>"\e\e0" return '\e0'; 1744 .Ed 1745 .Pp 1746 Start condition scopes may be nested. 1747 .Pp 1748 Three routines are available for manipulating stacks of start conditions: 1749 .Bl -tag -width Ds 1750 .It void yy_push_state(int new_state) 1751 Pushes the current start condition onto the top of the start condition 1752 stack and switches to 1753 .Fa new_state 1754 as though 1755 .Dq BEGIN new_state 1756 had been used 1757 .Pq recall that start condition names are also integers . 1758 .It void yy_pop_state() 1759 Pops the top of the stack and switches to it via 1760 .Em BEGIN . 1761 .It int yy_top_state() 1762 Returns the top of the stack without altering the stack's contents. 1763 .El 1764 .Pp 1765 The start condition stack grows dynamically and so has no built-in 1766 size limitation. 1767 If memory is exhausted, program execution aborts. 1768 .Pp 1769 To use start condition stacks, scanners must include a 1770 .Dq %option stack 1771 directive (see 1772 .Sx OPTIONS 1773 below). 1774 .Sh MULTIPLE INPUT BUFFERS 1775 Some scanners 1776 (such as those which support 1777 .Qq include 1778 files) 1779 require reading from several input streams. 1780 As 1781 .Nm 1782 scanners do a large amount of buffering, one cannot control 1783 where the next input will be read from by simply writing a 1784 .Dv YY_INPUT 1785 which is sensitive to the scanning context. 1786 .Dv YY_INPUT 1787 is only called when the scanner reaches the end of its buffer, which 1788 may be a long time after scanning a statement such as an 1789 .Qq include 1790 which requires switching the input source. 1791 .Pp 1792 To negotiate these sorts of problems, 1793 .Nm 1794 provides a mechanism for creating and switching between multiple 1795 input buffers. 1796 An input buffer is created by using: 1797 .Pp 1798 .D1 YY_BUFFER_STATE yy_create_buffer(FILE *file, int size) 1799 .Pp 1800 which takes a 1801 .Fa FILE 1802 pointer and a 1803 .Fa size 1804 and creates a buffer associated with the given file and large enough to hold 1805 .Fa size 1806 characters (when in doubt, use 1807 .Dv YY_BUF_SIZE 1808 for the size). 1809 It returns a 1810 .Dv YY_BUFFER_STATE 1811 handle, which may then be passed to other routines 1812 .Pq see below . 1813 The 1814 .Dv YY_BUFFER_STATE 1815 type is a pointer to an opaque 1816 .Dq struct yy_buffer_state 1817 structure, so 1818 .Dv YY_BUFFER_STATE 1819 variables may be safely initialized to 1820 .Dq ((YY_BUFFER_STATE) 0) 1821 if desired, and the opaque structure can also be referred to in order to 1822 correctly declare input buffers in source files other than that of scanners. 1823 Note that the 1824 .Fa FILE 1825 pointer in the call to 1826 .Fn yy_create_buffer 1827 is only used as the value of 1828 .Fa yyin 1829 seen by 1830 .Dv YY_INPUT ; 1831 if 1832 .Dv YY_INPUT 1833 is redefined so that it no longer uses 1834 .Fa yyin , 1835 then a nil 1836 .Fa FILE 1837 pointer can safely be passed to 1838 .Fn yy_create_buffer . 1839 To select a particular buffer to scan: 1840 .Pp 1841 .D1 void yy_switch_to_buffer(YY_BUFFER_STATE new_buffer) 1842 .Pp 1843 It switches the scanner's input buffer so subsequent tokens will 1844 come from 1845 .Fa new_buffer . 1846 Note that 1847 .Fn yy_switch_to_buffer 1848 may be used by 1849 .Fn yywrap 1850 to set things up for continued scanning, 1851 instead of opening a new file and pointing 1852 .Fa yyin 1853 at it. 1854 Note also that switching input sources via either 1855 .Fn yy_switch_to_buffer 1856 or 1857 .Fn yywrap 1858 does not change the start condition. 1859 .Pp 1860 .D1 void yy_delete_buffer(YY_BUFFER_STATE buffer) 1861 .Pp 1862 is used to reclaim the storage associated with a buffer. 1863 .Pf ( Fa buffer 1864 can be nil, in which case the routine does nothing.) 1865 To clear the current contents of a buffer: 1866 .Pp 1867 .D1 void yy_flush_buffer(YY_BUFFER_STATE buffer) 1868 .Pp 1869 This function discards the buffer's contents, 1870 so the next time the scanner attempts to match a token from the buffer, 1871 it will first fill the buffer anew using 1872 .Dv YY_INPUT . 1873 .Pp 1874 .Fn yy_new_buffer 1875 is an alias for 1876 .Fn yy_create_buffer , 1877 provided for compatibility with the C++ use of 1878 .Em new 1879 and 1880 .Em delete 1881 for creating and destroying dynamic objects. 1882 .Pp 1883 Finally, the 1884 .Dv YY_CURRENT_BUFFER 1885 macro returns a 1886 .Dv YY_BUFFER_STATE 1887 handle to the current buffer. 1888 .Pp 1889 Here is an example of using these features for writing a scanner 1890 which expands include files (the 1891 .Aq Aq EOF 1892 feature is discussed below): 1893 .Bd -literal -offset indent 1894 /* 1895 * the "incl" state is used for picking up the name 1896 * of an include file 1897 */ 1898 %x incl 1899 1900 %{ 1901 #define MAX_INCLUDE_DEPTH 10 1902 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH]; 1903 int include_stack_ptr = 0; 1904 %} 1905 1906 %% 1907 include BEGIN(incl); 1908 1909 [a-z]+ ECHO; 1910 [^a-z\en]*\en? ECHO; 1911 1912 <incl>[ \et]* /* eat the whitespace */ 1913 <incl>[^ \et\en]+ { /* got the include file name */ 1914 if (include_stack_ptr >= MAX_INCLUDE_DEPTH) 1915 errx(1, "Includes nested too deeply"); 1916 1917 include_stack[include_stack_ptr++] = 1918 YY_CURRENT_BUFFER; 1919 1920 yyin = fopen(yytext, "r"); 1921 1922 if (yyin == NULL) 1923 err(1, NULL); 1924 1925 yy_switch_to_buffer( 1926 yy_create_buffer(yyin, YY_BUF_SIZE)); 1927 1928 BEGIN(INITIAL); 1929 } 1930 1931 <<EOF>> { 1932 if (--include_stack_ptr < 0) 1933 yyterminate(); 1934 else { 1935 yy_delete_buffer(YY_CURRENT_BUFFER); 1936 yy_switch_to_buffer( 1937 include_stack[include_stack_ptr]); 1938 } 1939 } 1940 .Ed 1941 .Pp 1942 Three routines are available for setting up input buffers for 1943 scanning in-memory strings instead of files. 1944 All of them create a new input buffer for scanning the string, 1945 and return a corresponding 1946 .Dv YY_BUFFER_STATE 1947 handle (which should be deleted afterwards using 1948 .Fn yy_delete_buffer ) . 1949 They also switch to the new buffer using 1950 .Fn yy_switch_to_buffer , 1951 so the next call to 1952 .Fn yylex 1953 will start scanning the string. 1954 .Bl -tag -width Ds 1955 .It yy_scan_string(const char *str) 1956 Scans a NUL-terminated string. 1957 .It yy_scan_bytes(const char *bytes, int len) 1958 Scans 1959 .Fa len 1960 bytes 1961 .Pq including possibly NUL's 1962 starting at location 1963 .Fa bytes . 1964 .El 1965 .Pp 1966 Note that both of these functions create and scan a copy 1967 of the string or bytes. 1968 (This may be desirable, since 1969 .Fn yylex 1970 modifies the contents of the buffer it is scanning.) 1971 The copy can be avoided by using: 1972 .Bl -tag -width Ds 1973 .It yy_scan_buffer(char *base, yy_size_t size) 1974 Which scans the buffer starting at 1975 .Fa base , 1976 consisting of 1977 .Fa size 1978 bytes, the last two bytes of which must be 1979 .Dv YY_END_OF_BUFFER_CHAR 1980 .Pq ASCII NUL . 1981 These last two bytes are not scanned; thus, scanning consists of 1982 base[0] through base[size-2], inclusive. 1983 .Pp 1984 If 1985 .Fa base 1986 is not set up in this manner 1987 (i.e., forget the final two 1988 .Dv YY_END_OF_BUFFER_CHAR 1989 bytes), then 1990 .Fn yy_scan_buffer 1991 returns a nil pointer instead of creating a new input buffer. 1992 .Pp 1993 The type 1994 .Fa yy_size_t 1995 is an integral type which can be cast to an integer expression 1996 reflecting the size of the buffer. 1997 .El 1998 .Sh END-OF-FILE RULES 1999 The special rule 2000 .Qq Aq Aq EOF 2001 indicates actions which are to be taken when an end-of-file is encountered and 2002 .Fn yywrap 2003 returns non-zero 2004 .Pq i.e., indicates no further files to process . 2005 The action must finish by doing one of four things: 2006 .Bl -dash 2007 .It 2008 Assigning 2009 .Em yyin 2010 to a new input file 2011 (in previous versions of 2012 .Nm , 2013 after doing the assignment, it was necessary to call the special action 2014 .Dv YY_NEW_FILE ; 2015 this is no longer necessary). 2016 .It 2017 Executing a 2018 .Em return 2019 statement. 2020 .It 2021 Executing the special 2022 .Fn yyterminate 2023 action. 2024 .It 2025 Switching to a new buffer using 2026 .Fn yy_switch_to_buffer 2027 as shown in the example above. 2028 .El 2029 .Pp 2030 .Aq Aq EOF 2031 rules may not be used with other patterns; 2032 they may only be qualified with a list of start conditions. 2033 If an unqualified 2034 .Aq Aq EOF 2035 rule is given, it applies to all start conditions which do not already have 2036 .Aq Aq EOF 2037 actions. 2038 To specify an 2039 .Aq Aq EOF 2040 rule for only the initial start condition, use 2041 .Pp 2042 .Dl <INITIAL><<EOF>> 2043 .Pp 2044 These rules are useful for catching things like unclosed comments. 2045 An example: 2046 .Bd -literal -offset indent 2047 %x quote 2048 %% 2049 2050 \&...other rules for dealing with quotes... 2051 2052 <quote><<EOF>> { 2053 error("unterminated quote"); 2054 yyterminate(); 2055 } 2056 <<EOF>> { 2057 if (*++filelist) 2058 yyin = fopen(*filelist, "r"); 2059 else 2060 yyterminate(); 2061 } 2062 .Ed 2063 .Sh MISCELLANEOUS MACROS 2064 The macro 2065 .Dv YY_USER_ACTION 2066 can be defined to provide an action 2067 which is always executed prior to the matched rule's action. 2068 For example, 2069 it could be #define'd to call a routine to convert yytext to lower-case. 2070 When 2071 .Dv YY_USER_ACTION 2072 is invoked, the variable 2073 .Fa yy_act 2074 gives the number of the matched rule 2075 .Pq rules are numbered starting with 1 . 2076 For example, to profile how often each rule is matched, 2077 the following would do the trick: 2078 .Pp 2079 .Dl #define YY_USER_ACTION ++ctr[yy_act] 2080 .Pp 2081 where 2082 .Fa ctr 2083 is an array to hold the counts for the different rules. 2084 Note that the macro 2085 .Dv YY_NUM_RULES 2086 gives the total number of rules 2087 (including the default rule, even if 2088 .Fl s 2089 is used), 2090 so a correct declaration for 2091 .Fa ctr 2092 is: 2093 .Pp 2094 .Dl int ctr[YY_NUM_RULES]; 2095 .Pp 2096 The macro 2097 .Dv YY_USER_INIT 2098 may be defined to provide an action which is always executed before 2099 the first scan 2100 .Pq and before the scanner's internal initializations are done . 2101 For example, it could be used to call a routine to read 2102 in a data table or open a logging file. 2103 .Pp 2104 The macro 2105 .Dv yy_set_interactive(is_interactive) 2106 can be used to control whether the current buffer is considered 2107 .Em interactive . 2108 An interactive buffer is processed more slowly, 2109 but must be used when the scanner's input source is indeed 2110 interactive to avoid problems due to waiting to fill buffers 2111 (see the discussion of the 2112 .Fl I 2113 flag below). 2114 A non-zero value in the macro invocation marks the buffer as interactive, 2115 a zero value as non-interactive. 2116 Note that use of this macro overrides 2117 .Dq %option always-interactive 2118 or 2119 .Dq %option never-interactive 2120 (see 2121 .Sx OPTIONS 2122 below). 2123 .Fn yy_set_interactive 2124 must be invoked prior to beginning to scan the buffer that is 2125 .Pq or is not 2126 to be considered interactive. 2127 .Pp 2128 The macro 2129 .Dv yy_set_bol(at_bol) 2130 can be used to control whether the current buffer's scanning 2131 context for the next token match is done as though at the 2132 beginning of a line. 2133 A non-zero macro argument makes rules anchored with 2134 .Sq ^ 2135 active, while a zero argument makes 2136 .Sq ^ 2137 rules inactive. 2138 .Pp 2139 The macro 2140 .Dv YY_AT_BOL 2141 returns true if the next token scanned from the current buffer will have 2142 .Sq ^ 2143 rules active, false otherwise. 2144 .Pp 2145 In the generated scanner, the actions are all gathered in one large 2146 switch statement and separated using 2147 .Dv YY_BREAK , 2148 which may be redefined. 2149 By default, it is simply a 2150 .Qq break , 2151 to separate each rule's action from the following rules. 2152 Redefining 2153 .Dv YY_BREAK 2154 allows, for example, C++ users to 2155 .Dq #define YY_BREAK 2156 to do nothing 2157 (while being very careful that every rule ends with a 2158 .Qq break 2159 or a 2160 .Qq return ! ) 2161 to avoid suffering from unreachable statement warnings where because a rule's 2162 action ends with 2163 .Dq return , 2164 the 2165 .Dv YY_BREAK 2166 is inaccessible. 2167 .Sh VALUES AVAILABLE TO THE USER 2168 This section summarizes the various values available to the user 2169 in the rule actions. 2170 .Bl -tag -width Ds 2171 .It char *yytext 2172 Holds the text of the current token. 2173 It may be modified but not lengthened 2174 .Pq characters cannot be appended to the end . 2175 .Pp 2176 If the special directive 2177 .Dq %array 2178 appears in the first section of the scanner description, then 2179 .Fa yytext 2180 is instead declared 2181 .Dq char yytext[YYLMAX] , 2182 where 2183 .Dv YYLMAX 2184 is a macro definition that can be redefined in the first section 2185 to change the default value 2186 .Pq generally 8KB . 2187 Using 2188 .Dq %array 2189 results in somewhat slower scanners, but the value of 2190 .Fa yytext 2191 becomes immune to calls to 2192 .Fn input 2193 and 2194 .Fn unput , 2195 which potentially destroy its value when 2196 .Fa yytext 2197 is a character pointer. 2198 The opposite of 2199 .Dq %array 2200 is 2201 .Dq %pointer , 2202 which is the default. 2203 .Pp 2204 .Dq %array 2205 cannot be used when generating C++ scanner classes 2206 (the 2207 .Fl + 2208 flag). 2209 .It int yyleng 2210 Holds the length of the current token. 2211 .It FILE *yyin 2212 Is the file which by default 2213 .Nm 2214 reads from. 2215 It may be redefined, but doing so only makes sense before 2216 scanning begins or after an 2217 .Dv EOF 2218 has been encountered. 2219 Changing it in the midst of scanning will have unexpected results since 2220 .Nm 2221 buffers its input; use 2222 .Fn yyrestart 2223 instead. 2224 Once scanning terminates because an end-of-file 2225 has been seen, 2226 .Fa yyin 2227 can be assigned as the new input file 2228 and the scanner can be called again to continue scanning. 2229 .It void yyrestart(FILE *new_file) 2230 May be called to point 2231 .Fa yyin 2232 at the new input file. 2233 The switch-over to the new file is immediate 2234 .Pq any previously buffered-up input is lost . 2235 Note that calling 2236 .Fn yyrestart 2237 with 2238 .Fa yyin 2239 as an argument thus throws away the current input buffer and continues 2240 scanning the same input file. 2241 .It FILE *yyout 2242 Is the file to which 2243 .Em ECHO 2244 actions are done. 2245 It can be reassigned by the user. 2246 .It YY_CURRENT_BUFFER 2247 Returns a 2248 .Dv YY_BUFFER_STATE 2249 handle to the current buffer. 2250 .It YY_START 2251 Returns an integer value corresponding to the current start condition. 2252 This value can subsequently be used with 2253 .Em BEGIN 2254 to return to that start condition. 2255 .El 2256 .Sh INTERFACING WITH YACC 2257 One of the main uses of 2258 .Nm 2259 is as a companion to the 2260 .Xr yacc 1 2261 parser-generator. 2262 yacc parsers expect to call a routine named 2263 .Fn yylex 2264 to find the next input token. 2265 The routine is supposed to return the type of the next token 2266 as well as putting any associated value in the global 2267 .Fa yylval , 2268 which is defined externally, 2269 and can be a union or any other complex data structure. 2270 To use 2271 .Nm 2272 with yacc, one specifies the 2273 .Fl d 2274 option to yacc to instruct it to generate the file 2275 .Pa y.tab.h 2276 containing definitions of all the 2277 .Dq %tokens 2278 appearing in the yacc input. 2279 This file is then included in the 2280 .Nm 2281 scanner. 2282 For example, if one of the tokens is 2283 .Qq TOK_NUMBER , 2284 part of the scanner might look like: 2285 .Bd -literal -offset indent 2286 %{ 2287 #include "y.tab.h" 2288 %} 2289 2290 %% 2291 2292 [0-9]+ yylval = atoi(yytext); return TOK_NUMBER; 2293 .Ed 2294 .Sh OPTIONS 2295 .Nm 2296 has the following options: 2297 .Bl -tag -width Ds 2298 .It Fl 7 2299 Instructs 2300 .Nm 2301 to generate a 7-bit scanner, i.e., one which can only recognize 7-bit 2302 characters in its input. 2303 The advantage of using 2304 .Fl 7 2305 is that the scanner's tables can be up to half the size of those generated 2306 using the 2307 .Fl 8 2308 option 2309 .Pq see below . 2310 The disadvantage is that such scanners often hang 2311 or crash if their input contains an 8-bit character. 2312 .Pp 2313 Note, however, that unless generating a scanner using the 2314 .Fl Cf 2315 or 2316 .Fl CF 2317 table compression options, use of 2318 .Fl 7 2319 will save only a small amount of table space, 2320 and make the scanner considerably less portable. 2321 .Nm flex Ns 's 2322 default behavior is to generate an 8-bit scanner unless 2323 .Fl Cf 2324 or 2325 .Fl CF 2326 is specified, in which case 2327 .Nm 2328 defaults to generating 7-bit scanners unless it was 2329 configured to generate 8-bit scanners 2330 (as will often be the case with non-USA sites). 2331 It is possible tell whether 2332 .Nm 2333 generated a 7-bit or an 8-bit scanner by inspecting the flag summary in the 2334 .Fl v 2335 output as described below. 2336 .Pp 2337 Note that if 2338 .Fl Cfe 2339 or 2340 .Fl CFe 2341 are used 2342 (the table compression options, but also using equivalence classes as 2343 discussed below), 2344 .Nm 2345 still defaults to generating an 8-bit scanner, 2346 since usually with these compression options full 8-bit tables 2347 are not much more expensive than 7-bit tables. 2348 .It Fl 8 2349 Instructs 2350 .Nm 2351 to generate an 8-bit scanner, i.e., one which can recognize 8-bit 2352 characters. 2353 This flag is only needed for scanners generated using 2354 .Fl Cf 2355 or 2356 .Fl CF , 2357 as otherwise 2358 .Nm 2359 defaults to generating an 8-bit scanner anyway. 2360 .Pp 2361 See the discussion of 2362 .Fl 7 2363 above for 2364 .Nm flex Ns 's 2365 default behavior and the tradeoffs between 7-bit and 8-bit scanners. 2366 .It Fl B 2367 Instructs 2368 .Nm 2369 to generate a 2370 .Em batch 2371 scanner, the opposite of 2372 .Em interactive 2373 scanners generated by 2374 .Fl I 2375 .Pq see below . 2376 In general, 2377 .Fl B 2378 is used when the scanner will never be used interactively, 2379 and you want to squeeze a little more performance out of it. 2380 If the aim is instead to squeeze out a lot more performance, 2381 use the 2382 .Fl Cf 2383 or 2384 .Fl CF 2385 options 2386 .Pq discussed below , 2387 which turn on 2388 .Fl B 2389 automatically anyway. 2390 .It Fl b 2391 Generate backing-up information to 2392 .Pa lex.backup . 2393 This is a list of scanner states which require backing up 2394 and the input characters on which they do so. 2395 By adding rules one can remove backing-up states. 2396 If all backing-up states are eliminated and 2397 .Fl Cf 2398 or 2399 .Fl CF 2400 is used, the generated scanner will run faster (see the 2401 .Fl p 2402 flag). 2403 Only users who wish to squeeze every last cycle out of their 2404 scanners need worry about this option. 2405 (See the section on 2406 .Sx PERFORMANCE CONSIDERATIONS 2407 below.) 2408 .It Fl C Ns Op Cm aeFfmr 2409 Controls the degree of table compression and, more generally, trade-offs 2410 between small scanners and fast scanners. 2411 .Bl -tag -width Ds 2412 .It Fl Ca 2413 Instructs 2414 .Nm 2415 to trade off larger tables in the generated scanner for faster performance 2416 because the elements of the tables are better aligned for memory access 2417 and computation. 2418 On some 2419 .Tn RISC 2420 architectures, fetching and manipulating longwords is more efficient 2421 than with smaller-sized units such as shortwords. 2422 This option can double the size of the tables used by the scanner. 2423 .It Fl Ce 2424 Directs 2425 .Nm 2426 to construct 2427 .Em equivalence classes , 2428 i.e., sets of characters which have identical lexical properties 2429 (for example, if the only appearance of digits in the 2430 .Nm 2431 input is in the character class 2432 .Qq [0-9] 2433 then the digits 2434 .Sq 0 , 2435 .Sq 1 , 2436 .Sq ... , 2437 .Sq 9 2438 will all be put in the same equivalence class). 2439 Equivalence classes usually give dramatic reductions in the final 2440 table/object file sizes 2441 .Pq typically a factor of 2\-5 2442 and are pretty cheap performance-wise 2443 .Pq one array look-up per character scanned . 2444 .It Fl CF 2445 Specifies that the alternate fast scanner representation 2446 (described below under the 2447 .Fl F 2448 option) 2449 should be used. 2450 This option cannot be used with 2451 .Fl + . 2452 .It Fl Cf 2453 Specifies that the 2454 .Em full 2455 scanner tables should be generated \- 2456 .Nm 2457 should not compress the tables by taking advantage of 2458 similar transition functions for different states. 2459 .It Fl \&Cm 2460 Directs 2461 .Nm 2462 to construct 2463 .Em meta-equivalence classes , 2464 which are sets of equivalence classes 2465 (or characters, if equivalence classes are not being used) 2466 that are commonly used together. 2467 Meta-equivalence classes are often a big win when using compressed tables, 2468 but they have a moderate performance impact 2469 (one or two 2470 .Qq if 2471 tests and one array look-up per character scanned). 2472 .It Fl Cr 2473 Causes the generated scanner to 2474 .Em bypass 2475 use of the standard I/O library 2476 .Pq stdio 2477 for input. 2478 Instead of calling 2479 .Xr fread 3 2480 or 2481 .Xr getc 3 , 2482 the scanner will use the 2483 .Xr read 2 2484 system call, 2485 resulting in a performance gain which varies from system to system, 2486 but in general is probably negligible unless 2487 .Fl Cf 2488 or 2489 .Fl CF 2490 are being used. 2491 Using 2492 .Fl Cr 2493 can cause strange behavior if, for example, reading from 2494 .Fa yyin 2495 using stdio prior to calling the scanner 2496 (because the scanner will miss whatever text previous reads left 2497 in the stdio input buffer). 2498 .Pp 2499 .Fl Cr 2500 has no effect if 2501 .Dv YY_INPUT 2502 is defined 2503 (see 2504 .Sx THE GENERATED SCANNER 2505 above). 2506 .El 2507 .Pp 2508 A lone 2509 .Fl C 2510 specifies that the scanner tables should be compressed but neither 2511 equivalence classes nor meta-equivalence classes should be used. 2512 .Pp 2513 The options 2514 .Fl Cf 2515 or 2516 .Fl CF 2517 and 2518 .Fl \&Cm 2519 do not make sense together \- there is no opportunity for meta-equivalence 2520 classes if the table is not being compressed. 2521 Otherwise the options may be freely mixed, and are cumulative. 2522 .Pp 2523 The default setting is 2524 .Fl Cem 2525 which specifies that 2526 .Nm 2527 should generate equivalence classes and meta-equivalence classes. 2528 This setting provides the highest degree of table compression. 2529 It is possible to trade off faster-executing scanners at the cost of 2530 larger tables with the following generally being true: 2531 .Bd -unfilled -offset indent 2532 slowest & smallest 2533 -Cem 2534 -Cm 2535 -Ce 2536 -C 2537 -C{f,F}e 2538 -C{f,F} 2539 -C{f,F}a 2540 fastest & largest 2541 .Ed 2542 .Pp 2543 Note that scanners with the smallest tables are usually generated and 2544 compiled the quickest, 2545 so during development the default is usually best, 2546 maximal compression. 2547 .Pp 2548 .Fl Cfe 2549 is often a good compromise between speed and size for production scanners. 2550 .It Fl d 2551 Makes the generated scanner run in debug mode. 2552 Whenever a pattern is recognized and the global 2553 .Fa yy_flex_debug 2554 is non-zero 2555 .Pq which is the default , 2556 the scanner will write to stderr a line of the form: 2557 .Pp 2558 .D1 --accepting rule at line 53 ("the matched text") 2559 .Pp 2560 The line number refers to the location of the rule in the file 2561 defining the scanner 2562 (i.e., the file that was fed to 2563 .Nm ) . 2564 Messages are also generated when the scanner backs up, 2565 accepts the default rule, 2566 reaches the end of its input buffer 2567 (or encounters a NUL; 2568 at this point, the two look the same as far as the scanner's concerned), 2569 or reaches an end-of-file. 2570 .It Fl F 2571 Specifies that the fast scanner table representation should be used 2572 .Pq and stdio bypassed . 2573 This representation is about as fast as the full table representation 2574 .Pq Fl f , 2575 and for some sets of patterns will be considerably smaller 2576 .Pq and for others, larger . 2577 In general, if the pattern set contains both 2578 .Qq keywords 2579 and a catch-all, 2580 .Qq identifier 2581 rule, such as in the set: 2582 .Bd -unfilled -offset indent 2583 "case" return TOK_CASE; 2584 "switch" return TOK_SWITCH; 2585 \&... 2586 "default" return TOK_DEFAULT; 2587 [a-z]+ return TOK_ID; 2588 .Ed 2589 .Pp 2590 then it's better to use the full table representation. 2591 If only the 2592 .Qq identifier 2593 rule is present and a hash table or some such is used to detect the keywords, 2594 it's better to use 2595 .Fl F . 2596 .Pp 2597 This option is equivalent to 2598 .Fl CFr 2599 .Pq see above . 2600 It cannot be used with 2601 .Fl + . 2602 .It Fl f 2603 Specifies 2604 .Em fast scanner . 2605 No table compression is done and stdio is bypassed. 2606 The result is large but fast. 2607 This option is equivalent to 2608 .Fl Cfr 2609 .Pq see above . 2610 .It Fl h 2611 Generates a help summary of 2612 .Nm flex Ns 's 2613 options to stdout and then exits. 2614 .Fl ?\& 2615 and 2616 .Fl Fl help 2617 are synonyms for 2618 .Fl h . 2619 .It Fl I 2620 Instructs 2621 .Nm 2622 to generate an 2623 .Em interactive 2624 scanner. 2625 An interactive scanner is one that only looks ahead to decide 2626 what token has been matched if it absolutely must. 2627 It turns out that always looking one extra character ahead, 2628 even if the scanner has already seen enough text 2629 to disambiguate the current token, is a bit faster than 2630 only looking ahead when necessary. 2631 But scanners that always look ahead give dreadful interactive performance; 2632 for example, when a user types a newline, 2633 it is not recognized as a newline token until they enter 2634 .Em another 2635 token, which often means typing in another whole line. 2636 .Pp 2637 .Nm 2638 scanners default to 2639 .Em interactive 2640 unless 2641 .Fl Cf 2642 or 2643 .Fl CF 2644 table-compression options are specified 2645 .Pq see above . 2646 That's because if high-performance is most important, 2647 one of these options should be used, 2648 so if they weren't, 2649 .Nm 2650 assumes it is preferable to trade off a bit of run-time performance for 2651 intuitive interactive behavior. 2652 Note also that 2653 .Fl I 2654 cannot be used in conjunction with 2655 .Fl Cf 2656 or 2657 .Fl CF . 2658 Thus, this option is not really needed; it is on by default for all those 2659 cases in which it is allowed. 2660 .Pp 2661 A scanner can be forced to not be interactive by using 2662 .Fl B 2663 .Pq see above . 2664 .It Fl i 2665 Instructs 2666 .Nm 2667 to generate a case-insensitive scanner. 2668 The case of letters given in the 2669 .Nm 2670 input patterns will be ignored, 2671 and tokens in the input will be matched regardless of case. 2672 The matched text given in 2673 .Fa yytext 2674 will have the preserved case 2675 .Pq i.e., it will not be folded . 2676 .It Fl L 2677 Instructs 2678 .Nm 2679 not to generate 2680 .Dq #line 2681 directives. 2682 Without this option, 2683 .Nm 2684 peppers the generated scanner with #line directives so error messages 2685 in the actions will be correctly located with respect to either the original 2686 .Nm 2687 input file 2688 (if the errors are due to code in the input file), 2689 or 2690 .Pa lex.yy.c 2691 (if the errors are 2692 .Nm flex Ns 's 2693 fault \- these sorts of errors should be reported to the email address 2694 given below). 2695 .It Fl l 2696 Turns on maximum compatibility with the original 2697 .At 2698 .Nm lex 2699 implementation. 2700 Note that this does not mean full compatibility. 2701 Use of this option costs a considerable amount of performance, 2702 and it cannot be used with the 2703 .Fl + , f , F , Cf , 2704 or 2705 .Fl CF 2706 options. 2707 For details on the compatibilities it provides, see the section 2708 .Sx INCOMPATIBILITIES WITH LEX AND POSIX 2709 below. 2710 This option also results in the name 2711 .Dv YY_FLEX_LEX_COMPAT 2712 being #define'd in the generated scanner. 2713 .It Fl n 2714 Another do-nothing, deprecated option included only for 2715 .Tn POSIX 2716 compliance. 2717 .It Fl o Ns Ar output 2718 Directs 2719 .Nm 2720 to write the scanner to the file 2721 .Ar output 2722 instead of 2723 .Pa lex.yy.c . 2724 If 2725 .Fl o 2726 is combined with the 2727 .Fl t 2728 option, then the scanner is written to stdout but its 2729 .Dq #line 2730 directives 2731 (see the 2732 .Fl L 2733 option above) 2734 refer to the file 2735 .Ar output . 2736 .It Fl P Ns Ar prefix 2737 Changes the default 2738 .Qq yy 2739 prefix used by 2740 .Nm 2741 for all globally visible variable and function names to instead be 2742 .Ar prefix . 2743 For example, 2744 .Fl P Ns Ar foo 2745 changes the name of 2746 .Fa yytext 2747 to 2748 .Fa footext . 2749 It also changes the name of the default output file from 2750 .Pa lex.yy.c 2751 to 2752 .Pa lex.foo.c . 2753 Here are all of the names affected: 2754 .Bd -unfilled -offset indent 2755 yy_create_buffer 2756 yy_delete_buffer 2757 yy_flex_debug 2758 yy_init_buffer 2759 yy_flush_buffer 2760 yy_load_buffer_state 2761 yy_switch_to_buffer 2762 yyin 2763 yyleng 2764 yylex 2765 yylineno 2766 yyout 2767 yyrestart 2768 yytext 2769 yywrap 2770 .Ed 2771 .Pp 2772 (If using a C++ scanner, then only 2773 .Fa yywrap 2774 and 2775 .Fa yyFlexLexer 2776 are affected.) 2777 Within the scanner itself, it is still possible to refer to the global variables 2778 and functions using either version of their name; but externally, they 2779 have the modified name. 2780 .Pp 2781 This option allows multiple 2782 .Nm 2783 programs to be easily linked together into the same executable. 2784 Note, though, that using this option also renames 2785 .Fn yywrap , 2786 so now either an 2787 .Pq appropriately named 2788 version of the routine for the scanner must be supplied, or 2789 .Dq %option noyywrap 2790 must be used, as linking with 2791 .Fl lfl 2792 no longer provides one by default. 2793 .It Fl p 2794 Generates a performance report to stderr. 2795 The report consists of comments regarding features of the 2796 .Nm 2797 input file which will cause a serious loss of performance in the resulting 2798 scanner. 2799 If the flag is specified twice, 2800 comments regarding features that lead to minor performance losses 2801 will also be reported> 2802 .Pp 2803 Note that the use of 2804 .Em REJECT , 2805 .Dq %option yylineno , 2806 and variable trailing context 2807 (see the 2808 .Sx BUGS 2809 section below) 2810 entails a substantial performance penalty; use of 2811 .Fn yymore , 2812 the 2813 .Sq ^ 2814 operator, and the 2815 .Fl I 2816 flag entail minor performance penalties. 2817 .It Fl S Ns Ar skeleton 2818 Overrides the default skeleton file from which 2819 .Nm 2820 constructs its scanners. 2821 This option is needed only for 2822 .Nm 2823 maintenance or development. 2824 .It Fl s 2825 Causes the default rule 2826 .Pq that unmatched scanner input is echoed to stdout 2827 to be suppressed. 2828 If the scanner encounters input that does not 2829 match any of its rules, it aborts with an error. 2830 This option is useful for finding holes in a scanner's rule set. 2831 .It Fl T 2832 Makes 2833 .Nm 2834 run in 2835 .Em trace 2836 mode. 2837 It will generate a lot of messages to stderr concerning 2838 the form of the input and the resultant non-deterministic and deterministic 2839 finite automata. 2840 This option is mostly for use in maintaining 2841 .Nm . 2842 .It Fl t 2843 Instructs 2844 .Nm 2845 to write the scanner it generates to standard output instead of 2846 .Pa lex.yy.c . 2847 .It Fl V 2848 Prints the version number to stdout and exits. 2849 .Fl Fl version 2850 is a synonym for 2851 .Fl V . 2852 .It Fl v 2853 Specifies that 2854 .Nm 2855 should write to stderr 2856 a summary of statistics regarding the scanner it generates. 2857 Most of the statistics are meaningless to the casual 2858 .Nm 2859 user, but the first line identifies the version of 2860 .Nm 2861 (same as reported by 2862 .Fl V ) , 2863 and the next line the flags used when generating the scanner, 2864 including those that are on by default. 2865 .It Fl w 2866 Suppresses warning messages. 2867 .It Fl + 2868 Specifies that 2869 .Nm 2870 should generate a C++ scanner class. 2871 See the section on 2872 .Sx GENERATING C++ SCANNERS 2873 below for details. 2874 .El 2875 .Pp 2876 .Nm 2877 also provides a mechanism for controlling options within the 2878 scanner specification itself, rather than from the 2879 .Nm 2880 command line. 2881 This is done by including 2882 .Dq %option 2883 directives in the first section of the scanner specification. 2884 Multiple options can be specified with a single 2885 .Dq %option 2886 directive, and multiple directives in the first section of the 2887 .Nm 2888 input file. 2889 .Pp 2890 Most options are given simply as names, optionally preceded by the word 2891 .Qq no 2892 .Pq with no intervening whitespace 2893 to negate their meaning. 2894 A number are equivalent to 2895 .Nm 2896 flags or their negation: 2897 .Bd -unfilled -offset indent 2898 7bit -7 option 2899 8bit -8 option 2900 align -Ca option 2901 backup -b option 2902 batch -B option 2903 c++ -+ option 2904 2905 caseful or 2906 case-sensitive opposite of -i (default) 2907 2908 case-insensitive or 2909 caseless -i option 2910 2911 debug -d option 2912 default opposite of -s option 2913 ecs -Ce option 2914 fast -F option 2915 full -f option 2916 interactive -I option 2917 lex-compat -l option 2918 meta-ecs -Cm option 2919 perf-report -p option 2920 read -Cr option 2921 stdout -t option 2922 verbose -v option 2923 warn opposite of -w option 2924 (use "%option nowarn" for -w) 2925 2926 array equivalent to "%array" 2927 pointer equivalent to "%pointer" (default) 2928 .Ed 2929 .Pp 2930 Some %option's provide features otherwise not available: 2931 .Bl -tag -width Ds 2932 .It always-interactive 2933 Instructs 2934 .Nm 2935 to generate a scanner which always considers its input 2936 .Qq interactive . 2937 Normally, on each new input file the scanner calls 2938 .Fn isatty 2939 in an attempt to determine whether the scanner's input source is interactive 2940 and thus should be read a character at a time. 2941 When this option is used, however, no such call is made. 2942 .It main 2943 Directs 2944 .Nm 2945 to provide a default 2946 .Fn main 2947 program for the scanner, which simply calls 2948 .Fn yylex . 2949 This option implies 2950 .Dq noyywrap 2951 .Pq see below . 2952 .It never-interactive 2953 Instructs 2954 .Nm 2955 to generate a scanner which never considers its input 2956 .Qq interactive 2957 (again, no call made to 2958 .Fn isatty ) . 2959 This is the opposite of 2960 .Dq always-interactive . 2961 .It stack 2962 Enables the use of start condition stacks 2963 (see 2964 .Sx START CONDITIONS 2965 above). 2966 .It stdinit 2967 If set (i.e., 2968 .Dq %option stdinit ) , 2969 initializes 2970 .Fa yyin 2971 and 2972 .Fa yyout 2973 to stdin and stdout, instead of the default of 2974 .Dq nil . 2975 Some existing 2976 .Nm lex 2977 programs depend on this behavior, even though it is not compliant with ANSI C, 2978 which does not require stdin and stdout to be compile-time constant. 2979 .It yylineno 2980 Directs 2981 .Nm 2982 to generate a scanner that maintains the number of the current line 2983 read from its input in the global variable 2984 .Fa yylineno . 2985 This option is implied by 2986 .Dq %option lex-compat . 2987 .It yywrap 2988 If unset (i.e., 2989 .Dq %option noyywrap ) , 2990 makes the scanner not call 2991 .Fn yywrap 2992 upon an end-of-file, but simply assume that there are no more files to scan 2993 (until the user points 2994 .Fa yyin 2995 at a new file and calls 2996 .Fn yylex 2997 again). 2998 .El 2999 .Pp 3000 .Nm 3001 scans rule actions to determine whether the 3002 .Em REJECT 3003 or 3004 .Fn yymore 3005 features are being used. 3006 The 3007 .Dq reject 3008 and 3009 .Dq yymore 3010 options are available to override its decision as to whether to use the 3011 options, either by setting them (e.g., 3012 .Dq %option reject ) 3013 to indicate the feature is indeed used, 3014 or unsetting them to indicate it actually is not used 3015 (e.g., 3016 .Dq %option noyymore ) . 3017 .Pp 3018 Three options take string-delimited values, offset with 3019 .Sq = : 3020 .Pp 3021 .D1 %option outfile="ABC" 3022 .Pp 3023 is equivalent to 3024 .Fl o Ns Ar ABC , 3025 and 3026 .Pp 3027 .D1 %option prefix="XYZ" 3028 .Pp 3029 is equivalent to 3030 .Fl P Ns Ar XYZ . 3031 Finally, 3032 .Pp 3033 .D1 %option yyclass="foo" 3034 .Pp 3035 only applies when generating a C++ scanner 3036 .Pf ( Fl + 3037 option). 3038 It informs 3039 .Nm 3040 that 3041 .Dq foo 3042 has been derived as a subclass of yyFlexLexer, so 3043 .Nm 3044 will place actions in the member function 3045 .Dq foo::yylex() 3046 instead of 3047 .Dq yyFlexLexer::yylex() . 3048 It also generates a 3049 .Dq yyFlexLexer::yylex() 3050 member function that emits a run-time error (by invoking 3051 .Dq yyFlexLexer::LexerError() ) 3052 if called. 3053 See 3054 .Sx GENERATING C++ SCANNERS , 3055 below, for additional information. 3056 .Pp 3057 A number of options are available for 3058 lint 3059 purists who want to suppress the appearance of unneeded routines 3060 in the generated scanner. 3061 Each of the following, if unset 3062 (e.g., 3063 .Dq %option nounput ) , 3064 results in the corresponding routine not appearing in the generated scanner: 3065 .Bd -unfilled -offset indent 3066 input, unput 3067 yy_push_state, yy_pop_state, yy_top_state 3068 yy_scan_buffer, yy_scan_bytes, yy_scan_string 3069 .Ed 3070 .Pp 3071 (though 3072 .Fn yy_push_state 3073 and friends won't appear anyway unless 3074 .Dq %option stack 3075 is being used). 3076 .Sh PERFORMANCE CONSIDERATIONS 3077 The main design goal of 3078 .Nm 3079 is that it generate high-performance scanners. 3080 It has been optimized for dealing well with large sets of rules. 3081 Aside from the effects on scanner speed of the table compression 3082 .Fl C 3083 options outlined above, 3084 there are a number of options/actions which degrade performance. 3085 These are, from most expensive to least: 3086 .Bd -unfilled -offset indent 3087 REJECT 3088 %option yylineno 3089 arbitrary trailing context 3090 3091 pattern sets that require backing up 3092 %array 3093 %option interactive 3094 %option always-interactive 3095 3096 \&'^' beginning-of-line operator 3097 yymore() 3098 .Ed 3099 .Pp 3100 with the first three all being quite expensive 3101 and the last two being quite cheap. 3102 Note also that 3103 .Fn unput 3104 is implemented as a routine call that potentially does quite a bit of work, 3105 while 3106 .Fn yyless 3107 is a quite-cheap macro; so if just putting back some excess text, 3108 use 3109 .Fn yyless . 3110 .Pp 3111 .Em REJECT 3112 should be avoided at all costs when performance is important. 3113 It is a particularly expensive option. 3114 .Pp 3115 Getting rid of backing up is messy and often may be an enormous 3116 amount of work for a complicated scanner. 3117 In principal, one begins by using the 3118 .Fl b 3119 flag to generate a 3120 .Pa lex.backup 3121 file. 3122 For example, on the input 3123 .Bd -literal -offset indent 3124 %% 3125 foo return TOK_KEYWORD; 3126 foobar return TOK_KEYWORD; 3127 .Ed 3128 .Pp 3129 the file looks like: 3130 .Bd -literal -offset indent 3131 State #6 is non-accepting - 3132 associated rule line numbers: 3133 2 3 3134 out-transitions: [ o ] 3135 jam-transitions: EOF [ \e001-n p-\e177 ] 3136 3137 State #8 is non-accepting - 3138 associated rule line numbers: 3139 3 3140 out-transitions: [ a ] 3141 jam-transitions: EOF [ \e001-` b-\e177 ] 3142 3143 State #9 is non-accepting - 3144 associated rule line numbers: 3145 3 3146 out-transitions: [ r ] 3147 jam-transitions: EOF [ \e001-q s-\e177 ] 3148 3149 Compressed tables always back up. 3150 .Ed 3151 .Pp 3152 The first few lines tell us that there's a scanner state in 3153 which it can make a transition on an 3154 .Sq o 3155 but not on any other character, 3156 and that in that state the currently scanned text does not match any rule. 3157 The state occurs when trying to match the rules found 3158 at lines 2 and 3 in the input file. 3159 If the scanner is in that state and then reads something other than an 3160 .Sq o , 3161 it will have to back up to find a rule which is matched. 3162 With a bit of headscratching one can see that this must be the 3163 state it's in when it has seen 3164 .Sq fo . 3165 When this has happened, if anything other than another 3166 .Sq o 3167 is seen, the scanner will have to back up to simply match the 3168 .Sq f 3169 .Pq by the default rule . 3170 .Pp 3171 The comment regarding State #8 indicates there's a problem when 3172 .Qq foob 3173 has been scanned. 3174 Indeed, on any character other than an 3175 .Sq a , 3176 the scanner will have to back up to accept 3177 .Qq foo . 3178 Similarly, the comment for State #9 concerns when 3179 .Qq fooba 3180 has been scanned and an 3181 .Sq r 3182 does not follow. 3183 .Pp 3184 The final comment reminds us that there's no point going to 3185 all the trouble of removing backing up from the rules unless we're using 3186 .Fl Cf 3187 or 3188 .Fl CF , 3189 since there's no performance gain doing so with compressed scanners. 3190 .Pp 3191 The way to remove the backing up is to add 3192 .Qq error 3193 rules: 3194 .Bd -literal -offset indent 3195 %% 3196 foo return TOK_KEYWORD; 3197 foobar return TOK_KEYWORD; 3198 3199 fooba | 3200 foob | 3201 fo { 3202 /* false alarm, not really a keyword */ 3203 return TOK_ID; 3204 } 3205 .Ed 3206 .Pp 3207 Eliminating backing up among a list of keywords can also be done using a 3208 .Qq catch-all 3209 rule: 3210 .Bd -literal -offset indent 3211 %% 3212 foo return TOK_KEYWORD; 3213 foobar return TOK_KEYWORD; 3214 3215 [a-z]+ return TOK_ID; 3216 .Ed 3217 .Pp 3218 This is usually the best solution when appropriate. 3219 .Pp 3220 Backing up messages tend to cascade. 3221 With a complicated set of rules it's not uncommon to get hundreds of messages. 3222 If one can decipher them, though, 3223 it often only takes a dozen or so rules to eliminate the backing up 3224 (though it's easy to make a mistake and have an error rule accidentally match 3225 a valid token; a possible future 3226 .Nm 3227 feature will be to automatically add rules to eliminate backing up). 3228 .Pp 3229 It's important to keep in mind that the benefits of eliminating 3230 backing up are gained only if 3231 .Em every 3232 instance of backing up is eliminated. 3233 Leaving just one gains nothing. 3234 .Pp 3235 .Em Variable 3236 trailing context 3237 (where both the leading and trailing parts do not have a fixed length) 3238 entails almost the same performance loss as 3239 .Em REJECT 3240 .Pq i.e., substantial . 3241 So when possible a rule like: 3242 .Bd -literal -offset indent 3243 %% 3244 mouse|rat/(cat|dog) run(); 3245 .Ed 3246 .Pp 3247 is better written: 3248 .Bd -literal -offset indent 3249 %% 3250 mouse/cat|dog run(); 3251 rat/cat|dog run(); 3252 .Ed 3253 .Pp 3254 or as 3255 .Bd -literal -offset indent 3256 %% 3257 mouse|rat/cat run(); 3258 mouse|rat/dog run(); 3259 .Ed 3260 .Pp 3261 Note that here the special 3262 .Sq |\& 3263 action does not provide any savings, and can even make things worse (see 3264 .Sx BUGS 3265 below). 3266 .Pp 3267 Another area where the user can increase a scanner's performance 3268 .Pq and one that's easier to implement 3269 arises from the fact that the longer the tokens matched, 3270 the faster the scanner will run. 3271 This is because with long tokens the processing of most input 3272 characters takes place in the 3273 .Pq short 3274 inner scanning loop, and does not often have to go through the additional work 3275 of setting up the scanning environment (e.g., 3276 .Fa yytext ) 3277 for the action. 3278 Recall the scanner for C comments: 3279 .Bd -literal -offset indent 3280 %x comment 3281 %% 3282 int line_num = 1; 3283 3284 "/*" BEGIN(comment); 3285 3286 <comment>[^*\en]* 3287 <comment>"*"+[^*/\en]* 3288 <comment>\en ++line_num; 3289 <comment>"*"+"/" BEGIN(INITIAL); 3290 .Ed 3291 .Pp 3292 This could be sped up by writing it as: 3293 .Bd -literal -offset indent 3294 %x comment 3295 %% 3296 int line_num = 1; 3297 3298 "/*" BEGIN(comment); 3299 3300 <comment>[^*\en]* 3301 <comment>[^*\en]*\en ++line_num; 3302 <comment>"*"+[^*/\en]* 3303 <comment>"*"+[^*/\en]*\en ++line_num; 3304 <comment>"*"+"/" BEGIN(INITIAL); 3305 .Ed 3306 .Pp 3307 Now instead of each newline requiring the processing of another action, 3308 recognizing the newlines is 3309 .Qq distributed 3310 over the other rules to keep the matched text as long as possible. 3311 Note that adding rules does 3312 .Em not 3313 slow down the scanner! 3314 The speed of the scanner is independent of the number of rules or 3315 (modulo the considerations given at the beginning of this section) 3316 how complicated the rules are with regard to operators such as 3317 .Sq * 3318 and 3319 .Sq |\& . 3320 .Pp 3321 A final example in speeding up a scanner: 3322 scan through a file containing identifiers and keywords, one per line 3323 and with no other extraneous characters, and recognize all the keywords. 3324 A natural first approach is: 3325 .Bd -literal -offset indent 3326 %% 3327 asm | 3328 auto | 3329 break | 3330 \&... etc ... 3331 volatile | 3332 while /* it's a keyword */ 3333 3334 \&.|\en /* it's not a keyword */ 3335 .Ed 3336 .Pp 3337 To eliminate the back-tracking, introduce a catch-all rule: 3338 .Bd -literal -offset indent 3339 %% 3340 asm | 3341 auto | 3342 break | 3343 \&... etc ... 3344 volatile | 3345 while /* it's a keyword */ 3346 3347 [a-z]+ | 3348 \&.|\en /* it's not a keyword */ 3349 .Ed 3350 .Pp 3351 Now, if it's guaranteed that there's exactly one word per line, 3352 then we can reduce the total number of matches by a half by 3353 merging in the recognition of newlines with that of the other tokens: 3354 .Bd -literal -offset indent 3355 %% 3356 asm\en | 3357 auto\en | 3358 break\en | 3359 \&... etc ... 3360 volatile\en | 3361 while\en /* it's a keyword */ 3362 3363 [a-z]+\en | 3364 \&.|\en /* it's not a keyword */ 3365 .Ed 3366 .Pp 3367 One has to be careful here, 3368 as we have now reintroduced backing up into the scanner. 3369 In particular, while we know that there will never be any characters 3370 in the input stream other than letters or newlines, 3371 .Nm 3372 can't figure this out, and it will plan for possibly needing to back up 3373 when it has scanned a token like 3374 .Qq auto 3375 and then the next character is something other than a newline or a letter. 3376 Previously it would then just match the 3377 .Qq auto 3378 rule and be done, but now it has no 3379 .Qq auto 3380 rule, only an 3381 .Qq auto\en 3382 rule. 3383 To eliminate the possibility of backing up, 3384 we could either duplicate all rules but without final newlines, or, 3385 since we never expect to encounter such an input and therefore don't 3386 how it's classified, we can introduce one more catch-all rule, 3387 this one which doesn't include a newline: 3388 .Bd -literal -offset indent 3389 %% 3390 asm\en | 3391 auto\en | 3392 break\en | 3393 \&... etc ... 3394 volatile\en | 3395 while\en /* it's a keyword */ 3396 3397 [a-z]+\en | 3398 [a-z]+ | 3399 \&.|\en /* it's not a keyword */ 3400 .Ed 3401 .Pp 3402 Compiled with 3403 .Fl Cf , 3404 this is about as fast as one can get a 3405 .Nm 3406 scanner to go for this particular problem. 3407 .Pp 3408 A final note: 3409 .Nm 3410 is slow when matching NUL's, 3411 particularly when a token contains multiple NUL's. 3412 It's best to write rules which match short 3413 amounts of text if it's anticipated that the text will often include NUL's. 3414 .Pp 3415 Another final note regarding performance: as mentioned above in the section 3416 .Sx HOW THE INPUT IS MATCHED , 3417 dynamically resizing 3418 .Fa yytext 3419 to accommodate huge tokens is a slow process because it presently requires that 3420 the 3421 .Pq huge 3422 token be rescanned from the beginning. 3423 Thus if performance is vital, it is better to attempt to match 3424 .Qq large 3425 quantities of text but not 3426 .Qq huge 3427 quantities, where the cutoff between the two is at about 8K characters/token. 3428 .Sh GENERATING C++ SCANNERS 3429 .Nm 3430 provides two different ways to generate scanners for use with C++. 3431 The first way is to simply compile a scanner generated by 3432 .Nm 3433 using a C++ compiler instead of a C compiler. 3434 This should not generate any compilation errors 3435 (please report any found to the email address given in the 3436 .Sx AUTHORS 3437 section below). 3438 C++ code can then be used in rule actions instead of C code. 3439 Note that the default input source for scanners remains 3440 .Fa yyin , 3441 and default echoing is still done to 3442 .Fa yyout . 3443 Both of these remain 3444 .Fa FILE * 3445 variables and not C++ streams. 3446 .Pp 3447 .Nm 3448 can also be used to generate a C++ scanner class, using the 3449 .Fl + 3450 option (or, equivalently, 3451 .Dq %option c++ ) , 3452 which is automatically specified if the name of the flex executable ends in a 3453 .Sq + , 3454 such as 3455 .Nm flex++ . 3456 When using this option, 3457 .Nm 3458 defaults to generating the scanner to the file 3459 .Pa lex.yy.cc 3460 instead of 3461 .Pa lex.yy.c . 3462 The generated scanner includes the header file 3463 .Aq Pa g++/FlexLexer.h , 3464 which defines the interface to two C++ classes. 3465 .Pp 3466 The first class, 3467 .Em FlexLexer , 3468 provides an abstract base class defining the general scanner class interface. 3469 It provides the following member functions: 3470 .Bl -tag -width Ds 3471 .It const char* YYText() 3472 Returns the text of the most recently matched token, the equivalent of 3473 .Fa yytext . 3474 .It int YYLeng() 3475 Returns the length of the most recently matched token, the equivalent of 3476 .Fa yyleng . 3477 .It int lineno() const 3478 Returns the current input line number 3479 (see 3480 .Dq %option yylineno ) , 3481 or 1 if 3482 .Dq %option yylineno 3483 was not used. 3484 .It void set_debug(int flag) 3485 Sets the debugging flag for the scanner, equivalent to assigning to 3486 .Fa yy_flex_debug 3487 (see the 3488 .Sx OPTIONS 3489 section above). 3490 Note that the scanner must be built using 3491 .Dq %option debug 3492 to include debugging information in it. 3493 .It int debug() const 3494 Returns the current setting of the debugging flag. 3495 .El 3496 .Pp 3497 Also provided are member functions equivalent to 3498 .Fn yy_switch_to_buffer , 3499 .Fn yy_create_buffer 3500 (though the first argument is an 3501 .Fa std::istream* 3502 object pointer and not a 3503 .Fa FILE* ) , 3504 .Fn yy_flush_buffer , 3505 .Fn yy_delete_buffer , 3506 and 3507 .Fn yyrestart 3508 (again, the first argument is an 3509 .Fa std::istream* 3510 object pointer). 3511 .Pp 3512 The second class defined in 3513 .Aq Pa g++/FlexLexer.h 3514 is 3515 .Fa yyFlexLexer , 3516 which is derived from 3517 .Fa FlexLexer . 3518 It defines the following additional member functions: 3519 .Bl -tag -width Ds 3520 .It "yyFlexLexer(std::istream* arg_yyin = 0, std::ostream* arg_yyout = 0)" 3521 Constructs a 3522 .Fa yyFlexLexer 3523 object using the given streams for input and output. 3524 If not specified, the streams default to 3525 .Fa cin 3526 and 3527 .Fa cout , 3528 respectively. 3529 .It virtual int yylex() 3530 Performs the same role as 3531 .Fn yylex 3532 does for ordinary flex scanners: it scans the input stream, consuming 3533 tokens, until a rule's action returns a value. 3534 If subclass 3535 .Sq S 3536 is derived from 3537 .Fa yyFlexLexer , 3538 in order to access the member functions and variables of 3539 .Sq S 3540 inside 3541 .Fn yylex , 3542 use 3543 .Dq %option yyclass="S" 3544 to inform 3545 .Nm 3546 that the 3547 .Sq S 3548 subclass will be used instead of 3549 .Fa yyFlexLexer . 3550 In this case, rather than generating 3551 .Dq yyFlexLexer::yylex() , 3552 .Nm 3553 generates 3554 .Dq S::yylex() 3555 (and also generates a dummy 3556 .Dq yyFlexLexer::yylex() 3557 that calls 3558 .Dq yyFlexLexer::LexerError() 3559 if called). 3560 .It "virtual void switch_streams(std::istream* new_in = 0, std::ostream* new_out = 0)" 3561 Reassigns 3562 .Fa yyin 3563 to 3564 .Fa new_in 3565 .Pq if non-nil 3566 and 3567 .Fa yyout 3568 to 3569 .Fa new_out 3570 .Pq ditto , 3571 deleting the previous input buffer if 3572 .Fa yyin 3573 is reassigned. 3574 .It int yylex(std::istream* new_in, std::ostream* new_out = 0) 3575 First switches the input streams via 3576 .Dq switch_streams(new_in, new_out) 3577 and then returns the value of 3578 .Fn yylex . 3579 .El 3580 .Pp 3581 In addition, 3582 .Fa yyFlexLexer 3583 defines the following protected virtual functions which can be redefined 3584 in derived classes to tailor the scanner: 3585 .Bl -tag -width Ds 3586 .It virtual int LexerInput(char* buf, int max_size) 3587 Reads up to 3588 .Fa max_size 3589 characters into 3590 .Fa buf 3591 and returns the number of characters read. 3592 To indicate end-of-input, return 0 characters. 3593 Note that 3594 .Qq interactive 3595 scanners (see the 3596 .Fl B 3597 and 3598 .Fl I 3599 flags) define the macro 3600 .Dv YY_INTERACTIVE . 3601 If 3602 .Fn LexerInput 3603 has been redefined, and it's necessary to take different actions depending on 3604 whether or not the scanner might be scanning an interactive input source, 3605 it's possible to test for the presence of this name via 3606 .Dq #ifdef . 3607 .It virtual void LexerOutput(const char* buf, int size) 3608 Writes out 3609 .Fa size 3610 characters from the buffer 3611 .Fa buf , 3612 which, while NUL-terminated, may also contain 3613 .Qq internal 3614 NUL's if the scanner's rules can match text with NUL's in them. 3615 .It virtual void LexerError(const char* msg) 3616 Reports a fatal error message. 3617 The default version of this function writes the message to the stream 3618 .Fa cerr 3619 and exits. 3620 .El 3621 .Pp 3622 Note that a 3623 .Fa yyFlexLexer 3624 object contains its entire scanning state. 3625 Thus such objects can be used to create reentrant scanners. 3626 Multiple instances of the same 3627 .Fa yyFlexLexer 3628 class can be instantiated, and multiple C++ scanner classes can be combined 3629 in the same program using the 3630 .Fl P 3631 option discussed above. 3632 .Pp 3633 Finally, note that the 3634 .Dq %array 3635 feature is not available to C++ scanner classes; 3636 .Dq %pointer 3637 must be used 3638 .Pq the default . 3639 .Pp 3640 Here is an example of a simple C++ scanner: 3641 .Bd -literal -offset indent 3642 // An example of using the flex C++ scanner class. 3643 3644 %{ 3645 #include <errno.h> 3646 int mylineno = 0; 3647 %} 3648 3649 string \e"[^\en"]+\e" 3650 3651 ws [ \et]+ 3652 3653 alpha [A-Za-z] 3654 dig [0-9] 3655 name ({alpha}|{dig}|\e$)({alpha}|{dig}|[_.\e-/$])* 3656 num1 [-+]?{dig}+\e.?([eE][-+]?{dig}+)? 3657 num2 [-+]?{dig}*\e.{dig}+([eE][-+]?{dig}+)? 3658 number {num1}|{num2} 3659 3660 %% 3661 3662 {ws} /* skip blanks and tabs */ 3663 3664 "/*" { 3665 int c; 3666 3667 while ((c = yyinput()) != 0) { 3668 if(c == '\en') 3669 ++mylineno; 3670 else if(c == '*') { 3671 if ((c = yyinput()) == '/') 3672 break; 3673 else 3674 unput(c); 3675 } 3676 } 3677 } 3678 3679 {number} cout << "number " << YYText() << '\en'; 3680 3681 \en mylineno++; 3682 3683 {name} cout << "name " << YYText() << '\en'; 3684 3685 {string} cout << "string " << YYText() << '\en'; 3686 3687 %% 3688 3689 int main(int /* argc */, char** /* argv */) 3690 { 3691 FlexLexer* lexer = new yyFlexLexer; 3692 while(lexer->yylex() != 0) 3693 ; 3694 return 0; 3695 } 3696 .Ed 3697 .Pp 3698 To create multiple 3699 .Pq different 3700 lexer classes, use the 3701 .Fl P 3702 flag 3703 (or the 3704 .Dq prefix= 3705 option) 3706 to rename each 3707 .Fa yyFlexLexer 3708 to some other 3709 .Fa xxFlexLexer . 3710 .Aq Pa g++/FlexLexer.h 3711 can then be included in other sources once per lexer class, first renaming 3712 .Fa yyFlexLexer 3713 as follows: 3714 .Bd -literal -offset indent 3715 #undef yyFlexLexer 3716 #define yyFlexLexer xxFlexLexer 3717 #include <g++/FlexLexer.h> 3718 3719 #undef yyFlexLexer 3720 #define yyFlexLexer zzFlexLexer 3721 #include <g++/FlexLexer.h> 3722 .Ed 3723 .Pp 3724 If, for example, 3725 .Dq %option prefix="xx" 3726 is used for one scanner and 3727 .Dq %option prefix="zz" 3728 is used for the other. 3729 .Pp 3730 .Sy IMPORTANT : 3731 the present form of the scanning class is experimental 3732 and may change considerably between major releases. 3733 .Sh INCOMPATIBILITIES WITH LEX AND POSIX 3734 .Nm 3735 is a rewrite of the 3736 .At 3737 .Nm lex 3738 tool 3739 (the two implementations do not share any code, though), 3740 with some extensions and incompatibilities, both of which are of concern 3741 to those who wish to write scanners acceptable to either implementation. 3742 .Nm 3743 is fully compliant with the 3744 .Tn POSIX 3745 .Nm lex 3746 specification, except that when using 3747 .Dq %pointer 3748 .Pq the default , 3749 a call to 3750 .Fn unput 3751 destroys the contents of 3752 .Fa yytext , 3753 which is counter to the 3754 .Tn POSIX 3755 specification. 3756 .Pp 3757 In this section we discuss all of the known areas of incompatibility between 3758 .Nm , 3759 .At 3760 .Nm lex , 3761 and the 3762 .Tn POSIX 3763 specification. 3764 .Pp 3765 .Nm flex Ns 's 3766 .Fl l 3767 option turns on maximum compatibility with the original 3768 .At 3769 .Nm lex 3770 implementation, at the cost of a major loss in the generated scanner's 3771 performance. 3772 We note below which incompatibilities can be overcome using the 3773 .Fl l 3774 option. 3775 .Pp 3776 .Nm 3777 is fully compatible with 3778 .Nm lex 3779 with the following exceptions: 3780 .Bl -dash 3781 .It 3782 The undocumented 3783 .Nm lex 3784 scanner internal variable 3785 .Fa yylineno 3786 is not supported unless 3787 .Fl l 3788 or 3789 .Dq %option yylineno 3790 is used. 3791 .Pp 3792 .Fa yylineno 3793 should be maintained on a per-buffer basis, rather than a per-scanner 3794 .Pq single global variable 3795 basis. 3796 .Pp 3797 .Fa yylineno 3798 is not part of the 3799 .Tn POSIX 3800 specification. 3801 .It 3802 The 3803 .Fn input 3804 routine is not redefinable, though it may be called to read characters 3805 following whatever has been matched by a rule. 3806 If 3807 .Fn input 3808 encounters an end-of-file, the normal 3809 .Fn yywrap 3810 processing is done. 3811 A 3812 .Dq real 3813 end-of-file is returned by 3814 .Fn input 3815 as 3816 .Dv EOF . 3817 .Pp 3818 Input is instead controlled by defining the 3819 .Dv YY_INPUT 3820 macro. 3821 .Pp 3822 The 3823 .Nm 3824 restriction that 3825 .Fn input 3826 cannot be redefined is in accordance with the 3827 .Tn POSIX 3828 specification, which simply does not specify any way of controlling the 3829 scanner's input other than by making an initial assignment to 3830 .Fa yyin . 3831 .It 3832 The 3833 .Fn unput 3834 routine is not redefinable. 3835 This restriction is in accordance with 3836 .Tn POSIX . 3837 .It 3838 .Nm 3839 scanners are not as reentrant as 3840 .Nm lex 3841 scanners. 3842 In particular, if a scanner is interactive and 3843 an interrupt handler long-jumps out of the scanner, 3844 and the scanner is subsequently called again, 3845 the following error message may be displayed: 3846 .Pp 3847 .D1 fatal flex scanner internal error--end of buffer missed 3848 .Pp 3849 To reenter the scanner, first use 3850 .Pp 3851 .Dl yyrestart(yyin); 3852 .Pp 3853 Note that this call will throw away any buffered input; 3854 usually this isn't a problem with an interactive scanner. 3855 .Pp 3856 Also note that flex C++ scanner classes are reentrant, 3857 so if using C++ is an option , they should be used instead. 3858 See 3859 .Sx GENERATING C++ SCANNERS 3860 above for details. 3861 .It 3862 .Fn output 3863 is not supported. 3864 Output from the 3865 .Em ECHO 3866 macro is done to the file-pointer 3867 .Fa yyout 3868 .Pq default stdout . 3869 .Pp 3870 .Fn output 3871 is not part of the 3872 .Tn POSIX 3873 specification. 3874 .It 3875 .Nm lex 3876 does not support exclusive start conditions 3877 .Pq %x , 3878 though they are in the 3879 .Tn POSIX 3880 specification. 3881 .It 3882 When definitions are expanded, 3883 .Nm 3884 encloses them in parentheses. 3885 With 3886 .Nm lex , 3887 the following: 3888 .Bd -literal -offset indent 3889 NAME [A-Z][A-Z0-9]* 3890 %% 3891 foo{NAME}? printf("Found it\en"); 3892 %% 3893 .Ed 3894 .Pp 3895 will not match the string 3896 .Qq foo 3897 because when the macro is expanded the rule is equivalent to 3898 .Qq foo[A-Z][A-Z0-9]*? 3899 and the precedence is such that the 3900 .Sq ?\& 3901 is associated with 3902 .Qq [A-Z0-9]* . 3903 With 3904 .Nm , 3905 the rule will be expanded to 3906 .Qq foo([A-Z][A-Z0-9]*)? 3907 and so the string 3908 .Qq foo 3909 will match. 3910 .Pp 3911 Note that if the definition begins with 3912 .Sq ^ 3913 or ends with 3914 .Sq $ 3915 then it is not expanded with parentheses, to allow these operators to appear in 3916 definitions without losing their special meanings. 3917 But the 3918 .Sq Aq s , 3919 .Sq / , 3920 and 3921 .Aq Aq EOF 3922 operators cannot be used in a 3923 .Nm 3924 definition. 3925 .Pp 3926 Using 3927 .Fl l 3928 results in the 3929 .Nm lex 3930 behavior of no parentheses around the definition. 3931 .Pp 3932 The 3933 .Tn POSIX 3934 specification is that the definition be enclosed in parentheses. 3935 .It 3936 Some implementations of 3937 .Nm lex 3938 allow a rule's action to begin on a separate line, 3939 if the rule's pattern has trailing whitespace: 3940 .Bd -literal -offset indent 3941 %% 3942 foo|bar<space here> 3943 { foobar_action(); } 3944 .Ed 3945 .Pp 3946 .Nm 3947 does not support this feature. 3948 .It 3949 The 3950 .Nm lex 3951 .Sq %r 3952 .Pq generate a Ratfor scanner 3953 option is not supported. 3954 It is not part of the 3955 .Tn POSIX 3956 specification. 3957 .It 3958 After a call to 3959 .Fn unput , 3960 .Fa yytext 3961 is undefined until the next token is matched, 3962 unless the scanner was built using 3963 .Dq %array . 3964 This is not the case with 3965 .Nm lex 3966 or the 3967 .Tn POSIX 3968 specification. 3969 The 3970 .Fl l 3971 option does away with this incompatibility. 3972 .It 3973 The precedence of the 3974 .Sq {} 3975 .Pq numeric range 3976 operator is different. 3977 .Nm lex 3978 interprets 3979 .Qq abc{1,3} 3980 as match one, two, or three occurrences of 3981 .Sq abc , 3982 whereas 3983 .Nm 3984 interprets it as match 3985 .Sq ab 3986 followed by one, two, or three occurrences of 3987 .Sq c . 3988 The latter is in agreement with the 3989 .Tn POSIX 3990 specification. 3991 .It 3992 The precedence of the 3993 .Sq ^ 3994 operator is different. 3995 .Nm lex 3996 interprets 3997 .Qq ^foo|bar 3998 as match either 3999 .Sq foo 4000 at the beginning of a line, or 4001 .Sq bar 4002 anywhere, whereas 4003 .Nm 4004 interprets it as match either 4005 .Sq foo 4006 or 4007 .Sq bar 4008 if they come at the beginning of a line. 4009 The latter is in agreement with the 4010 .Tn POSIX 4011 specification. 4012 .It 4013 The special table-size declarations such as 4014 .Sq %a 4015 supported by 4016 .Nm lex 4017 are not required by 4018 .Nm 4019 scanners; 4020 .Nm 4021 ignores them. 4022 .It 4023 The name 4024 .Dv FLEX_SCANNER 4025 is #define'd so scanners may be written for use with either 4026 .Nm 4027 or 4028 .Nm lex . 4029 Scanners also include 4030 .Dv YY_FLEX_MAJOR_VERSION 4031 and 4032 .Dv YY_FLEX_MINOR_VERSION 4033 indicating which version of 4034 .Nm 4035 generated the scanner 4036 (for example, for the 2.5 release, these defines would be 2 and 5, 4037 respectively). 4038 .El 4039 .Pp 4040 The following 4041 .Nm 4042 features are not included in 4043 .Nm lex 4044 or the 4045 .Tn POSIX 4046 specification: 4047 .Bd -unfilled -offset indent 4048 C++ scanners 4049 %option 4050 start condition scopes 4051 start condition stacks 4052 interactive/non-interactive scanners 4053 yy_scan_string() and friends 4054 yyterminate() 4055 yy_set_interactive() 4056 yy_set_bol() 4057 YY_AT_BOL() 4058 <<EOF>> 4059 <*> 4060 YY_DECL 4061 YY_START 4062 YY_USER_ACTION 4063 YY_USER_INIT 4064 #line directives 4065 %{}'s around actions 4066 multiple actions on a line 4067 .Ed 4068 .Pp 4069 plus almost all of the 4070 .Nm 4071 flags. 4072 The last feature in the list refers to the fact that with 4073 .Nm 4074 multiple actions can be placed on the same line, 4075 separated with semi-colons, while with 4076 .Nm lex , 4077 the following 4078 .Pp 4079 .Dl foo handle_foo(); ++num_foos_seen; 4080 .Pp 4081 is 4082 .Pq rather surprisingly 4083 truncated to 4084 .Pp 4085 .Dl foo handle_foo(); 4086 .Pp 4087 .Nm 4088 does not truncate the action. 4089 Actions that are not enclosed in braces 4090 are simply terminated at the end of the line. 4091 .Sh FILES 4092 .Bl -tag -width "<g++/FlexLexer.h>" 4093 .It flex.skl 4094 Skeleton scanner. 4095 This file is only used when building flex, not when 4096 .Nm 4097 executes. 4098 .It lex.backup 4099 Backing-up information for the 4100 .Fl b 4101 flag (called 4102 .Pa lex.bck 4103 on some systems). 4104 .It lex.yy.c 4105 Generated scanner 4106 (called 4107 .Pa lexyy.c 4108 on some systems). 4109 .It lex.yy.cc 4110 Generated C++ scanner class, when using 4111 .Fl + . 4112 .It Aq g++/FlexLexer.h 4113 Header file defining the C++ scanner base class, 4114 .Fa FlexLexer , 4115 and its derived class, 4116 .Fa yyFlexLexer . 4117 .It /usr/lib/libl.* 4118 .Nm 4119 libraries. 4120 The 4121 .Pa /usr/lib/libfl.*\& 4122 libraries are links to these. 4123 Scanners must be linked using either 4124 .Fl \&ll 4125 or 4126 .Fl lfl . 4127 .El 4128 .Sh EXIT STATUS 4129 .Ex -std flex 4130 .Sh DIAGNOSTICS 4131 .Bl -diag 4132 .It warning, rule cannot be matched 4133 Indicates that the given rule cannot be matched because it follows other rules 4134 that will always match the same text as it. 4135 For example, in the following 4136 .Dq foo 4137 cannot be matched because it comes after an identifier 4138 .Qq catch-all 4139 rule: 4140 .Bd -literal -offset indent 4141 [a-z]+ got_identifier(); 4142 foo got_foo(); 4143 .Ed 4144 .Pp 4145 Using 4146 .Em REJECT 4147 in a scanner suppresses this warning. 4148 .It "warning, \-s option given but default rule can be matched" 4149 Means that it is possible 4150 .Pq perhaps only in a particular start condition 4151 that the default rule 4152 .Pq match any single character 4153 is the only one that will match a particular input. 4154 Since 4155 .Fl s 4156 was given, presumably this is not intended. 4157 .It reject_used_but_not_detected undefined 4158 .It yymore_used_but_not_detected undefined 4159 These errors can occur at compile time. 4160 They indicate that the scanner uses 4161 .Em REJECT 4162 or 4163 .Fn yymore 4164 but that 4165 .Nm 4166 failed to notice the fact, meaning that 4167 .Nm 4168 scanned the first two sections looking for occurrences of these actions 4169 and failed to find any, but somehow they snuck in 4170 .Pq via an #include file, for example . 4171 Use 4172 .Dq %option reject 4173 or 4174 .Dq %option yymore 4175 to indicate to 4176 .Nm 4177 that these features are really needed. 4178 .It flex scanner jammed 4179 A scanner compiled with 4180 .Fl s 4181 has encountered an input string which wasn't matched by any of its rules. 4182 This error can also occur due to internal problems. 4183 .It token too large, exceeds YYLMAX 4184 The scanner uses 4185 .Dq %array 4186 and one of its rules matched a string longer than the 4187 .Dv YYLMAX 4188 constant 4189 .Pq 8K bytes by default . 4190 The value can be increased by #define'ing 4191 .Dv YYLMAX 4192 in the definitions section of 4193 .Nm 4194 input. 4195 .It "scanner requires \-8 flag to use the character 'x'" 4196 The scanner specification includes recognizing the 8-bit character 4197 .Sq x 4198 and the 4199 .Fl 8 4200 flag was not specified, and defaulted to 7-bit because the 4201 .Fl Cf 4202 or 4203 .Fl CF 4204 table compression options were used. 4205 See the discussion of the 4206 .Fl 7 4207 flag for details. 4208 .It flex scanner push-back overflow 4209 unput() was used to push back so much text that the scanner's buffer 4210 could not hold both the pushed-back text and the current token in 4211 .Fa yytext . 4212 Ideally the scanner should dynamically resize the buffer in this case, 4213 but at present it does not. 4214 .It "input buffer overflow, can't enlarge buffer because scanner uses REJECT" 4215 The scanner was working on matching an extremely large token and needed 4216 to expand the input buffer. 4217 This doesn't work with scanners that use 4218 .Em REJECT . 4219 .It "fatal flex scanner internal error--end of buffer missed" 4220 This can occur in an scanner which is reentered after a long-jump 4221 has jumped out 4222 .Pq or over 4223 the scanner's activation frame. 4224 Before reentering the scanner, use: 4225 .Pp 4226 .Dl yyrestart(yyin); 4227 .Pp 4228 or, as noted above, switch to using the C++ scanner class. 4229 .It "too many start conditions in <> construct!" 4230 More start conditions than exist were listed in a <> construct 4231 (so at least one of them must have been listed twice). 4232 .El 4233 .Sh SEE ALSO 4234 .Xr awk 1 , 4235 .Xr sed 1 , 4236 .Xr yacc 1 4237 .Rs 4238 .%A John Levine 4239 .%A Tony Mason 4240 .%A Doug Brown 4241 .%B Lex & Yacc 4242 .%I O'Reilly and Associates 4243 .%N 2nd edition 4244 .Re 4245 .Rs 4246 .%A Alfred Aho 4247 .%A Ravi Sethi 4248 .%A Jeffrey Ullman 4249 .%B Compilers: Principles, Techniques and Tools 4250 .%I Addison-Wesley 4251 .%D 1986 4252 .%O "Describes the pattern-matching techniques used by flex (deterministic finite automata)" 4253 .Re 4254 .Sh STANDARDS 4255 The 4256 .Nm lex 4257 utility is compliant with the 4258 .St -p1003.1-2008 4259 specification, 4260 though its presence is optional. 4261 .Pp 4262 The flags 4263 .Op Fl 78BbCdFfhIiLloPpSsTVw+? , 4264 .Op Fl -help , 4265 and 4266 .Op Fl -version 4267 are extensions to that specification. 4268 .Pp 4269 See also the 4270 .Sx INCOMPATIBILITIES WITH LEX AND POSIX 4271 section, above. 4272 .Sh AUTHORS 4273 Vern Paxson, with the help of many ideas and much inspiration from 4274 Van Jacobson. 4275 Original version by Jef Poskanzer. 4276 The fast table representation is a partial implementation of a design done by 4277 Van Jacobson. 4278 The implementation was done by Kevin Gong and Vern Paxson. 4279 .Pp 4280 Thanks to the many 4281 .Nm 4282 beta-testers, feedbackers, and contributors, especially Francois Pinard, 4283 Casey Leedom, 4284 Robert Abramovitz, 4285 Stan Adermann, Terry Allen, David Barker-Plummer, John Basrai, 4286 Neal Becker, Nelson H.F. Beebe, benson@odi.com, 4287 Karl Berry, Peter A. Bigot, Simon Blanchard, 4288 Keith Bostic, Frederic Brehm, Ian Brockbank, Kin Cho, Nick Christopher, 4289 Brian Clapper, J.T. Conklin, 4290 Jason Coughlin, Bill Cox, Nick Cropper, Dave Curtis, Scott David 4291 Daniels, Chris G. Demetriou, Theo de Raadt, 4292 Mike Donahue, Chuck Doucette, Tom Epperly, Leo Eskin, 4293 Chris Faylor, Chris Flatters, Jon Forrest, Jeffrey Friedl, 4294 Joe Gayda, Kaveh R. Ghazi, Wolfgang Glunz, 4295 Eric Goldman, Christopher M. Gould, Ulrich Grepel, Peer Griebel, 4296 Jan Hajic, Charles Hemphill, NORO Hideo, 4297 Jarkko Hietaniemi, Scott Hofmann, 4298 Jeff Honig, Dana Hudes, Eric Hughes, John Interrante, 4299 Ceriel Jacobs, Michal Jaegermann, Sakari Jalovaara, Jeffrey R. Jones, 4300 Henry Juengst, Klaus Kaempf, Jonathan I. Kamens, Terrence O Kane, 4301 Amir Katz, ken@ken.hilco.com, Kevin B. Kenny, 4302 Steve Kirsch, Winfried Koenig, Marq Kole, Ronald Lamprecht, 4303 Greg Lee, Rohan Lenard, Craig Leres, John Levine, Steve Liddle, 4304 David Loffredo, Mike Long, 4305 Mohamed el Lozy, Brian Madsen, Malte, Joe Marshall, 4306 Bengt Martensson, Chris Metcalf, 4307 Luke Mewburn, Jim Meyering, R. Alexander Milowski, Erik Naggum, 4308 G.T. Nicol, Landon Noll, James Nordby, Marc Nozell, 4309 Richard Ohnemus, Karsten Pahnke, 4310 Sven Panne, Roland Pesch, Walter Pelissero, Gaumond Pierre, 4311 Esmond Pitt, Jef Poskanzer, Joe Rahmeh, Jarmo Raiha, 4312 Frederic Raimbault, Pat Rankin, Rick Richardson, 4313 Kevin Rodgers, Kai Uwe Rommel, Jim Roskind, Alberto Santini, 4314 Andreas Scherer, Darrell Schiebel, Raf Schietekat, 4315 Doug Schmidt, Philippe Schnoebelen, Andreas Schwab, 4316 Larry Schwimmer, Alex Siegel, Eckehard Stolz, Jan-Erik Strvmquist, 4317 Mike Stump, Paul Stuart, Dave Tallman, Ian Lance Taylor, 4318 Chris Thewalt, Richard M. Timoney, Jodi Tsai, 4319 Paul Tuinenga, Gary Weik, Frank Whaley, Gerhard Wilhelms, Kent Williams, 4320 Ken Yap, Ron Zellar, Nathan Zelle, David Zuhn, 4321 and those whose names have slipped my marginal mail-archiving skills 4322 but whose contributions are appreciated all the 4323 same. 4324 .Pp 4325 Thanks to Keith Bostic, Jon Forrest, Noah Friedman, 4326 John Gilmore, Craig Leres, John Levine, Bob Mulcahy, G.T. 4327 Nicol, Francois Pinard, Rich Salz, and Richard Stallman for help with various 4328 distribution headaches. 4329 .Pp 4330 Thanks to Esmond Pitt and Earle Horton for 8-bit character support; 4331 to Benson Margulies and Fred Burke for C++ support; 4332 to Kent Williams and Tom Epperly for C++ class support; 4333 to Ove Ewerlid for support of NUL's; 4334 and to Eric Hughes for support of multiple buffers. 4335 .Pp 4336 This work was primarily done when I was with the Real Time Systems Group 4337 at the Lawrence Berkeley Laboratory in Berkeley, CA. 4338 Many thanks to all there for the support I received. 4339 .Pp 4340 Send comments to 4341 .Aq Mt vern@ee.lbl.gov . 4342 .Sh BUGS 4343 Some trailing context patterns cannot be properly matched and generate 4344 warning messages 4345 .Pq "dangerous trailing context" . 4346 These are patterns where the ending of the first part of the rule 4347 matches the beginning of the second part, such as 4348 .Qq zx*/xy* , 4349 where the 4350 .Sq x* 4351 matches the 4352 .Sq x 4353 at the beginning of the trailing context. 4354 (Note that the POSIX draft states that the text matched by such patterns 4355 is undefined.) 4356 .Pp 4357 For some trailing context rules, parts which are actually fixed-length are 4358 not recognized as such, leading to the above mentioned performance loss. 4359 In particular, parts using 4360 .Sq |\& 4361 or 4362 .Sq {n} 4363 (such as 4364 .Qq foo{3} ) 4365 are always considered variable-length. 4366 .Pp 4367 Combining trailing context with the special 4368 .Sq |\& 4369 action can result in fixed trailing context being turned into 4370 the more expensive variable trailing context. 4371 For example, in the following: 4372 .Bd -literal -offset indent 4373 %% 4374 abc | 4375 xyz/def 4376 .Ed 4377 .Pp 4378 Use of 4379 .Fn unput 4380 invalidates yytext and yyleng, unless the 4381 .Dq %array 4382 directive 4383 or the 4384 .Fl l 4385 option has been used. 4386 .Pp 4387 Pattern-matching of NUL's is substantially slower than matching other 4388 characters. 4389 .Pp 4390 Dynamic resizing of the input buffer is slow, as it entails rescanning 4391 all the text matched so far by the current 4392 .Pq generally huge 4393 token. 4394 .Pp 4395 Due to both buffering of input and read-ahead, 4396 it is not possible to intermix calls to 4397 .Aq Pa stdio.h 4398 routines, such as, for example, 4399 .Fn getchar , 4400 with 4401 .Nm 4402 rules and expect it to work. 4403 Call 4404 .Fn input 4405 instead. 4406 .Pp 4407 The total table entries listed by the 4408 .Fl v 4409 flag excludes the number of table entries needed to determine 4410 what rule has been matched. 4411 The number of entries is equal to the number of DFA states 4412 if the scanner does not use 4413 .Em REJECT , 4414 and somewhat greater than the number of states if it does. 4415 .Pp 4416 .Em REJECT 4417 cannot be used with the 4418 .Fl f 4419 or 4420 .Fl F 4421 options. 4422 .Pp 4423 The 4424 .Nm 4425 internal algorithms need documentation.