JISON: How do I avoid "dog" being parsed as "do"?

117 views Asked by At

I have the following JISON file (lite version of my actual file, but reproduces my problem):

%lex

%%

"do"                        return 'DO';
[a-zA-Z_][a-zA-Z0-9_]*      return 'ID';
"::"                        return 'DOUBLECOLON'
<<EOF>>                     return 'ENDOFFILE';

/lex

%%

start
    : ID DOUBLECOLON ID ENDOFFILE
    {$$ = {type: "enumval", enum: $1, val: $3}}
    ;

It is for parsing something like "AnimalTypes::cat". It works fine for things like "AnimalTypes::cat", but the when it sees dog instead of cat, it asumes it's a DO instead of an id. I can see why it does that, but how do I get around it? I've been looking at other JISON documents, but can't seem to spot the difference that (I assume) makes those work.

This is the error I get:

JisonParserError: Parse error on line 1:
PetTypes::dog
----------^
Expecting "ID", "enumstr", "id", got unexpected "DO"

Repro steps:

  1. Install jison-gho globally from npm (or modify code to use local version). I use Node v14.6.0.
  2. Save the JISON above as minimal-repro.jison
  3. Run: jison -m es -o ./minimal.mjs ./minimal-repro.jison to create parser
  4. Create a file named test.mjs with code like:
import Parser from "./minimal.mjs";
Parser.parser.parse("PetTypes::dog")
  1. Run node test.mjs

Edit: Updated with a reproducible example. Edit2: Simpler JISON

1

There are 1 answers

0
rici On BEST ANSWER

Unlike (f)lex, the jison lexer accepts the first matching pattern, even if it is not the longest matching pattern. You can get the (f)lex behaviour by using

 %option flex

However, that significantly slows down the scanner.

The original jison automatically added \b to the end of patterns which ended with a literal string matching an alphabetic character, to make it easier to match keywords without incurring this overhead. In jison-gho, this feature was turned off unless you specify

 %option easy_keyword_rules

See https://github.com/zaach/jison/wiki/Deviations-From-Flex-Bison#user-content-literal-tokens.

So either of those options will achieve the behaviour you expect.