Order of Basic keywords

carlsson · Post by **carlsson** » Thu Jan 22, 2009 9:30 am

Rather than further disturbing Kweepa's thread, I thought we could discuss the order of Basic keywords and how different tokenizers handle them in a separate thread.

As far as I understand, keywords (excluding special symbols) are tokenized in this order:

END, FOR, NEXT, DATA, INPUT#, INPUT, DIM, READ, LET, GOTO, RUN, IF, RESTORE, GOSUB, RETURN, REM, STOP, ON, WAIT, LOAD, SAVE, VERIFY, DEF, POKE, PRINT#, PRINT, CONT, LIST, CLR, CMD, SYS, OPEN, CLOSE, GET, NEW, TAB, TO, FN, SPC, THEN, NOT, STEP, AND, OR, SGN, INT, ABS, USR, FRE, POS, SQR, RND, LOG, EXP, COS, SIN, TAN, ATN, PEEK, LEN, STR$, VAL, ASC, CHR$, LEFT$, RIGHT$, MID$, GO

I don't see GET# in the list, but perhaps Basic handles the two variants in one call unlike INPUT# and PRINT#.

From here we can come up with impossible combinations if spaces are omitted. "T AND 16" and "S TO P" were already mentioned. Obviously the parser will choke on something like "LF OR 32" without spaces. I suppose "ON LO GOTO 25" also needs to be spaced.

Most of the time I believe you will quickly spot a misplaced keyword if Basic bails out with ?SYNTAX ERROR but it can be fun to try various tokenizers how they handle these peculiar cases.

Mike · Post by **Mike** » Fri Jan 23, 2009 2:09 am

The BASIC tokenizer roughly works as follows:

There is a text pointer, an offset, and keyword pointer. The text pointer looks at the first still not tokenized char of the input line, the offset adds to a position further into the input line. The keyword pointer starts at the begin of the keyword table.

A mismatch resets the offset to zero, and advances the keyword pointer to the next keyword.

If a match is found, the letters are shortened to the token, the gap behind is closed, and then the text pointer is set behind the token. Never does the tokenizer take a look to the left! But that's what happened in the STOP case.

Only if no keyword matches, the text pointer is advanced by one, the character left as is, and the keyword pointer reset to the begin.

Thus, order of the keywords in the list really doesn't matter - the numbering of the result set aside -, as long as there's no shorter keyword preceding a longer one, where both are identical up to the last letter of the shorter one. That's the reason INPUT is prefaced by INPUT#. Otherwise INPUT# could never be tokenized with a single byte token.

carlsson wrote:Most of the time I believe you will quickly spot a misplaced keyword if Basic bails out with ?SYNTAX ERROR but it can be fun to try various tokenizers how they handle these peculiar cases.

Of course, the CBM tokenizer is the reference. Any other tokenizer should better give the same result.

Michael