Order of Basic keywords

Basic and Machine Language

Moderator: Moderators

Post Reply
carlsson
Class of '6502
Posts: 5516
Joined: Wed Mar 10, 2004 1:41 am

Order of Basic keywords

Post by carlsson »

Rather than further disturbing Kweepa's thread, I thought we could discuss the order of Basic keywords and how different tokenizers handle them in a separate thread.

As far as I understand, keywords (excluding special symbols) are tokenized in this order:

END, FOR, NEXT, DATA, INPUT#, INPUT, DIM, READ, LET, GOTO, RUN, IF, RESTORE, GOSUB, RETURN, REM, STOP, ON, WAIT, LOAD, SAVE, VERIFY, DEF, POKE, PRINT#, PRINT, CONT, LIST, CLR, CMD, SYS, OPEN, CLOSE, GET, NEW, TAB, TO, FN, SPC, THEN, NOT, STEP, AND, OR, SGN, INT, ABS, USR, FRE, POS, SQR, RND, LOG, EXP, COS, SIN, TAN, ATN, PEEK, LEN, STR$, VAL, ASC, CHR$, LEFT$, RIGHT$, MID$, GO

I don't see GET# in the list, but perhaps Basic handles the two variants in one call unlike INPUT# and PRINT#.

From here we can come up with impossible combinations if spaces are omitted. "T AND 16" and "S TO P" were already mentioned. Obviously the parser will choke on something like "LF OR 32" without spaces. I suppose "ON LO GOTO 25" also needs to be spaced.

Most of the time I believe you will quickly spot a misplaced keyword if Basic bails out with ?SYNTAX ERROR but it can be fun to try various tokenizers how they handle these peculiar cases. :)
Anders Carlsson

Image Image Image Image Image
User avatar
Mike
Herr VC
Posts: 4841
Joined: Wed Dec 01, 2004 1:57 pm
Location: Munich, Germany
Occupation: electrical engineer

Post by Mike »

The BASIC tokenizer roughly works as follows:

There is a text pointer, an offset, and keyword pointer. The text pointer looks at the first still not tokenized char of the input line, the offset adds to a position further into the input line. The keyword pointer starts at the begin of the keyword table.

A mismatch resets the offset to zero, and advances the keyword pointer to the next keyword.

If a match is found, the letters are shortened to the token, the gap behind is closed, and then the text pointer is set behind the token. Never does the tokenizer take a look to the left! But that's what happened in the STOP case.

Only if no keyword matches, the text pointer is advanced by one, the character left as is, and the keyword pointer reset to the begin.

Thus, order of the keywords in the list really doesn't matter - the numbering of the result set aside -, as long as there's no shorter keyword preceding a longer one, where both are identical up to the last letter of the shorter one. That's the reason INPUT is prefaced by INPUT#. Otherwise INPUT# could never be tokenized with a single byte token.
carlsson wrote:Most of the time I believe you will quickly spot a misplaced keyword if Basic bails out with ?SYNTAX ERROR but it can be fun to try various tokenizers how they handle these peculiar cases.
Of course, the CBM tokenizer is the reference. Any other tokenizer should better give the same result.

Michael
Post Reply