What is lexemes?
In compiler design, lexemes are the smallest meaningful units or atomic elements in the source code that the lexical analyzer (lexer) recognizes. Let me explain this concept with examples and details.
A lexeme is the actual character sequence that matches a pattern defined by a token. For example:
counter = initial_value + 100
In this line, we have several lexemes:
- "counter" (identifier lexeme)
- "=" (assignment operator lexeme)
- "initial_value" (identifier lexeme)
- "+" (addition operator lexeme)
- "100" (numeric literal lexeme)
The lexical analyzer processes these lexemes and converts them into tokens, which typically include:
- The token type (e.g., IDENTIFIER, OPERATOR, NUMBER)
- The actual lexeme value
- Sometimes additional information like line number and position
Here's how different lexemes map to token categories:
- Keywords: lexemes like "if", "while", "for", "class"
- Identifiers: variable names, function names
- Operators: +, -, *, /, =, ==, !=
- Literals:
- Numbers: 42, 3.14, 0xFF
- Strings: "hello", 'world'
- Characters: 'a', '$'
- Delimiters: (, ), {, }, [, ]