Recommended outputs of tokenization: - Char(c) -- prepresents e.g. 'a' or 'x' - BracketOpen/BracketClose ('[' and ']') - ParenOpen/ParenClose - Hyphen ( - ) - Star ( * ) - Caret ( ^ ) - Dot ( . ) - Or ( | ) Backslashing should be solved in tokenization (e.g. tokenizing "\." should return Char('.'), NOT Char('\') and Dot). Double backlash counts as Char('\'), as usual. Recommended grammar: Regex -> OrRegex OrRegex -> SeqRegex | SeqRegex Or OrRegex SeqRegex -> StarRegex | StarRegex SeqRegex StarRegex -> SimpleRegex | SimpleRegex Star SimpleRegex -> ParenOpen Regex ParenClose | BracketOpen InBrackets BracketClose | BracketOpen Caret InBrackets BracketClose | Dot | Char InBrackets -> BracketsItem | BracketsItem InBrackets BracketsItem -> Char | Char Hyphen Char Notice the similarity of OrRegex and SeqRegex with AddExpr and MultExpr from the lab, and how it solves possible ambiguities while parsing. Recommended implementations of the interface Regex: - OrRegex - SeqRegex - StarRegex - Char (which can be used to represent any single-character pattern, including a, [a-z], [acxz], [^a-z], [^ ], [^a] or just . (Dot)) - SingleChar (which can optionally be used in place of Char to efficiently represent a single-choice character) Note that the grammar rule for 'BracketItem' does not define an actual AST element -- all BracketItem's and InBrackets are processed to a single 'Char' object. Corresponding AST for expression "a[^bc]|de*f": OrRegex ( SeqRegex ( Char ('a'), Char (NOT 'b','c') ), SeqRegex ( Char ('d'), SeqRegex ( StarRegex ( Char ('e') ), Char ('f') ) ) ) Recommended interface for matching (without parentheses present in the pattern): size_t Regex::match_part(const string& str, size_t begin, size_t max_len); - str is a string where the matching is done - begin is a position of the first character where the regex should match (this is useful for matching successive letters in a long integer -- you don't need to construct a new `str` for matching each letter. - max_len is a maximum length of the matched part of the string. This helps with ambiguous stuff. - return value is the actual length of the string that has been matched, or some special value when the match was not possible (see e.g. std::string::npos for a similar special value) For example, when matching pattern "abc" in string "abc", the match_part methods are successively called like this: r.match_part("abc", 0, 3) Char("a").match_part("abc", 0, 3) matches and returns 1 Char("b").match_part("abc", 1, 3) matches and returns 1 Char("c").match_part("abc", 2, 3) matches and returns 1 r.match_part("abc", 0, 3) returns 3 The max_len is useful for handling ambiguity, e.g. when retrying to match a pattern with Kleene star. Say, with pattern "(ab)*ab" in string "abab" Star("ab").match_part("abab",0,4) returns 4, because it's possible Char("a").match_part("abab",4,0) can't match anything and returns npos! So we retry the Star matching, giving it a bit less of characters: Star("ab").match_part("abab",0,3) returns 2 (matching only the first "ab") Char("a").match_part("abab",2,2) succeeds, returns 1 Char("b").match_part("abab",3,1) succeeds, returns 1 done. When there are parentheses present in the pattern, you need 4 minor upgrades: - a way to express that there is a parenthesis in the regex (e.g. ParenRegex) - a way to detect if the Regex contains ParenRegex (one extra interface method) - a way to assign a number to each ParenRegex so that it knows which index it has in a sequence of all parentheses (this is also doable by the previous interface method) - when matching, pass around some structure where ParenRegexes that match successfully can fill in what they matched (using e.g. an extra parameter to match_part of type vector& )