reglibcpp  1.0.0
(Naïve) C++ implementation of models for regular languages
Classes | Public Types | Public Member Functions | Static Public Member Functions | Public Attributes | Static Public Attributes | List of all members
reg::expression::parser Struct Reference

Parses regular expressions. More...

Classes

struct  tree
 Represents the table entries as binary trees. More...
 

Public Types

enum  token : unsigned char {
  A, B, C, K,
  E, F, Σ, P,
  L, R, S, END
}
 Tokens the grammar deals with. More...
 
typedef bitset< token::END > tokens
 Tokens don't usually come alone. More...
 

Public Member Functions

 parser (u32string const &re, literals const &lits)
 Initializes with a string to parse and literals to parse for. More...
 
exptr operator() (bool optimized, bool aggressive)
 Gives the RE resulting from parsing. More...
 

Static Public Member Functions

static tokens getUClosure (tokens const &m)
 Constructs the reflexive-transitive closure of the inverse unit relation for a given set of symbols. More...
 
static bool canDerive (token symbol, tokens const &left, tokens const &right)
 Checks if a token could derive a pair of tokens from two other entries. More...
 
static void compileTableEntry (size_t row, size_t diag, vector< vector< tokens >> &table)
 Fills a table entry. More...
 

Public Attributes

literals lits
 Stores interpretations for characters encountered in the parsed string. More...
 
vector< char32_t > symbolMapping
 Stores the actual symbols encountered in the RE while parsing. More...
 
size_t symbolMappingIndex
 Index for when symbols have to be extracted from the mapping. More...
 
vector< vector< tokens > > table
 The table of sets of symbols that derive a subsentence. More...
 
bool badRegularExpression = false
 Signifies that the RE used to initialize this object was invalid. More...
 

Static Public Attributes

static array< token, token::END > const inverseUnitGraph
 Maps symbols that may be derived in-place to their predecessing symbols. More...
 
static array< array< tokens, token::END >, token::END > const inverseBinaryRules
 Maps pairs of symbols to the symbols that derive them. More...
 

Detailed Description

Parses regular expressions.

Recognizes REs generated by the following context-free grammar, where P stands for the alternative symbol (literal "+" or "|"), L for a left parenthesis (literal "("), R for a right parenthesis (literal ")") and S for the Kleene star (literal "*"). Σ is a placeholder for any symbol that makes up the regular language described by the RE:

GRE=(
       {A, B, C, K, E, F},
       {Σ, P, L, R, S}, A,
       {
         (A,C), (A,AB),
         (B,PC),
         (C,K), (C,CK),
         (K,E), (K,ES),
         (E,Σ), (E,LF),
         (F,AR)
       }
    )

This parser is based on the article To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm by Martin Lange and Hans Leiß, published in informatica didactica No. 8. Conventions will be based on this article.

Definition at line 338 of file expression.cpp.

Member Typedef Documentation

◆ tokens

typedef bitset<token::END> reg::expression::parser::tokens

Tokens don't usually come alone.

Definition at line 356 of file expression.cpp.

Member Enumeration Documentation

◆ token

enum reg::expression::parser::token : unsigned char

Tokens the grammar deals with.

Enumerator

Beginning of an alternation expression.

Second part of an alternation expression.

A concatenation expression.

Kleene expression.

Beginning of a new subexpression.

Second part of a new subexpression.

Σ 

A symbol expression.

An alternation symbol.

A left parenthesis.

A right parenthesis.

A Kleene star symbol.

END 

Number of elements in this enumeration, NOT AN ACTUAL TOKEN!

Definition at line 340 of file expression.cpp.

340  : unsigned char {
341  A,
342  B,
343  C,
344  K,
345  E,
346  F,
347  Σ,
348  P,
349  L,
350  R,
351  S,
352  END
353  };
Number of elements in this enumeration, NOT AN ACTUAL TOKEN!
Definition: expression.cpp:352
An alternation symbol.
Definition: expression.cpp:348
A concatenation expression.
Definition: expression.cpp:343
A left parenthesis.
Definition: expression.cpp:349
Beginning of a new subexpression.
Definition: expression.cpp:345
A symbol expression.
Definition: expression.cpp:347
A Kleene star symbol.
Definition: expression.cpp:351
Second part of an alternation expression.
Definition: expression.cpp:342
A right parenthesis.
Definition: expression.cpp:350
Beginning of an alternation expression.
Definition: expression.cpp:341
Kleene expression.
Definition: expression.cpp:344
Second part of a new subexpression.
Definition: expression.cpp:346

Constructor & Destructor Documentation

◆ parser()

reg::expression::parser::parser ( u32string const &  re,
literals const &  lits 
)
inline

Initializes with a string to parse and literals to parse for.

Definition at line 451 of file expression.cpp.

451  {
452  if (re.empty()) {
453  badRegularExpression = true;
454  return;
455  }
456  symbolMappingIndex = 0;
457  size_t numberOfTokens(re.length());
458  symbolMapping.reserve(numberOfTokens);
459  table.reserve(numberOfTokens);
460  size_t row(0);
461  for (char32_t symbol : re) {
462  table.push_back(vector<tokens>(numberOfTokens-row, tokens()));
463  if (symbol == lits.L) { table[row][0].set(token::L); }
464  else if (symbol == lits.R) { table[row][0].set(token::R); }
465  else if (symbol == lits.P) { table[row][0].set(token::P); }
466  else if (symbol == lits.S) { table[row][0].set(token::S); }
467  else {
468  table[row][0].set(token::Σ);
469  symbolMapping.push_back(symbol);
470  }
471  row++;
472  }
473  symbolMapping.shrink_to_fit();
474  for (size_t diag(1); diag < numberOfTokens; diag++) {
475  for (size_t row(0); row < numberOfTokens - diag; row++) {
476  compileTableEntry(row, diag, table);
477  }
478  }
479  if (!getUClosure(table[0][table[0].size()-1])[token::A]) {
480  badRegularExpression = true;
481  }
482  }
char32_t const S
The Kleene star.
Definition: expression.h:53
static tokens getUClosure(tokens const &m)
Constructs the reflexive-transitive closure of the inverse unit relation for a given set of symbols...
Definition: expression.cpp:374
vector< vector< tokens > > table
The table of sets of symbols that derive a subsentence.
Definition: expression.cpp:447
bool badRegularExpression
Signifies that the RE used to initialize this object was invalid.
Definition: expression.cpp:448
literals lits
Stores interpretations for characters encountered in the parsed string.
Definition: expression.cpp:444
vector< char32_t > symbolMapping
Stores the actual symbols encountered in the RE while parsing.
Definition: expression.cpp:445
size_t symbolMappingIndex
Index for when symbols have to be extracted from the mapping.
Definition: expression.cpp:446
char32_t const L
The left parenthesis.
Definition: expression.h:53
char32_t const P
The alternation symbol.
Definition: expression.h:53
size_t size() const
Reports the size of this RE&#39;s tree representation.
Definition: expression.cpp:181
static void compileTableEntry(size_t row, size_t diag, vector< vector< tokens >> &table)
Fills a table entry.
Definition: expression.cpp:428
bitset< token::END > tokens
Tokens don't usually come alone.
Definition: expression.cpp:356
char32_t const R
The right parenthesis.
Definition: expression.h:53

Member Function Documentation

◆ canDerive()

static bool reg::expression::parser::canDerive ( token  symbol,
tokens const &  left,
tokens const &  right 
)
inlinestatic

Checks if a token could derive a pair of tokens from two other entries.

Parameters
symbolthe original token
leftthe set of tokens allowed for the first half of the result pair
rightthe set of tokens allowed for the second half of the result pair
Returns
true if the original token may lead to any such pair, false else

Definition at line 394 of file expression.cpp.

394  {
395  tokens leftClosure(getUClosure(left));
396  tokens rightClosure(getUClosure(right));
397  for (unsigned char li(0); li < token::END; li++) {
398  if (leftClosure[li]) {
399  for (unsigned char ri(0); ri < token::END; ri++) {
400  if (rightClosure[ri]) {
401  for (unsigned char ci(0); ci < token::END; ci++) {
402  if (inverseBinaryRules[li][ri][ci]) {
403  for (unsigned char successor(ci); successor != token::END; successor = inverseUnitGraph[successor]) {
404  if (symbol == successor) {
405  return true;
406  }
407  }
408  }
409  }
410  }
411  }
412  }
413  }
414  return false;
415  }
static array< array< tokens, token::END >, token::END > const inverseBinaryRules
Maps pairs of symbols to the symbols that derive them.
Definition: expression.cpp:362
static tokens getUClosure(tokens const &m)
Constructs the reflexive-transitive closure of the inverse unit relation for a given set of symbols...
Definition: expression.cpp:374
bitset< token::END > tokens
Tokens don't usually come alone.
Definition: expression.cpp:356
static array< token, token::END > const inverseUnitGraph
Maps symbols that may be derived in-place to their predecessing symbols.
Definition: expression.cpp:358

◆ compileTableEntry()

static void reg::expression::parser::compileTableEntry ( size_t  row,
size_t  diag,
vector< vector< tokens >> &  table 
)
inlinestatic

Fills a table entry.

The entry will be filled with the set of symbols deriving the subsentence beginning at the diagonal entry in the same row and ending at the diagonal entry in the same column.

Parameters
rowthe row of the entry to fill
diagthe diagonal of the entry to fill
tablethe table to operate on
Returns
the set of symbols deriving the subsentence

Definition at line 428 of file expression.cpp.

428  {
429  for (size_t i(0); i < diag; i++) {
430  tokens first(getUClosure(table[row][i]));
431  for (unsigned char fi(0); fi < first.size(); fi++) {
432  if (first[fi]) {
433  tokens second(getUClosure(table[row+1+i][diag-i-1]));
434  for (unsigned char si(0); si < second.size(); si++) {
435  if (second[si]) {
436  table[row][diag] |= inverseBinaryRules[fi][si];
437  }
438  } // Going through the tokens in second.
439  }
440  } // Going through the tokens in first.
441  }
442  }
static array< array< tokens, token::END >, token::END > const inverseBinaryRules
Maps pairs of symbols to the symbols that derive them.
Definition: expression.cpp:362
static tokens getUClosure(tokens const &m)
Constructs the reflexive-transitive closure of the inverse unit relation for a given set of symbols...
Definition: expression.cpp:374
vector< vector< tokens > > table
The table of sets of symbols that derive a subsentence.
Definition: expression.cpp:447
bitset< token::END > tokens
Tokens don't usually come alone.
Definition: expression.cpp:356

◆ getUClosure()

static tokens reg::expression::parser::getUClosure ( tokens const &  m)
inlinestatic

Constructs the reflexive-transitive closure of the inverse unit relation for a given set of symbols.

See Lemma 5 in the article.

Parameters
mthe set of symbols that may be derived in-place
Returns
the set of symbols that may be predecessors to any of the symbols in m

Definition at line 374 of file expression.cpp.

374  {
375  tokens closure;
376  for (unsigned char c(0); c < token::END; c++) {
377  if (m[c]) {
378  for (token s(static_cast<token>(c)); s != token::END; s = inverseUnitGraph[s]) {
379  closure.set(s);
380  }
381  }
382  } // Going through the tokens in m.
383  return closure;
384  }
token
Tokens the grammar deals with.
Definition: expression.cpp:340
bitset< token::END > tokens
Tokens don't usually come alone.
Definition: expression.cpp:356
static array< token, token::END > const inverseUnitGraph
Maps symbols that may be derived in-place to their predecessing symbols.
Definition: expression.cpp:358

◆ operator()()

exptr reg::expression::parser::operator() ( bool  optimized,
bool  aggressive 
)
inline

Gives the RE resulting from parsing.

Returns
exptr to the RE represented by the initialization string

Definition at line 565 of file expression.cpp.

565  {
566  if (badRegularExpression) {
567  #ifdef __EXCEPTIONS
568  throw std::bad_function_call();
569  #else
570  return exptr();
571  #endif
572  }
573  return tree(0, table.size()-1, this)(optimized, aggressive);
574  }
vector< vector< tokens > > table
The table of sets of symbols that derive a subsentence.
Definition: expression.cpp:447
bool badRegularExpression
Signifies that the RE used to initialize this object was invalid.
Definition: expression.cpp:448
std::shared_ptr< expression const > exptr
This is the type used to handle regular expressions.
Definition: expression.h:43

Member Data Documentation

◆ badRegularExpression

bool reg::expression::parser::badRegularExpression = false

Signifies that the RE used to initialize this object was invalid.

Definition at line 448 of file expression.cpp.

◆ inverseBinaryRules

auto const reg::expression::parser::inverseBinaryRules
static
Initial value:
=
[]()->array<array<expression::parser::tokens,expression::parser::token::END>,expression::parser::token::END>{
array<tokens,token::END> noPredecessor;
noPredecessor.fill(tokens());
array<array<tokens,token::END>,token::END> rules;
rules.fill(noPredecessor);
rules[token::A][token::B].set(token::A);
rules[token::P][token::C].set(token::B);
rules[token::C][token::K].set(token::C);
rules[token::E][token::S].set(token::K);
rules[token::L][token::F].set(token::E);
rules[token::A][token::R].set(token::F);
return rules;
}()

Maps pairs of symbols to the symbols that derive them.

Definition at line 362 of file expression.cpp.

◆ inverseUnitGraph

auto const reg::expression::parser::inverseUnitGraph
static
Initial value:
=
[]()->array<expression::parser::token,expression::parser::token::END>{
array<token,token::END> graph;
graph.fill(token::END);
graph[token::Σ] = token::E;
graph[token::E] = token::K;
graph[token::K] = token::C;
graph[token::C] = token::A;
return graph;
}()

Maps symbols that may be derived in-place to their predecessing symbols.

See Definition 2 and Lemma 4 in the article.

Definition at line 358 of file expression.cpp.

◆ lits

literals reg::expression::parser::lits

Stores interpretations for characters encountered in the parsed string.

Definition at line 444 of file expression.cpp.

◆ symbolMapping

vector<char32_t> reg::expression::parser::symbolMapping

Stores the actual symbols encountered in the RE while parsing.

Definition at line 445 of file expression.cpp.

◆ symbolMappingIndex

size_t reg::expression::parser::symbolMappingIndex
mutable

Index for when symbols have to be extracted from the mapping.

Definition at line 446 of file expression.cpp.

◆ table

vector<vector<tokens> > reg::expression::parser::table

The table of sets of symbols that derive a subsentence.

Definition at line 447 of file expression.cpp.


The documentation for this struct was generated from the following file: