Mstdlib-1.24.0
Regular Expression

Typedefs

typedef struct M_re M_re_t
 
typedef struct M_ret_match M_re_match_t
 

Enumerations

enum  M_re_flags_t {
  M_RE_NONE = 0 ,
  M_RE_CASECMP = 1 << 0 ,
  M_RE_MULTILINE = 1 << 1 ,
  M_RE_DOTALL = 1 << 2 ,
  M_RE_UNGREEDY = 1 << 3
}
 

Functions

M_re_tM_re_compile (const char *pattern, M_uint32 flags)
 
void M_re_destroy (M_re_t *re)
 
M_bool M_re_search (const M_re_t *re, const char *str, M_re_match_t **match)
 
M_bool M_re_eq_start (const M_re_t *re, const char *str)
 
M_bool M_re_eq (const M_re_t *re, const char *str)
 
M_list_tM_re_matches (const M_re_t *re, const char *str)
 
M_list_str_tM_re_find_all (const M_re_t *re, const char *str)
 
char * M_re_sub (const M_re_t *re, const char *repl, const char *str)
 
void M_re_match_destroy (M_re_match_t *match)
 
M_list_u64_tM_re_match_idxs (const M_re_match_t *match)
 
M_bool M_re_match_idx (const M_re_match_t *match, size_t idx, size_t *offset, size_t *len)
 

Detailed Description

The engine targets Perl/Python/PCRE expression syntax. However, this is not a full implementation of the syntax.

The re engine is uses DFA processing to ensure evaluation happens in a reasonable amount of time. It does not use back tracking to avoid pathological expressions causing very slow run time. Due to this back references in patterns are not supported.

Patterns are thread safe and re-entrant.

Supported:

Syntax

Expression Description
. any character (except newline, see DOTALL)
^ Start of string. Or start of line in MULTILINE
$ End of string. Or end of line in MULTILINE
* 0 or more repetitions
+ 1 or more repetitions
? 0 or 1 repetitions
*? +? ?? Ungreedy version of repetition
{#} Exactly # of repetitions
{#,} # or more repetitions
{#,#} Inclusive of # and # repetitions
\ Escape character. E.g. \\ => \
[] Character range. Can be specific characters or '-' specified range. Multiple ranges can be specified. E.g. [a-z-8XYZ]
[^] Negative character range. Can be specific characters or '-' specified range. Multiple ranges can be specified. E.g. [^a-z-8XYZ]
| Composite A or B. E.g. A|B
() Pattern and capture group. Groups expressions together for evaluation when used with |. Also, defines a capture group.
(?imsU-imsU) Allows specifying compile flags in the expression. Supports i (ignore case), m (multiline), s (dot all), U (ungreedy). - can be used to disable a flag. E.g. (?im-s). Only allowed to be used once at the start of the pattern.
Note
\ as part of | (pipe) shown in table is for escaping and not part of syntax.

Escapes

Expression Description
C escape sequences Any standard escape sequence that is part of C. Such as, \n (newline) and \t (tab)
\xHH \x{HHHH} Hex values
\< Beginning of word
\> End of word

Short hand character classes

Cannot be used within brackets.

ASCII only.

Expression Description
\s White space. Equivalent to [ \t\n\r\f\v]
\S Not white space. Equivalent to [^ \t\n\r\f\v]
\d Digit (number). Equivalent to [0-9].
\D Not digit Equivalent to [^0-9]
\w Word. Equivalent to [a-zA-Z0-9_]
\W Not word. Equivalent to [^a-zA-Z0-9_]

POSIX character classes for bracket expressions

Character ranges must be used in [] expressions. ^ negation is supported with ranges.

ASCII only.

Range Description
[:alpha:] Alpha characters. Contains [a-zA-Z]
[:alnum:] Alpha numeric characters. Contains [a-zA-Z0-9]
[:word:] Alpha numeric characters. Contains [a-zA-Z0-9_]. Equivalent to \w
[:space:] White space characters. Contains [ \t\r\n\v\f]. Equivalent to \s
[:digit:] Digit (number) characters. Contains [0-9]. Equivalent to \d
[:cntrl:] Control characters. Contains [\x00-\x1F\x7F]. Note: \x00 is the NULL string terminator so this is really [\x01-\x1F\x7F] because \x00 can never be encountered in a string.
[:print:] Printable characters range. Contains [\x20-\x7E]
[:xdigit:] Hexadecimal digit range. Contains [0-9a-fA-F]
[:lower:] Lower case character range. Contains [a-z]
[:upper:] Upper case character range. Contains [A-Z]
[:blank:] Blank character range. Contains [ \t]
[:graph:] Graph character range. Contains [\x21-\x7E]
[:punct:] Punctuation character range. Contains ‘[!"#$%&’()*+,-./:;<=>?@[\]^_`{|}~]`
Note
\ as part of | (pipe) and ` shown in [:punct:] is for escaping and not part of character set.

Features

Not supported:

Match object

Patterns can have capture groups which can be filled in a match object during string evaluation. Only numbered capture indexes are supported. Up to 99 captures can be recorded.

Index 0 is the full match for the regular expression. If the pattern matches the string, this will always be populated. Groups (when present) are number 1-99.

If a capture is present the index will be available. Composite (|) patterns can cause gaps in captures. Meaning capture 1, and 5 could be present but capture 3 and 4 not. Also, captures can be present but have zero length.

Finally, captures are reported with offset from the start of the string and the length of the captured data. This is different than some other libraries which return start and end offsets. Utilizing length instead of end offsets was decided based on captures being passed to other functions, the majority of which take a start and length; not an end offset.

Unicode

Patterns and strings are expected to be UTF-8 encoded and will be interpreted as such.

While Unicode is supported normalization is not. Every Unicode character is treated as a unique character. Many characters match multiple Unicode code points. Equivalence is not applied and each code point is treated as its own character.

Typedef Documentation

◆ M_re_t

typedef struct M_re M_re_t

◆ M_re_match_t

typedef struct M_ret_match M_re_match_t

Enumeration Type Documentation

◆ M_re_flags_t

Pattern modifier options.

Enumerator
M_RE_NONE 

No modifiers applied.

M_RE_CASECMP 

Matching should be case insensitive.

M_RE_MULTILINE 

^ and $ match start and end of lines instead of start and end of string.

M_RE_DOTALL 

Dot matches all characters including new line.

M_RE_UNGREEDY 

Invert behavior of greedy qualifiers. E.g. ? acts like ?? and ?? acts like ?.

Function Documentation

◆ M_re_compile()

M_re_t * M_re_compile ( const char *  pattern,
M_uint32  flags 
)

Compile a regular expression pattern.

Parameters
[in]patternThe pattern to compile.
[in]flagsM_re_flags_t flags controlling pattern behavior.
Returns
Re object on success. NULL on compilation error.

◆ M_re_destroy()

void M_re_destroy ( M_re_t re)

Destroy a re object.

Parameters
[in]reRe object.

◆ M_re_search()

M_bool M_re_search ( const M_re_t re,
const char *  str,
M_re_match_t **  match 
)

Search for the first match of patten in string.

Parameters
[in]reRe object.
[in]strString to evaluate.
[out]matchOptional match object.
Returns
M_TRUE if match was found. Otherwise, M_FALSE.

◆ M_re_eq_start()

M_bool M_re_eq_start ( const M_re_t re,
const char *  str 
)

Check if the pattern matches from the beginning of the string.

Equivalent to the pattern starting with ^ and not multi line.

Parameters
[in]reRe object.
[in]strString to evaluate.
Returns
M_TRUE if match was found. Otherwise, M_FALSE.

◆ M_re_eq()

M_bool M_re_eq ( const M_re_t re,
const char *  str 
)

Check if the pattern matches the entire string

Equivalent to the pattern starting with ^, ending with $ and not multi line.

Parameters
[in]reRe object.
[in]strString to evaluate.
Returns
M_TRUE if match was found. Otherwise, M_FALSE.

◆ M_re_matches()

M_list_t * M_re_matches ( const M_re_t re,
const char *  str 
)

Get all pattern matches within a string.

Parameters
[in]reRe object.
[in]strString to evaluate.
Returns
List of M_re_match_t objects for every match found in the string. NULL if no matches found.

◆ M_re_find_all()

M_list_str_t * M_re_find_all ( const M_re_t re,
const char *  str 
)

Get all matching text within a string.

If locations of the text or captures are needed use M_re_matches.

Parameters
[in]reRe object.
[in]strString to evaluate.
Returns
List of matching strings for every match found in the string. NULL if no matches found.

◆ M_re_sub()

char * M_re_sub ( const M_re_t re,
const char *  repl,
const char *  str 
)

Substitute matching pattern in string.

The replacement string can reference capture groups using \#, \##, \g<#>, \g<##>. The capture data applies to the match being evaluated. For example:

pattern: ' (c-e)'
string: 'a b c d e f g'
repl: '\1'
result: 'a bcde f g'
Parameters
[in]reRe object.
[in]replReplacement string.
[in]strString to evaluate.
Returns
String with substitutions or original string if no sub situations were made.

◆ M_re_match_destroy()

void M_re_match_destroy ( M_re_match_t match)

Destroy a match object.

Parameters
[in]matchMatch object.

◆ M_re_match_idxs()

M_list_u64_t * M_re_match_idxs ( const M_re_match_t match)

Get a list of all the captured indexes.

Parameters
[in]matchMatch object.
Returns
List of indexes. Otherwise NULL if no indexes captured.

◆ M_re_match_idx()

M_bool M_re_match_idx ( const M_re_match_t match,
size_t  idx,
size_t *  offset,
size_t *  len 
)

Get the offset and length of a match at a given index.

Parameters
[in]matchMatch object.
[in]idxIndex.
[out]offsetStart of match from the beginning of evaluated string.
[out]lenLength of matched data.
Returns
M_TRUE if match found for index. Otherwise, M_FALSE.