Mstdlib-1.24.0
|
Typedefs | |
typedef struct M_re | M_re_t |
typedef struct M_ret_match | M_re_match_t |
Enumerations | |
enum | M_re_flags_t { M_RE_NONE = 0 , M_RE_CASECMP = 1 << 0 , M_RE_MULTILINE = 1 << 1 , M_RE_DOTALL = 1 << 2 , M_RE_UNGREEDY = 1 << 3 } |
Functions | |
M_re_t * | M_re_compile (const char *pattern, M_uint32 flags) |
void | M_re_destroy (M_re_t *re) |
M_bool | M_re_search (const M_re_t *re, const char *str, M_re_match_t **match) |
M_bool | M_re_eq_start (const M_re_t *re, const char *str) |
M_bool | M_re_eq (const M_re_t *re, const char *str) |
M_list_t * | M_re_matches (const M_re_t *re, const char *str) |
M_list_str_t * | M_re_find_all (const M_re_t *re, const char *str) |
char * | M_re_sub (const M_re_t *re, const char *repl, const char *str) |
void | M_re_match_destroy (M_re_match_t *match) |
M_list_u64_t * | M_re_match_idxs (const M_re_match_t *match) |
M_bool | M_re_match_idx (const M_re_match_t *match, size_t idx, size_t *offset, size_t *len) |
The engine targets Perl/Python/PCRE expression syntax. However, this is not a full implementation of the syntax.
The re engine is uses DFA processing to ensure evaluation happens in a reasonable amount of time. It does not use back tracking to avoid pathological expressions causing very slow run time. Due to this back references in patterns are not supported.
Patterns are thread safe and re-entrant.
Expression | Description |
---|---|
. | any character (except newline, see DOTALL) |
^ | Start of string. Or start of line in MULTILINE |
$ | End of string. Or end of line in MULTILINE |
* | 0 or more repetitions |
+ | 1 or more repetitions |
? | 0 or 1 repetitions |
*? +? ?? | Ungreedy version of repetition |
{#} | Exactly # of repetitions |
{#,} | # or more repetitions |
{#,#} | Inclusive of # and # repetitions |
\ | Escape character. E.g. \\ => \ |
[] | Character range. Can be specific characters or '-' specified range. Multiple ranges can be specified. E.g. [a-z-8XYZ] |
[^] | Negative character range. Can be specific characters or '-' specified range. Multiple ranges can be specified. E.g. [^a-z-8XYZ] |
| | Composite A or B. E.g. A|B |
() | Pattern and capture group. Groups expressions together for evaluation when used with |. Also, defines a capture group. |
(?imsU-imsU) | Allows specifying compile flags in the expression. Supports i (ignore case), m (multiline), s (dot all), U (ungreedy). - can be used to disable a flag. E.g. (?im-s). Only allowed to be used once at the start of the pattern. |
Expression | Description |
---|---|
C escape sequences | Any standard escape sequence that is part of C. Such as, \n (newline) and \t (tab) |
\xHH \x{HHHH} | Hex values |
\< | Beginning of word |
\> | End of word |
Cannot be used within brackets.
ASCII only.
Expression | Description |
---|---|
\s | White space. Equivalent to [ \t\n\r\f\v] |
\S | Not white space. Equivalent to [^ \t\n\r\f\v] |
\d | Digit (number). Equivalent to [0-9]. |
\D | Not digit Equivalent to [^0-9] |
\w | Word. Equivalent to [a-zA-Z0-9_] |
\W | Not word. Equivalent to [^a-zA-Z0-9_] |
Character ranges must be used in []
expressions. ^
negation is supported with ranges.
ASCII only.
Range | Description |
---|---|
[:alpha:] | Alpha characters. Contains [a-zA-Z] |
[:alnum:] | Alpha numeric characters. Contains [a-zA-Z0-9] |
[:word:] | Alpha numeric characters. Contains [a-zA-Z0-9_] . Equivalent to \w |
[:space:] | White space characters. Contains [ \t\r\n\v\f] . Equivalent to \s |
[:digit:] | Digit (number) characters. Contains [0-9] . Equivalent to \d |
[:cntrl:] | Control characters. Contains [\x00-\x1F\x7F] . Note: \x00 is the NULL string terminator so this is really [\x01-\x1F\x7F] because \x00 can never be encountered in a string. |
[:print:] | Printable characters range. Contains [\x20-\x7E] |
[:xdigit:] | Hexadecimal digit range. Contains [0-9a-fA-F] |
[:lower:] | Lower case character range. Contains [a-z] |
[:upper:] | Upper case character range. Contains [A-Z] |
[:blank:] | Blank character range. Contains [ \t] |
[:graph:] | Graph character range. Contains [\x21-\x7E] |
[:punct:] | Punctuation character range. Contains ‘[!"#$%&’()*+,-./:;<=>?@[\]^_`{|}~]` |
[:punct:]
is for escaping and not part of character set.Patterns can have capture groups which can be filled in a match object during string evaluation. Only numbered capture indexes are supported. Up to 99 captures can be recorded.
Index 0 is the full match for the regular expression. If the pattern matches the string, this will always be populated. Groups (when present) are number 1-99.
If a capture is present the index will be available. Composite (|) patterns can cause gaps in captures. Meaning capture 1, and 5 could be present but capture 3 and 4 not. Also, captures can be present but have zero length.
Finally, captures are reported with offset from the start of the string and the length of the captured data. This is different than some other libraries which return start and end offsets. Utilizing length instead of end offsets was decided based on captures being passed to other functions, the majority of which take a start and length; not an end offset.
Patterns and strings are expected to be UTF-8 encoded and will be interpreted as such.
While Unicode is supported normalization is not. Every Unicode character is treated as a unique character. Many characters match multiple Unicode code points. Equivalence is not applied and each code point is treated as its own character.
typedef struct M_re M_re_t |
typedef struct M_ret_match M_re_match_t |
enum M_re_flags_t |
Pattern modifier options.
M_re_t * M_re_compile | ( | const char * | pattern, |
M_uint32 | flags | ||
) |
Compile a regular expression pattern.
[in] | pattern | The pattern to compile. |
[in] | flags | M_re_flags_t flags controlling pattern behavior. |
void M_re_destroy | ( | M_re_t * | re | ) |
Destroy a re object.
[in] | re | Re object. |
M_bool M_re_search | ( | const M_re_t * | re, |
const char * | str, | ||
M_re_match_t ** | match | ||
) |
Search for the first match of patten in string.
[in] | re | Re object. |
[in] | str | String to evaluate. |
[out] | match | Optional match object. |
M_bool M_re_eq_start | ( | const M_re_t * | re, |
const char * | str | ||
) |
Check if the pattern matches from the beginning of the string.
Equivalent to the pattern starting with ^ and not multi line.
[in] | re | Re object. |
[in] | str | String to evaluate. |
M_bool M_re_eq | ( | const M_re_t * | re, |
const char * | str | ||
) |
Check if the pattern matches the entire string
Equivalent to the pattern starting with ^, ending with $ and not multi line.
[in] | re | Re object. |
[in] | str | String to evaluate. |
Get all pattern matches within a string.
[in] | re | Re object. |
[in] | str | String to evaluate. |
M_list_str_t * M_re_find_all | ( | const M_re_t * | re, |
const char * | str | ||
) |
Get all matching text within a string.
If locations of the text or captures are needed use M_re_matches.
[in] | re | Re object. |
[in] | str | String to evaluate. |
char * M_re_sub | ( | const M_re_t * | re, |
const char * | repl, | ||
const char * | str | ||
) |
Substitute matching pattern in string.
The replacement string can reference capture groups using \#
, \##
, \g<#>
, \g<##>
. The capture data applies to the match being evaluated. For example:
[in] | re | Re object. |
[in] | repl | Replacement string. |
[in] | str | String to evaluate. |
void M_re_match_destroy | ( | M_re_match_t * | match | ) |
Destroy a match object.
[in] | match | Match object. |
M_list_u64_t * M_re_match_idxs | ( | const M_re_match_t * | match | ) |
Get a list of all the captured indexes.
[in] | match | Match object. |
M_bool M_re_match_idx | ( | const M_re_match_t * | match, |
size_t | idx, | ||
size_t * | offset, | ||
size_t * | len | ||
) |
Get the offset and length of a match at a given index.
[in] | match | Match object. |
[in] | idx | Index. |
[out] | offset | Start of match from the beginning of evaluated string. |
[out] | len | Length of matched data. |