Typedefs
typedef struct M_re	M_re_t

typedef struct M_ret_match	M_re_match_t

Enumerations
enum	M_re_flags_t { M_RE_NONE = 0 , M_RE_CASECMP = 1 << 0 , M_RE_MULTILINE = 1 << 1 , M_RE_DOTALL = 1 << 2 , M_RE_UNGREEDY = 1 << 3 }

Functions
M_re_t *	M_re_compile (const char *pattern, M_uint32 flags)

void	M_re_destroy (M_re_t *re)

M_bool	M_re_search (const M_re_t re, const char str, M_re_match_t **match)

M_bool	M_re_eq_start (const M_re_t re, const char str)

M_bool	M_re_eq (const M_re_t re, const char str)

M_list_t *	M_re_matches (const M_re_t re, const char str)

M_list_str_t *	M_re_find_all (const M_re_t re, const char str)

char *	M_re_sub (const M_re_t re, const char repl, const char *str)

void	M_re_match_destroy (M_re_match_t *match)

M_list_u64_t *	M_re_match_idxs (const M_re_match_t *match)

M_bool	M_re_match_idx (const M_re_match_t match, size_t idx, size_t offset, size_t *len)

Detailed Description

The engine targets Perl/Python/PCRE expression syntax. However, this is not a full implementation of the syntax.

The re engine is uses DFA processing to ensure evaluation happens in a reasonable amount of time. It does not use back tracking to avoid pathological expressions causing very slow run time. Due to this back references in patterns are not supported.

Patterns are thread safe and re-entrant.

Supported:

Syntax

Expression	Description
`.`	any character (except newline, see DOTALL)
`^`	Start of string. Or start of line in MULTILINE
`$`	End of string. Or end of line in MULTILINE
`*`	0 or more repetitions
`+`	1 or more repetitions
`?`	0 or 1 repetitions
`*? +? ??`	Ungreedy version of repetition
`{#}`	Exactly # of repetitions
`{#,}`	# or more repetitions
`{#,#}`	Inclusive of # and # repetitions
`\`	Escape character. E.g. `\\ => \`
`[]`	Character range. Can be specific characters or '-' specified range. Multiple ranges can be specified. E.g. `[a-z-8XYZ]`
`[^]`	Negative character range. Can be specific characters or '-' specified range. Multiple ranges can be specified. E.g. `[^a-z-8XYZ]`
\|	Composite A or B. E.g. A\|B
`()`	Pattern and capture group. Groups expressions together for evaluation when used with \|. Also, defines a capture group.
`(?imsU-imsU)`	Allows specifying compile flags in the expression. Supports `i` (ignore case), `m` (multiline), `s` (dot all), `U` (ungreedy). - can be used to disable a flag. E.g. (?im-s). Only allowed to be used once at the start of the pattern.

Note: \ as part of | (pipe) shown in table is for escaping and not part of syntax.

Escapes

Expression	Description
C escape sequences	Any standard escape sequence that is part of C. Such as, `\n` (newline) and `\t` (tab)
`\xHH \x{HHHH}`	Hex values
`\<`	Beginning of word
`\>`	End of word

Short hand character classes

Cannot be used within brackets.

ASCII only.

Expression	Description
`\s`	White space. Equivalent to `[ \t\n\r\f\v]`
`\S`	Not white space. Equivalent to `[^ \t\n\r\f\v]`
`\d`	Digit (number). Equivalent to `[0-9].`
`\D`	Not digit Equivalent to `[^0-9]`
`\w`	Word. Equivalent to `[a-zA-Z0-9_]`
`\W`	Not word. Equivalent to `[^a-zA-Z0-9_]`

POSIX character classes for bracket expressions

Character ranges must be used in [] expressions. ^ negation is supported with ranges.

ASCII only.

Range	Description
`[:alpha:]`	Alpha characters. Contains `[a-zA-Z]`
`[:alnum:]`	Alpha numeric characters. Contains `[a-zA-Z0-9]`
`[:word:]`	Alpha numeric characters. Contains `[a-zA-Z0-9_]`. Equivalent to `\w`
`[:space:]`	White space characters. Contains `[ \t\r\n\v\f]`. Equivalent to `\s`
`[:digit:]`	Digit (number) characters. Contains `[0-9]`. Equivalent to `\d`
`[:cntrl:]`	Control characters. Contains `[\x00-\x1F\x7F]`. Note: `\x00` is the NULL string terminator so this is really `[\x01-\x1F\x7F]` because `\x00` can never be encountered in a string.
`[:print:]`	Printable characters range. Contains `[\x20-\x7E]`
`[:xdigit:]`	Hexadecimal digit range. Contains `[0-9a-fA-F]`
`[:lower:]`	Lower case character range. Contains `[a-z]`
`[:upper:]`	Upper case character range. Contains `[A-Z]`
`[:blank:]`	Blank character range. Contains `[ \t]`
`[:graph:]`	Graph character range. Contains `[\x21-\x7E]`
`[:punct:]`	Punctuation character range. Contains ‘[!"#$%&’()*+,-./:;<=>?@[\]^_`{\|}~]`

Note: \ as part of | (pipe) and ` shown in [:punct:] is for escaping and not part of character set.

Features

Numbered captures (up to 99) are supported in M_re_sub's replacement string.

Not supported:

Back references in patterns
Collating symbols (in brackets)
Equivalence classes (in brackets)
100% POSIX conformance
BRE (Basic Regular Expression) syntax
\ escape short hands (\d, \w, ...) inside of a bracket ([]) expression.

Match object

Patterns can have capture groups which can be filled in a match object during string evaluation. Only numbered capture indexes are supported. Up to 99 captures can be recorded.

Index 0 is the full match for the regular expression. If the pattern matches the string, this will always be populated. Groups (when present) are number 1-99.

If a capture is present the index will be available. Composite (|) patterns can cause gaps in captures. Meaning capture 1, and 5 could be present but capture 3 and 4 not. Also, captures can be present but have zero length.

Finally, captures are reported with offset from the start of the string and the length of the captured data. This is different than some other libraries which return start and end offsets. Utilizing length instead of end offsets was decided based on captures being passed to other functions, the majority of which take a start and length; not an end offset.

Unicode

Patterns and strings are expected to be UTF-8 encoded and will be interpreted as such.

While Unicode is supported normalization is not. Every Unicode character is treated as a unique character. Many characters match multiple Unicode code points. Equivalence is not applied and each code point is treated as its own character.

Typedef Documentation

◆ M_re_t

typedef struct M_re M_re_t

◆ M_re_match_t

typedef struct M_ret_match M_re_match_t

Enumeration Type Documentation

◆ M_re_flags_t

enum M_re_flags_t

Pattern modifier options.

Enumerator
M_RE_NONE	No modifiers applied.
M_RE_CASECMP	Matching should be case insensitive.
M_RE_MULTILINE	^ and $ match start and end of lines instead of start and end of string.
M_RE_DOTALL	Dot matches all characters including new line.
M_RE_UNGREEDY	Invert behavior of greedy qualifiers. E.g. ? acts like ?? and ?? acts like ?.

Function Documentation

◆ M_re_compile()

M_re_t * M_re_compile	(	const char *	pattern,
		M_uint32	flags
	)

Compile a regular expression pattern.

Parameters

[in]	pattern	The pattern to compile.
[in]	flags	M_re_flags_t flags controlling pattern behavior.

Returns: Re object on success. NULL on compilation error.

◆ M_re_destroy()

void M_re_destroy ( M_re_t * re )

Destroy a re object.

Parameters

[in] re Re object.

◆ M_re_search()

M_bool M_re_search	(	const M_re_t *	re,
		const char *	str,
		M_re_match_t **	match
	)

Search for the first match of patten in string.

Parameters

[in]	re	Re object.
[in]	str	String to evaluate.
[out]	match	Optional match object.

Returns: M_TRUE if match was found. Otherwise, M_FALSE.

◆ M_re_eq_start()

M_bool M_re_eq_start	(	const M_re_t *	re,
		const char *	str
	)

Check if the pattern matches from the beginning of the string.

Equivalent to the pattern starting with ^ and not multi line.

Parameters

[in]	re	Re object.
[in]	str	String to evaluate.

Returns: M_TRUE if match was found. Otherwise, M_FALSE.

◆ M_re_eq()

M_bool M_re_eq	(	const M_re_t *	re,
		const char *	str
	)

Check if the pattern matches the entire string

Equivalent to the pattern starting with ^, ending with $ and not multi line.

Parameters

[in]	re	Re object.
[in]	str	String to evaluate.

Returns: M_TRUE if match was found. Otherwise, M_FALSE.

◆ M_re_matches()

M_list_t * M_re_matches	(	const M_re_t *	re,
		const char *	str
	)

Get all pattern matches within a string.

Parameters

[in]	re	Re object.
[in]	str	String to evaluate.

Returns: List of M_re_match_t objects for every match found in the string. NULL if no matches found.

◆ M_re_find_all()

M_list_str_t * M_re_find_all	(	const M_re_t *	re,
		const char *	str
	)

Get all matching text within a string.

If locations of the text or captures are needed use M_re_matches.

Parameters

[in]	re	Re object.
[in]	str	String to evaluate.

Returns: List of matching strings for every match found in the string. NULL if no matches found.

◆ M_re_sub()

char * M_re_sub	(	const M_re_t *	re,
		const char *	repl,
		const char *	str
	)

Substitute matching pattern in string.

The replacement string can reference capture groups using \#, \##, \g<#>, \g<##>. The capture data applies to the match being evaluated. For example:

pattern: ' (c-e)'
string:  'a b c d e f g'
repl:    '\1'
 
result:  'a bcde f g'

Parameters

[in]	re	Re object.
[in]	repl	Replacement string.
[in]	str	String to evaluate.

Returns: String with substitutions or original string if no sub situations were made.

◆ M_re_match_destroy()

void M_re_match_destroy ( M_re_match_t * match )

Destroy a match object.

Parameters

[in] match Match object.

◆ M_re_match_idxs()

M_list_u64_t * M_re_match_idxs ( const M_re_match_t * match )

Get a list of all the captured indexes.

Parameters

[in] match Match object.

Returns: List of indexes. Otherwise NULL if no indexes captured.

◆ M_re_match_idx()

M_bool M_re_match_idx	(	const M_re_match_t *	match,
		size_t	idx,
		size_t *	offset,
		size_t *	len
	)

Get the offset and length of a match at a given index.

Parameters

[in]	match	Match object.
[in]	idx	Index.
[out]	offset	Start of match from the beginning of evaluated string.
[out]	len	Length of matched data.

Returns: M_TRUE if match found for index. Otherwise, M_FALSE.

Typedefs

Enumerations

Functions

Detailed Description

Supported:

Syntax

Escapes

Short hand character classes

POSIX character classes for bracket expressions

Features

Not supported:

Match object

Unicode

Typedef Documentation

◆ M_re_t

◆ M_re_match_t

Enumeration Type Documentation

◆ M_re_flags_t

Function Documentation

◆ M_re_compile()

◆ M_re_destroy()

◆ M_re_search()

◆ M_re_eq_start()

◆ M_re_eq()

◆ M_re_matches()

◆ M_re_find_all()

◆ M_re_sub()

◆ M_re_match_destroy()

◆ M_re_match_idxs()

◆ M_re_match_idx()