Modules
	Case Folding

	Checking/Validation

Enumerations
enum	M_utf8_error_t { M_UTF8_ERROR_SUCCESS , M_UTF8_ERROR_BAD_START , M_UTF8_ERROR_TRUNCATED , M_UTF8_ERROR_EXPECT_CONTINUE , M_UTF8_ERROR_BAD_CODE_POINT , M_UTF8_ERROR_OVERLONG , M_UTF8_ERROR_INVALID_PARAM }

Functions
M_bool	M_utf8_is_valid (const char str, const char *endptr)

M_bool	M_utf8_is_valid_cp (M_uint32 cp)

size_t	M_utf8_cnt (const char *str)

M_utf8_error_t	M_utf8_get_cp (const char str, M_uint32 cp, const char **next)

M_utf8_error_t	M_utf8_get_chr (const char str, char buf, size_t buf_size, size_t len, const char *next)

M_utf8_error_t	M_utf8_get_chr_buf (const char str, M_buf_t buf, const char **next)

char *	M_utf8_next_chr (const char *str)

M_utf8_error_t	M_utf8_from_cp (char buf, size_t buf_size, size_t len, M_uint32 cp)

M_utf8_error_t	M_utf8_from_cp_buf (M_buf_t *buf, M_uint32 cp)

M_utf8_error_t	M_utf8_cp_at (const char str, size_t idx, M_uint32 cp)

M_utf8_error_t	M_utf8_chr_at (const char str, char buf, size_t buf_size, size_t *len, size_t idx)

Detailed Description

Targets unicode 10.0.

Note: Non-characters are considered an error conditions because they do not have a defined meaning.

A utf-8 sequence is defined as the variable number of bytes that represent a single utf-8 display character.

Enumeration Type Documentation

◆ M_utf8_error_t

enum M_utf8_error_t

Error codes.

Enumerator
M_UTF8_ERROR_SUCCESS	Success.
M_UTF8_ERROR_BAD_START	Start of byte sequence is invalid.
M_UTF8_ERROR_TRUNCATED	The utf-8 character length exceeds the data length.
M_UTF8_ERROR_EXPECT_CONTINUE	A conurbation marker was expected but not found.
M_UTF8_ERROR_BAD_CODE_POINT	Code point is invalid.
M_UTF8_ERROR_OVERLONG	Overlong encoding encountered.
M_UTF8_ERROR_INVALID_PARAM	Input parameter is invalid.

Function Documentation

◆ M_utf8_is_valid()

M_bool M_utf8_is_valid	(	const char *	str,
		const char **	endptr
	)

Check if a given string is valid utf-8 encoded.

Parameters

[in]	str	utf-8 string.
[out]	endptr	On success, will be set to the NULL terminator. On error, will be set to the character that caused the failure.

Returns: M_TRUE if str is a valid utf-8 sequence. Otherwise, M_FALSE.

◆ M_utf8_is_valid_cp()

M_bool M_utf8_is_valid_cp ( M_uint32 cp )

Check if a given code point is valid for utf-8.

Parameters

[in] cp Code point.

Returns: M_TRUE if code point is valid for utf-8. Otherwise, M_FALSE.

◆ M_utf8_cnt()

size_t M_utf8_cnt ( const char * str )

Ge the number of utf-8 characters in a string.

This is the number of characters not the number of bytes in the string. M_str_len will only return the same value if the string is only ascii.

Parameters

[in] str utf-8 string.

Returns: Number of characters on success. On failure will return 0. Use M_str_isempty to determine if 0 is a failure or empty string.

◆ M_utf8_get_cp()

M_utf8_error_t M_utf8_get_cp	(	const char *	str,
		M_uint32 *	cp,
		const char **	next
	)

Read a utf-8 sequence as a code point.

Parameters

[in]	str	utf-8 string.
[out]	cp	Code point. Can be NULL.
[out]	next	Start of next character. Will point to NULL terminator if last character.

Returns: Result.

◆ M_utf8_get_chr()

M_utf8_error_t M_utf8_get_chr	(	const char *	str,
		char *	buf,
		size_t	buf_size,
		size_t *	len,
		const char **	next
	)

Read a utf-8 sequence.

Output is not NULL terminated.

Parameters

[in]	str	utf-8 string.
[in]	buf	Buffer to put utf-8 sequence. Can be NULL.
[in]	buf_size	Size of the buffer.
[out]	len	Length of the sequence that was put into buffer.
[out]	next	Start of next character. Will point to NULL terminator if last character.

Returns: Result.

◆ M_utf8_get_chr_buf()

M_utf8_error_t M_utf8_get_chr_buf	(	const char *	str,
		M_buf_t *	buf,
		const char **	next
	)

Read a utf-8 sequence into an M_buf_t.

Parameters

[in]	str	utf-8 string.
[in]	buf	Buffer to put utf-8 sequence.
[out]	next	Start of next character. Will point to NULL terminator if last character.

Returns: Result.

◆ M_utf8_next_chr()

char * M_utf8_next_chr ( const char * str )

Get the location of the next utf-8 sequence.

Does not validate characters. Useful when parsing an invalid string and wanting to move past to ignore or replace invalid characters.

Parameters

[in] str utf-8 string.

Returns: Pointer to next character in sequence.

◆ M_utf8_from_cp()

M_utf8_error_t M_utf8_from_cp	(	char *	buf,
		size_t	buf_size,
		size_t *	len,
		M_uint32	cp
	)

Convert a code point to a utf-8 sequence.

Output is not NULL terminated.

Parameters

[in]	buf	Buffer to put utf-8 sequence.
[in]	buf_size	Size of the buffer.
[out]	len	Length of the sequence that was put into buffer.
[in]	cp	Code point to convert from.

Returns: Result.

◆ M_utf8_from_cp_buf()

M_utf8_error_t M_utf8_from_cp_buf	(	M_buf_t *	buf,
		M_uint32	cp
	)

Convert a code point to a utf-8 sequence writing to an M_buf_t.

Parameters

[in]	buf	Buffer to put utf-8 sequence.
[in]	cp	Code point to convert from.

Returns: Result.

◆ M_utf8_cp_at()

M_utf8_error_t M_utf8_cp_at	(	const char *	str,
		size_t	idx,
		M_uint32 *	cp
	)

Get the code point at a given index.

Index is based on M_utf8_cnt not the number of bytes. This causes a full scan of the string. Iteration should use M_utf8_get_cp.

Parameters

[in]	str	utf-8 string.
[in]	idx	Index.
[out]	cp	Code point.

Returns: Result.

◆ M_utf8_chr_at()

M_utf8_error_t M_utf8_chr_at	(	const char *	str,
		char *	buf,
		size_t	buf_size,
		size_t *	len,
		size_t	idx
	)

Get the utf-8 sequence at a given index.

Index is based on M_utf8_cnt not the number of bytes. This causes a full scan of the string. Iteration should use M_utf8_get_chr.

Parameters

[in]	str	utf-8 string.
[in]	buf	Buffer to put utf-8 sequence.
[in]	buf_size	Size of the buffer.
[out]	len	Length of the sequence that was put into buffer.
[in]	idx	Index.

Returns: Result.

Modules

Enumerations

Functions

Detailed Description

Enumeration Type Documentation

◆ M_utf8_error_t

Function Documentation

◆ M_utf8_is_valid()

◆ M_utf8_is_valid_cp()

◆ M_utf8_cnt()

◆ M_utf8_get_cp()

◆ M_utf8_get_chr()

◆ M_utf8_get_chr_buf()

◆ M_utf8_next_chr()

◆ M_utf8_from_cp()

◆ M_utf8_from_cp_buf()

◆ M_utf8_cp_at()

◆ M_utf8_chr_at()