Mstdlib-1.24.0
utf-8 Handling

Modules

 Case Folding
 
 Checking/Validation
 

Enumerations

enum  M_utf8_error_t {
  M_UTF8_ERROR_SUCCESS ,
  M_UTF8_ERROR_BAD_START ,
  M_UTF8_ERROR_TRUNCATED ,
  M_UTF8_ERROR_EXPECT_CONTINUE ,
  M_UTF8_ERROR_BAD_CODE_POINT ,
  M_UTF8_ERROR_OVERLONG ,
  M_UTF8_ERROR_INVALID_PARAM
}
 

Functions

M_bool M_utf8_is_valid (const char *str, const char **endptr)
 
M_bool M_utf8_is_valid_cp (M_uint32 cp)
 
size_t M_utf8_cnt (const char *str)
 
M_utf8_error_t M_utf8_get_cp (const char *str, M_uint32 *cp, const char **next)
 
M_utf8_error_t M_utf8_get_chr (const char *str, char *buf, size_t buf_size, size_t *len, const char **next)
 
M_utf8_error_t M_utf8_get_chr_buf (const char *str, M_buf_t *buf, const char **next)
 
char * M_utf8_next_chr (const char *str)
 
M_utf8_error_t M_utf8_from_cp (char *buf, size_t buf_size, size_t *len, M_uint32 cp)
 
M_utf8_error_t M_utf8_from_cp_buf (M_buf_t *buf, M_uint32 cp)
 
M_utf8_error_t M_utf8_cp_at (const char *str, size_t idx, M_uint32 *cp)
 
M_utf8_error_t M_utf8_chr_at (const char *str, char *buf, size_t buf_size, size_t *len, size_t idx)
 

Detailed Description

Targets unicode 10.0.

Note
Non-characters are considered an error conditions because they do not have a defined meaning.

A utf-8 sequence is defined as the variable number of bytes that represent a single utf-8 display character.

Enumeration Type Documentation

◆ M_utf8_error_t

Error codes.

Enumerator
M_UTF8_ERROR_SUCCESS 

Success.

M_UTF8_ERROR_BAD_START 

Start of byte sequence is invalid.

M_UTF8_ERROR_TRUNCATED 

The utf-8 character length exceeds the data length.

M_UTF8_ERROR_EXPECT_CONTINUE 

A conurbation marker was expected but not found.

M_UTF8_ERROR_BAD_CODE_POINT 

Code point is invalid.

M_UTF8_ERROR_OVERLONG 

Overlong encoding encountered.

M_UTF8_ERROR_INVALID_PARAM 

Input parameter is invalid.

Function Documentation

◆ M_utf8_is_valid()

M_bool M_utf8_is_valid ( const char *  str,
const char **  endptr 
)

Check if a given string is valid utf-8 encoded.

Parameters
[in]strutf-8 string.
[out]endptrOn success, will be set to the NULL terminator. On error, will be set to the character that caused the failure.
Returns
M_TRUE if str is a valid utf-8 sequence. Otherwise, M_FALSE.

◆ M_utf8_is_valid_cp()

M_bool M_utf8_is_valid_cp ( M_uint32  cp)

Check if a given code point is valid for utf-8.

Parameters
[in]cpCode point.
Returns
M_TRUE if code point is valid for utf-8. Otherwise, M_FALSE.

◆ M_utf8_cnt()

size_t M_utf8_cnt ( const char *  str)

Ge the number of utf-8 characters in a string.

This is the number of characters not the number of bytes in the string. M_str_len will only return the same value if the string is only ascii.

Parameters
[in]strutf-8 string.
Returns
Number of characters on success. On failure will return 0. Use M_str_isempty to determine if 0 is a failure or empty string.

◆ M_utf8_get_cp()

M_utf8_error_t M_utf8_get_cp ( const char *  str,
M_uint32 *  cp,
const char **  next 
)

Read a utf-8 sequence as a code point.

Parameters
[in]strutf-8 string.
[out]cpCode point. Can be NULL.
[out]nextStart of next character. Will point to NULL terminator if last character.
Returns
Result.

◆ M_utf8_get_chr()

M_utf8_error_t M_utf8_get_chr ( const char *  str,
char *  buf,
size_t  buf_size,
size_t *  len,
const char **  next 
)

Read a utf-8 sequence.

Output is not NULL terminated.

Parameters
[in]strutf-8 string.
[in]bufBuffer to put utf-8 sequence. Can be NULL.
[in]buf_sizeSize of the buffer.
[out]lenLength of the sequence that was put into buffer.
[out]nextStart of next character. Will point to NULL terminator if last character.
Returns
Result.

◆ M_utf8_get_chr_buf()

M_utf8_error_t M_utf8_get_chr_buf ( const char *  str,
M_buf_t buf,
const char **  next 
)

Read a utf-8 sequence into an M_buf_t.

Parameters
[in]strutf-8 string.
[in]bufBuffer to put utf-8 sequence.
[out]nextStart of next character. Will point to NULL terminator if last character.
Returns
Result.

◆ M_utf8_next_chr()

char * M_utf8_next_chr ( const char *  str)

Get the location of the next utf-8 sequence.

Does not validate characters. Useful when parsing an invalid string and wanting to move past to ignore or replace invalid characters.

Parameters
[in]strutf-8 string.
Returns
Pointer to next character in sequence.

◆ M_utf8_from_cp()

M_utf8_error_t M_utf8_from_cp ( char *  buf,
size_t  buf_size,
size_t *  len,
M_uint32  cp 
)

Convert a code point to a utf-8 sequence.

Output is not NULL terminated.

Parameters
[in]bufBuffer to put utf-8 sequence.
[in]buf_sizeSize of the buffer.
[out]lenLength of the sequence that was put into buffer.
[in]cpCode point to convert from.
Returns
Result.

◆ M_utf8_from_cp_buf()

M_utf8_error_t M_utf8_from_cp_buf ( M_buf_t buf,
M_uint32  cp 
)

Convert a code point to a utf-8 sequence writing to an M_buf_t.

Parameters
[in]bufBuffer to put utf-8 sequence.
[in]cpCode point to convert from.
Returns
Result.

◆ M_utf8_cp_at()

M_utf8_error_t M_utf8_cp_at ( const char *  str,
size_t  idx,
M_uint32 *  cp 
)

Get the code point at a given index.

Index is based on M_utf8_cnt not the number of bytes. This causes a full scan of the string. Iteration should use M_utf8_get_cp.

Parameters
[in]strutf-8 string.
[in]idxIndex.
[out]cpCode point.
Returns
Result.

◆ M_utf8_chr_at()

M_utf8_error_t M_utf8_chr_at ( const char *  str,
char *  buf,
size_t  buf_size,
size_t *  len,
size_t  idx 
)

Get the utf-8 sequence at a given index.

Index is based on M_utf8_cnt not the number of bytes. This causes a full scan of the string. Iteration should use M_utf8_get_chr.

Parameters
[in]strutf-8 string.
[in]bufBuffer to put utf-8 sequence.
[in]buf_sizeSize of the buffer.
[out]lenLength of the sequence that was put into buffer.
[in]idxIndex.
Returns
Result.