libu8
u8ctype.h File Reference

These functions and macros interrogate and transform unicode points. More...

Data Structures

struct  U8_DECOMPOSITION
 struct U8_DECOMPOSITION indicates a mapping between a single Unicode codepoint and an equivalent Unicode sequence. More...
 
struct  U8_CHARINFO_TABLE
 struct U8_CHARINFO_TABLE is used to store additional character info not provided by the statically defined tables. More...
 

Macros

#define u8_isalpha(c)   ((c>=0) && ((u8_getcharinfo(c)) < 6))
 Returns 1 if its argument is an alphabetic unicode point. More...
 
#define u8_islower(c)   ((c>=0) && ((u8_getcharinfo(c)) == U8_LOWER_LETTER))
 Returns 1 if its argument is a lower-case alphabetic unicode point. More...
 
#define u8_isupper(c)
 Returns 1 if its argument is an upper-case alphabetic unicode point. More...
 
#define u8_ismodifier(c)   ((c>=0) && ((u8_getcharinfo(c)) == U8_MODIFIER_LETTER))
 Returns 1 if its argument is modifier unicode point. More...
 
#define u8_isdigit(c)   ((c>=0) && ((u8_getcharinfo(c)) == U8_NUMBER))
 Returns 1 if its argument is numeric digit unicode point. More...
 
#define u8_ispunct(c)
 Returns 1 if its argument is a punctuation character. More...
 
#define u8_isprint(c)
 Returns 1 if its argument is a printing character (letter,digit,punct)
 
#define u8_isspace(c)   ((c>=0) && ((u8_getcharinfo(c)) == U8_SEPARATOR))
 Returns 1 if its argument is whitespace unicode point. More...
 
#define u8_ishspace(c)
 Returns 1 if its argument is horizontal whitespace unicode point. More...
 
#define u8_isvspace(c)
 Returns 1 if its argument is horizontal whitespace unicode point. More...
 
#define u8_isctrl(c)   ((c>=0) && ((c<0x20) || ((c>0x7e) && (c<0x9f))))
 Returns 1 if its argument is a standard control character. More...
 
#define u8_isalnum(c)   ((c>=0) && (((u8_getcharinfo(c)) < 6) || (u8_isdigit(c))))
 Returns 1 if its argument is an alphanumeric unicode point. More...
 
#define u8_isxdigit(c)   ((c>=0) && ((c<128) && (isxdigit(c))))
 Returns 1 if its argument is an ASCII hex digit. More...
 
#define u8_isodigit(c)   ((c>=0) && ((c<128) && (isdigit(c)) && (c<'8')))
 Returns 1 if its argument is an ASCII octal digit. More...
 
#define u8_toupper(c)
 Returns a non-lowercase version of a unicode code point. More...
 
#define u8_tolower(c)
 Returns a non-uppercase version of a unicode code point. More...
 
#define u8_digit_weight(c)   ((u8_isdigit(c)) ? ((c<0x10000) ? (u8_getchardata(c)) : (u8_lookup_chardata(c))) : (0))
 Returns the numeric weight of a numeric unicode code point. More...
 

Typedefs

typedef struct U8_DECOMPOSITION U8_DECOMPOSITION
 struct U8_DECOMPOSITION indicates a mapping between a single Unicode codepoint and an equivalent Unicode sequence.
 

Functions

U8_EXPORT int u8_entity2code (u8_string name)
 Converts an XML entity name into the corresponding code point. More...
 
U8_EXPORT u8_string u8_code2entity (int code)
 Converts a code point into an XML entity name. More...
 
U8_EXPORT int u8_parse_entity (const u8_byte *entity, u8_string *endp)
 Parses a unicode entity name from a string, recording the endpoint. More...
 
U8_EXPORT int u8_parse_entity_err (const u8_byte *entity, u8_string *endp)
 Parses a unicode entity name from a string, recording the endpoint. More...
 
U8_EXPORT void u8_set_charinfo (int n, unsigned char *info, short *data)
 Sets the character information for a particular code point. More...
 

Detailed Description

These functions and macros interrogate and transform unicode points.

They include standard character predicates (u8_isspace, u8_ispunct, etc) as well as function/macros for changing case and converting to and from XML character entities.

Macro Definition Documentation

#define u8_digit_weight (   c)    ((u8_isdigit(c)) ? ((c<0x10000) ? (u8_getchardata(c)) : (u8_lookup_chardata(c))) : (0))

Returns the numeric weight of a numeric unicode code point.

#define u8_isalnum (   c)    ((c>=0) && (((u8_getcharinfo(c)) < 6) || (u8_isdigit(c))))

Returns 1 if its argument is an alphanumeric unicode point.

Referenced by u8_guess_encoding().

#define u8_isalpha (   c)    ((c>=0) && ((u8_getcharinfo(c)) < 6))

Returns 1 if its argument is an alphabetic unicode point.

#define u8_isctrl (   c)    ((c>=0) && ((c<0x20) || ((c>0x7e) && (c<0x9f))))

Returns 1 if its argument is a standard control character.

#define u8_isdigit (   c)    ((c>=0) && ((u8_getcharinfo(c)) == U8_NUMBER))

Returns 1 if its argument is numeric digit unicode point.

#define u8_ishspace (   c)
Value:
((c>=0) && \
((c==' ')||(c=='\t')||(c==0x1680)||(c==0x180e)|| \
((c>=0x2000)&&(c<0x200b))))

Returns 1 if its argument is horizontal whitespace unicode point.

#define u8_islower (   c)    ((c>=0) && ((u8_getcharinfo(c)) == U8_LOWER_LETTER))

Returns 1 if its argument is a lower-case alphabetic unicode point.

#define u8_ismodifier (   c)    ((c>=0) && ((u8_getcharinfo(c)) == U8_MODIFIER_LETTER))

Returns 1 if its argument is modifier unicode point.

#define u8_isodigit (   c)    ((c>=0) && ((c<128) && (isdigit(c)) && (c<'8')))

Returns 1 if its argument is an ASCII octal digit.

#define u8_ispunct (   c)
Value:
((c>=0) && \
(((u8_getcharinfo(c)) == U8_GLUE_PUNCTUATION) || \
((u8_getcharinfo(c)) == U8_BREAK_PUNCTUATION) || \
((u8_getcharinfo(c)) == U8_SYMBOL) || \
((u8_getcharinfo(c)) == U8_MARK)))

Returns 1 if its argument is a punctuation character.

#define u8_isspace (   c)    ((c>=0) && ((u8_getcharinfo(c)) == U8_SEPARATOR))

Returns 1 if its argument is whitespace unicode point.

Referenced by u8_guess_encoding().

#define u8_isupper (   c)
Value:
((c>=0) && \
(((u8_getcharinfo(c)) == U8_UPPER_LETTER) || \
((u8_getcharinfo(c)) == U8_TITLE_LETTER)))

Returns 1 if its argument is an upper-case alphabetic unicode point.

#define u8_isvspace (   c)
Value:
((c>=0) && \
((c=='\n')||(c=='\r')||(c==0x0c)||(c==0x0b)|| \
(c==0x1C)||(c==0x1D)||(c==0x1E)||(c==0x1F)|| \
(c==0x85)||(c==0x2029)))

Returns 1 if its argument is horizontal whitespace unicode point.

#define u8_isxdigit (   c)    ((c>=0) && ((c<128) && (isxdigit(c))))

Returns 1 if its argument is an ASCII hex digit.

#define u8_tolower (   c)
Value:
((u8_isupper(c)) ? \
((c<0x10000) ? (c+(u8_getchardata(c))) : (u8_lookup_chardata(c))) : \
(c))
#define u8_isupper(c)
Returns 1 if its argument is an upper-case alphabetic unicode point.
Definition: u8ctype.h:80

Returns a non-uppercase version of a unicode code point.

Referenced by u8_downcase().

#define u8_toupper (   c)
Value:
((u8_islower(c)) ? \
((c<0x10000) ? (c+(u8_getchardata(c))) : (u8_lookup_chardata(c))) : \
(c))
#define u8_islower(c)
Returns 1 if its argument is a lower-case alphabetic unicode point.
Definition: u8ctype.h:78

Returns a non-lowercase version of a unicode code point.

Referenced by u8_upcase().

Function Documentation

U8_EXPORT u8_string u8_code2entity ( int  code)

Converts a code point into an XML entity name.

This falls back to hex codes if neccessary.

Parameters
codea unicode code point
Returns
a mallocd utf-8 (ASCII) name.
U8_EXPORT int u8_entity2code ( u8_string  name)

Converts an XML entity name into the corresponding code point.

This returns -1 for unrecognized or invalid entity names.

Parameters
namea utf-8 (ASCII) name.
Returns
a unicode code point or -1 on error.

Referenced by u8_parse_entity(), and u8_parse_entity_err().

U8_EXPORT int u8_parse_entity ( const u8_byte *  entity,
u8_string *  endp 
)

Parses a unicode entity name from a string, recording the endpoint.

This is handed a pointer to a UTF-8 string (entity) just after the entity escape character ampersand ('&'). It parses an entity name, returning the corresponding code and storing the end of the entity (after the trailing semicolon (';')) in endp. If endp is NULL, the end result is not stored. If the string does not point to a valid entity reference, -1 is returned.

Parameters
entitya pointer into a UTF-8 string
endpa pointer to a location to store the end of the entity
Returns
a unicode code point

References u8_entity2code().

Referenced by u8_get_entity().

U8_EXPORT int u8_parse_entity_err ( const u8_byte *  entity,
u8_string *  endp 
)

Parses a unicode entity name from a string, recording the endpoint.

This version sets an error when an entity cannot be processed. This is handed a pointer to a UTF-8 string (entity) just after the entity escape character ampersand ('&'). It parses an entity name, returning the corresponding code and storing the end of the entity (after the trailing semicolon (';')) in endp. If endp is NULL, the end result is not stored.

Parameters
entitya pointer into a UTF-8 string
endpa pointer to a location to store the end of the entity
Returns
a unicode code point

References u8_entity2code(), and u8_seterr().

U8_EXPORT void u8_set_charinfo ( int  n,
unsigned char *  info,
short *  data 
)

Sets the character information for a particular code point.

Parameters
na Unicode code point
infoa string describing information about the character
dataa pointer to a short vector of data about the character
Returns
void

Referenced by u8_init_chardata_c().