28 Text processing library [text]

28.4 Text encodings identification [text.encoding]

28.4.2 Class text_encoding [text.encoding.class]

28.4.2.1 Overview [text.encoding.overview]

The class text_encoding describes an interface for accessing the IANA Character Sets registry[bib].
namespace std { struct text_encoding { static constexpr size_t max_name_length = 63; // [text.encoding.id], enumeration text_encoding​::​id enum class id : int_least32_t { see below }; using enum id; constexpr text_encoding() = default; constexpr explicit text_encoding(string_view enc) noexcept; constexpr text_encoding(id i) noexcept; constexpr id mib() const noexcept; constexpr const char* name() const noexcept; struct aliases_view; constexpr aliases_view aliases() const noexcept; friend constexpr bool operator==(const text_encoding& a, const text_encoding& b) noexcept; friend constexpr bool operator==(const text_encoding& encoding, id i) noexcept; static consteval text_encoding literal() noexcept; static text_encoding environment(); template<id i> static bool environment_is(); private: id mib_ = id::unknown; // exposition only char name_[max_name_length + 1] = {0}; // exposition only static constexpr bool comp-name(string_view a, string_view b); // exposition only }; }
Class text_encoding is a trivially copyable type ([basic.types.general]).

28.4.2.2 General [text.encoding.general]

A registered character encoding is a character encoding scheme in the IANA Character Sets registry.
[Note 1: 
The IANA Character Sets registry uses the term β€œcharacter sets” to refer to character encodings.
β€” end note]
The primary name of a registered character encoding is the name of that encoding specified in the IANA Character Sets registry.
The set of known registered character encodings contains every registered character encoding specified in the IANA Character Sets registry except for the following:
Each known registered character encoding is identified by an enumerator in text_encoding​::​id, and has a set of zero or more aliases.
The set of aliases of a known registered character encoding is an implementation-defined superset of the aliases specified in the IANA Character Sets registry.
The set of aliases for US-ASCII includes β€œASCII”.
No two aliases or primary names of distinct registered character encodings are equivalent when compared by text_encoding​::​comp-name.
How a text_encoding object is determined to be representative of a character encoding scheme implemented in the translation or execution environment is implementation-defined.
An object e of type text_encoding such that e.mib() == text_encoding​::​id​::​unknown is false and e.mib() == text_encoding​::​id​::​other is false maintains the following invariants:
  • e.name() == nullptr is false, and
  • e.mib() == text_encoding(e.name()).mib() is true.
Recommended practice:
  • Implementations should not consider registered encodings to be interchangeable.
    [Example 1: 
    Shift_JIS and Windows-31J denote different encodings.
    β€” end example]
  • Implementations should not use the name of a registered encoding to describe another similar yet different non-registered encoding unless there is a precedent on that implementation.
    [Example 2: 
    Big5
    β€” end example]

28.4.2.3 Members [text.encoding.members]

constexpr explicit text_encoding(string_view enc) noexcept;
Preconditions:
  • enc represents a string in the ordinary literal encoding consisting only of elements of the basic character set ([lex.charset]).
  • enc.size() <= max_name_length is true.
  • enc.contains('\0') is false.
Postconditions:
  • If there exists a primary name or alias a of a known registered character encoding such that comp-name(a, enc) is true, mib_ has the value of the enumerator of id associated with that registered character encoding.
    Otherwise, mib_ == id​::​other is true.
  • enc.compare(name_) == 0 is true.
constexpr text_encoding(id i) noexcept;
Preconditions: i has the value of one of the enumerators of id.
Postconditions:
  • mib_ == i is true.
  • If (mib_ == id​::​unknown || mib_ == id​::​other) is true, strlen(name_) == 0 is true.
    Otherwise, ranges​::​contains(aliases(), string_view(name_)) is true.
constexpr id mib() const noexcept;
Returns: mib_.
constexpr const char* name() const noexcept;
Returns: name_ if (name_[0] != '\0') is true, and nullptr otherwise.
Remarks: If name() == nullptr is false, name() is an ntbs and accessing elements of name_ outside of the range is undefined behavior.
constexpr aliases_view aliases() const noexcept;
Let r denote an instance of aliases_view.
If *this represents a known registered character encoding, then:
  • r.front() is the primary name of the registered character encoding,
  • r contains the aliases of the registered character encoding, and
  • r does not contain duplicate values when compared with strcmp.
Otherwise, r is an empty range.
Each element in r is a non-null, non-empty ntbs encoded in the literal character encoding and comprising only characters from the basic character set.
Returns: r.
[Note 1: 
The order of aliases in r is unspecified.
β€” end note]
static consteval text_encoding literal() noexcept;
Mandates: CHAR_BIT == 8 is true.
Returns: A text_encoding object representing the ordinary character literal encoding ([lex.charset]).
static text_encoding environment();
Mandates: CHAR_BIT == 8 is true.
Returns: A text_encoding object representing the implementation-defined character encoding scheme of the environment.
On a POSIX implementation, this is the encoding scheme associated with the POSIX locale denoted by the empty string "".
[Note 2: 
This function is not affected by calls to setlocale.
β€” end note]
Recommended practice: Implementations should return a value that is not affected by calls to the POSIX function setenv and other functions which can modify the environment ([support.runtime]).
template<id i> static bool environment_is();
Mandates: CHAR_BIT == 8 is true.
Returns: environment() == i.
static constexpr bool comp-name(string_view a, string_view b);
Returns: true if the two strings a and b encoded in the ordinary literal encoding are equal, ignoring, from left-to-right,
  • all elements that are not digits or letters ([character.seq.general]),
  • character case, and
  • any sequence of one or more 0 characters not immediately preceded by a numeric prefix, where a numeric prefix is a sequence consisting of a digit in the range [1, 9] optionally followed by one or more elements which are not digits or letters,
and false otherwise.
[Note 3: 
This comparison is identical to the β€œCharset Alias Matching” algorithm described in the Unicode Technical Standard 22[bib].
β€” end note]
[Example 1: static_assert(comp-name("UTF-8", "utf8") == true); static_assert(comp-name("u.t.f-008", "utf8") == true); static_assert(comp-name("ut8", "utf8") == false); static_assert(comp-name("utf-80", "utf8") == false); β€” end example]

28.4.2.4 Comparison functions [text.encoding.cmp]

friend constexpr bool operator==(const text_encoding& a, const text_encoding& b) noexcept;
Returns: If a.mib_ == id​::​other && b.mib_ == id​::​other is true, then comp-name(a.name_,
b.name_)
.
Otherwise, a.mib_ == b.mib_.
friend constexpr bool operator==(const text_encoding& encoding, id i) noexcept;
Returns: encoding.mib_ == i.
Remarks: This operator induces an equivalence relation on its arguments if and only if i != id​::​other is true.

28.4.2.5 Class text_encoding​::​aliases_view [text.encoding.aliases]

struct text_encoding::aliases_view : ranges::view_interface<text_encoding::aliases_view> { constexpr implementation-defined begin() const; constexpr implementation-defined end() const; };
text_encoding​::​aliases_view models copyable, ranges​::​view, ranges​::​random_access_range, and ranges​::​borrowed_range.
[Note 1: 
text_encoding​::​aliases_view is not required to satisfy ranges​::​common_range, nor default_initializable.
β€” end note]
Both ranges​::​range_value_t<text_encoding​::​aliases_view> and ranges​::​range_reference_t<text_encoding​::​aliases_view> denote const char*.
ranges​::​iterator_t<text_encoding​::​aliases_view> is a constexpr iterator ([iterator.requirements.general]).

28.4.2.6 Enumeration text_encoding​::​id [text.encoding.id]

namespace std { enum class text_encoding::id : int_least32_t { other = 1, unknown = 2, ASCII = 3, ISOLatin1 = 4, ISOLatin2 = 5, ISOLatin3 = 6, ISOLatin4 = 7, ISOLatinCyrillic = 8, ISOLatinArabic = 9, ISOLatinGreek = 10, ISOLatinHebrew = 11, ISOLatin5 = 12, ISOLatin6 = 13, ISOTextComm = 14, HalfWidthKatakana = 15, JISEncoding = 16, ShiftJIS = 17, EUCPkdFmtJapanese = 18, EUCFixWidJapanese = 19, ISO4UnitedKingdom = 20, ISO11SwedishForNames = 21, ISO15Italian = 22, ISO17Spanish = 23, ISO21German = 24, ISO60DanishNorwegian = 25, ISO69French = 26, ISO10646UTF1 = 27, ISO646basic1983 = 28, INVARIANT = 29, ISO2IntlRefVersion = 30, NATSSEFI = 31, NATSSEFIADD = 32, ISO10Swedish = 35, KSC56011987 = 36, ISO2022KR = 37, EUCKR = 38, ISO2022JP = 39, ISO2022JP2 = 40, ISO13JISC6220jp = 41, ISO14JISC6220ro = 42, ISO16Portuguese = 43, ISO18Greek7Old = 44, ISO19LatinGreek = 45, ISO25French = 46, ISO27LatinGreek1 = 47, ISO5427Cyrillic = 48, ISO42JISC62261978 = 49, ISO47BSViewdata = 50, ISO49INIS = 51, ISO50INIS8 = 52, ISO51INISCyrillic = 53, ISO54271981 = 54, ISO5428Greek = 55, ISO57GB1988 = 56, ISO58GB231280 = 57, ISO61Norwegian2 = 58, ISO70VideotexSupp1 = 59, ISO84Portuguese2 = 60, ISO85Spanish2 = 61, ISO86Hungarian = 62, ISO87JISX0208 = 63, ISO88Greek7 = 64, ISO89ASMO449 = 65, ISO90 = 66, ISO91JISC62291984a = 67, ISO92JISC62991984b = 68, ISO93JIS62291984badd = 69, ISO94JIS62291984hand = 70, ISO95JIS62291984handadd = 71, ISO96JISC62291984kana = 72, ISO2033 = 73, ISO99NAPLPS = 74, ISO102T617bit = 75, ISO103T618bit = 76, ISO111ECMACyrillic = 77, ISO121Canadian1 = 78, ISO122Canadian2 = 79, ISO123CSAZ24341985gr = 80, ISO88596E = 81, ISO88596I = 82, ISO128T101G2 = 83, ISO88598E = 84, ISO88598I = 85, ISO139CSN369103 = 86, ISO141JUSIB1002 = 87, ISO143IECP271 = 88, ISO146Serbian = 89, ISO147Macedonian = 90, ISO150 = 91, ISO151Cuba = 92, ISO6937Add = 93, ISO153GOST1976874 = 94, ISO8859Supp = 95, ISO10367Box = 96, ISO158Lap = 97, ISO159JISX02121990 = 98, ISO646Danish = 99, USDK = 100, DKUS = 101, KSC5636 = 102, Unicode11UTF7 = 103, ISO2022CN = 104, ISO2022CNEXT = 105, UTF8 = 106, ISO885913 = 109, ISO885914 = 110, ISO885915 = 111, ISO885916 = 112, GBK = 113, GB18030 = 114, OSDEBCDICDF0415 = 115, OSDEBCDICDF03IRV = 116, OSDEBCDICDF041 = 117, ISO115481 = 118, KZ1048 = 119, UCS2 = 1000, UCS4 = 1001, UnicodeASCII = 1002, UnicodeLatin1 = 1003, UnicodeJapanese = 1004, UnicodeIBM1261 = 1005, UnicodeIBM1268 = 1006, UnicodeIBM1276 = 1007, UnicodeIBM1264 = 1008, UnicodeIBM1265 = 1009, Unicode11 = 1010, SCSU = 1011, UTF7 = 1012, UTF16BE = 1013, UTF16LE = 1014, UTF16 = 1015, CESU8 = 1016, UTF32 = 1017, UTF32BE = 1018, UTF32LE = 1019, BOCU1 = 1020, UTF7IMAP = 1021, Windows30Latin1 = 2000, Windows31Latin1 = 2001, Windows31Latin2 = 2002, Windows31Latin5 = 2003, HPRoman8 = 2004, AdobeStandardEncoding = 2005, VenturaUS = 2006, VenturaInternational = 2007, DECMCS = 2008, PC850Multilingual = 2009, PC8DanishNorwegian = 2012, PC862LatinHebrew = 2013, PC8Turkish = 2014, IBMSymbols = 2015, IBMThai = 2016, HPLegal = 2017, HPPiFont = 2018, HPMath8 = 2019, HPPSMath = 2020, HPDesktop = 2021, VenturaMath = 2022, MicrosoftPublishing = 2023, Windows31J = 2024, GB2312 = 2025, Big5 = 2026, Macintosh = 2027, IBM037 = 2028, IBM038 = 2029, IBM273 = 2030, IBM274 = 2031, IBM275 = 2032, IBM277 = 2033, IBM278 = 2034, IBM280 = 2035, IBM281 = 2036, IBM284 = 2037, IBM285 = 2038, IBM290 = 2039, IBM297 = 2040, IBM420 = 2041, IBM423 = 2042, IBM424 = 2043, PC8CodePage437 = 2011, IBM500 = 2044, IBM851 = 2045, PCp852 = 2010, IBM855 = 2046, IBM857 = 2047, IBM860 = 2048, IBM861 = 2049, IBM863 = 2050, IBM864 = 2051, IBM865 = 2052, IBM868 = 2053, IBM869 = 2054, IBM870 = 2055, IBM871 = 2056, IBM880 = 2057, IBM891 = 2058, IBM903 = 2059, IBM904 = 2060, IBM905 = 2061, IBM918 = 2062, IBM1026 = 2063, IBMEBCDICATDE = 2064, EBCDICATDEA = 2065, EBCDICCAFR = 2066, EBCDICDKNO = 2067, EBCDICDKNOA = 2068, EBCDICFISE = 2069, EBCDICFISEA = 2070, EBCDICFR = 2071, EBCDICIT = 2072, EBCDICPT = 2073, EBCDICES = 2074, EBCDICESA = 2075, EBCDICESS = 2076, EBCDICUK = 2077, EBCDICUS = 2078, Unknown8BiT = 2079, Mnemonic = 2080, Mnem = 2081, VISCII = 2082, VIQR = 2083, KOI8R = 2084, HZGB2312 = 2085, IBM866 = 2086, PC775Baltic = 2087, KOI8U = 2088, IBM00858 = 2089, IBM00924 = 2090, IBM01140 = 2091, IBM01141 = 2092, IBM01142 = 2093, IBM01143 = 2094, IBM01144 = 2095, IBM01145 = 2096, IBM01146 = 2097, IBM01147 = 2098, IBM01148 = 2099, IBM01149 = 2100, Big5HKSCS = 2101, IBM1047 = 2102, PTCP154 = 2103, Amiga1251 = 2104, KOI7switched = 2105, BRF = 2106, TSCII = 2107, CP51932 = 2108, windows874 = 2109, windows1250 = 2250, windows1251 = 2251, windows1252 = 2252, windows1253 = 2253, windows1254 = 2254, windows1255 = 2255, windows1256 = 2256, windows1257 = 2257, windows1258 = 2258, TIS620 = 2259, CP50220 = 2260 }; }
[Note 1: 
The text_encoding​::​id enumeration contains an enumerator for each known registered character encoding.
For each encoding, the corresponding enumerator is derived from the alias beginning with β€œcs”, as follows
  • csUnicode is mapped to text_encoding​::​id​::​UCS2,
  • csIBBM904 is mapped to text_encoding​::​id​::​IBM904, and
  • the β€œcs” prefix is removed from other names.
β€” end note]

28.4.2.7 Hash support [text.encoding.hash]

template<> struct hash<text_encoding>;
The specialization is enabled ([unord.hash]).