utfcpp 다운로드 utfcpp 소스 코드 다운로드

UTF8-CPP : 휴대용으로 C ++가있는 UTF-8

소개

C ++ 개발자는 여전히 유니 코드 인코딩 문자열을 처리하는 쉽고 휴대 가능한 방법을 놓치고 있습니다. 원래 C ++ 표준 (C ++ 98 또는 C ++ 03으로 알려짐)은 유니 코드 아그네틱입니다. 표준의 후반판에서 일부 진전이 있었지만 표준 시설 만 사용하여 유니 코드와 함께 작업하기는 여전히 어렵습니다.

UTF-8 인코딩 문자열을 처리하기 위해 작고 C ++ 98 호환 일반 라이브러리를 생각해 냈습니다. STL 알고리즘 및 반복자로 작업하는 데 사용되는 사람은 누구나 사용하기 쉽고 자연 스러워야합니다. 이 코드는 모든 목적으로 자유롭게 사용할 수 있습니다. 라이센스를 확인하십시오. 이 도서관은 2006 년 상업 및 오픈 소스 프로젝트에서 첫 릴리스 이후로 많이 사용되었으며 안정적이고 유용한 것으로 입증되었습니다.

UTF8-CPP : 휴대용으로 C ++가있는 UTF-8
- 소개
- 설치
- 사용의 예
  - 입문 샘플
  - 파일에 유효한 UTF-8 텍스트가 포함되어 있는지 확인합니다
  - 문자열에 유효한 UTF-8 텍스트가 포함되어 있는지 확인하십시오
- 관심 지점 - 설계 목표 및 결정 - 대안
- 참조
  - UTF8 네임 스페이스의 기능
    - UTF8 :: 부록
      - Octet_iterator Append (UTFCHAR32_T CP, OCTET_ITERATOR 결과)
      - void Append (utfchar32_t cp, std :: string & s);
    - UTF8 :: 부록 16
      - Word_iterator Append16 (utfchar32_t cp, word_iterator result)
      - void append (utfchar32_t cp, std :: u16string & s)
    - UTF8 :: 다음
    - UTF8 :: NEXT16
    - utf8 :: peek_next
    - UTF8 :: 사전
    - UTF8 :: 사전
    - UTF8 :: 거리
    - UTF8 :: UTF16TO8
      - Octet_iterator UTF16to8 (u16bit_iterator start, u16bit_iterator end, Octet_iterator result)
      - std :: String UTF16To8 (const std :: u16string & s)
      - std :: String UTF16To8 (std :: u16string_view s)
    - UTF8 :: UTF16TOU8
      - std :: u8string utf16tou8 (const std :: u16string & s)
      - std :: u8string utf16tou8 (const std :: u16string_view & s)
    - UTF8 :: UTF8TO16
      - u16bit_iterator utf8to16 (Octet_iterator start, Octet_iterator end, u16bit_iterator result)
      - std :: u16string utf8to16 (const std :: string & s)
      - std :: u16string utf8to16 (std :: string_view s)
      - std :: u16string utf8to16 (std :: u8string & s)
      - std :: u16string utf8to16 (std :: u8string_view & s)
    - UTF8 :: UTF32TO8
      - Octet_iterator UTF32To8 (u32bit_iterator start, u32bit_iterator end, Octet_iterator result)
      - std :: String UTF32To8 (const std :: u32string & s)
      - std :: u8string utf32to8 (const std :: u32string & s)
      - std :: u8string utf32to8 (const std :: u32string_view & s)
      - std :: String UTF32To8 (const std :: u32string & s)
      - std :: String UTF32To8 (std :: u32string_view s)
    - UTF8 :: UTF8TO32
      - u32bit_iterator utf8to32
      - std :: u32string utf8to32 (const std :: u8string & s)
      - std :: u32string utf8to32 (const std :: u8string_view & s)
      - std :: u32string utf8to32 (const std :: string & s)
      - std :: u32string utf8to32 (std :: string_view s)
    - utf8 :: find_invalid
      - Octet_iterator find_invalid (Octet_iterator start, Octet_iterator end)
      - const char* find_invalid (const char* str)
      - std :: size_t find_invalid (const std :: string & s)
      - std :: size_t find_invalid (std :: string_view s)
    - UTF8 :: IS_VALID
      - BOOL IS_VALID (Octet_iterator Start, Octet_iterator End)
      - bool is_valid (const char* str)
      - bool is_valid (const std :: string & s)
      - bool is_valid (std :: string_view s)
    - UTF8 :: Replace_Invalid
      - output_iterator replace_invalid (Octet_iterator start, Octet_iterator end, output_iterator out, utfchar32_t 교체)
      - std :: string replace_invalid (const std :: string & s, utfchar32_t 교체)
      - std :: string replace_invalid (std :: string_view s, char32_t 교체)
    - utf8 :: starts_with_bom
      - bool Starts_with_bom (Octet_iterator it, Octet_iterator end)
      - bool Starts_with_bom (const std :: string & s)
      - bool Starts_with_bom (std :: string_view s)
  - UTF8 네임 스페이스의 유형
    - UTF8 :: 예외
    - utf8 :: invalid_code_point
    - UTF8 :: invalid_utf8
    - UTF8 :: invalid_utf16
    - utf8 :: not_enough_room
    - UTF8 :: 반복자
      - 회원 기능
  - UTF8 :: 확인되지 않은 네임 스페이스의 기능
    - UTF8 :: 선택 취소 :: Append
    - UTF8 :: 선택 취소 :: 부록 16
    - UTF8 :: 선택 해제 :: 다음
    - UTF8 :: NEXT16
    - UTF8 :: 선택 해제 :: peek_next
    - UTF8 :: 선택 해제 :: prior
    - UTF8 :: 선택 취소 :: 사전
    - UTF8 :: 선택 해제 :: 거리
    - UTF8 :: 선택 해제 :: UTF16TO8
    - UTF8 :: 선택 취소 :: UTF8TO16
    - UTF8 :: 선택 취소 :: UTF32To8
    - UTF8 :: 선택 취소 :: UTF8TO32
    - UTF8 :: 선택 취소 :: replace_invalid
  - UTF8의 유형 :: 확인되지 않은 네임 스페이스
    - UTF8 :: 반복자
      - 회원 기능

설치

이것은 헤더 전용 라이브러리이며 지원되는 배포 방법은 다음과 같습니다.

https://github.com/nemtrif/utfcpp/releases에서 릴리스를 임시 디렉토리로 다운로드하십시오
릴리스를 압축합니다
UTFCPP/소스 파일의 내용을 프로젝트에 대한 파일을 포함시키는 디렉토리에 복사하십시오.

cmakelist.txt 파일은 원래 테스트 목적으로 만 만들어졌지만 불행히도 시간이 지남에 따라 설치 대상을 추가 한 기여를 받아 들였습니다. 이것은 UTFCPP 라이브러리를 설치하는 지원되는 방법이 아니며 향후 릴리스에서 cmakelist.txt를 제거하는 것을 고려하고 있습니다.

사용의 예

입문 샘플

라이브러리의 사용을 설명하려면 UTF-8 인코딩 된 텍스트가 포함 된 파일을 열고, 라인별로 읽고, 유효하지 않은 UTF-8 바이트 시퀀스를 확인하고, UTF-16 인코딩으로 변환하고 UTF-8로 다시 변환하는 작지만 완전한 프로그램으로 시작하겠습니다.

# include < fstream >
# include < iostream >
# include < string >
# include < vector >
# include " utf8.h "
using namespace std ;
int main ( int argc, char ** argv)
{
    if (argc != 2 ) {
        cout << " n Usage: docsample filename n " ;
        return 0 ;
    }
    const char * test_file_path = argv[ 1 ];
    // Open the test file (must be UTF-8 encoded)
    ifstream fs8 (test_file_path);
    if (!fs8. is_open ()) {
        cout << " Could not open " << test_file_path << endl;
        return 0 ;
    }

    unsigned line_count = 1 ;
    string line;
    // Play with all the lines in the file
    while ( getline (fs8, line)) {
        // check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)
# if __cplusplus >= 201103L // C++ 11 or later
        auto end_it = utf8::find_invalid (line. begin (), line. end ());
# else
        string::iterator end_it = utf8::find_invalid (line. begin (), line. end ());
# endif // C++ 11
        if (end_it != line. end ()) {
            cout << " Invalid UTF-8 encoding detected at line " << line_count << " n " ;
            cout << " This part is fine: " << string (line. begin (), end_it) << " n " ;
        }
        // Get the line length (at least for the valid part)
        int length = utf8::distance (line. begin (), end_it);
        cout << " Length of line " << line_count << " is " << length <<  " n " ;

        // Convert it to utf-16
# if __cplusplus >= 201103L // C++ 11 or later
        u16string utf16line = utf8::utf8to16 (line);
# else
        vector< unsigned short > utf16line;
        utf8::utf8to16 (line. begin (), end_it, back_inserter (utf16line));
# endif // C++ 11
        // And back to utf-8;
# if __cplusplus >= 201103L // C++ 11 or later
        string utf8line = utf8::utf16to8 (utf16line);
# else
        string utf8line; 
        utf8::utf16to8 (utf16line. begin (), utf16line. end (), back_inserter (utf8line));
# endif // C++ 11
        // Confirm that the conversion went OK:
        if (utf8line != string (line. begin (), end_it))
            cout << " Error in UTF-16 conversion at line: " << line_count << " n " ;        

        line_count++;
    } 

    return 0 ;
}

이전 코드 샘플에서 각 라인에 대해 find_invalid 사용하여 유효하지 않은 UTF-8 시퀀스를 감지했습니다. 각 라인에서 문자 수 (보다 정확하게 - 라인 끝 및 하나가있는 경우 BOM을 포함하여 유니 코드 코드 포인트의 수)는 utf8::distance 사용하여 결정되었습니다. 마지막으로, 각 라인을 utf8to16 으로 인코딩하고 utf16to8 사용하여 UTF-8로 다시 변환했습니다.

오래된 컴파일러에 대한 다른 패턴의 사용 패턴에 유의하십시오. 예를 들어, UTF-8 인코딩 된 문자열을 Pre-C ++ 11 컴파일러로 인코딩 된 UTF-16으로 변환하는 방법입니다.

    vector< unsigned short > utf16line;
    utf8::utf8to16 (line.begin(), end_it, back_inserter(utf16line));

보다 최신 컴파일러를 사용하면 동일한 작업이 다음과 같습니다.

    u16string utf16line = utf8::utf8to16(line);

__cplusplus 매크로가 C ++ 11 이상을 가리키면 라이브러리는 C ++ 표준 유니 코드 문자열을 고려하여 Semantics를 고려하는 API를 노출시킵니다. 구형 컴파일러를 사용하면 여전히 덜 편리한 방식으로 동일한 기능을 사용할 수 있습니다.

예를 들어 __cplusplus 매크로를 신뢰하지 않거나 최신 컴파일러와 함께 C ++ 11 헬퍼 함수를 포함시키지 않으려면 utf8.h 포함시키기 전에 UTF_CPP_CPLUSPLUS 매크로를 정의하고 사용 __cplusplus 표준에 대한 값을 할당하십시오. 이것은 최근 표준판을 잘 지원하더라도 __cplusplus 매크로를 설정하는 데 보수적 인 컴파일러에도 유용 할 수 있습니다. Microsoft의 Visual C ++는 한 예입니다.

파일에 유효한 UTF-8 텍스트가 포함되어 있는지 확인합니다

다음은 파일의 내용이 메모리에 내용을 읽지 않고 유효한 UTF-8 인코딩 된 텍스트인지 확인하는 기능입니다.

 bool valid_utf8_file ( const char * file_name)
{
    ifstream ifs (file_name);
    if (!ifs)
        return false ; // even better, throw here

    istreambuf_iterator< char > it (ifs. rdbuf ());
    istreambuf_iterator< char > eos;

    return utf8::is_valid (it, eos);
}

함수 utf8::is_valid() 입력 반복자와 함께 작동하기 때문에 istreambuf_iterator it 하고 파일의 내용을 먼저 메모리에로드하지 않고 직접 읽을 수있었습니다.

입력 반복자 인수를 취하는 다른 기능도 비슷한 방식으로 사용할 수 있습니다. 예를 들어 UTF-8 인코딩 된 텍스트 파일의 내용을 읽고 텍스트를 UTF-16으로 변환하려면 다음과 같은 작업을 수행하십시오.

    utf8::utf8to16 (it, eos, back_inserter(u16string));

문자열에 유효한 UTF-8 텍스트가 포함되어 있는지 확인하십시오

"아마도"가 UTF-8 인코딩 된 텍스트를 포함하고 유효하지 않은 UTF-8 시퀀스를 대체 문자로 바꾸려는 텍스트가있는 경우 다음 기능과 같은 내용이 사용될 수 있습니다.

 void fix_utf8_string (std::string& str)
{
    std::string temp;
    utf8::replace_invalid (str. begin (), str. end (), back_inserter (temp));
    str = temp;
}

이 함수는 유효하지 않은 UTF-8 시퀀스를 유니 코드 대체 문자로 대체합니다. 발신자가 자신의 교체 문자를 제공 할 수있는 과부하 기능이 있습니다.

관심 지점

설계 목표와 결정

도서관은 다음과 같이 설계되었습니다.

GENERIC : 더 나은 또는 악화하기 위해 많은 C ++ 문자열 클래스가 있으며, 라이브러리는 가능한 한 많은 것들과 함께 작동해야합니다.
휴대용 : 라이브러리는 다른 플랫폼과 컴파일러에서 휴대용이어야합니다. 유일하게 포송할 수없는 코드는 다른 크기의 부호없는 정수를 선언하는 작은 섹션입니다 : 세 개의 typedef. 플랫폼과 일치하지 않으면 도서관 사용자가 변경할 수 있습니다. 기본 설정은 Windows (32 및 64 비트 모두), 대부분의 32 비트 및 64 비트 UNIX 파생 상품에 대해 작동해야합니다. Post C ++ 03 언어 기능에 대한 지원은 API 레벨의 최신 컴파일러에만 포함되므로 라이브러리는 예쁜 오래된 컴파일러에서도 작동해야합니다.
경량 : "사용하는 것에 대해서만 지불"가이드 라인을 따르십시오.
무관심 : 사용자의 특정 디자인이나 프로그래밍 스타일을 강제하지 마십시오. 이것은 프레임 워크가 아닌 라이브러리입니다.

대안

대안과 비교를 위해 다음 기사를 권장합니다. Jeanheyd Meneide의 C ++ 인코딩 API (일부 녹음)의 놀랍도록 끔찍한 세계. 이 기사 에서이 라이브러리는 다음과 비교됩니다.

simdutf
아이콘
boost.text
ICU
encoding_rs
Windows API 기능은 인코딩간에 텍스트를 변환 할 수 있습니다
ztd.text

이 기사는 API 디자인의 품질에 대한 저자의 견해뿐만 아니라 속도 벤치 마크도 제시합니다.

참조

UTF8 네임 스페이스의 기능

UTF8 :: 부록

Octet_iterator Append (UTFCHAR32_T CP, OCTET_ITERATOR 결과)

버전 1.0 이상으로 제공됩니다.

32 비트 코드 포인트를 UTF-8 옥셋 시퀀스로 인코딩하고 시퀀스를 UTF-8 문자열에 추가합니다.

 template < typename octet_iterator>
octet_iterator append ( utfchar32_t cp, octet_iterator result);

octet_iterator : 출력 반복기.
cp : 시퀀스에 추가되는 코드 포인트를 나타내는 32 비트 정수.
result : 코드 포인트를 추가 할 순서대로 위치로의 출력 반복자.
반환 값 : 새로 추가 된 시퀀스 후에 그 장소를 가리키는 반복자.

사용의 예 :

 unsigned char u[ 5 ] = { 0 , 0 , 0 , 0 , 0 };
unsigned char * end = append( 0x0448 , u);
assert (u[ 0 ] == 0xd1 && u[ 1 ] == 0x88 && u[ 2 ] == 0 && u[ 3 ] == 0 && u[ 4 ] == 0 );

append 메모리를 할당하지 않습니다. 운영에 할당 된 충분한 메모리가 있는지 확인하는 것은 발신자의 부담입니다. 더 흥미롭게 만들기 위해, append 1 ~ 4 옥셋 사이의 시퀀스에 추가 할 수 있습니다. 실제로, 당신은 가장 자주 std::back_inserter 사용하여 필요한 메모리가 할당되도록하기를 원할 것입니다.

유효하지 않은 코드 포인트의 경우 utf8::invalid_code_point 예외가 발생합니다.

void Append (utfchar32_t cp, std :: string & s);

버전 3.0 이상으로 제공됩니다. 4.0 이전에는 C ++ 11 컴파일러가 필요했습니다. 요구 사항은 4.0으로 해제됩니다.

32 비트 코드 포인트를 UTF-8 옥셋 시퀀스로 인코딩하고 시퀀스를 UTF-8 문자열에 추가합니다.

 void append ( utfchar32_t cp, std::string& s);

cp : 문자열에 추가되는 코드 포인트.
s : 코드 포인트를 추가하기 위해 UTF-8 인코딩 된 문자열.

사용의 예 :

std::string u;
append ( 0x0448 , u);
assert (u[ 0 ] == char ( 0xd1 ) && u[1] == char( 0x88 ) && u.length() == 2);

유효하지 않은 코드 포인트의 경우 utf8::invalid_code_point 예외가 발생합니다.

UTF8 :: 부록 16

Word_iterator Append16 (utfchar32_t cp, word_iterator result)

버전 4.0 이상으로 제공됩니다.

32 비트 코드 포인트를 UTF-16 단어 시퀀스로 인코딩하고 시퀀스를 UTF-16 문자열에 추가합니다.

 template < typename word_iterator>
word_iterator append16 ( utfchar32_t cp, word_iterator result);

word_iterator : 출력 반복기.
cp : 시퀀스에 추가되는 코드 포인트를 나타내는 32 비트 정수.
result : 코드 포인트를 추가 할 순서대로 위치로의 출력 반복자.
반환 값 : 새로 추가 된 시퀀스 후에 그 장소를 가리키는 반복자.

사용의 예 :

 unsigned short u[ 2 ] = { 0 , 0 };
unsigned short * end = append16( 0x0448 , u);
assert (u[ 0 ] == 0x0448 && u[ 1 ] == 0 );

append16 메모리를 할당하지 않습니다. 작업에 충분한 메모리가 할당 된 지 확인하는 것은 발신자의 부담입니다. 더 흥미롭게 만들기 위해 append16 시퀀스에 하나 또는 두 단어를 추가 할 수 있습니다. 실제로, 당신은 가장 자주 std::back_inserter 사용하여 필요한 메모리가 할당되도록하기를 원할 것입니다.

유효하지 않은 코드 포인트의 경우 utf8::invalid_code_point 예외가 발생합니다.

void append (utfchar32_t cp, std :: u16string & s)

버전 4.0 이상으로 제공됩니다. C ++ 11 호환 컴파일러가 필요합니다.

32 비트 코드 포인트를 UTF-16 단어 시퀀스로 인코딩하고 시퀀스를 UTF-16 문자열에 추가합니다.

 void append ( utfchar32_t cp, std::u16string& s);

cp : 문자열에 추가되는 코드 포인트.
s : 코드 포인트를 추가하기 위해 UTF-16 인코딩 된 문자열.

사용의 예 :

std::u16string u;
append ( 0x0448 , u);
assert (u[ 0 ] == 0x0448 && u.length() == 1);

유효하지 않은 코드 포인트의 경우 utf8::invalid_code_point 예외가 발생합니다.

UTF8 :: 다음

버전 1.0 이상으로 제공됩니다.

UTF-8 시퀀스의 시작 부분에 반복기를 고려할 때 코드 포인트를 반환하고 반복기를 다음 위치로 이동시킵니다.

 template < typename octet_iterator> 
utfchar32_t next (octet_iterator& it, octet_iterator end);

octet_iterator : 입력 반복기.
it : UTF-8 인코딩 된 코드 포인트의 시작 부분을 가리키는 반복자에 대한 참조. 함수가 반환 된 후 다음 코드 포인트의 시작 부분을 가리킬 수 있습니다.
end : 처리 할 UTF-8 시퀀스의 끝. 코드 포인트 추출 중에 end it utf8::not_enough_room 예외가 발생합니다.
반환 값 : 처리 된 UTF-8 코드 포인트의 32 비트 표현.

사용의 예 :

 char * twochars = " xe6x97xa5xd1x88 " ;
char * w = twochars;
int cp = next(w, twochars + 6 );
assert (cp == 0x65e5 );
assert (w == twochars + 3 );

이 기능은 일반적으로 UTF-8 인코딩 된 문자열을 통해 반복하는 데 사용됩니다.

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

UTF8 :: NEXT16

버전 4.0 이상으로 제공됩니다.

반복자가 UTF-16 시퀀스의 시작 부분까지 주어지면 코드 포인트를 반환하고 반복기를 다음 위치로 이동시킵니다.

 template < typename word_iterator>
utfchar32_t next16 (word_iterator& it, word_iterator end);

word_iterator : 입력 반복기.
it : UTF-16 인코딩 된 코드 포인트의 시작 부분을 가리키는 반복자에 대한 참조. 함수가 반환 된 후 다음 코드 포인트의 시작 부분을 가리킬 수 있습니다.
end : 처리 할 UTF-16 서열의 끝. 코드 포인트 추출 중에 end it utf8::not_enough_room 예외가 발생합니다.
반환 값 : 처리 된 UTF-16 코드 포인트의 32 비트 표현.

사용의 예 :

 const unsigned short u[ 3 ] = { 0x65e5 , 0xd800 , 0xdf46 };
const unsigned short * w = u;
int cp = next16(w, w + 3 );
assert (cp, 0x65e5 );
assert (w, u + 1 );

이 함수는 일반적으로 UTF-16 인코딩 된 문자열을 통해 반복하는 데 사용됩니다.

유효하지 않은 UTF-16 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

utf8 :: peek_next

버전 2.1 이상으로 제공됩니다.

UTF-8 시퀀스의 시작 부분에 반복기를 고려할 때 반복자의 값을 변경하지 않고 다음 시퀀스의 코드 포인트를 반환합니다.

 template < typename octet_iterator> 
utfchar32_t peek_next (octet_iterator it, octet_iterator end);

octet_iterator : 입력 반복기.
it : UTF-8 인코딩 된 코드 포인트의 시작 부분을 가리키는 반복자.
end : 처리 할 UTF-8 시퀀스의 끝. 코드 포인트 추출 중에 end it utf8::not_enough_room 예외가 발생합니다.
반환 값 : 처리 된 UTF-8 코드 포인트의 32 비트 표현.

사용의 예 :

 char * twochars = " xe6x97xa5xd1x88 " ;
char * w = twochars;
int cp = peek_next(w, twochars + 6 );
assert (cp == 0x65e5 );
assert (w == twochars);

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

UTF8 :: 사전

버전 1.02 이상으로 제공됩니다.

UTF-8 시퀀스에서 옥셋을 가리키는 반복자에 대한 언급이 주어지면, 이전 UTF-8 인코딩 된 코드 포인트의 시작 부분에 도달 할 때까지 반복기가 줄어들고 코드 포인트의 32 비트 표현을 반환합니다.

 template < typename octet_iterator> 
utfchar32_t prior (octet_iterator& it, octet_iterator start);

octet_iterator : 양방향 반복자.
it : UTF-8 인코딩 된 문자열 내의 옥트를 가리키는 참조. 함수가 반환 된 후 이전 코드 포인트의 시작 부분을 가리 키도록 감소합니다.
start : 코드 포인트의 시작에 대한 검색이 수행되는 시퀀스의 시작 부분에 반복자. UTF-8 리드 옥켓을 검색 할 때 문자열의 시작을 통과하는 것을 방지하는 안전 조치입니다.
반환 값 : 이전 코드 포인트의 32 비트 표현.

사용의 예 :

 char * twochars = " xe6x97xa5xd1x88 " ;
unsigned char * w = twochars + 3 ;
int cp = prior (w, twochars);
assert (cp == 0x65e5 );
assert (w == twochars);

이 함수는 두 가지 목적을 가지고 있습니다. 하나는 UTF-8 인코딩 된 문자열을 통해 뒤로 반복됩니다. utf8::next 더 빠르기 때문에 대신 앞으로 반복하는 것이 일반적으로 더 나은 아이디어입니다. 두 번째 목적은 문자열 내에 임의의 위치가있는 경우 UTF-8 시퀀스의 시작을 찾는 것입니다. 이 경우 utf8::prior 일부 시나리오에서 유효하지 않은 UTF-8 시퀀스를 감지하지 못할 수 있습니다. 예를 들어 불필요한 트레일 옥트가 있으면 건너 뛸 수 있습니다.

일반적으로 코드 포인트의 시작 부분을 it start 문자열의 시작 부분을 가리켜 너무 멀리 가지 않도록합니다. LEAD UTF-8 옥트를 가리킬 때까지 it 다음 해당 옥트로 시작하는 UTF-8 시퀀스가 32 비트 표현으로 디코딩되어 반환됩니다.

UTF-8 리드 옥트가 히트되기 전에 start 에 도달하거나 유효하지 않은 UTF-8 시퀀스가 리드 옥트에 의해 시작되면 invalid_utf8 예외가 발생합니다.

start it 동일하면 not_enough_room 예외가 발생합니다.

UTF8 :: 사전

버전 1.0 이상으로 제공됩니다.

UTF-8 시퀀스 내에서 지정된 수의 코드 포인트로 반복자를 발전시킵니다.

 template < typename octet_iterator, typename distance_type> 
void advance (octet_iterator& it, distance_type n, octet_iterator end);

octet_iterator : 입력 반복기.
distance_type : octet_iterator 의 차이 유형으로 컨텍스트 유형 컨버터블.
it : UTF-8 인코딩 된 코드 포인트의 시작 부분을 가리키는 반복자에 대한 참조. 함수가 반환 된 후 코드 포인트 다음 N을 가리 키도록 증가합니다.
n : 코드 포인트 수가 고급화되어야 it . 음수 값은 감소를 의미합니다.
end : 처리 할 UTF-8 시퀀스의 한계. n 이 양수이고 코드 포인트 추출 중에 end 동일 it utf8::not_enough_room 예외가 발생합니다. n 이 음수이고 end it 하면 UTF-8 시퀀스의 ta 트레일 바이트를 it utf8::invalid_code_point 예외가 발생합니다.

사용의 예 :

 char * twochars = " xe6x97xa5xd1x88 " ;
unsigned char * w = twochars;
advance (w, 2 , twochars + 6 );
assert (w == twochars + 5 );
advance (w, - 2 , twochars);
assert (w == twochars);

유효하지 않은 코드 포인트의 경우 utf8::invalid_code_point 예외가 발생합니다.

UTF8 :: 거리

버전 1.0 이상으로 제공됩니다.

반복자가 순서대로 두 개의 UTF-8 인코딩 된 코드 포인트에 대한 주어지면 그 사이의 코드 포인트 수를 반환합니다.

 template < typename octet_iterator> 
typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);

octet_iterator : 입력 반복기.
first : UTF-8 인코딩 된 코드 포인트의 시작에 대한 반복자.
last : 우리가 길이를 결정하려는 순서대로 마지막 UTF-8 인코딩 된 코드 포인트의 반복자. 새로운 코드 포인트의 시작일 수도 있습니다.
반환 값 코드 포인트에서 반복자 간의 거리를 반복합니다.

사용의 예 :

 char * twochars = " xe6x97xa5xd1x88 " ;
size_t dist = utf8::distance(twochars, twochars + 5 );
assert (dist == 2 );

이 기능은 UTF-8 인코딩 된 문자열의 길이 (코드 포인트)를 찾는 데 사용됩니다. 길이가 주로 길이 라고하는 이유는 주로 개발자가 사용되기 때문에 길이가 O (1) 함수이기 때문입니다. UTF-8 문자열의 길이를 계산하는 것은 선형 작동이며 std::distance 알고리즘 후에 모델링하는 것이 더 좋았습니다.

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다. last UTF-8 시퀀스의 과거를 가리키지 않으면 utf8::not_enough_room 예외가 발생합니다.

UTF8 :: UTF16TO8

Octet_iterator UTF16to8 (u16bit_iterator start, u16bit_iterator end, Octet_iterator result)

버전 1.0 이상으로 제공됩니다.

UTF-16 인코딩 된 문자열을 UTF-8로 변환합니다.

 template < typename u16bit_iterator, typename octet_iterator>
octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);

u16bit_iterator : 입력 반복기.
octet_iterator : 출력 반복기.
start : UTF-16 인코딩 된 문자열의 시작 부분을 가리키는 반복자가 변환합니다.
end : UTF-16 인코딩 된 문자열의 패스를 가리키는 반복자가 변환합니다.
result : 변환 결과를 추가 할 UTF-8 문자열의 위치에 출력 반복자.
반환 값 : 추가 된 UTF-8 문자열 이후에 그 장소를 가리키는 반복자.

사용의 예 :

 unsigned short utf16string[] = { 0x41 , 0x0448 , 0x65e5 , 0xd834 , 0xdd1e };
vector< unsigned char > utf8result;
utf16to8 (utf16string, utf16string + 5 , back_inserter(utf8result));
assert (utf8result.size() == 10);

유효하지 않은 UTF-16 시퀀스의 경우 utf8::invalid_utf16 예외가 발생합니다.

std :: String UTF16To8 (const std :: u16string & s)

버전 3.0 이상으로 제공됩니다. C ++ 11 호환 컴파일러가 필요합니다.

UTF-16 인코딩 된 문자열을 UTF-8로 변환합니다.

std::string utf16to8 ( const std::u16string& s);

s : UTF-16 인코딩 된 문자열. 반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

    u16string utf16string = { 0x41 , 0x0448 , 0x65e5 , 0xd834 , 0xdd1e };
    string u = utf16to8(utf16string);
    assert (u.size() == 10);

유효하지 않은 UTF-16 시퀀스의 경우 utf8::invalid_utf16 예외가 발생합니다.

std :: String UTF16To8 (std :: u16string_view s)

버전 3.2 이상으로 제공됩니다. C ++ 17 호환 컴파일러가 필요합니다.

UTF-16 인코딩 된 문자열을 UTF-8로 변환합니다.

std::string utf16to8 (std::u16string_view s);

s : UTF-16 인코딩 된 문자열. 반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

    u16string utf16string = { 0x41 , 0x0448 , 0x65e5 , 0xd834 , 0xdd1e };
    u16string_view utf16stringview (u16string);
    string u = utf16to8(utf16string);
    assert (u.size() == 10);

유효하지 않은 UTF-16 시퀀스의 경우 utf8::invalid_utf16 예외가 발생합니다.

UTF8 :: UTF16TOU8

std :: u8string utf16tou8 (const std :: u16string & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-16 인코딩 된 문자열을 UTF-8로 변환합니다.

std::u8string utf16tou8 ( const std::u16string& s);

s : UTF-16 인코딩 된 문자열. 반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

    u16string utf16string = { 0x41 , 0x0448 , 0x65e5 , 0xd834 , 0xdd1e };
    u8string u = utf16tou8(utf16string);
    assert (u.size() == 10);

유효하지 않은 UTF-16 시퀀스의 경우 utf8::invalid_utf16 예외가 발생합니다.

std :: u8string utf16tou8 (const std :: u16string_view & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-16 인코딩 된 문자열을 UTF-8로 변환합니다.

std::u8string utf16tou8 ( const std::u16string_view& s);

s : UTF-16 인코딩 된 문자열. 반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

    u16string utf16string = { 0x41 , 0x0448 , 0x65e5 , 0xd834 , 0xdd1e };
    u16string_view utf16stringview (u16string);
    u8string u = utf16tou8(utf16string);
    assert (u.size() == 10);

유효하지 않은 UTF-16 시퀀스의 경우 utf8::invalid_utf16 예외가 발생합니다.

UTF8 :: UTF8TO16

u16bit_iterator utf8to16 (Octet_iterator start, Octet_iterator end, u16bit_iterator result)

버전 1.0 이상으로 제공됩니다.

UTF-8 인코딩 된 문자열을 UTF-16으로 변환합니다

 template < typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);

octet_iterator : 입력 반복기.
u16bit_iterator : 출력 반복기.
start : UTF-8 인코딩 된 문자열의 시작 부분을 가리키는 반복자가 변환합니다. end : UTF-8 인코딩 된 문자열의 패스를 가리키는 반복자가 변환합니다.
result : UTF-16 문자열의 위치로의 출력 반복자는 변환 결과를 추가 할 수 있습니다.
반환 값 : 추가 된 UTF-16 문자열 이후에 장소를 가리키는 반복자.

사용의 예 :

 char utf8_with_surrogates[] = " xe6x97xa5xd1x88xf0x9dx84x9e " ;
vector < unsigned short > utf16result;
utf8to16 (utf8_with_surrogates, utf8_with_surrogates + 9 , back_inserter(utf16result));
assert (utf16result.size() == 4);
assert (utf16result[ 2 ] == 0xd834 );
assert (utf16result[ 3 ] == 0xdd1e );

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다. end UTF-8 시퀀스의 과거를 가리키지 않으면 utf8::not_enough_room 예외가 발생합니다.

std :: u16string utf8to16 (const std :: string & s)

버전 3.0 이상으로 제공됩니다. C ++ 11 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-16으로 변환합니다.

std::u16string utf8to16 ( const std::string& s);

s : 변환 할 UTF-8 인코딩 된 문자열.
반환 값 : UTF-16 인코딩 된 문자열

사용의 예 :

string utf8_with_surrogates = " xe6x97xa5xd1x88xf0x9dx84x9e " ;
u16string utf16result = utf8to16(utf8_with_surrogates);
assert (utf16result.length() == 4);
assert (utf16result[ 2 ] == 0xd834 );
assert (utf16result[ 3 ] == 0xdd1e );

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

std :: u16string utf8to16 (std :: string_view s)

버전 3.2 이상으로 제공됩니다. C ++ 17 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-16으로 변환합니다.

std::u16string utf8to16 (std::string_view s);

s : 변환 할 UTF-8 인코딩 된 문자열.
반환 값 : UTF-16 인코딩 된 문자열

사용의 예 :

string_view utf8_with_surrogates = " xe6x97xa5xd1x88xf0x9dx84x9e " ;
u16string utf16result = utf8to16(utf8_with_surrogates);
assert (utf16result.length() == 4);
assert (utf16result[ 2 ] == 0xd834 );
assert (utf16result[ 3 ] == 0xdd1e );

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

std :: u16string utf8to16 (std :: u8string & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-16으로 변환합니다.

std::u16string utf8to16 (std::u8string& s);

s : 변환 할 UTF-8 인코딩 된 문자열.
반환 값 : UTF-16 인코딩 된 문자열

사용의 예 :

std::u8string utf8_with_surrogates = " xe6x97xa5xd1x88xf0x9dx84x9e " ;
std::u16string utf16result = utf8to16(utf8_with_surrogates);
assert (utf16result.length() == 4);
assert (utf16result[ 2 ] == 0xd834 );
assert (utf16result[ 3 ] == 0xdd1e );

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

std :: u16string utf8to16 (std :: u8string_view & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-16으로 변환합니다.

std::u16string utf8to16 (std::u8string_view& s);

s : 변환 할 UTF-8 인코딩 된 문자열.
반환 값 : UTF-16 인코딩 된 문자열

사용의 예 :

std::u8string utf8_with_surrogates = " xe6x97xa5xd1x88xf0x9dx84x9e " ;
std::u8string_view utf8stringview {utf8_with_surrogates}
std::u16string utf16result = utf8to16(utf8stringview);
assert (utf16result.length() == 4);
assert (utf16result[ 2 ] == 0xd834 );
assert (utf16result[ 3 ] == 0xdd1e );

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

UTF8 :: UTF32TO8

Octet_iterator UTF32To8 (u32bit_iterator start, u32bit_iterator end, Octet_iterator result)

버전 1.0 이상으로 제공됩니다.

UTF-32 인코딩 된 문자열을 UTF-8로 변환합니다.

 template < typename octet_iterator, typename u32bit_iterator>
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);

octet_iterator : 출력 반복기.
u32bit_iterator : 입력 반복기.
start : UTF-32 인코딩 된 문자열의 시작 부분을 가리키는 반복자가 변환합니다.
end : UTF-32 인코딩 된 문자열의 패스를 가리키는 반복자가 변환합니다.
result : 변환 결과를 추가 할 UTF-8 문자열의 위치에 출력 반복자.
반환 값 : 추가 된 UTF-8 문자열 이후에 그 장소를 가리키는 반복자.

사용의 예 :

 int utf32string[] = { 0x448 , 0x65E5 , 0x10346 , 0 };
vector< unsigned char > utf8result;
utf32to8 (utf32string, utf32string + 3 , back_inserter(utf8result));
assert (utf8result.size() == 9);

유효하지 않은 UTF-32 문자열의 경우 utf8::invalid_code_point 예외가 발생합니다.

std :: String UTF32To8 (const std :: u32string & s)

버전 3.0 이상으로 제공됩니다. C ++ 11 호환 컴파일러가 필요합니다.

UTF-32 인코딩 된 문자열을 UTF-8로 변환합니다.

std::string utf32to8 ( const std::u32string& s);

s : UTF-32 인코딩 된 문자열.
반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

u32string utf32string = { 0x448 , 0x65E5 , 0x10346 };
string utf8result = utf32to8(utf32string);
assert (utf8result.size() == 9);

유효하지 않은 UTF-32 문자열의 경우 utf8::invalid_code_point 예외가 발생합니다.

std :: u8string utf32to8 (const std :: u32string & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-32 인코딩 된 문자열을 UTF-8로 변환합니다.

std::u8string utf32to8 ( const std::u32string& s);

s : UTF-32 인코딩 된 문자열.
반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

u32string utf32string = { 0x448 , 0x65E5 , 0x10346 };
u8string utf8result = utf32to8(utf32string);
assert (utf8result.size() == 9);

유효하지 않은 UTF-32 문자열의 경우 utf8::invalid_code_point 예외가 발생합니다.

std :: u8string utf32to8 (const std :: u32string_view & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-32 인코딩 된 문자열을 UTF-8로 변환합니다.

std::u8string utf32to8 ( const std::u32string_view& s);

s : UTF-32 인코딩 된 문자열.
반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

u32string utf32string = { 0x448 , 0x65E5 , 0x10346 };
u32string_view utf32stringview (utf32string);
u8string utf8result = utf32to8(utf32stringview);
assert (utf8result.size() == 9);

유효하지 않은 UTF-32 문자열의 경우 utf8::invalid_code_point 예외가 발생합니다.

std :: String UTF32To8 (const std :: u32string & s)

버전 3.0 이상으로 제공됩니다. C ++ 11 호환 컴파일러가 필요합니다.

UTF-32 인코딩 된 문자열을 UTF-8로 변환합니다.

std::string utf32to8 ( const std::u32string& s);

s : UTF-32 인코딩 된 문자열.
반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

u32string utf32string = { 0x448 , 0x65E5 , 0x10346 };
string utf8result = utf32to8(utf32string);
assert (utf8result.size() == 9);

유효하지 않은 UTF-32 문자열의 경우 utf8::invalid_code_point 예외가 발생합니다.

std :: String UTF32To8 (std :: u32string_view s)

버전 3.2 이상으로 제공됩니다. C ++ 17 호환 컴파일러가 필요합니다.

UTF-32 인코딩 된 문자열을 UTF-8로 변환합니다.

std::string utf32to8 (std::u32string_view s);

s : UTF-32 인코딩 된 문자열.
반환 값 : UTF-8 인코딩 된 문자열.

사용의 예 :

u32string utf32string = { 0x448 , 0x65E5 , 0x10346 };
u32string_view utf32stringview (utf32string);
string utf8result = utf32to8(utf32stringview);
assert (utf8result.size() == 9);

유효하지 않은 UTF-32 문자열의 경우 utf8::invalid_code_point 예외가 발생합니다.

UTF8 :: UTF8TO32

u32bit_iterator utf8to32

버전 1.0 이상으로 제공됩니다.

UTF-8 인코딩 된 문자열을 UTF-32로 변환합니다.

 template < typename octet_iterator, typename u32bit_iterator>
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);

octet_iterator : 입력 반복기.
u32bit_iterator : 출력 반복기.
start : UTF-8 인코딩 된 문자열의 시작 부분을 가리키는 반복자가 변환합니다.
end : UTF-8 인코딩 된 문자열의 패스를 가리키는 반복자가 변환합니다.
result : UTF-32 문자열의 위치로의 출력 반복자는 변환 결과를 추가 할 수 있습니다.
반환 값 : 첨부 된 UTF-32 문자열 이후 장소를 가리키는 반복자.

사용의 예 :

 char * twochars = " xe6x97xa5xd1x88 " ;
vector< int > utf32result;
utf8to32 (twochars, twochars + 5 , back_inserter(utf32result));
assert (utf32result.size() == 2);

std :: u32string utf8to32 (const std :: u8string & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-32로 변환합니다.

std::u32string utf8to32 ( const std::u8string& s);

s : UTF-8 인코딩 된 문자열. 반환 값 : UTF-32 인코딩 된 문자열.

사용의 예 :

 const std::u8string* twochars = u8" xe6x97xa5xd1x88 " ;
u32string utf32result = utf8to32(twochars);
assert (utf32result.size() == 2);

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

std :: u32string utf8to32 (const std :: u8string_view & s)

버전 4.0 이상으로 제공됩니다. C ++ 20 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-32로 변환합니다.

std::u32string utf8to32 ( const std::u8string_view& s);

s : UTF-8 인코딩 된 문자열. 반환 값 : UTF-32 인코딩 된 문자열.

사용의 예 :

 const u8string* twochars = u8" xe6x97xa5xd1x88 " ;
const u8string_view stringview{twochars};
u32string utf32result = utf8to32(stringview);
assert (utf32result.size() == 2);

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

std :: u32string utf8to32 (const std :: string & s)

버전 3.0 이상으로 제공됩니다. C ++ 11 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-32로 변환합니다.

std::u32string utf8to32 ( const std::string& s);

s : UTF-8 인코딩 된 문자열. 반환 값 : UTF-32 인코딩 된 문자열.

사용의 예 :

 const char * twochars = " xe6x97xa5xd1x88 " ;
u32string utf32result = utf8to32(twochars);
assert (utf32result.size() == 2);

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

std :: u32string utf8to32 (std :: string_view s)

버전 3.2 이상으로 제공됩니다. C ++ 17 호환 컴파일러가 필요합니다.

UTF-8 인코딩 된 문자열을 UTF-32로 변환합니다.

std::u32string utf8to32 (std::string_view s);

s : UTF-8 인코딩 된 문자열. 반환 값 : UTF-32 인코딩 된 문자열.

사용의 예 :

string_view twochars = " xe6x97xa5xd1x88 " ;
u32string utf32result = utf8to32(twochars);
assert (utf32result.size() == 2);

유효하지 않은 UTF-8 시퀀스의 경우 utf8::invalid_utf8 예외가 발생합니다.

utf8 :: find_invalid

Octet_iterator find_invalid (Octet_iterator start, Octet_iterator end)

버전 1.0 이상으로 제공됩니다.

UTF-8 문자열 내에서 유효하지 않은 시퀀스를 감지합니다.

 template < typename octet_iterator> 
octet_iterator find_invalid (octet_iterator start, octet_iterator end);

octet_iterator : 입력 반복기.
start : 유효성을 테스트하기 위해 UTF-8 문자열의 시작 부분을 가리키는 반복자.
end : 유효성을 테스트하기 위해 UTF-8 문자열의 패스를 가리키는 반복자.
반환 값 : UTF-8 문자열의 첫 번째 유효하지 않은 옥넷을 가리키는 반복자. 아무것도 발견되지 않은 경우, 동일하게 end .

사용의 예 :

 char utf_invalid[] = " xe6x97xa5xd1x88xfa " ;
char * invalid = find_invalid(utf_invalid, utf_invalid + 6 );
assert (invalid == utf_invalid + 5 );

이 기능은 일반적으로 다른 함수로 처리하기 전에 UTF-8 문자열이 유효한 지 확인하는 데 사용됩니다. 확인되지 않은 작업을 수행하기 전에 전화하는 것이 특히 중요합니다.

const char* find_invalid (const char* str)

버전 4.0 이상으로 제공됩니다.

C 스타일 UTF-8 문자열 내에서 유효하지 않은 시퀀스를 감지합니다.

 const char * find_invalid ( const char * str);

str : UTF-8 인코딩 된 문자열. 반환 값 : UTF-8 문자열의 첫 번째 유효하지 않은 옥셋에 대한 포인터. 아무것도 발견되지 않은 경우, 후행 제로 바이트를 가리키십시오.

사용의 예 :

 const char * utf_invalid = " xe6x97xa5xd1x88xfa " ;
const char * invalid = find_invalid(utf_invalid);
assert ((invalid - utf_invalid) == 5);

std :: size_t find_invalid (const std :: string & s)

버전 3.0 이상으로 제공됩니다. 4.0 이전에는 C ++ 11 컴파일러가 필요했습니다. 요구 사항은 4.0으로 해제됩니다

UTF-8 문자열 내에서 유효하지 않은 시퀀스를 감지합니다.

std:: size_t find_invalid ( const std::string& s);

s : UTF-8 인코딩 된 문자열. 반환 값 : UTF-8 문자열에서 첫 번째 유효하지 않은 옥넷의 색인. 아무것도 발견되지 않은 경우, std::string::npos 와 같습니다.

사용의 예 :

string utf_invalid = " xe6x97xa5xd1x88xfa " ;
auto invalid = find_invalid(utf_invalid);
assert (invalid == 5 );

std :: size_t find_invalid (std :: string_view s)

버전 3.2 이상으로 제공됩니다. C ++ 17 호환 컴파일러가 필요합니다.

UTF-8 문자열 내에서 유효하지 않은 시퀀스를 감지합니다.

std:: size_t find_invalid (std::string_view s);

s : UTF-8 인코딩 된 문자열. 반환 값 : UTF-8 문자열에서 첫 번째 유효하지 않은 옥넷의 색인. 아무것도 발견되지 않은 경우, std::string_view::npos 와 같습니다.

사용의 예 :

string_view utf_invalid = " xe6x97xa5xd1x88xfa " ;
auto invalid = find_invalid(utf_invalid);
assert (invalid == 5 );

UTF8 :: IS_VALID

BOOL IS_VALID (Octet_iterator Start, Octet_iterator End)

버전 1.0 이상으로 제공됩니다.

옥셋 시퀀스가 유효한 UTF-8 문자열인지 확인합니다.

 template < typename octet_iterator> 
bool is_valid (octet_iterator start, octet_iterator end);

octet_iterator : 입력 반복기.
start : 유효성을 테스트하기 위해 UTF-8 문자열의 시작 부분을 가리키는 반복자.
end : 유효성을 테스트하기 위해 UTF-8 문자열의 패스를 가리키는 반복자.
반환 값 : 시퀀스가 유효한 UTF-8 문자열 인 경우 true ; 그렇지 않은 경우 false .

사용의 예 :

 char utf_invalid[] = " xe6x97xa5xd1x88xfa " ;
bool bvalid = is_valid(utf_invalid, utf_invalid + 6 );
assert (bvalid == false );

is_valid find_invalid(start, end) == end; . 바이트 시퀀스가 유효하지 않은 경우 어디에서 실패하는지 알 필요없이 바이트 시퀀스가 유효한 UTF-8 문자열인지 확인하기 위해 사용하고 싶을 수도 있습니다.

bool is_valid (const char* str)

버전 4.0 이상으로 제공됩니다.

C 스타일 문자열에 유효한 UTF-8 인코딩 된 텍스트가 포함되어 있는지 확인합니다.

 bool is_valid ( const char * str);

str : UTF-8 인코딩 된 문자열.
반환 값 : 문자열에 유효한 UTF-8 인코딩 된 텍스트가 포함 된 true ; 그렇지 않은 경우 false .

사용의 예 :

 char utf_invalid[] = " xe6x97xa5xd1x88xfa " ;
bool bvalid = is_valid(utf_invalid);
assert (bvalid == false );

is_valid 사용하여 문자열에 유효한 UTF-8 텍스트가 포함되어 있는지 확인하지 않아도 유효한 UTF-8 텍스트가 포함되어 있는지 확인하십시오.

bool is_valid (const std :: string & s)

버전 3.0 이상으로 제공됩니다. 4.0 이전에는 C ++ 11 컴파일러가 필요했습니다. 요구 사항은 4.0으로 해제됩니다

문자열 객체에 유효한 UTF-8 인코딩 된 텍스트가 포함되어 있는지 확인합니다.

 bool is_valid ( const std::string& s);

s : UTF-8 인코딩 된 문자열.
반환 값 : 문자열에 유효한 UTF-8 인코딩 된 텍스트가 포함 된 true ; 그렇지 않은 경우 false .

사용의 예 :

 char utf_invalid[] = " xe6x97xa5xd1x88xfa " ;
bool bvalid = is_valid(utf_invalid);
assert (bvalid == false );

is_valid 사용하여 문자열에 유효한 UTF-8 텍스트가 포함되어 있는지 확인하지 않아도 유효한 UTF-8 텍스트가 포함되어 있는지 확인하십시오.

bool is_valid (std :: string_view s)

버전 3.2 이상으로 제공됩니다. C ++ 17 호환 컴파일러가 필요합니다.

문자열 객체에 유효한 UTF-8 인코딩 된 텍스트가 포함되어 있는지 확인합니다.

 bool is_valid (std::string_view s);

s : UTF-8 인코딩 된 문자열.
반환 값 : 문자열에 유효한 UTF-8 인코딩 된 텍스트가 포함 된 true ; 그렇지 않은 경우 false .

사용의 예 :

string_view utf_invalid = " xe6x97xa5xd1x88xfa " ;
bool bvalid = is_valid(utf_invalid);
assert (bvalid == false );

is_valid 사용하여 문자열에 유효한 UTF-8 텍스트가 포함되어 있는지 확인하지 않아도 유효한 UTF-8 텍스트가 포함되어 있는지 확인하십시오.

UTF8 :: Replace_Invalid

output_iterator replace_invalid (Octet_iterator start, Octet_iterator end, output_iterator out, utfchar32_t 교체)

버전 2.0 이상으로 제공됩니다.

교체 마커로 문자열 내의 모든 유효하지 않은 UTF-8 시퀀스를 대체합니다.

 template < typename octet_iterator, typename output_iterator>
output_iterator replace_invalid (octet_iterator start, octet_iterator end, output_iterator out, utfchar32_t replacement);
template < typename octet_iterator, typename output_iterator>
output_iterator replace_invalid (octet_iterator start, octet_iterator end, output_iterator out);

octet_iterator : 입력 반복기.
output_iterator : 출력 반복기.
start : UTF-8 문자열의 시작 부분을 가리키는 반복자가 유효하지 않은 UTF-8 시퀀스를 찾습니다.
end : UTF-8 문자열의 패스를 가리키는 반복자가 유효하지 않은 UTF-8 시퀀스를 찾습니다.
out : 교체 결과가 저장되는 범위에 출력 반복자.
replacement : A Unicode code point for the replacement marker. The version without this parameter assumes the value 0xfffd
Return value: An iterator pointing to the place after the UTF-8 string with replaced invalid sequences.

Example of use:

 char invalid_sequence[] = " a x80xe0xa0xc0xafxedxa0x80 z " ;
vector< char > replace_invalid_result;
replace_invalid (invalid_sequence, invalid_sequence + sizeof (invalid_sequence), back_inserter(replace_invalid_result), '?');
bvalid = is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
assert (bvalid);
char * fixed_invalid_sequence = " a????z " ;
assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));

replace_invalid does not perform in-place replacement of invalid sequences. Rather, it produces a copy of the original string with the invalid sequences replaced with a replacement marker. Therefore, out must not be in the [start, end] range.

std::string replace_invalid(const std::string& s, utfchar32_t replacement)

Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0

Replaces all invalid UTF-8 sequences within a string with a replacement marker.

std::string replace_invalid ( const std::string& s, utfchar32_t replacement);
std::string replace_invalid ( const std::string& s);

s : a UTF-8 encoded string.
replacement : A Unicode code point for the replacement marker. The version without this parameter assumes the value 0xfffd
Return value: A UTF-8 encoded string with replaced invalid sequences.

Example of use:

string invalid_sequence = " a x80xe0xa0xc0xafxedxa0x80 z " ;
string replace_invalid_result = replace_invalid(invalid_sequence, ' ? ' );
bvalid = is_valid(replace_invalid_result);
assert (bvalid);
const string fixed_invalid_sequence = " a????z " ;
assert (fixed_invalid_sequence == replace_invalid_result);

std::string replace_invalid(std::string_view s, char32_t replacement)

Available in version 3.2 and later. Requires a C++ 17 compliant compiler.

Replaces all invalid UTF-8 sequences within a string with a replacement marker.

std::string replace_invalid (std::string_view s, char32_t replacement);
std::string replace_invalid (std::string_view s);

Example of use:

string_view invalid_sequence = " a x80xe0xa0xc0xafxedxa0x80 z " ;
string replace_invalid_result = replace_invalid(invalid_sequence, ' ? ' );
bool bvalid = is_valid(replace_invalid_result);
assert (bvalid);
const string fixed_invalid_sequence = " a????z " ;
assert (fixed_invalid_sequence, replace_invalid_result);

utf8::starts_with_bom

bool starts_with_bom (octet_iterator it, octet_iterator end)

Available in version 2.3 and later.

Checks whether an octet sequence starts with a UTF-8 byte order mark (BOM)

 template < typename octet_iterator> 
bool starts_with_bom (octet_iterator it, octet_iterator end);

octet_iterator : an input iterator.
it : beginning of the octet sequence to check
end : pass-end of the sequence to check
Return value: true if the sequence starts with a UTF-8 byte order mark; false if not.

Example of use:

 unsigned char byte_order_mark[] = { 0xef , 0xbb , 0xbf };
bool bbom = starts_with_bom(byte_order_mark, byte_order_mark + sizeof (byte_order_mark));
assert (bbom == true );

The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text.

bool starts_with_bom(const std::string& s)

Available in version 3.0 and later. Prior to 4.0 it required a C++ 11 compiler; the requirement is lifted with 4.0

Checks whether a string starts with a UTF-8 byte order mark (BOM)

 bool starts_with_bom ( const std::string& s);

s : a UTF-8 encoded string. Return value: true if the string starts with a UTF-8 byte order mark; false if not.

Example of use:

string byte_order_mark = { char ( 0xef ), char ( 0xbb ), char ( 0xbf )};
bool bbom = starts_with_bom(byte_order_mark);
assert (bbom == true );
string threechars = " xf0x90x8dx86xe6x97xa5xd1x88 " ;
bool no_bbom = starts_with_bom(threechars);
assert (no_bbom == false );

The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text.

bool starts_with_bom(std::string_view s)

Available in version 3.2 and later. Requires a C++ 17 compliant compiler.

Checks whether a string starts with a UTF-8 byte order mark (BOM)

 bool starts_with_bom (std::string_view s);

s : a UTF-8 encoded string. Return value: true if the string starts with a UTF-8 byte order mark; false if not.

Example of use:

string byte_order_mark = { char ( 0xef ), char ( 0xbb ), char ( 0xbf )};
string_view byte_order_mark_view (byte_order_mark);
bool bbom = starts_with_bom(byte_order_mark_view);
assert (bbom);
string_view threechars = " xf0x90x8dx86xe6x97xa5xd1x88 " ;
bool no_bbom = starts_with_bom(threechars);
assert (!no_bbom);

The typical use of this function is to check the first three bytes of a file. If they form the UTF-8 BOM, we want to skip them before processing the actual UTF-8 encoded text.

Types From utf8 Namespace

utf8::exception

Available in version 2.3 and later.

Base class for the exceptions thrown by UTF CPP library functions.

 class exception : public std :: exception {};

Example of use:

 try {
  code_that_uses_utf_cpp_library ();
}
catch ( const utf8:: exception & utfcpp_ex) {
  cerr << utfcpp_ex. what ();
}

utf8::invalid_code_point

Available in version 1.0 and later.

Thrown by UTF8 CPP functions such as advance and next if an UTF-8 sequence represents and invalid code point.

 class invalid_code_point : public exception {
public: 
    utfchar32_t code_point () const ;
};

Member function code_point() can be used to determine the invalid code point that caused the exception to be thrown.

utf8::invalid_utf8

Available in version 1.0 and later.

Thrown by UTF8 CPP functions such as next and prior if an invalid UTF-8 sequence is detected during decoding.

 class invalid_utf8 : public exception {
public: 
    utfchar8_t utf8_octet () const ;
};

Member function utf8_octet() can be used to determine the beginning of the byte sequence that caused the exception to be thrown.

utf8::invalid_utf16

Available in version 1.0 and later.

Thrown by UTF8 CPP function utf16to8 if an invalid UTF-16 sequence is detected during decoding.

 class invalid_utf16 : public exception {
public: 
    utfchar16_t utf16_word () const ;
};

Member function utf16_word() can be used to determine the UTF-16 code unit that caused the exception to be thrown.

utf8::not_enough_room

Available in version 1.0 and later.

Thrown by UTF8 CPP functions such as next if the end of the decoded UTF-8 sequence was reached before the code point was decoded.

 class not_enough_room : public exception {};

utf8::iterator

Available in version 2.0 and later.

Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.

 template < typename octet_iterator>
class iterator ;

Member functions

iterator(); the default constructor; the underlying octet_iterator is constructed with its default constructor.

explicit iterator (const octet_iterator& octet_it, const octet_iterator& range_start, const octet_iterator& range_end); a constructor that initializes the underlying octet_iterator with octet_it and sets the range in which the iterator is considered valid.

octet_iterator base () const; returns the underlying octet_iterator.

utfchar32_t operator * () const; decodes the utf-8 sequence the underlying octet_iterator is pointing to and returns the code point.

bool operator == (const iterator& rhs) const; returns true if the two underlying iterators are equal.

bool operator != (const iterator& rhs) const; returns true if the two underlying iterators are not equal.

iterator& operator ++ (); the prefix increment - moves the iterator to the next UTF-8 encoded code point.

iterator operator ++ (int); the postfix increment - moves the iterator to the next UTF-8 encoded code point and returns the current one.

iterator& operator -- (); the prefix decrement - moves the iterator to the previous UTF-8 encoded code point.

iterator operator -- (int); the postfix decrement - moves the iterator to the previous UTF-8 encoded code point and returns the current one.

Example of use:

 char * threechars = " xf0x90x8dx86xe6x97xa5xd1x88 " ;
utf8::iterator< char *> it (threechars, threechars, threechars + 9 );
utf8::iterator< char *> it2 = it;
assert (it2 == it);
assert (*it == 0x10346 );
assert (*(++it) == 0x65e5);
assert ((*it++) == 0x65e5);
assert (*it == 0x0448 );
assert (it != it2);
utf8::iterator< char *> endit (threechars + 9 , threechars, threechars + 9 );  
assert (++it == endit);
assert (*(--it) == 0x0448);
assert ((*it--) == 0x0448);
assert (*it == 0x65e5 );
assert (--it == utf8::iterator< char *>(threechars, threechars, threechars + 9 ));
assert (*it == 0x10346 );

The purpose of utf8::iterator adapter is to enable easy iteration as well as the use of STL algorithms with UTF-8 encoded strings. Increment and decrement operators are implemented in terms of utf8::next() and utf8::prior() functions.

Note that utf8::iterator adapter is a checked iterator. It operates on the range specified in the constructor; any attempt to go out of that range will result in an exception. Even the comparison operators require both iterator object to be constructed against the same range - otherwise an exception is thrown. Typically, the range will be determined by sequence container functions begin and end , ie:

std::string s = " example " ;
utf8::iterator i (s.begin(), s.begin(), s.end());

Functions From utf8::unchecked Namespace

utf8::unchecked::append

Available in version 1.0 and later.

Encodes a 32 bit code point as a UTF-8 sequence of octets and appends the sequence to a UTF-8 string.

 template < typename octet_iterator>
octet_iterator append ( utfchar32_t cp, octet_iterator result);

cp : A 32 bit integer representing a code point to append to the sequence.
result : An output iterator to the place in the sequence where to append the code point.
Return value: An iterator pointing to the place after the newly appended sequence.

Example of use:

 unsigned char u[ 5 ] = { 0 , 0 , 0 , 0 , 0 };
unsigned char * end = unchecked::append( 0x0448 , u);
assert (u[ 0 ] == 0xd1 && u[ 1 ] == 0x88 && u[ 2 ] == 0 && u[ 3 ] == 0 && u[ 4 ] == 0 );

This is a faster but less safe version of utf8::append . It does not check for validity of the supplied code point, and may produce an invalid UTF-8 sequence.

utf8::unchecked::append16

Available in version 4.0 and later.

Encodes a 32 bit code point as a UTF-16 sequence of words and appends the sequence to a UTF-16 string.

 template < typename word_iterator>
word_iterator append16 ( utfchar32_t cp, word_iterator result)

Example of use:

 unsigned short u[ 5 ] = { 0 , 0 };
utf8::unchecked::append16 ( 0x0448 , u);
assert (u[ 0 ], 0x0448 );
assert (u[ 1 ], 0x0000 );

This is a faster but less safe version of utf8::append . It does not check for validity of the supplied code point, and may produce an invalid UTF-8 sequence.

utf8::unchecked::next

Available in version 1.0 and later.

Given the iterator to the beginning of a UTF-8 sequence, it returns the code point and moves the iterator to the next position.

 template < typename octet_iterator>
utfchar32_t next (octet_iterator& it);

it : a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the beginning of the next code point.
Return value: the 32 bit representation of the processed UTF-8 code point.

Example of use:

 char * twochars = " xe6x97xa5xd1x88 " ;
char * w = twochars;
int cp = unchecked::next(w);
assert (cp == 0x65e5 );
assert (w == twochars + 3 );

This is a faster but less safe version of utf8::next . It does not check for validity of the supplied UTF-8 sequence.

utf8::next16

Available in version 4.0 and later.

Given the iterator to the beginning of the UTF-16 sequence, it returns the code point and moves the iterator to the next position.

 template < typename word_iterator>
utfchar32_t next16 (word_iterator& it);

word_iterator : an input iterator.
it : a reference to an iterator pointing to the beginning of an UTF-16 encoded code point. After the function returns, it is incremented to point to the beginning of the next code point.

Return value: the 32 bit representation of the processed UTF-16 code point.

Example of use:

 const unsigned short u[ 3 ] = { 0x65e5 , 0xd800 , 0xdf46 };
const unsigned short * w = u;
int cp = unchecked::next16(w);
assert (cp, 0x65e5 );
assert (w, u + 1 );

This function is typically used to iterate through a UTF-16 encoded string.

This is a faster but less safe version of utf8::next16 . It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::peek_next

Available in version 2.1 and later.

Given the iterator to the beginning of a UTF-8 sequence, it returns the code point.

 template < typename octet_iterator>
utfchar32_t peek_next (octet_iterator it);

it : an iterator pointing to the beginning of an UTF-8 encoded code point.
Return value: the 32 bit representation of the processed UTF-8 code point.

Example of use:

 char * twochars = " xe6x97xa5xd1x88 " ;
char * w = twochars;
int cp = unchecked::peek_next(w);
assert (cp == 0x65e5 );
assert (w == twochars);

This is a faster but less safe version of utf8::peek_next . It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::prior

Available in version 1.02 and later.

Given a reference to an iterator pointing to an octet in a UTF-8 sequence, it decreases the iterator until it hits the beginning of the previous UTF-8 encoded code point and returns the 32 bits representation of the code point.

 template < typename octet_iterator>
utfchar32_t prior (octet_iterator& it);

it : a reference pointing to an octet within a UTF-8 encoded string. After the function returns, it is decremented to point to the beginning of the previous code point.
Return value: the 32 bit representation of the previous code point.

Example of use:

 char * twochars = " xe6x97xa5xd1x88 " ;
char * w = twochars + 3 ;
int cp = unchecked::prior (w);
assert (cp == 0x65e5 );
assert (w == twochars);

This is a faster but less safe version of utf8::prior . It does not check for validity of the supplied UTF-8 sequence and offers no boundary checking.

utf8::unchecked::advance

Available in version 1.0 and later.

Advances an iterator by the specified number of code points within an UTF-8 sequence.

 template < typename octet_iterator, typename distance_type>
void advance (octet_iterator& it, distance_type n);

it : a reference to an iterator pointing to the beginning of an UTF-8 encoded code point. After the function returns, it is incremented to point to the nth following code point. n : number of code points it should be advanced. A negative value means decrement.

Example of use:

 char * twochars = " xe6x97xa5xd1x88 " ;
char * w = twochars;
unchecked::advance (w, 2 );
assert (w == twochars + 5 );

This is a faster but less safe version of utf8::advance . It does not check for validity of the supplied UTF-8 sequence and offers no boundary checking.

utf8::unchecked::distance

Available in version 1.0 and later.

Given the iterators to two UTF-8 encoded code points in a sequence, returns the number of code points between them.

 template < typename octet_iterator>
typename std::iterator_traits<octet_iterator>::difference_type distance (octet_iterator first, octet_iterator last);

first : an iterator to a beginning of a UTF-8 encoded code point.
last : an iterator to a "post-end" of the last UTF-8 encoded code point in the sequence we are trying to determine the length. It can be the beginning of a new code point, or not.
Return value: the distance between the iterators, in code points.

Example of use:

 char * twochars = " xe6x97xa5xd1x88 " ;
size_t dist = utf8::unchecked::distance(twochars, twochars + 5 );
assert (dist == 2 );

This is a faster but less safe version of utf8::distance . It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::utf16to8

Available in version 1.0 and later.

Converts a UTF-16 encoded string to UTF-8.

 template < typename u16bit_iterator, typename octet_iterator>
octet_iterator utf16to8 (u16bit_iterator start, u16bit_iterator end, octet_iterator result);

start : an iterator pointing to the beginning of the UTF-16 encoded string to convert.
end : an iterator pointing to pass-the-end of the UTF-16 encoded string to convert.
result : an output iterator to the place in the UTF-8 string where to append the result of conversion.
Return value: An iterator pointing to the place after the appended UTF-8 string.

Example of use:

 unsigned short utf16string[] = { 0x41 , 0x0448 , 0x65e5 , 0xd834 , 0xdd1e };
vector< unsigned char > utf8result;
unchecked::utf16to8 (utf16string, utf16string + 5 , back_inserter(utf8result));
assert (utf8result.size() == 10);

This is a faster but less safe version of utf8::utf16to8 . It does not check for validity of the supplied UTF-16 sequence.

utf8::unchecked::utf8to16

Available in version 1.0 and later.

Converts an UTF-8 encoded string to UTF-16

 template < typename u16bit_iterator, typename octet_iterator>
u16bit_iterator utf8to16 (octet_iterator start, octet_iterator end, u16bit_iterator result);

start : an iterator pointing to the beginning of the UTF-8 encoded string to convert. end : an iterator pointing to pass-the-end of the UTF-8 encoded string to convert.
result : an output iterator to the place in the UTF-16 string where to append the result of conversion.
Return value: An iterator pointing to the place after the appended UTF-16 string.

Example of use:

 char utf8_with_surrogates[] = " xe6x97xa5xd1x88xf0x9dx84x9e " ;
vector < unsigned short > utf16result;
unchecked::utf8to16 (utf8_with_surrogates, utf8_with_surrogates + 9 , back_inserter(utf16result));
assert (utf16result.size() == 4);
assert (utf16result[ 2 ] == 0xd834 );
assert (utf16result[ 3 ] == 0xdd1e );

This is a faster but less safe version of utf8::utf8to16 . It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::utf32to8

Available in version 1.0 and later.

Converts a UTF-32 encoded string to UTF-8.

 template < typename octet_iterator, typename u32bit_iterator>
octet_iterator utf32to8 (u32bit_iterator start, u32bit_iterator end, octet_iterator result);

start : an iterator pointing to the beginning of the UTF-32 encoded string to convert.
end : an iterator pointing to pass-the-end of the UTF-32 encoded string to convert.
result : an output iterator to the place in the UTF-8 string where to append the result of conversion.
Return value: An iterator pointing to the place after the appended UTF-8 string.

Example of use:

 int utf32string[] = { 0x448 , 0x65e5 , 0x10346 , 0 };
vector< unsigned char > utf8result;
utf32to8 (utf32string, utf32string + 3 , back_inserter(utf8result));
assert (utf8result.size() == 9);

This is a faster but less safe version of utf8::utf32to8 . It does not check for validity of the supplied UTF-32 sequence.

utf8::unchecked::utf8to32

Available in version 1.0 and later.

Converts a UTF-8 encoded string to UTF-32.

 template < typename octet_iterator, typename u32bit_iterator>
u32bit_iterator utf8to32 (octet_iterator start, octet_iterator end, u32bit_iterator result);

start : an iterator pointing to the beginning of the UTF-8 encoded string to convert.
end : an iterator pointing to pass-the-end of the UTF-8 encoded string to convert.
result : an output iterator to the place in the UTF-32 string where to append the result of conversion.
Return value: An iterator pointing to the place after the appended UTF-32 string.

Example of use:

 char * twochars = " xe6x97xa5xd1x88 " ;
vector< int > utf32result;
unchecked::utf8to32 (twochars, twochars + 5 , back_inserter(utf32result));
assert (utf32result.size() == 2);

This is a faster but less safe version of utf8::utf8to32 . It does not check for validity of the supplied UTF-8 sequence.

utf8::unchecked::replace_invalid

Available in version 3.1 and later.

Replaces all invalid UTF-8 sequences within a string with a replacement marker.

 template < typename octet_iterator, typename output_iterator>
output_iterator replace_invalid (octet_iterator start, octet_iterator end, output_iterator out, utfchar32_t replacement);
template < typename octet_iterator, typename output_iterator>
output_iterator replace_invalid (octet_iterator start, octet_iterator end, output_iterator out);

octet_iterator : an input iterator.
output_iterator : an output iterator.
start : an iterator pointing to the beginning of the UTF-8 string to look for invalid UTF-8 sequences.
end : an iterator pointing to pass-the-end of the UTF-8 string to look for invalid UTF-8 sequences.
out : An output iterator to the range where the result of replacement is stored.
replacement : A Unicode code point for the replacement marker. The version without this parameter assumes the value 0xfffd
Return value: An iterator pointing to the place after the UTF-8 string with replaced invalid sequences.

Example of use:

 char invalid_sequence[] = " a x80xe0xa0xc0xafxedxa0x80 z " ;
vector< char > replace_invalid_result;
unchecked::replace_invalid (invalid_sequence, invalid_sequence + sizeof (invalid_sequence), back_inserter(replace_invalid_result), '?');
bvalid = utf8::is_valid(replace_invalid_result.begin(), replace_invalid_result.end());
assert (bvalid);
char * fixed_invalid_sequence = " a????z " ;
assert (std::equal(replace_invalid_result.begin(), replace_invalid_result.end(), fixed_invalid_sequence));

Unlike utf8::replace_invalid , this function does not verify validity of the replacement marker.

Types From utf8::unchecked Namespace

utf8::iterator

Available in version 2.0 and later.

Adapts the underlying octet iterator to iterate over the sequence of code points, rather than raw octets.

 template < typename octet_iterator>
class iterator ;

Member functions

iterator(); the default constructor; the underlying octet_iterator is constructed with its default constructor.

explicit iterator (const octet_iterator& octet_it); a constructor that initializes the underlying octet_iterator with octet_it .