C++ преобразование asii экранированной строки unicode в строку utf8

Question

C++ преобразование asii экранированной строки unicode в строку utf8

Мне нужно прочитать стандартную строку в стиле ascii с экранированием unicode и преобразовать ее в строку std::, содержащую кодированный эквивалент utf8. Так, например, "u03a0 " (строка std::с 6 символами) должна быть преобразована в строку std::с двумя символами, 0xce, 0xa0 соответственно, в необработанном двоичном коде.

Был бы очень рад, если бы был простой ответ с помощью icu или boost, но я не смог его найти.

(это похоже на преобразование строки Юникода в экранированная строка ASCII , но NB, что мне в конечном итоге нужно прийти к кодировке UTF8. Если мы можем использовать Unicode в качестве промежуточного шага, это нормально.)

632 3

c++utf-8 unicode

3 ответов:

Comments

Ничего не найдено.

Bulletmagnet · Accepted Answer · 2015-02-17 10:19:43

(\u03a0-это кодовая точка Юникода для греческой заглавной буквы PI, кодировка UTF-8 которой равна 0xCE 0xA0)

Вам нужно:

получить число 0x03a0 из строки "\u03a0": поместите обратную косую черту и u и разберите 03a0 как шестнадцатеричное, в wchar_t. повторяйте, пока не получите (широкую) строку.

преобразование 0x3a0 в UTF-8. C++11 имеет кодек codecvt_utf8 , который может помочь.

Remy Lebeau · Accepted Answer · 2015-02-17 04:35:28

Попробуйте что-нибудь вроде этого:

std::string to_utf8(uint32_t cp)
{
    /*
    if using C++11 or later, you can do this:

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv;
    return conv.to_bytes( (char32_t)cp );

    Otherwise...
    */

    std::string result;

    int count;
    if (cp < 0x0080)
        count = 1
    else if (cp < 0x0800)
        count = 2;
    else if (cp < 0x10000)
        count = 3;
    else else if (cp <= 0x10FFFF)
        count = 4;
    else
        return result; // or throw an exception

    result.resize(count);

    for (int i = count-1; i > 0; --i)
    {
        result[i] = (char) (0x80 | (cp & 0x3F));
        cp >>= 6;
    }

    for (int i = 0; i < count; ++i)
        cp |= (1 << (7-i));

    result[0] = (char) cp;

    return result;
}

std::string str = ...; // "\\u03a0"
std::string::size_type startIdx = 0;
do
{
    startIdx = str.find("\\u", startIdx);
    if (startIdx == std::string::npos) break;

    std::string::size_type endIdx = str.find_first_not_of("0123456789abcdefABCDEF", startIdx+2);
    if (endIdx == std::string::npos) break;

    std::string tmpStr = str.substr(startIdx+2, endIdx-(startIdx+2));
    std::istringstream iss(tmpStr);

    uint32_t cp;
    if (iss >> std::hex >> cp)
    {
        std::string utf8 = to_utf8(cp);
        str.replace(startIdx, 2+tmpStr.length(), utf8);
        startIdx += utf8.length();
    }
    else
        startIdx += 2;
}
while (true);

dontsov · Accepted Answer · 2017-12-10 02:16:47

Мое решение:

convert_unicode_escape_sequences(str)

input: "\u043f\u0440\u0438\u0432\u0435\u0442"
output: "привет"

Повышение, используемое для преобразования wchar / chars:

#include <boost/locale/encoding_utf.hpp>

using boost::locale::conv::utf_to_utf;

inline uint8_t get_uint8(uint8_t h, uint8_t l)
{
    uint8_t ret;

    if (h - '0' < 10)
        ret = h - '0';
    else if (h - 'A' < 6)
        ret = h - 'A' + 0x0A;
    else if (h - 'a' < 6)
        ret = h - 'a' + 0x0A;

    ret = ret << 4;

    if (l - '0' < 10)
        ret |= l - '0';
    else if (l - 'A' < 6)
        ret |= l - 'A' + 0x0A;
    else if (l - 'a' < 6)
        ret |= l - 'a' + 0x0A;
    return  ret;
}

std::string wstring_to_utf8(const std::wstring& str)
{
    return utf_to_utf<char>(str.c_str(), str.c_str() + str.size());
}

std::string convert_unicode_escape_sequences(const std::string& source)
{
    std::wstring ws; ws.reserve(source.size());
    std::wstringstream wis(ws);

    auto s = source.begin();
    while (s != source.end())
    {
        if (*s == '\\')
        {
            if (std::distance(s, source.end()) > 5)
            {
                if (*(s + 1) == 'u')
                {
                    unsigned int v = get_uint8(*(s + 2), *(s + 3)) << 8;
                    v |= get_uint8(*(s + 4), *(s + 5));

                    s += 6;
                    wis << boost::numeric_cast<wchar_t>(v);
                    continue;
                }
            }
        }
        wis << wchar_t(*s);
        s++;
    }

    return wstring_to_utf8(wis.str());
}