Go 字符串編碼？UTF-8？Unicode？看完就通！

Go byte rune string

string 類型在 golang 中以 utf-8 的編碼形式存在，而 string 的底層存儲結構，劃分到字節即 byte，劃分到字符即 rune。本文將會介紹字符編碼的一些基礎概念，詳細講述三者之間的關係，並提供部分字符串相關的操作實踐。

一、基礎概念

介紹 Unicode，UTF-8 之間的關係與編碼規則

1、Unicode

Unicode 是一種在計算機上使用的字符編碼。它爲每種語言中的每個字符設定了統一併且唯一的二進制編碼，以滿足跨語言、跨平臺進行文本轉換、處理的要求。本質上 Unicode 表示了一種字符與二進制編碼的一一對應關係，所以是一種單字符的編碼。

對於字符串來說，如果使用 Unicode 進行存儲，則每個字符使用的存儲長度是不固定的，而且是無法進行精確分割的。如中文字符 “南” 使用的 Unicode 編碼爲 0x5357，對於該編碼可以整體理解爲一個字符“南”，也可以理解爲 0x53（S）和 0x57（W）。因而單純使用 Unicode 是無法進行字符串編碼的，因爲計算機無法去識別要在幾個字節處做分割，哪幾個字節要組成一個字符。所以需要一種 Unicode 之上，存在部分冗餘位的編碼方式，以準確表示單個字符，並在多個字符進行組合的時候，能夠正確進行分割，即 UTF-8。

2、UTF-8

UTF-8 是針對 Unicode 的一種可變長度字符編碼，它可以用來表示 Unicode 標準中的任何字符。因而 UTF-8 是 Unicode 字符編碼的一種實現方式，Unicode 強調單個字符的一一對應關係，UTF-8 是 Unicode 的組合實現方式，此外還有 UTF-16，UTF-32 等類似編碼，普適性較 UTF-8 稍弱。

編碼規則

• ASCII 字符（不包含擴展 128+）0000 0000-0000 007F （0～7bit）
• 0xxxxxxx
• 0000 0080-0000 07FF （8～11bit）
• 110xxxxx 10xxxxxx
• 0000 0800-0000 FFFF （12～16bit）
• 1110xxxx 10xxxxxx 10xxxxxx
• 0001 0000-0010 FFFF （17～21bit）
• 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

總結

1. 對於 ASCII（不包含擴展 128+）字符，UTF-8 編碼、Unicode 編碼、ASCII 碼均相同（即單字節以 0 開頭）
1. 對於非 ASCII（不包含擴展 128+）字符，若字符有 n 個字節（編碼後）。則首字節的開頭爲 n 個 1 和 1 個 0，其餘字節均以 10 開頭。除去這些開頭固定位，其餘位組合表示 Unicode 字符。

轉換（2 + 字節 UTF-8）

UTF-8 to Unicode

將 UTF-8 按字節進行分割，以編碼規則去掉每個字節頭部的佔位 01，剩下位進行組合即 Unicode 字符

Unicode to UTF-8

從低位開始每次取 6 位，前加 10 組成尾部一個字節。直到不足六位，加上對應的 n 個 1 和 1 個 0，首字節的大端不足位補 0，如補充字節後位數不夠則再增加一字節，規則同上。

（按規則預估字節數，優先寫好每個字節的填充位，從末端補充即可）

實踐

UTF-8 to Unicode

字符” 南 “，UTF-8 十六進制編碼爲 0xe58d97，二進制編碼爲 11100101 10001101 10010111

去掉第一字節頭部的 1110，二三字節頭部的 10，則爲 0101 0011 010 10111，Unicode 編碼 0x5357

Unicode to UTF-8

字符” 南 “，Unicode 十六進制編碼爲 0x5357，二進制編碼爲 0101 0011 0101 0111 （15 位）。則轉換爲 UTF-8 後佔用 3 個字節，即 1110xxxx 10xxxxxx 10xxxxxx。

從後向前填充：11100101 10001101 10010111

3、UCA（Unicode Collation Algorithm)

UCA 是 Unicode 字符的核對算法，目前最新版本 15.0.0(2022-05-03 12:36)。以 14.0.0 爲準，數據文件主要包含兩個部分，即 allkeys 和 decomps，表示字符集的排序、大小寫、分解關係等，詳細信息可閱讀 Unicode 官方文檔。不同版本之間的 UCA 是存在差異的，如兩個字符，在 14.0.0 中定義了大小寫關係，但在 5.0.0 中是不具備大小寫關係的。在僅支持 5.0.0 的應用中，14.0.0 增加的字符是可能以硬編碼的方式存在的，具體情況要看實現細節。因而對於跨平臺，多語言的業務，各個服務使用的 UCA 很可能不是同一個版本。因而對於部分字符，其排序規則、大小寫轉換的不同，有可能會產生不一致的問題。

二、byte rune string

1、類型定義

三者都是 Go 中的內置類型，在 builtin 包中有類型定義

// byte is an alias for uint8 and is equivalent to uint8 in all ways. It is
// used, by convention, to distinguish byte values from 8-bit unsigned
// integer values.
type byte = uint8

// rune is an alias for int32 and is equivalent to int32 in all ways. It is
// used, by convention, to distinguish character values from integer values.
type rune = int32

// string is the set of all strings of 8-bit bytes, conventionally but not
// necessarily representing UTF-8-encoded text. A string may be empty, but
// not nil. Values of string type are immutable.
type string string

byte 是 uint8 類型的別名，通常用於表示一個字節（8bit）。

rune 是 int32 類型的別名，通常用於表示一個字符（32bit）。

string 是 8bit 字節的集合，通常是表示 UTF-8 編碼的字符串。

從官方概念來看，string 表示的是 byte 的集合，即八位的一個字節的集合，通常情況下使用 UTF-8 的編碼方式，但不絕對。而 rune 表示用四個字節組成的一個字符，rune 值爲字符的 Unicode 編碼。

str := "南"

對於一個字符串 “南”，其在 UTF-8 編碼下有三個字節 0xe58d97，所以轉化爲字節數組

byteList := []byte{0xe5,0x8d,0x97}

三個字節共同表示一個字符，因而 rune 實際上爲其對應的 Unicode 對應的編碼 0x5357

runeList := []rune{0x5357}

上述三段中的 str，byteList，runeList 雖然分別爲字符串、字節數組、字符數組不同類型，但實際上表示的都是漢字 “南”。

2、類型轉換

類型轉換時候使用的語法，是無法直接定位到具體實現過程的。需要查看 plan9 彙編結果以找到類型轉換具體調用的源碼。

func main() {
    byteList := []byte{0xe5, 0x8d, 0x97}
    str := string(byteList)
    fmt.Println(str)
}

如上示例代碼，定義字節數組（表示漢字 “南”），轉化爲 string 類型後進行輸出。

go tool compile -S -N -l main.go

命令行對上述代碼進行編譯，禁止內聯，禁止優化，輸出彙編代碼如下（僅關注類型轉換）：

0x0074 00116 (main.go:7)        MOVD    ZR, 8(RSP)
0x0078 00120 (main.go:7)        MOVD    R0, 16(RSP)
0x007c 00124 (main.go:7)        MOVD    R1, 24(RSP)
0x0080 00128 (main.go:7)        PCDATA  $1, ZR
0x0080 00128 (main.go:7)        CALL    runtime.slicebytetostring(SB)
0x0084 00132 (main.go:7)        MOVD    32(RSP), R0
0x0088 00136 (main.go:7)        MOVD    40(RSP), R1
0x008c 00140 (main.go:7)        MOVD    R0, "".str-80(SP)
0x0090 00144 (main.go:7)        MOVD    R1, "".str-72(SP)

可見，類型轉換實際上是調用了 runtime 包中的 slicebytetostring 方法

三種類型相互轉換均可通過彙編的方式找到源碼位置，此處僅以 []byte->string 舉例。

rune to []byte（string）

encoderune 函數接受一個 rune 值，通過 UTF-8 的編碼規則，將其轉化爲 []byte 並寫入 p，同時返回寫入的字節數。

// encoderune writes into p (which must be large enough) the UTF-8 encoding of the rune.
// It returns the number of bytes written.
func encoderune(p []byte, r rune) int {
    // Negative values are erroneous. Making it unsigned addresses the problem.
    switch i := uint32(r); {
    case i <= rune1Max:
        p[0] = byte(r)
        return 1
    case i <= rune2Max:
        _ = p[1] // eliminate bounds checks
        p[0] = t2 | byte(r>>6)
        p[1] = tx | byte(r)&maskx
        return 2
    case i > maxRune, surrogateMin <= i && i <= surrogateMax:
        r = runeError
        fallthrough
    case i <= rune3Max:
        _ = p[2] // eliminate bounds checks
        p[0] = t3 | byte(r>>12)
        p[1] = tx | byte(r>>6)&maskx
        p[2] = tx | byte(r)&maskx
        return 3
    default:
        _ = p[3] // eliminate bounds checks
        p[0] = t4 | byte(r>>18)
        p[1] = tx | byte(r>>12)&maskx
        p[2] = tx | byte(r>>6)&maskx
        p[3] = tx | byte(r)&maskx
        return 4
    }
}

rune 向 byte 和 string 類型的轉換實際上都是基於 encoderune 函數，該函數通過硬編碼和位運算的方式實現了 Unicode 值向 UTF-8 編碼（[]byte）的轉換。因而不再關注 rune，僅關注 []byte 和 string 的轉換邏輯。

[]byte to string

// slicebytetostring converts a byte slice to a string.
// It is inserted by the compiler into generated code.
// ptr is a pointer to the first element of the slice;
// n is the length of the slice.
// Buf is a fixed-size buffer for the result,
// it is not nil if the result does not escape.
func slicebytetostring(buf *tmpBuf, ptr *byte, n int) (str string) {
    /*
        部分情況（race、msan、n=0,1等不關注）
    */

    var p unsafe.Pointer
    if buf != nil && n <= len(buf) {
        p = unsafe.Pointer(buf)
    } else {
        p = mallocgc(uintptr(n), nil, false)
    }
    stringStructOf(&str).str = p
    stringStructOf(&str).len = n
    memmove(p, unsafe.Pointer(ptr), uintptr(n))
    return
}

*tmpBuf 是一個定長爲 32 的字節數組，當長度超過 32，無法直接通過 tmpBuf 進行承接，則需要重新分配一塊內存去存儲 string。

const tmpStringBufSize = 32

type tmpBuf [tmpStringBufSize]byte

stringStructOf 用於將字符串類型轉爲 string 內置的 stringStruct 類型，以設置字符串指針與 len。

type stringStruct struct {
   str unsafe.Pointer
   len int
}

func stringStructOf(sp *string) *stringStruct {
    return (*stringStruct)(unsafe.Pointer(sp))
}

無論使用 tmpBuf 還是在堆上新分配，都需要通過 memmove 進行底層數據拷貝。

string to []byte

func stringtoslicebyte(buf *tmpBuf, s string) []byte {
   var b []byte
   if buf != nil && len(s) <= len(buf) {
      *buf = tmpBuf{}
      b = buf[:len(s)]
   } else {
      b = rawbyteslice(len(s))
   }
   copy(b, s)
   return b
}

本質上也是基於 string 的 len，選擇性使用 tmpBuf 或新分配內存，後使用 copy 進行底層數據拷貝

三、操作實踐

1、類型轉換性能優化

Go 底層對 []byte 和 string 的轉化都需要進行內存拷貝，因而在部分需要頻繁轉換的場景下，大量的內存拷貝會導致性能下降。

type stringStruct struct {
   str unsafe.Pointer
   len int
}

type slice struct {
   array unsafe.Pointer
   len   int
   cap   int
}

本質上底層數據存儲都是基於 uintptr，可見 string 與 []byte 的區別在於 []byte 額外有一個 cap 去指定 slice 的容量。所以 string 可以看作 [2]uintptr，[]byte 看作 [3]uintptr，類型轉換隻需要轉換成對應的 uintptr 數組即可，不需要進行底層數據的頻繁拷貝。

以下是 fasthttp 基於此思想提供的一個解決方案，用於 string 與 []byte 的高性能轉換。

// b2s converts byte slice to a string without memory allocation.
// See https://groups.google.com/forum/#!msg/Golang-Nuts/ENgbUzYvCuU/90yGx7GUAgAJ .
//
// Note it may break if string and/or slice header will change
// in the future go versions.
func b2s(b []byte) string {
    /* #nosec G103 */
    return *(*string)(unsafe.Pointer(&b))
}

// s2b converts string to a byte slice without memory allocation.
//
// Note it may break if string and/or slice header will change
// in the future go versions.
func s2b(s string) (b []byte) {
    /* #nosec G103 */
    bh := (*reflect.SliceHeader)(unsafe.Pointer(&b))
    /* #nosec G103 */
    sh := (*reflect.StringHeader)(unsafe.Pointer(&s))
    bh.Data = sh.Data
    bh.Cap = sh.Len
    bh.Len = sh.Len
    return b
}

由於 []byte 轉換到 string 時直接拋棄 cap 即可，因而可以直接通過 unsafe.Pointer 進行操作。

string 轉換到 []byte 時，需要進行指針的拷貝，並將 Cap 設置爲 Len。此處是該方案的一個細節點，因爲 string 是定長的，轉換後 data 後續的數據是否可寫是不確定的。如果 Cap 大於 Len，在進行 append 的時候不會觸發 slice 的擴容，而且由於後續內存不可寫，就會在運行時導致 panic。

2、UCA 不一致

UCA 定義在 unicode/tables.go 中，頭部即定義了使用的 UCA 版本。

// Version is the Unicode edition from which the tables are derived.
const Version = "13.0.0"

經過追溯，go 1 起的 tables.go 即使用了 6.0.0 的版本，位置與現在稍有不同。

根據 MySQL 官方文檔關於 UCA 的相關內容

MySQL 使用不同編碼，UCA 的版本並不相同，因而很大概率會存在底層數據庫使用的 UCA 與業務層使用的 UCA 不一致的情況。在一些大小寫不敏感的場景下，可能會出現字符的識別問題。如業務層認爲兩個字符爲一對大小寫字符，而由於 MySQL 使用的 UCA 版本較低，導致 MySQL 通過小寫進行不敏感查詢無法查詢到大寫的數據。

由於常用字符集基本不會發生變化，所以對於普通業務，UCA 的不一致基本不會造成影響。

本文由 Readfog 進行 AMP 轉碼，版權歸原作者所有。
來源：https://mp.weixin.qq.com/s/9Pd7W9OZqrteNMKvvswVZQ