Mason Clayton 75172cd11b
fix: utf8 -> utf16 decoding bug on surrogate pairs (#1486)
* fix utf8 -> utf16 decoding bug on surrogate pairs

This fixes https://github.com/protobufjs/protobuf.js/issues/1473

The custom utf8 -> utf16 decoder appears to be subtly flawed. From my reading it appears the chunking mechanism doesn't account for surrogate pairs at the end of a chunk causing variable size chunks. A larger chunk followed by a smaller chunk leaves behind garbage that'll be included in the latter chunk.

It looks like the chunking mechanism was added to prevent stack overflows when calling `formCharCode` with too many args. From some benchmarking it appears putting utf16 code units in an array and spreading that into `fromCharCode` wasn't helping performance much anyway. I simplified it significantly.

Here's a repro of the existing encoding bug in a fuzzing suite
https://repl.it/@turbio/oh-no-our-strings#decoder.js

* fix lint

* add test case for surrogate pair bug

Co-authored-by: Alexander Fenster <fenster@google.com>
2020-10-09 15:54:17 -07:00
..

@protobufjs/utf8

npm

A minimal UTF8 implementation for number arrays.

API

  • utf8.length(string: string): number
    Calculates the UTF8 byte length of a string.

  • utf8.read(buffer: Uint8Array, start: number, end: number): string
    Reads UTF8 bytes as a string.

  • utf8.write(string: string, buffer: Uint8Array, offset: number): number
    Writes a string as UTF8 bytes.

License: BSD 3-Clause License