
The following proposal extends the JSON specification, with the idea
of using JSON as an information interchange format, rather than just a
way of writing certain ECMAscript values. They do not add anything but
only restrict valid JSON content and encoders with some rationale.
First of, I d like to remind everyone, including JSON s author, that
JSON
is case-sensitive,
except in the four hexdigits
after a backslash-u sequence in a String.
Second, I d like to remind everyone that JSON is not binary-safe. No
way around that, it implements Unicode (actually, 16-bit UCS-2, and it
doesn t guarantee that UTF-16 surrogates are correctly paired) text. I
also consider only UTF- 8, 16,32 B,L E valid encodings for JSON. (No
PDP endian, either. Sorry, guys.)
For my first proposal, I d like to point out CVE-2011-4815 which was
about overflowing hashtables. The obvious fix is to randomise the hash
per hashtable; to ensure this doesn t leak, we sort ASCIIbetically the
keys of an Object in the encoder. (Using Unicode is good here we can
just sort the keys as UTF-8 strings by their
uint8_t value or
as Unicode (UCS-2 or even UCS-4 or UTF-16) strings by the codepoints.)
JSON was never preserving the order of elements in an Object anyway so
we make it standardised (we still accept any order, and, when parsing,
in collision cases, the later value wins). This also helps diffs.
For my second proposal, I d like to forbid \u0000, \uFFFE, \uFFFF in
strings. The first because many implementations use C strings, and for
an information interchange format this is better; it also has security
implications to allow NUL in a string. The other two, but not unpaired
UTF-16 surrogates (as ECMAscript uses UCS-2 and got UTF-16 only later)
because they re not valid Unicode; JSON was not binary-safe already so
why bother. Among other benefits, this also helps implementations.
For my third proposal, I d like to agree that implementations should
impose a nesting depth limit that may be user-defined, and in the face
of which, cyclic checking may be ignored by an encoder. I emit nesting
depth overflows as literalnull; might also throw an error. Since I was
asked, the common standard value is to restrict nesting depth is 32,
unless the user specified one. (I also saw 8, but 32 WFM pretty well.)
Most seem to use it even if it may seem low at first. Only specialised
applications probably need more, and they can always pass a value.
For my forth proposal, backslash-escape U+007F U+009F always. It may
upset humans, editors, databases, etc. (This paragraph is newly added,
after some
IRC discussion.)
All these do not permit anything that wasn t accepted to be accepted
afterwards. I ve got a fifth proposal which changes acceptance rules
but only for a subset of parsers: formally JSON is defined in ECMA-262
as industry standard that, in contrast to RFC 4627, always allowed any
Value as top-level element of a JSON text. I d like to make it so, and
ignore the RFC s requirement for it to be an Object or Array. Even so,
the first two characters (after the BOM, if any) of a JSON text always
are in the non-NUL 7-bit ASCII range, allowing for encoding detection.
(This is done by the NUL octet pattern in the first four octets.)
JSON has only taken off because it s a tightly defined simple format
that can be used everywhere and isn t too awful for humans (escaping
not needed for U+0020 U+D7FF and U+E000 U+FFFD after all, although I d
also take the C1 control characters out, see my forth proposal above).
I ve started to use a trailing comma in indexed and associative arrays
in code I write at work, when the array values are one a line, to help
version control systems to do their diffs, but refrain from asking for
a JSON extension to permit that in order to not endanger compatibility
any (no comment needed, it s just not worth it), but I d like my above
proposals to be followed by implementators (and I m one of them).
Some more discussion with Jonathan pointed out that JSON5 allows for
trailing commata in Object and Array; IMHO the only feature of it that
is not bad or outright harmful. I ll probably keep from accepting them
because, on their own, they re not
that useful, and I usually
would run JSON texts, even configs, through a parser/encoder roundtrip
to pretty-print them which would lose them anyway.
As for binary-safeness: probably best to just use base64 and let the
outer layers worry about compression. The data is usually unrelated to
the JSON-encoded structure, and even if it s related to other data the
base64 representation is usually similar (unless misaligned).
Update 02.12.2012 Wrong I was about the first two
characters: " " is a valid JSON text. Still possible to peek at four
octets and determine the encoding by ordering the tests; updated my
notes.