Commit Graph

30 Commits

Author SHA1 Message Date
Markus Armbruster
1e960b4602 json: Eliminate lexer state IN_WHITESPACE, pseudo-token JSON_SKIP
The lexer ignores whitespace like this:

         on whitespace      on non-ws   spontaneously
    IN_START --> IN_WHITESPACE --> JSON_SKIP --> IN_START
                    ^    |
                     \__/  on whitespace

This accumulates a whitespace token in state IN_WHITESPACE, only to
throw it away on the transition via JSON_SKIP to the start state.
Wasteful.  Go from IN_START to IN_START on whitespace directly,
dropping the whitespace character.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180831075841.13363-7-armbru@redhat.com>
2018-09-24 18:08:07 +02:00
Markus Armbruster
2ce4ee64c4 json: Eliminate lexer state IN_ERROR
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180831075841.13363-6-armbru@redhat.com>
2018-09-24 18:08:07 +02:00
Markus Armbruster
0f07a5d5f1 json: Nicer recovery from lexical errors
When the lexer chokes on an input character, it consumes the
character, emits a JSON error token, and enters its start state.  This
can lead to suboptimal error recovery.  For instance, input

    0123 ,

produces the tokens

    JSON_ERROR    01
    JSON_INTEGER  23
    JSON_COMMA    ,

Make the lexer skip characters after a lexical error until a
structural character ('[', ']', '{', '}', ':', ','), an ASCII control
character, or '\xFE', or '\xFF'.

Note that we must not skip ASCII control characters, '\xFE', '\xFF',
because those are documented to force the JSON parser into known-good
state, by docs/interop/qmp-spec.txt.

The lexer now produces

    JSON_ERROR    01
    JSON_COMMA    ,

Update qmp-test for the nicer error recovery: QMP now reports just one
error for input %p instead of two.  Also drop the newline after %p; it
was needed to tease out the second error.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180831075841.13363-5-armbru@redhat.com>
[Conflict with commit ebb4d82d88 resolved]
2018-09-24 18:08:07 +02:00
Markus Armbruster
c0ee3afa7f json: Make lexer's "character consumed" logic less confusing
The lexer uses macro TERMINAL_NEEDED_LOOKAHEAD() to decide whether a
state transition consumes the input character.  It returns true when
the state transition is defined with the TERMINAL() macro.  To detect
that, it checks whether input '\0' would have resulted in the same
state transition, and the new state is not IN_ERROR.

Why does that even work?  For all states, the new state on input '\0'
is either IN_ERROR or defined with TERMINAL().  If the state
transition equals the one we'd get for input '\0', it goes to IN_ERROR
or to the argument of TERMINAL().  We never use TERMINAL(IN_ERROR),
because it makes no sense.  Thus, if it doesn't go to IN_ERROR, it
must be defined with TERMINAL().

Since this isn't quite confusing enough, we negate the result to get
@char_consumed, and ignore it when @flush is true.

Instead of deriving the lookahead bit from the state transition, make
it explicit.  This is easier to understand, and a bit more flexible,
too.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180831075841.13363-4-armbru@redhat.com>
2018-09-24 18:06:09 +02:00
Markus Armbruster
852dfa76b8 json: Clean up how lexer consumes "end of input"
When the lexer isn't in its start state at the end of input, it's
working on a token.  To flush it out, it needs to transit to its start
state on "end of input" lookahead.

There are two ways to the start state, depending on the current state:

* If the lexer is in a TERMINAL(JSON_FOO) state, it can emit a
  JSON_FOO token.

* Else, it can go to IN_ERROR state, and emit a JSON_ERROR token.

There are complications, however:

* The transition to IN_ERROR state consumes the input character and
  adds it to the JSON_ERROR token.  The latter is inappropriate for
  the "end of input" character, so we suppress that.  See also recent
  commit a2ec6be72b "json: Fix lexer to include the bad character in
  JSON_ERROR token".

* The transition to a TERMINAL(JSON_FOO) state doesn't consume the
  input character.  In that case, the lexer normally loops until it is
  consumed.  We have to suppress that for the "end of input" input
  character.  If we didn't, the lexer would consume it by entering
  IN_ERROR state, emitting a bogus JSON_ERROR token.  We fixed that in
  commit bd3924a33a.

However, simply breaking the loop this way assumes that the lexer
needs exactly one state transition to reach its start state.  That
assumption is correct now, but it's unclean, and I'll soon break it.
Clean up: instead of breaking the loop after one iteration, break it
after it reached the start state.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180831075841.13363-3-armbru@redhat.com>
2018-09-24 18:06:09 +02:00
Markus Armbruster
2a96042a8d json: Fix lexer for lookahead character beyond '\x7F'
The lexer fails to end a valid token when the lookahead character is
beyond '\x7F'.  For instance, input

    true\xC2\xA2

produces the tokens

    JSON_ERROR     true\xC2
    JSON_ERROR     \xA2

This should be

    JSON_KEYWORD   true
    JSON_ERROR     \xC2
    JSON_ERROR     \xA2

instead.

The culprit is

    #define TERMINAL(state) [0 ... 0x7F] = (state)

It leaves [0x80..0xFF] zero, i.e. IN_ERROR.  Has always been broken.
Fix it to initialize the complete array.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180831075841.13363-2-armbru@redhat.com>
2018-09-24 18:06:09 +02:00
Markus Armbruster
86cdf9ec8d json: Clean up headers
The JSON parser has three public headers, json-lexer.h, json-parser.h,
json-streamer.h.  They all contain stuff that is of no interest
outside qobject/json-*.c.

Collect the public interface in include/qapi/qmp/json-parser.h, and
everything else in qobject/json-parser-int.h.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-54-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
812ce33ead qobject: Drop superfluous includes of qemu-common.h
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-53-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
f9277915ee json: Fix streamer not to ignore trailing unterminated structures
json_message_process_token() accumulates tokens until it got the
sequence of tokens that comprise a single JSON value (it counts curly
braces and square brackets to decide).  It feeds those token sequences
to json_parser_parse().  If a non-empty sequence of tokens remains at
the end of the parse, it's silently ignored.  check-qjson.c cases
unterminated_array(), unterminated_array_comma(), unterminated_dict(),
unterminated_dict_comma() demonstrate this bug.

Fix as follows.  Introduce a JSON_END_OF_INPUT token.  When the
streamer receives it, it feeds the accumulated tokens to
json_parser_parse().

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-46-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
4d40066142 json: Improve names of lexer states related to numbers
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-43-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
f7617d45d4 json: Leave rejecting invalid interpolation to parser
Both lexer and parser reject invalid interpolation specifications.
The parser's check is useless.

The lexer ends the token right after the first bad character.  This
tends to lead to suboptimal error reporting.  For instance, input

    [ %04d ]

produces the tokens

    JSON_LSQUARE  [
    JSON_ERROR    %0
    JSON_INTEGER  4
    JSON_KEYWORD  d
    JSON_RSQUARE  ]

The parser then yields an error, an object and two more errors:

    error: Invalid JSON syntax
    object: 4
    error: JSON parse error, invalid keyword
    error: JSON parse error, expecting value

Dumb down the lexer to accept [A-Za-z0-9]*.  The parser's check is now
used.  Emit a proper error there.

The lexer now produces

    JSON_LSQUARE  [
    JSON_INTERP   %04d
    JSON_RSQUARE  ]

and the parser reports just

    JSON parse error, invalid interpolation '%04d'

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-41-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
84a56f38b2 json: Pass lexical errors and limit violations to callback
The callback to consume JSON values takes QObject *json, Error *err.
If both are null, the callback is supposed to make up an error by
itself.  This sucks.

qjson.c's consume_json() neglects to do so, which makes
qobject_from_json() null instead of failing.  I consider that a bug.

The culprit is json_message_process_token(): it passes two null
pointers when it runs into a lexical error or a limit violation.  Fix
it to pass a proper Error object then.  Update the callbacks:

* monitor.c's handle_qmp_command(): the code to make up an error is
  now dead, drop it.

* qga/main.c's process_event(): lumps the "both null" case together
  with the "not a JSON object" case.  The former is now gone.  The
  error message "Invalid JSON syntax" is misleading for the latter.
  Improve it to "Input must be a JSON object".

* qobject/qjson.c's consume_json(): no update; check-qjson
  demonstrates qobject_from_json() now sets an error on lexical
  errors, but still doesn't on some other errors.

* tests/libqtest.c's qmp_response(): the Error object is now reliable,
  so use it to improve the error message.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-40-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
2cbd15aa6f json: Treat unwanted interpolation as lexical error
The JSON parser optionally supports interpolation.  The lexer
recognizes interpolation tokens unconditionally.  The parser rejects
them when interpolation is disabled, in parse_interpolation().
However, it neglects to set an error then, which can make
json_parser_parse() fail without setting an error.

Move the check for unwanted interpolation from the parser's
parse_interpolation() into the lexer's finite state machine.  When
interpolation is disabled, '%' is now handled like any other
unexpected character.

The next commit will improve how such lexical errors are handled.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-39-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
61030280ca json: Rename token JSON_ESCAPE & friends to JSON_INTERP
The JSON parser optionally supports interpolation.  The code calls it
"escape".  Awkward, because it uses the same term for escape sequences
within strings.  The latter usage is consistent with RFC 8259 "The
JavaScript Object Notation (JSON) Data Interchange Format" and ISO C.
Call the former "interpolation" instead.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-38-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
037f244088 json: Have lexer call streamer directly
json_lexer_init() takes the function to process a token as an
argument.  It's always json_message_process_token().  Makes the code
harder to understand for no actual gain.  Drop the indirection.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-34-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Marc-André Lureau
7c1e1d5481 json: remove useless return value from lexer/parser
The lexer always returns 0 when char feeding. Furthermore, none of the
caller care about the return value.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20180326150916.9602-10-marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <20180823164025.12553-32-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
b2da4a4d75 json: Leave rejecting invalid escape sequences to parser
Both lexer and parser reject invalid escape sequences in strings.  The
parser's check is useless.

The lexer ends the token right after the first non-well-formed byte.
This tends to lead to suboptimal error reporting.  For instance, input

    {"abc\@ijk": 1}

produces the tokens

    JSON_LCURLY   {
    JSON_ERROR    "abc\@
    JSON_KEYWORD  ijk
    JSON_ERROR   ": 1}\n

The parser then reports three errors

    Invalid JSON syntax
    JSON parse error, invalid keyword 'ijk'
    Invalid JSON syntax

before it recovers at the newline.

Drop the lexer's escape sequence checking, and make it accept the same
characters after backslash it accepts elsewhere in strings.  It now
produces

    JSON_LCURLY   {
    JSON_STRING   "abc\@ijk"
    JSON_COLON    :
    JSON_INTEGER  1
    JSON_RCURLY

and the parser reports just

    JSON parse error, invalid escape sequence in string

While there, fix parse_string()'s inaccurate function comment.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-27-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
4b1c0cd7c7 json: Accept overlong \xC0\x80 as U+0000 ("modified UTF-8")
Since the JSON grammer doesn't accept U+0000 anywhere, this merely
exchanges one kind of parse error for another.  It's purely for
consistency with qobject_to_json(), which accepts \xC0\x80 (see commit
e2ec3f9768).

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-26-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
de930f45cb json: Leave rejecting invalid UTF-8 to parser
Both the lexer and the parser (attempt to) validate UTF-8 in JSON
strings.

The lexer rejects bytes that can't occur in valid UTF-8: \xC0..\xC1,
\xF5..\xFF.  This rejects some, but not all invalid UTF-8.  It also
rejects ASCII control characters \x00..\x1F, in accordance with RFC
8259 (see recent commit "json: Reject unescaped control characters").

When the lexer rejects, it ends the token right after the first bad
byte.  Good when the bad byte is a newline.  Not so good when it's
something like an overlong sequence in the middle of a string.  For
instance, input

    {"abc\xC0\xAFijk": 1}\n

produces the tokens

    JSON_LCURLY   {
    JSON_ERROR    "abc\xC0
    JSON_ERROR    \xAF
    JSON_KEYWORD  ijk
    JSON_ERROR   ": 1}\n

The parser then reports four errors

    Invalid JSON syntax
    Invalid JSON syntax
    JSON parse error, invalid keyword 'ijk'
    Invalid JSON syntax

before it recovers at the newline.

The commit before previous made the parser reject invalid UTF-8
sequences.  Since then, anything the lexer rejects, the parser would
reject as well.  Thus, the lexer's rejecting is unnecessary for
correctness, and harmful for error reporting.

However, we want to keep rejecting ASCII control characters in the
lexer, because that produces the behavior we want for unclosed
strings.

We also need to keep rejecting \xFF in the lexer, because we
documented that as a way to reset the JSON parser
(docs/interop/qmp-spec.txt section 2.6 QGA Synchronization), which
means we can't change how we recover from this error now.  I wish we
hadn't done that.

I think we should treat \xFE the same as \xFF.

Change the lexer to accept \xC0..\xC1 and \xF5..\xFD.  It now rejects
only \x00..\x1F and \xFE..\xFF.  Error reporting for invalid UTF-8 in
strings is much improved, except for \xFE and \xFF.  For the example
above, the lexer now produces

    JSON_LCURLY   {
    JSON_STRING   "abc\xC0\xAFijk"
    JSON_COLON    :
    JSON_INTEGER  1
    JSON_RCURLY

and the parser reports just

    JSON parse error, invalid UTF-8 sequence in string

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-25-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
eddc0a7f0a json: Revamp lexer documentation
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-20-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
340db1ed82 json: Reject unescaped control characters
Fix the lexer to reject unescaped control characters in JSON strings,
in accordance with RFC 8259 "The JavaScript Object Notation (JSON)
Data Interchange Format".

Bonus: we now recover more nicely from unclosed strings.  E.g.

    {"one: 1}\n{"two": 2}

now recovers cleanly after the newline, where before the lexer
remained confused until the next unpaired double quote or lexical
error.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-19-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Markus Armbruster
a2ec6be72b json: Fix lexer to include the bad character in JSON_ERROR token
json_lexer[] maps (lexer state, input character) to the new lexer
state.  The input character is consumed unless the new state is
terminal and the input character doesn't belong to this token,
i.e. the state transition uses look-ahead.  When this is the case,
input character '\0' would result in the same state transition.
TERMINAL_NEEDED_LOOKAHEAD() exploits this.

Except this is wrong for transitions to IN_ERROR.  There, the
offending input character is in fact consumed: case IN_ERROR returns.
It isn't added to the JSON_ERROR token, though.

Fix that by making TERMINAL_NEEDED_LOOKAHEAD() return false for
transitions to IN_ERROR.

There's a slight complication.  json_lexer_flush() passes input
character '\0' to flush an incomplete token.  If this results in
JSON_ERROR, we'd now add the '\0' to the token.  Suppress that.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-Id: <20180823164025.12553-18-armbru@redhat.com>
2018-08-24 20:26:37 +02:00
Marc-André Lureau
2bc7cfea09 json: learn to parse uint64 numbers
Switch strtoll() usage to qemu_strtoi64() helper while at it.

Add a few tests for large numbers.

Signed-off-by: Marc-André Lureau <marcandre.lureau@redhat.com>
Message-Id: <20170607163635.17635-11-marcandre.lureau@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2017-06-20 14:31:31 +02:00
Eric Blake
ff5394ad5b qobject: Correct JSON lexer grammar comments
Fix the regex comments describing what we parse as JSON.  No change
to the lexer itself, just to the comments:
- The "" and '' string construction was missing alternation between
different escape sequences
- The construction for numbers forgot to handle optional leading '-'
- The construction for numbers was grouped incorrectly so that it
didn't permit '0.1'
- The construction for numbers forgot to mark the exponent as optional
- No mention that our '' string and "\'" are JSON extensions
- No mention of our %d and related extensions when constructing JSON

Signed-off-by: Eric Blake <eblake@redhat.com>
Message-Id: <1465526889-8339-2-git-send-email-eblake@redhat.com>
Reviewed-by: Markus Armbruster <armbru@redhat.com>
[Eric's regexp simplification squashed in]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
2016-06-30 15:24:36 +02:00
Peter Maydell
f2ad72b30e qobject: Clean up includes
Clean up includes so that osdep.h is included first and headers
which it implies are not included manually.

This commit was created with scripts/clean-includes.

Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
Reviewed-by: Eric Blake <eblake@redhat.com>
Message-id: 1454089805-5470-12-git-send-email-peter.maydell@linaro.org
2016-02-04 17:41:30 +00:00
Paolo Bonzini
d2ca7c0b0d qjson: replace QString in JSONLexer with GString
JSONLexer only needs a simple resizable buffer.  json-streamer.c
can allocate memory for each token instead of relying on reference
counting of QStrings.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Message-Id: <1448300659-23559-2-git-send-email-pbonzini@redhat.com>
[Straightforwardly rebased on my patches, checkpatch made happy]
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2015-11-26 09:31:22 +01:00
Markus Armbruster
c54616608a qjson: Give each of the six structural chars its own token type
Simplifies things, because we always check for a specific one.

Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1448486613-17634-6-git-send-email-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2015-11-26 09:22:54 +01:00
Markus Armbruster
b8d3b1da3c qjson: Spell out some silent assumptions
Signed-off-by: Markus Armbruster <armbru@redhat.com>
Message-Id: <1448486613-17634-5-git-send-email-armbru@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
2015-11-26 09:18:45 +01:00
Paolo Bonzini
d593233438 json-lexer: fix escaped backslash in single-quoted string
This made the lexer wait for a closing *double* quote.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
Reviewed-by: Amos Kong <akong@redhat.com>
Signed-off-by: Luiz Capitulino <lcapitulino@redhat.com>
2014-06-23 11:01:24 -04:00
Paolo Bonzini
a372823a14 build: move qobject files to qobject/ and libqemuutil.a
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
2013-01-12 18:42:50 +01:00