Validating Data With JSON-Schema, Part 2

In the first part of this tutorial, you learned how to create quite advanced schemas using all available validation keywords. Many real-world examples of JSON data are more complex than our user example. An attempt to put all the requirements to such data in one file may lead to very a large schema that may also have a lot of duplication.

Structuring Your Schemas

The JSON-schema standard allows you to break schemas into multiple parts. Let’s look at the example of the data for the news site navigation:

The navigation structure above is somewhat similar to the one you can see on the website http://dailymail.co.uk. You can see a more complete example in the GitHub repository.

The data structure is complex and recursive, but the schemas describing this data are quite simple:

navigation.json:

page.json:

defs.json:

Have a look at the schemas above and the navigation data they describe (that is valid according to the schema navigation.json). The main thing to notice is that the schema navigation.json references the schema page.json that in its turn references the first one.

JavaScript code to validate the user record against the schema could be:

All the code samples are available in the GitHub Repository.

Ajv, the validator used in the example, is the fastest JSON-Schema validator for JavaScript. I created it, so I am going to use it in this tutorial. We will look at how it compares with other validators in the end so you can choose the right one for you.

Tasks

See part 1 of the tutorial for instructions on how to install the repository with the tasks and to test your answers.

References Between Schemas With the “$ref” Keyword

The JSON-Schema standard allows you to reuse the repeated parts of schemas using references with the “$ref” keyword. As you can see from the navigation example, you can reference the schema that is located:

  • in another file: use the schema URI that is defined in its “id” property
  • in any part of another file: append the JSON pointer to the schema reference
  • in any part of the current schema: append the JSON pointer to “#”

You can also refer to the whole current schema using “$ref” equal to “#”—it allows you to create recursive schemas referring to themselves.

So in our example, the schema in navigation.json refers to:

  • the schema page.json
  • definitions in the schema defs.json
  • the definition positiveIntOrNull in the same schema

The schema in page.json refers:

  • back to the schema navigation.json
  • also to definitions in the file defs.json

The standard requires that the “$ref” should be the only property in the object, so if you want to apply a referenced schema in addition to another schema, you have to use the “allOf” keyword.

Task 1

Refactor the user schema from part 1 of the tutorial using references. Separate the schema in two files: user.json and connection.json.

Put your schemas in the files part2/task1/user.json and part2/task1/connection.json and run node part2/task1/validate to check if your schemas are correct.

JSON-Pointer

JSON-pointer is a standard defining the paths to the parts of JSON files. The standard is described in RFC6901.

This path consists of segments (that can be any string) connected with the “/” character. If the segment contains characters “~” or “/”, they should be replaced with “~0” and “~1”. Each segment means the property or the index in the JSON data.

If you look at the navigation example, the “$ref” that defines the color property is “defs.json#/definitions/color”, where “defs.json#” is the schema URI and “/definitions/color” is the JSON pointer. It points to the property color inside property definitions.

The convention is to put all the parts of the schema that are used in refs inside the definitions property of the schema (as you can see in the example). Although the JSON-schema standard reserves the definitions keyword for this purpose, it is not required to put your subschemas there. JSON-pointer allows you to refer to any part of the JSON file.

When JSON pointers are used in URIs, all the characters that are invalid in URIs should be escaped (in JavaScript the global function encodeURIComponent can be used).

JSON-pointers can be used not only in JSON-schemas. They can be used to represent the path in JSON data to any property or item. You can use the library json-pointer to access objects with JSON-pointers.

Task 2

The JSON file below describes the folders and files structure (folder names start with “/”):

What are the JSON pointers that point to:

  • the size of the “Word” application,
  • the size of the “my_story~.rtf” document,
  • the name of the second application that can open the “my_story~.rtf” document?

Put your answers in part2/task2/json_pointers.json and run node part2/task2/validate to check them.

Schema IDs

Schemas usually have a top-level “id” property that has the schema URI. When “$ref” is used in a schema, its value is treated as a URI that is resolved relatively to the schema “id”.

The resolution works in the same way as the browser resolves URIs that are not absolute—they are resolved relatively to the schema URI that is in its “id” property. If “$ref” is a filename, it replaces the filename in the “id”. In the navigation example, the navigation schema id is "http://mynet.com/schemas/navigation.json#", so when reference "page.json#" is resolved, the full URI of the page schema becomes "http://mynet.com/schemas/page.json#" (that is the “id” of the page.json schema).

If the “$ref” of the page schema were a path, e.g. "/page.json", then it would have been resolved as "http://mynet.com/page.json#". And "/folder/page.json" would have been resolved as "http://mynet.com/folder/page.json#".

If “$ref” starts from the “#” character, it is treated as a hash fragment and is appended to the path in the “id” (replacing the hash fragment in it). In the navigation example, the reference "defs.json#/definitions/color" is resolved as "http://mynet.com/schemas/defs.json#/definitions/color" where "http://mynet.com/schemas/defs.json#" is the ID of the definitions schema and "/definitions/color" is treated as a JSON pointer inside it.

If “$ref” were a full URI with a different domain name, in the same way links work in the browser, it would have been resolved as the same full URI.

Internal Schema IDs

The JSON-schema standard allows you to use “id” inside the schema to identify these subschemas and also to change the base URI relative to which inner references will be resolved—it’s called “changing resolution scope”. That is probably one of the most confusing parts of the standard, and that’s why it is not very commonly used.

I would not recommend over-using internal IDs, with one exception below, for two reasons:

  • Very few validators consistently follow the standard and correctly resolve references when internal IDs are used (Ajv fully follows the standard here).
  • Schemas become more difficult to understand.

We will still look into how it works because you may encounter schemas that use internal IDs and there are cases when using them helps with structuring your schemas.

Firstly, let’s look at our navigation example. Most of the references are in the definitions object and that makes references quite long. There is a way to shorten them by adding IDs to the definitions. This is the updated defs.json schema:

Now instead of references "defs.json#/definitions/positiveInteger" and "defs.json#/definitions/color" that are used in navigation and page schemas, you can use shorter references: "defs.json#positiveInteger" and "defs.json#color". That’s a very common usage of internal IDs as it allows you to make your references shorter and more readable. Please note that while this simple case will be handled correctly by most JSON-schema validators, some of them may not support it.

Let’s look at a more complex example with IDs. Here’s the sample JSON schema:

In very few lines, it became very confusing. Have a look at the example and try to figure out which property should be a string and which one an integer.

The schema defines an object with properties bar, baz and bax. Property bar should be an object that is valid according to the subschema, which requires that its property foo is valid according to the "bar" reference. Because the subschema has its own “id”, the full URI for the reference will be "http://somewhere.else/completely.json#bar", so it should be an integer.

Now look at the properties baz and bax. The references for them are written in a different way, but they point to the same reference "http://somewhere.else/completely.json#bar" and they both should be integers. Although the property baz points directly to the schema { "$ref": "#bar" }, it should still be resolved relative to the ID of the subschema because it is inside it. So the object below is valid according to this schema:

Many JSON schema validators will not handle it correctly, and that’s why IDs that change the resolution scope should be used with caution.

Task 3

Solving this puzzle will help you better understand how references and changing resolution scope work. Your schema is:

Create an object that is valid according to this schema.

Put your answer in part2/task3/valid_data.json and run node part2/task3/validate to check it.

Loading Referenced Schemas

Until now we were looking at different schemas referring to each other without paying attention to how they are loaded to the validator.

One approach is to have all connected schemas preloaded like we had in the navigation example above. But there are situations when it is either not practical or impossible—for example, if the schema you need to use is supplied by another application, or if you don’t know in advance all the possible schemas that may be needed.

In such cases, the validator could load referenced schemas at the time when the data is validated. But that would make the validation process slow. Ajv allows you to compile a schema into a validating function asynchronously loading the missing referenced schemas in the process. The validation itself would still be synchronous and fast.

For example, if navigation schemas were available to download from the URIs in their IDs, the code to validate the data against the navigation schema could be this:

The code defines the validateNavigation function that loads the schema and compiles the validation function when it is called the first time and always returns the validation result via the callback. There are various ways to improve it, from preloading and compiling the schema separately, before it is used the first time, to accounting for the fact that the function can be called multiple times before it has managed caching the schema (ajv.compileAsync already ensures that the schema is always requested only once).

Now we will look at the new keywords that are proposed for version 5 of the JSON-schema standard.

JSON-Schema Version 5 Proposals

Although these proposals haven’t been finalised as a standard draft, they can be used today—the Ajv validator implements them. They substantially expand what you can validate using JSON-schema, so it’s worth using them.

To use all these keywords with Ajv, you need to use the option v5: true.

Keywords “constant” and “contains”

These keywords are added for convenience.

The “constant” keyword requires that the data is equal to the value of the keyword. Without this keyword, it could have been achieved with the “enum” keyword with one item in the array of elements.

This schema requires that the data is equal to 1:

The “contains” keyword requires that some array element matches the schema in this keyword. This keyword applies to arrays only; any other data type will be valid according to it. It is a bit more difficult to express this requirement using only keywords from version 4, but it is possible.

This schema requires that if the data is an array then at least one of its items is integer:

It is equivalent to this one:

For this schema to be valid, either data should not be an array or it should not have all its items non-integers (i.e. some item should be integer).

Please note that both the “contains” keyword and the equivalent schema above would fail if the data were an empty array.

Keyword “patternGroups”

This keyword is proposed as a replacement for “patternProperties”. It allows you to limit the number of properties matching the pattern that should exist in the object. Ajv supports both “patternGroups” and “patternProperties” in v5 mode because the first one is much more verbose, and if you don’t want to limit the number of properties you may prefer using the second one.

For example the schema:

is equivalent to this schema:

They both require that the object has only properties with keys consisting only of lowercase letters with values of type string and with keys consisting only of numbers with values of type number. They don’t require any number of such properties, nor do they limit the maximum number. That’s what you can do with “patternGroups”:

The schema above has additional requirements: there should be at least one property matching each pattern and no more than three properties whose keys contain only letters.

You can’t achieve the same with “patternProperties”.

Keywords for Limiting Formatted Values “formatMaximum” / “formatMaximum”

These keywords together with “exclusiveFormatMaximum” / “exclusiveFormatMinimum” allow you to set limits for time, date and potentially other string values that have format required with the “format” keyword.

This schema requires that data is a date and it’s greater than or equal to January 1, 2016:

Ajv supports comparing formatted data for the formats “date”, “time” and “date-time”, and you can define custom formats that would support limits with the “formatMaximum” / “formatMaximum” keywords.

Keyword “switch”

While all the previous keywords were either allowing you to better express what was possible without them or slightly extending the possibilities, they didn’t change the declarative and static nature of the schema. This keyword allows you to make the validation dynamic and data-dependent. It contains multiple if-then cases.

It is easier to explain with an example:

The schema above sequentially validates the data against the subschemas in “if” keywords until one of them passes validation. When that happens, it validates the schema in the “then” keyword in the same object—that will be the result of the validation of the whole schema. If the value of “then” is false, the validation immediately fails.

In this way, the schema above requires that the value is:

  • either greater than or equal to 50 and is a multiple of 5
  • or between 10 and 49 and a multiple of 2
  • or between 5 and 9

This particular set of requirements can be expressed without a switch keyword, but there are more complex cases when it is not possible.

Task 4

Create the schema equivalent to the last example above without using a switch keyword.

Put your answer in part2/task4/no_switch_schema.json and run node part2/task4/validate to check it.

The “switch” keyword cases can also contain the “continue” keyword with a boolean value. If this value is true, the validation will continue after a successful “if” schema match with successful “then” schema validation. That is similar to a fall-through to the next case in a JavaScript switch statement, although in JavaScript fall-through is a default behaviour and the “switch” keyword requires an explicit “continue” instruction. This is another simple example with a “continue” instruction:

If the first “if” condition is satisfied and the “then” requirement is met, the validation will continue to check the second condition.

“$data” Reference

The “$data” keyword even further extends what is possible with JSON-schema and makes validation more dynamic and data-dependent. It allows you to put values from some data properties, items or keys into certain schema keywords.

For example, this schema defines an object with two properties where if both are defined, “larger” should be larger than or equal to “smaller”—the value in “smaller” is used as a minimum for “larger”:

Ajv implements the “$data” reference for most keywords whose values are not schemas. It fails validation if the “$data” reference points to an incorrect type and succeeds if it points to the undefined value (or if the path doesn’t exist in the object).

So what is the string value in the “$data” reference? It looks similar to JSON-pointer but it is not exactly it. It is a relative JSON-pointer that is defined by this standard draft.

It consists of an integer number that defines how many times the lookup should traverse up the object (1 in the example above means a direct parent) followed by “#” or JSON pointer.

If the number is followed by “#” then the value JSON-pointer resolves to will be the name of the property or the index of the item the object has. In this way, “0#” in place of “1/smaller” would resolve to the string “larger”, and “1#” would be invalid as the whole data is not a member of any object or array. This schema:

is equivalent to this one:

because { “$data”: “0#” } is replaced with the property name.

If the number in the pointer is followed by JSON-pointer, then this JSON-pointer is resolved starting from the parent object this number refers to. You can see how it works in the first “smaller” / “larger” example.

Let’s look again at our navigation example. One of the requirements you can see in the data is that the page_id property in the page object is always equal to the parent_id property in the contained navigation object. We can express this requirement in the page.json schema using the “$data” reference:

The “switch” keyword added to the page schema requires that if the page object has the navigation property then the value of the page_id property should be the same as the value of the parent_id property in the navigation object. The same can be achieved without the “switch” keyword, but it is less expressive and contains duplication:

Task 5

Examples of relative JSON-pointers can be helpful.

Using v5 keywords, define the schema for the object with two required properties list and order. List should be an array that has up to five numbers. All items should be numbers and they should be ordered in ascending or descending order, as determined by the property order that can be "asc" or "desc".

For example, this is a valid object:

and this is invalid:

Put your answer in part2/task5/schema.json and run node part2/task5/validate to check it.

How would you create a schema with the same conditions but for a list of unlimited size?

Defining New Validation Keywords

We’ve looked at the new keywords that are proposed for version 5 of the JSON-schema standard. You can use them today, but sometimes you may want more. If you’ve done task 5, you probably have noticed that some requirements are difficult to express with JSON-schema.

Some validators, including Ajv, allow you to define custom keywords. Custom keywords:

  • allow you to create validation scenarios that cannot be expressed using JSON-Schema
  • simplify your schemas
  • help you to bring a bigger part of the validation logic to your schemas
  • make your schemas more expressive, less verbose and closer to your application domain

One of the developers who uses Ajv wrote on GitHub:

“ajv with custom keywords has helped us a lot with business logic validation in our backend. We consolidated a whole bunch of controller-level validations into JSON-Schema with custom keywords. The net effect is far far better than writing individual validation code.”

The concerns you have to be aware of when extending the JSON-schema standard with custom keywords are the portability and understanding of your schemas. You will have to support these custom keywords on other platforms and to properly document these keywords so that everybody can understand them in your schemas.

The best approach here is to define a new meta-schema that will be the extension of draft 4 meta-schema or “v5 proposals” meta-schema that will include both the validation of your additional keywords and their description. Then your schemas that use these custom keywords will have to set the $schema property to the URI of the new meta-schema.

Now that you’ve been warned, we’ll dive in and define a couple of custom keywords using Ajv.

Ajv provides four ways to define custom keywords that you can see in the documentation. We will look at two of them:

  • using a function that compiles your schema to a validation function
  • using a macro-function that takes your schema and returns another schema (with or without custom keywords)

Let’s start with the simple example of a range keyword. A range is simply a combination of minimum and maximum keywords, but if you have to define many ranges in your schema, especially if they have exclusive boundaries, it may easily become boring.

That’s how the schema should look:

where exclusive range is optional, of course. The code to define this keyword is below:

And that’s it! After this code you can use the range keyword in your schemas:

The object passed to addKeyword is a keyword definition. It optionally contains the type (or types as an array) the keyword applies to. The compile function is called with parameters schema and parentSchema and should return another function that validates the data. That makes it almost as efficient as native keywords, because the schema is analysed during its compilation, but there is the cost of an extra function call during validation.

Ajv allows you to avoid this overhead with keywords that return the code (as a string) that will be made part of the validation function, but it is quite complex so we won’t look at it here. The simpler way is to use macro keywords—you will have to define a function that takes the schema and returns another schema.

Below is the implementation of the range keyword with a macro function:

You can see that the function simply returns the new schema that is equivalent to the range keyword that uses keywords maximum and minimum.

Let’s also see how we can create a meta-schema that will include the range keyword. We’ll use draft 4 meta-schema as our starting point:

If you want to use “$data” references with the range keyword, you will have to extend “v5 proposals” meta-schema that is included in Ajv (see the link above) so that these references can be the values of range and exclusiveRange. And while our first implementation will not support “$data” references, the second one with a macro-function will support them.

Now that you have a meta-schema, you need to add it to Ajv and use it in schemas using the range keyword:

The code above would have thrown an exception if the invalid values were passed to range or exclusiveRange.

Task 6

Assume that you have defined the keyword jsonPointers that applies the schemas to the deep properties defined by the JSON pointers that point to data starting from the current one. This keyword is useful with the switch keyword as it allows you to define requirements for deep properties and items. For example, this schema using the jsonPointers keyword:

is equivalent to:

Assume that you also have defined the keyword requiredJsonPointers that works similar to required but with JSON-pointers instead of properties.

If you like, you can define these keywords yourself too, or you can see how they are defined in the file part2/task6/json_pointers.js.

Your task is: using keywords jsonPointers and requiredJsonPointers, define the keyword select that is similar to the JavaScript switch statement and has the syntax below (otherwise and fallthrough are optional):

This syntax allows values of any type. Please note that fallthrough is different from continue in the switch keyword. fallthrough applies the schema from the next case to the data without checking that the selector is equal to the value from the next case (as it is most likely not equal).

Put your answers in part2/task6/select_keyword.js and part2/task6/v5-meta-with-select.json and run node part2/task6/validate to check them.

Bonus 1: Improve your implementation to also support this syntax:

It can be used if all values are different strings and there is no fallthrough.

Bonus 2: extend the “v5 proposals” meta-schema to include this keyword.

Other Usages of JSON-Schemas

In addition to validating data, JSON-schemas can be used to:

  • generate UI
  • generate data
  • modify data

You can look at the libraries that generate UI and data if you are interested. We won’t explore it as it is out of the scope of this tutorial.

We will look at using JSON-schema to modify the data while it is being validated.

Filtering Data

One of the common tasks while validating the data is removing the additional properties from the data. This allows you to sanitize the data before passing it to the processing logic without failing the schema validation:

Without the option removeAdditional, the validation would have failed, as there is an additional property bar that is not allowed by the schema. With this option, validation passes and the property is removed from the object.

When the removeAdditional option value is true, additional properties are removed only if the additionalProperties keyword is false. Ajv also allows you to remove all additional properties, regardless of the additionalProperties keyword, or additional properties that fail validation (if the additionalProperties keyword is the schema). Please look at the Ajv documentation for more information.

Assigning Defaults to Properties and Items

The JSON-schema standard defines the keyword “default” that contains a value that the data should have if it is not defined in the validated data. Ajv allows you to assign such defaults in the process of validation:

Without the option useDefaults, the validation would have failed, as there is no required property bar in the validated object. With this option, the validation passes and the property with the default value is added to the object.

Coercing Data Types

“type” is one of the most commonly used keywords in JSON-schemas. When you are validating user inputs, all your data properties you get from forms are usually strings. Ajv allows you to coerce the data to the types specified in the schema, both to pass the validation and to use the correctly typed data afterwards:

Comparing JavaScript JSON-Schema Validators

There are more than ten actively supported JavaScript validators available. Which one should you use?

You can see the benchmarks of performance and of how different validators pass the tests suite from the JSON-schema standard in the project json-schema-benchmark.

There are also distinctive features that some validators have that can make them the most suitable for your project. I will compare some of them below.

is-my-json-valid and jsen

These two validators are very fast and have very simple interfaces. They both compile schemas to JavaScript functions, as Ajv does.

Their disadvantage is that they both have limited support of remote references.

schemasaurus

This is one-of-a-kind library where JSON-schema validation is almost a side effect.

It is built as a generic and easily extensible JSON-schema processor/iterator that you can use to build all sorts of tools that use JSON-schema: UI generators, templates, etc.

It already has relatively fast JSON-schema validator included.

It doesn’t support remote references at all, though.

themis

The slowest in the group of fast validators, it has a comprehensive set of features, with a limited support of remote references.

Where it really shines is its implementation of the default keyword. While the majority of validators have limited support of this keyword (Ajv is not an exception), Themis has a very complex logic of applying defaults with rollbacks inside compound keywords such as anyOf.

z-schema

Performance-wise, this very mature validator is on the border between fast and slow validators. It was probably one of the fastest before the new breed of compiled validators appeared (all of the above and Ajv).

It passes almost all the tests from the JSON-schema test suite for validators, and it has quite thorough implementation of remote references.

It has a large set of options allowing you to tweak the default behaviour of many JSON-schema keywords (e.g., not accept empty arrays as arrays or empty sets of strings) and to impose additional requirements on JSON schemas (e.g., require the minLength keyword for strings).

I think that in most cases, both modifying schema behaviours and including requests to other services in JSON-schema are the wrong things to do. But there are cases when the ability to do so simplifies a lot.

tv4

This is one of the oldest (and slowest) validators supporting version 4 of the standard. As such it is often a default choice for many projects.

If you are using it, it is very important to understand how it reports errors and missing references and to configure it correctly, otherwise you will be getting many false positives (i.e., the validation passes with invalid data or unresolved remote references).

Formats are not included by default, but they are available as a separate library.

Ajv

I wrote Ajv because all the existing validators were either fast or compliant with the standard (especially with regards to supporting remote references) but not both. Ajv filled that gap.

At the moment it is the only validator that:

  • passes all the tests and fully supports remote references
  • supports validation keywords proposed for version 5 of the standard and $data references
  • supports asynchronous validation of custom formats and keywords

It has options to modify the validation process and to modify validated data (filtering, assigning defaults and coercing types—see the examples above).

Which Validator to Use?

I think the best approach is to try several and to choose the one that works best for you.

I wrote json-schema-consolidate, which supplies a collection of adapters that unify the interfaces of 12 JSON-schema validators. Using this tool you can spend less time switching between validators. I recommend removing it once you have decided which validator to use, as keeping it would negatively affect performance.

This is it! I hope this tutorial was useful. You have learned about:

  • structuring your schemas
  • using references and IDs
  • using validation keywords and $data reference from version 5 proposals
  • loading remote schemas asynchronously
  • defining custom keywords
  • modifying data in the process of validation
  • advantages and disadvantages of different JSON-schema validators

Thanks for reading!

Tags:

Comments

Related Articles