Data Sanitization and Validation With WordPress

Proper security is critical to keeping your site or that of your theme or plug-in users safe. Part of that means appropriate data validation and sanitization. In this article we are going to look at why this is important, what needs to be done, and what functions WordPress provides to help.

Since there seem to be various interpretations of what the terms 'validation', 'escaping' and 'sanitization' mean, I'll first clarify what I mean by them in this article:

  • Validation – These are the checks that are run to ensure the data you have is what it should be. For instance, that an e-mail looks like an e-mail address, that a date is a date and that a number is (or is cast as) an integer
  • Sanitization / Escaping – These are the filters that are applied to data to make it 'safe' in a specific context. For instance, to display HTML code in a text area it would be necessary to replace all the HTML tags by their entity equivalents

Why Is Sanitization Important?

When data is included in some context (say in a HTML document) – that data could be misinterpreted as a code for that environment (for example HTML code). If that data contains malicious code, then using that data without sanitizing it, means that code will be executed. The code doesn't even necessarily have to be malicious for it to cause undesired effects. The job of sanitization is to make sure that any code in the data isn't interpreted as code – otherwise you may end up like Bobby Tables' school...

A seemingly innocuous example might be pre-filling a search field with the currently queried term, using the unescaped $_GET['s']:

This opens up a vulnerability that could allow javascript to be injected by, for instance, tricking someone into visiting http://yoursite.com?s="/><script>alert('Injected javascript')</script>. The search term 'jumps' out of the value attribute, and the following part of the data is interpreted as code and executed. To prevent this, WordPress provides get_search_query which returns the sanitized search query. Although this is a 'harmless' example the injected script could be far more malicious and at best it would just 'break' the form if search terms contain double quotes.

How this malicious (or otherwise) code may have found its way onto your site is not the concern here – but rather it is to prevent it from executing. Nor do we make assumptions about the nature of this unwanted code, or its intent – it could have simply been an error on the user's part. This brings me to rule No.1...


Rule No. 1: Trust Nobody

It's a common maxim that is used with regards to data sanitization, and it's a good one. The idea is that you should not assume that any data entered by the user is safe. Nor should you assume that the data you've retrieved from the database is safe – even if you had made it 'safe' prior to inserting it there. In fact, whether data can be considered 'safe' makes no sense without context. Sometimes the same data may be used in multiple contexts on the same page. Titles for instance, can safely contain quotes or double quotes when inside header tags – but will cause problems if used (unescaped) inside a title attribute of a link tag. So it is rather pointless to make data 'safe' when adding it to the database, since it is often impossible to make data safe for all contexts simultaneously. (Of course it needs to be made safe to add to the database – but we'll come to that later).

Even if you only intend to use that data in one specific context, say a form, it is still pointless to sanitize the data when writing to the database because, as per Rule No. 1, you cannot trust that it is still safe when you take it out again.


Rule No. 2: Validate on Input, Escape on Output

This is the procedural maxim that sets out when you should validate data, and when you sanitize it. Simply put – validate your data (check it's what it should be – and that it's 'valid') as soon as you receive it from the user. When you come to use this data, for example when you output it, you need to escape (or sanitize) it. What form this sanitization takes, depends entirely on the context you are using it in.

The best advice is to perform this 'late': escape your data immediately before you use or display it. This way you can be confident that your data has been properly sanitized and you don't need to remember if the data has been previously checked.


Rule No. 3: Trust WordPress

You might be thinking "Ok, validate before writing to database and sanitize when using it. But don't I need to make sure the data is safe to write to the database?". In general, yes. When adding data to a database, or simply using an input to interact with a database, you would need to escape the data incase it contained any SQL commands. But this brings me to Rule No. 3, one which flies in the face of Rule No. 1: Trust WordPress.

In a previous article, I took user input (sent from a search form via AJAX) and used it directly with get_posts() to return posts that matched that search query:

An observant reader noticed that I hadn't performed any sanitization – and they were right. But I didn't need to. When you use high-level functions such as get_posts(), you don't need to worry about sanitizing the data – because the database queries are all properly escaped by WordPress' internals. It's a different matter entirely if you are using a direct SQL query – but we'll look at this in a later section. Similarly, functions like the_title(), the_permalink(), the_content() etc. perform their own sanitization (for the appropriate context).


Data Validation

When you receive data entered by a user it's important to validate it. (The settings API, covered in this series, allows you to specify a callback function to do exactly this). Invalid data is either auto-corrected, or the process is aborted and the user is returned to the form to try again (hopefully with an appropriate error message). The concern here is not safety but rather validity – if you're doing it right, WordPress will take care of safely adding the data to the database. What 'valid' means is up to you – it could mean a valid email address, a positive integer, text of a limited length, or one of an array of specified options. However you aim to determine validity, WordPress offers a lot of functions that can help.

Numbers

When expecting numeric data, it's possible to check if the data 'is some form of number', for instance is_int or is_float. Usually, it's sufficient to simply cast the data as numeric with: intval or floatval.

If you need to ensure the number is padded with leading zeros, WordPress provides the function zeroise(). Which takes the following parameters:

  • Number – the number to pad
  • Threshold – the number of digits the number will be padded to

For example:

E-mails

To check the validity of e-mails, WordPress has the is_email() function. This function uses simple checks to validate the address. For instance, it checks that it contains the '@' symbol, that it's longer than 3 characters, the domain contains only alpha-numerics and hyphens, and so forth. Obviously, it doesn't check that the e-mail address actually exists. Assuming the e-mail address passed the checks, it is returned, otherwise 'false' is returned.

HTML

Often you may wish to allow only some HTML tags in your data – for instance in comments posted on your site. WordPress provides a family of functions of the form wp_kses_* (KSES Strips Evil Scripts). These functions remove (some subset of) HTML tags, and can be used to ensure that links in the data are of specified protocols. For example the wp_kses() function accepts three arguments:

  • content – (string) Content to filter through kses
  • allowed_html – An array where each key is an allowed HTML element and the value is an array of allowed attributes for that element
  • allowed_protocols – Optional. Allowed protocol in links (for example http, mailto, feed etc.)

wp_kses() is a very flexible function, allowing you to remove unwanted tags, or just unwanted attributes from tags. For example, to only allow <strong> or <a> tags (but only allow the href attribute):

Of course, specifying every allowed tag and every allowed attribute can be a laborious task. So WordPress provides other functions that allow you to use wp_kses with pre-set allowed tags and protocols – namely the ones used for validating posts and comments:

The above functions are helpful in ensuring that HTML received from the user only contains whitelisted elements. Once we've done that we would also like to ensure that each tag is balanced, that is every opening tag has its corresponding closing tag. For this we can use balanceTags(). This function accepts two arguments:

  • content – Content to filter and balance tags of
  • force balance – True or false, whether to force the balancing of tags

Filenames

If you want to create a file in one of your website's directories, you will want to ensure the filename is both valid and legal. You would also want to ensure that the filename is unique for that directory. For this WordPress provides:

  • sanitize_file_name( $filename ) – sanitizes (or validates) the file-name by removing characters that are illegal in filenames on certain operating systems or that would require escaping at the command line. Replaces spaces with dashes and consecutive dashes with a single dash and removes periods, dashes and underscores from the beginning and end of the filename.
  • wp_unique_filename( $dir, $filename ) – returns a unique (for directory $dir), sanitized filename (it uses sanitize_file_name).

Data From Text Fields

When receiving data inputted into a text field, you'll probably want to strip out extra white spaces, tabs and line breaks, as well as stripping out any tags. For this WordPress provides sanitize_text_field().

Keys

WordPress also provides sanitize_key. This is a very generic (and occasionally useful) function. It simply ensures the returned variable contains only lower-case alpha-numerics, dashes, and underscores.


Data Sanitization

Whereas validation is concerned with making sure data is valid – data sanitization is about making it safe. While some of the validation functions referred to above might be useful in making sure data is safe – in general, it is not sufficient. Even 'valid' data might be unsafe in certain contexts.


Rule No. 4: Making Data Safe Is About Context

Simply put you cannot ask "How do I make this data safe?". Instead you should ask, "How do I make this data safe for using it in X".

To illustrate this point, suppose you have a widget with a textarea where you intend to allow the user to enter some HTML. Suppose they then enter:

This is perfectly valid, and safe, HTML – however when you click save, we find that the text has jumped out of the textarea. The HTML code is not safe as a value for the textarea:

What is safe to use in one context, is not necessarily safe in another. Whenever you use or display data you must keep in mind what forms of sanitization need to be done in order to make using that data safe. This is why WordPress often provides several functions for the same content, for instance:

  • the_title – for using the title in standard HTML (inside header tags, for example)
  • the_title_attribute – for using the title as an attribute value (usually the title attribute in <a> tags)
  • the_title_rss – for using the title in RSS feeds

These all perform the necessary sanitization for a particular context – and if you're using them you should be sure to use the correct one. Sometimes though, we'll need to perform our own sanitization – often because we have custom input beyond the standard post title, permalink, content etc. that WordPress handles for us.

Escaping HTML

When printing variables to the page we need to be mindful of how the browser will interpret them. Let's consider the following example:

Suppose $title = <script>alert('Injected javascript')</script>. Rather than displaying the HTML <script> tags, they will be interpreted as HTML and the enclosed javascript would be injected into the page.

This form of injection (as also demonstrated in the search form example) is called Cross-site scripting and this benign example belies its severity. Injected script can essentially control the browser and 'act on behalf' of the user or steal the user's cookies. This becomes an even more serious issue if the user is logged in. To prevent variables printed inside HTML being interpreted as HTML, WordPress provides the well known esc_html function. In this example:

Escaping Attributes

Now consider the following example:

Because $value contains double quotes, unescaped it can jump out of the value attribute and inject script, for example, by using the onfocus attribute. To escape unsafe characters (such as quotes, and double-quotes in this case), WordPress provides the function esc_attr. Like esc_html it replaces 'unsafe' characters by their entity equivalents. In fact, at the time of writing, these functions are identical – but you should still use the one that is appropriate for the context.

For this example we should have:

Both esc_html and esc_attr also come with __, _e, and _x variants.

  • esc_html__('Text to translate', 'plugin-domain') / esc_attr__ – returns the escaped translated text,
  • esc_html_e('Text to translate', 'plugin-domain') / esc_attr_e – displays the escaped translated text and finally the
  • esc_html_x('Text to translate', $context, 'plugin-domain') / esc_attr_x – translates the text according to the passed context, and then returns the escaped translation

HTML Class Names

For class names, WordPress provides sanitize_html_class – this escapes variables for use in class names, simply by restricting the returned value to alpha-numerics, hyphens and underscores. Note: It does not ensure the class name is valid (reference: http://www.w3.org/TR/CSS21/syndata.html#value-def-identifier).

In CSS, identifiers can contain only the characters [a-zA-Z0-9] and ISO 10646 characters U+00A0 and higher, plus the hyphen (-) and the underscore (_); they cannot start with a digit, two hyphens, or a hyphen followed by a digit. Identifiers can also contain escaped characters and any ISO 10646 character as a numeric code.

Escaping URLs

Let's now look at another common practise, printing variables into the href attribute:

Clearly it is vulnerable to the same form of attack as illustrated in escaping HTML and attributes. But what if the $url was set as follows:

On clicking the link, the alert function would be fired. This contains no HTML, or any quotes that allow it to jump out of the href attribute – so esc_attr is not sufficient here. This is why context matters: esc_attr($url) would be safe in the title attribute, but not for the href attribute – and this is because of the javascript protocol – which while perfectly valid – is not to be considered safe in this context. Instead you should use:

  • esc_url – for escaping URLs that will be printed to the page.
  • esc_url_raw – for escaping URLs to save to the database or use in URL redirecting.

esc_url strips out various offending characters, and replaces quotes and ampersands with their entity equivalents. It then checks that the protocol being used is allowed (javascript, by default, isn't).

What esc_url_raw does is almost identical to esc_url, but it does not replace ampersands and single quotes (which you don't want to, when using the URL as an URL, rather than displaying it).

In this example, we are displaying the URL, so we use esc_url:

Although not necessary in most cases, both functions accept an optional array to specify which protocols (such as http, https, ftp, ftps, mailto, etc) you wish to allow.

Escaping JavaScript

Sometimes you'll want to print javascript variables to a page (usually in the head):

In fact, if you are doing this, you should almost certainly be using wp_localize_script() – which handles sanitization for you. (If anyone can think of a reason why you might need to use the above method instead, I would like to hear it).

However, to make the above example safe, you can use the esc_js function:

Escaping Textarea

When displaying content in a textarea, esc_html is not sufficient because it does not double encode entities. For example:

$var printed in the textarea will appear as:

Rather than also encoding the & as &amp; in the <b> tags.

For this WordPress provides esc_textarea, which is almost identical to esc_html, but does double encode entities. Essentially it is little more than a wrapper for htmlspecialchars. In this example:

Antispambot

Displaying e-mail addresses on your website leaves them prone to e-mail harvesters. One simple method is to disguise the e-mail address. WordPress provides antispambot, which encodes random parts of the e-mail address into their HTML entities (and hexadecimal equivalents if $mailto = 1). On each page load the encoding should be different and while the returned address renders correctly in the browser, it should appear as gobbledygook to the spambots. The function accepts two arguments:

  • e-mail – the address to obfuscate
  • mailto – 1 or 0 (1 if using the mailto protocol in a link tag)

Query Strings

If you wish to add (or remove) variables from a query string (this is very useful if you wish to allow users to select an order for your posts), the safest and easiest way is to use add_query_arg and remove_query_arg. These functions handle all the necessary escaping for for the arguments and their values for use in the URL.

add_query_arg accepts two arguments:

  • query parameters – an associative array of parameters -> values
  • url – the URL to add the parameters and their values to. If omitted, the URL of the current page is used

remove_query_arg also accepts two arguments, the first is an array of parameters to remove, the second is as above.


Validation & Sanitization

As previously mentioned, sanitization doesn't make much sense without a context – so it's pretty pointless to sanitize data when writing to the database. Often, you need to store data in its raw format anyway, and in any case – Rule No. 1 dictates that we should always sanitize on output.

Validation of data, on the other hand, should be done as soon as it's received and before it's written to the database. The idea is that 'invalid' data should either be auto-corrected, or be flagged to the data, and only valid data should be given to the database.

That said – you may want to also perform validation when data is displayed too. In fact sometimes, 'validation' will also ensure the data is safe. But the priority here is on safety and you should avoid excessive validation that would run on every page load (the wp_kses_* functions, for instance, are very expensive to perform).


Database Escaping

When using functions such as get_posts or classes such as WP_Query and WP_User_Query, WordPress takes care of the necessary sanitization in querying the database. However, when retrieving data from a custom table, or otherwise performing a direct SQL query on the database – proper sanitization is then up to you. WordPress, however, provides a helpful class, the $wpdb class, that helps with escaping SQL queries.

Let's consider this basic 'SELECT' command, where $age and $firstname are variables storing an age and name that we are querying:

We have not escaped these variables, so potentially further commands could be injected in. Borrowing xkcd's example from above:

Will run as the command(s):

And delete our entire Students table.

To prevent this, we can use the $wpdb->prepare method. This accepts two parameters:

  • The SQL command as a string, where string variables are replaced by the placeholder %s and decimal numbers are replaced by the placeholder %d and floats by %f
  • An array of values for the above placeholders, in the order they appear in the query

In this example:

The escaped SQL query ($sql in this example) can then be used with one of the methods:

  • $wpdb->get_row($sql)
  • $wpdb->get_var($sql)
  • $wpdb->get_results($sql)
  • $wpdb->get_col($sql)
  • $wpdb->query($sql)

Inserting and Updating Data

For inserting or updating data, WordPress makes life even easier by providing the $wpdb->insert() and $wpdb->update() methods.

The $wpdb->insert() method accepts three arguments:

  • Table name – the name of the table
  • Data – array of data to insert as column->value pairs
  • Formats – array of formats for the corresponding values ('%s','%d' or'%f')

The $wpdb->update() method accepts five arguments:

  • Table name – the name of the table
  • Data – array of data to update as column->value pairs
  • Where – array of data to match as column->value pairs
  • Data Format – array of formats for the corresponding data values
  • Where Format – array of formats for the corresponding 'where' values

Both the $wpdb->insert() and the $wpdb->update() methods perform all the necessary sanitization for writing to the database.

Like Statements

Because the $wpdb->prepare method uses % to distinguish the place-holders, care needs to be taken when using the % wildcard in SQL LIKE-statements. The Codex suggests escaping them with a second %. Alternatively you can escape the term to be searched for with like_escape and then add the wildcard % where appropriate, before including this in the query using the prepare method. For instance:

Would be made safe with:


Summary

This isn't an exhaustive list of the functions available for validation and sanitization, but it should cover the vast majority of use cases. A lot of these (and other) functions can be found in /wp-includes/formatting.php and I'd strongly recommend digging into the core code and having a look into how WordPress core does validation and sanitization of data.

Did you find this article useful? Do you have any further suggestions on best practices for data validation and sanitization in WordPress? Let us know in the comments below.

Tags:

Comments

Related Articles