Key Differences Between Validation and Sanitization

VIP Services developer Daniel Bachhuber shares some tips on writing better code for your WordPress site:

Your code works, but is it safe? When writing code for a high-profile environment, you’ll need to be extra cautious of how you handle data coming into WordPress and how it’s presented to the end user. This commonly comes up when building a settings page for your theme, creating and manipulating shortcodes, or saving and rendering extra data associated with a post.

There’s a distinction between how input and output are managed, however.

Validation: Checking User Input

To validate is to ensure the data you’ve requested of the user matches what they’ve submitted. There are several core methods you can use for input validation; usage obviously depends on the type of fields you’d like to validate. Let’s take a look at an example.

Say we have an input area in our form like this:

<input type="text" id="my-zipcode" name="my-zipcode" maxlength="5" />

Just like that, we’ve limited my user to five characters of input, but there’s no limitation on what they can input. They could enter “11221” or “eval(“. If we’re saving to the database, there’s no way we want to give the user unrestricted write access.

This is where validation plays a role. When processing the form, we’ll write code to check each field for its proper data type. If it’s not of the proper data type, we’ll discard it. For instance, to check “my-zipcode” field, we might do something like this:

$safe_zipcode = intval( $_POST['my-zipcode'] );
if ( ! $safe_zipcode )
$safe_zipcode = '';
update_post_meta( $post->ID, 'my_zipcode', $safe_zipcode );

The intval() function casts user input as an integer, and defaults to zero if the input was a non-numeric value. We then check to see if the value ended up as zero. If it did, we’ll save an empty value to the database. Otherwise, we’ll save the properly validated zipcode.

This style of validation most closely follows WordPress’ whitelist philosophy: only allow the user to input what you’re expecting. Luckily, there’s a number of handy helper functions you can use for most every data type.

Sanitization: Escaping Output

For security on the other end of the spectrum, we have sanitization. To sanitize is to take the data you may already have and help secure it prior to rendering it for the end user. WordPress thankfully has a few helper functions we can use for most of what we’ll commonly need to do:

esc_html() we should use anytime our HTML element encloses a section of data we’re outputting.

<h4><?php echo esc_html( $title ); ?></h4>

esc_url() should be used on all URLs, including those in the ‘src’ and ‘href’ attributes of an HTML element.

<img src="<?php echo esc_url( $great_user_picture_url ); ?>" />

esc_js() is intended for inline Javascript.

<a href="#" onclick="<?php echo esc_js( $custom_js ); ?>">Click me</a>

esc_attr() can be used on everything else that’s printed into an HTML element’s attribute.

<ul class="<?php echo esc_attr( $stored_class ); ?>">

It’s important to note that most WordPress functions properly prepare the data for output, and you don’t need to escape again.

<h4><?php the_title(); ?></h4>

Also, as there are always exceptions to the rule, there are a selection of user-submitted data that needs to be validated and sanitized. Freeform text areas would fall into this category. For this, you can run user data through sanitize_text_field() or any of the wp_kses_*() functions.

To recap: follow the whitelist philosophy with data validation, and only allow the user to input data of your expected type. If it’s not the proper type, discard it. Sanitize data as much as possible on output, and a selection needs to be sanitized on input too.

Hit us with your questions or tips in the comments.

9 thoughts on “Key Differences Between Validation and Sanitization

  1. I think there are three concepts actually, or at least in my language that’s not English there is a semantic difference

    1- Validation: you “validate”, ie deem valid or invalid, data at input time. For instance if asked for a zipcode user enters “zzz43″, that’s invalid. At this point, you can reject or… sanitize.

    2- sanitization: you make data “sane” before storing it. For instance if you want a zipcode, you can remove any character that’s not [0-9]

    3- escaping: at output time, you ensure data printed will never corrupt display and/or be used in an evil way (escaping HTML etc…)

    • Thanks Ozh. That’s a really good description of the differences. I’d been attempting to explain this last night and failed, so had been googling around for a more thorough definition. Your explanation seems the most accurate/easy to grok that I’ve found so far.

  2. Pingback: Validation And Sanitization Primer

  3. I realize that this is somewhat orthogonal to the primary point, but it is important to make sure your sanitization steps do not accidentally change good data in bad ways. In the example above:

    $safe_zipcode = intval( $_POST['my-zipcode'] );

    An east coast zip code that starts with a leading zero will be altered. The following example code shows how the zip code ‘04040’ will be changed to ‘4040’ rather than the ‘04040’ that you are looking for. intval() does not strip non-numeric values, it literally returns the integer value of the variable.

    $zipcode_in_maine = '04040';

    $safe_zipcode = intval( $zipcode_in_maine );

    echo $safe_zipcode;

    • It depends on the one you use, which is why it’s important to use them appropriately too.

      esc_html() checks for invalid characters and converts special characters to their HTML entities.

      esc_url() (somewhat improperly named):

      Rejects URLs that do not have one of the provided whitelisted protocols (defaulting to http, https, ftp, ftps, mailto, news, irc, gopher, nntp, feed, and telnet), eliminates invalid characters, and removes dangerous characters. This function encodes characters as HTML entities: use it when generating an (X)HTML or XML document. Encodes ampersands (&) and single quotes (‘) as numeric entity references (&#038, &#039).

      esc_attr():

      Encodes & ” ‘ (less than, greater than, ampersand, double quote, single quote). Will never double encode entities.

      esc_js():

      Escape single quotes, htmlspecialchar ” &, and fix line endings.
      Escapes text strings for echoing in JS. It is intended to be used for inline JS (in a tag attribute, for example onclick=”…”).

  4. Pingback: Status « danielbachhuber

  5. Pingback: Key Differences Between Validation and Sanitization | Nick Hamze

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s