What Is URL (Uniform Resource Identifier) Encoding?

Definition

URL encoding is an encoding format used in URLs. The standard allows the use of arbitrary data inside a Uniform Resource Identifier (a URI; typically a URL) while using only a narrow set of US-ASCII characters. The encoding exists because URLs and HTTP request parameters often contain characters (or other data) that cannot be represented with the limited set of US-ASCII characters (i.e. control characters, etc.).

Reserved and unreserved characters

In general, a URI can contain characters that are either reserved or unreserved. Unreserved characters are characters that have no special meaning; they can be displayed as-is and require no special handling. These include uppercase and lowercase letters (A-Z, a-z), decimal digits (0-9), hyphen (-), period (.), underscore (_), and tilde (~).

Reserved characters, on the other hand, are characters that may delimit the URI into sub-components: characters such as / # & and others. The following is the list of all reserved characters: ! # $ & ' ( ) * + , / : ; = ? @ [ ].

We cannot use reserved character as-is, because this would create ambiguous URIs. For instance, consider URL http://example.com/foo#bar. Does this URL point to an anchor #bar inside resource /foo, or it points to a resource /foo#bar, that is, a resource whose name contains character #? Without URL encoding it would be impossible to tell.

We resolve such ambiguities by encoding reserved characters differently when used as data; when used as delimiters, we encode them as-is.

Percent encoding

To encode reserved characters, we use the percent-encoding scheme. In percent-encoding, each byte is encoded as a character triplet that consists of the percent character % followed by the two hexadecimal digits that represent the byte numeric value. For instance, %23 is the percent-encoding for the binary octet 00100011, which in US-ASCII, corresponds to the character #. Strictly speaking, while the percent character (%) isn't reserved, it nonetheless serves as a special indicator for percent-encoded bytes (and therefore requires special handling). Simply put: it must also be percent-encoded (as %25).

So with percent-encoding, we know that URL http://example.com/foo#bar points to an anchor bar inside resource /foo while http://example.com/foo%23bar points to resource /foo#bar where character # is encoded as %23.

What Is URL (Uniform Resource Identifier) and Percent Encoding

Other characters

Percent encoding is also used to represent other characters; characters that are neither reserved nor unreserved. As an example, imagine a GET request containing a non-ASCII string parameter, such as a search query zajec in jež which is Slovenian for a rabbit and a hedgehog.

In such cases, we have to first encode non-ASCII characters as UTF-8 and then encode each byte of the new string with percent-encoding. So if we send a GET request to the Duckduckgo search engine containing search query zajec in jež, we generate the following URL: https://duckduckgo.com/?q=zajec%20in%20je%C5%BE

Encoding the space character

You may have seen cases where the space character was encoded as character +, however, the percent-encoding suggests it should be encoded as %20 (in US-ASCII, the space character is 20 hexadecimal or 32 decimal). So what is going on?

Such encodings are typically created by HTML forms. When a user submits an HTML form, the data is URL-encoded using an early version of the URI percent-encoding rules that contained a number of modifications such as replacing spaces with + and others.

Note however, that using the + instead of %20 is valid only when encoding the application/x-www-form-urlencoded content, such as the query part of an URL. To make this clearer, consider the following cases.

  1. http://www.example.com/search+script.php?search+query=search+term

    In this URL, the resource being requested is search+script.php (the plus character (+) is part of the filename), while the parameter name is search query and its value is search term – in the name of the query parameter and in its value the + sign is converted to space while in the name of the resource, search+script.php, the + sign remains.

  2. http://www.example.com/search+script.php?search%20query=search%20term

    This case is identical to the example above. The difference—using %20 instead of the + sign in parameter name and value—is only superficial. Both URLs point to the same resource, search+script.php, and they contain the same parameters.

  3. http://www.example.com/search%20script.php?search%20query=search%20term

    This example, however, is different. Here the resource name contains the actual space character, so the name of the requested resource is search function.php; the request parameter names and values remain the same as above. Consequently this URL is different from those above.

A URL encoder

The application below performs URL encoding and decoding on arbitrary strings. Feel free to test it out (HTML).

Input <br>
<input type="text" name="input" id="input"><br><br>

Output <br>
<input type="text" name="encoded" id="encoded">

<script>
let input = null;
let encoded = null;

document.addEventListener("DOMContentLoaded", () => {
		input = document.querySelector("#input");
		input.onkeyup = encode;
		encoded = document.querySelector("#encoded");
		encoded.onkeyup = decode;
});

function encode(event) {
		encoded.value = encodeURIComponent(input.value);
}

function decode(event) {
		try {
				input.value = decodeURIComponent(encoded.value);
		} catch (error) {
				input.value = "Invalid URI string";
		}
}
</script>

Further reading

Glossary

HTTP

Hypertext Transfer Protocol. A protocol that connects web browsers to web servers when they request content.

Encoding

The act of transferring or saving information into a usable file format.