url/doc/modules/ROOT/pages/urls/normalization.adoc

//
// Copyright (c) 2023 Alan de Freitas (alandefreitas@gmail.com)
//
// Distributed under the Boost Software License, Version 1.0. (See accompanying
// file LICENSE_1_0.txt or copy at https://www.boost.org/LICENSE_1_0.txt)
//
// Official repository: https://github.com/boostorg/url
//

= Normalization

Normalization allows us to determine if two URLs refer to the same resource.
URLs comparisons serve the same purpose, where two strings are compared as if they were normalized.

There is no way to determine whether two URLs refer to the same resource without full knowledge or control of them.
Thus, equivalence is based on string comparisons augmented by additional URL and scheme rules.
This means comparison is not sufficient to determine whether two URLs identify different resources as the same resource can always be served from different addresses.

For this reason, comparison methods are designed to minimize false negatives while strictly avoiding false positives.
In other words, if two URLs compare equal, they definitely represent the same resource.
If they are considered different, they might still refer to the same resource depending on the application.

Context-dependent rules can be considered to minimize the number of false negatives, where cheaper methods have a higher chance of producing false negatives:

* Simple String Comparison
* Syntax-Based Normalization
* Scheme-Based Normalization
* Protocol-Based Normalization

== Simple String Comparison

Simple String Comparison can be performed by accessing the underlying buffer of URLs:

[source,cpp]
----
include::example$unit/snippets.cpp[tag=snippet_normalizing_1,indent=0]
----

By only considering the rules of https://tools.ietf.org/html/rfc3986[rfc3986,window=blank_], Simple String Comparison fails to identify the URLs above point to the same resource.

== Syntax-Based Normalization

The comparison operators implement Syntax-Based Normalization, which implements the rules defined by https://tools.ietf.org/html/rfc3986[rfc3986,window=blank_].

[source,cpp]
----
include::example$unit/snippets.cpp[tag=snippet_normalizing_2,indent=0]
----

In mutable URLs, the member function cpp:normalize[] can also be used to apply Syntax-Based Normalization to a URL.
A normalized URL is represented by a canonical string where any two strings that would compare equal end up with the same underlying representation.
In other words, Simple String Comparison of two normalized URLs is equivalent to Syntax-Based Normalization.

[source,cpp]
----
include::example$unit/snippets.cpp[tag=snippet_normalizing_3,indent=0]
----

=== Syntax-Based Normalization Procedure

Normalization uses the following definitions of https://tools.ietf.org/html/rfc3986[rfc3986,window=blank_]
to minimize false negatives:

* Case Normalization: percent-encoding triplets are normalized to use uppercase letters
* Percent-Encoding Normalization: decode octets that do not require percent-encoding
* Path Segment Normalization: path segments "." and ".." are resolved

The following example normalizes the percent-encoding and path segments of a URL:

[source,cpp]
----
include::example$unit/snippets.cpp[tag=snippet_normalizing_4,indent=0]
----

== Scheme and Protocol-Based Normalization

Syntax-Based Normalization can also be used as a first step for Scheme-Based and Protocol-Based Normalization.
One common scheme-specific rule is ignoring the default port for that scheme and empty absolute paths:

[source,cpp]
----
include::example$unit/snippets.cpp[tag=snippet_normalizing_5,indent=0]
----

Other criteria commonly used to minimize false negatives for specific schemes are:

* Handling empty authority component as an error or localhost
* Replacing authority with empty string for the default authority
* Normalizing extra subcomponents with case-insensitive data
* Normalizing paths with case-insensitive data