mirror of
https://github.com/boostorg/url.git
synced 2026-01-31 08:42:22 +00:00
92 lines
3.8 KiB
Plaintext
92 lines
3.8 KiB
Plaintext
//
|
|
// Copyright (c) 2023 Alan de Freitas (alandefreitas@gmail.com)
|
|
//
|
|
// Distributed under the Boost Software License, Version 1.0. (See accompanying
|
|
// file LICENSE_1_0.txt or copy at https://www.boost.org/LICENSE_1_0.txt)
|
|
//
|
|
// Official repository: https://github.com/boostorg/url
|
|
//
|
|
|
|
= Normalization
|
|
|
|
Normalization allows us to determine if two URLs refer to the same resource.
|
|
URLs comparisons serve the same purpose, where two strings are compared as if they were normalized.
|
|
|
|
There is no way to determine whether two URLs refer to the same resource without full knowledge or control of them.
|
|
Thus, equivalence is based on string comparisons augmented by additional URL and scheme rules.
|
|
This means comparison is not sufficient to determine whether two URLs identify different resources as the same resource can always be served from different addresses.
|
|
|
|
For this reason, comparison methods are designed to minimize false negatives while strictly avoiding false positives.
|
|
In other words, if two URLs compare equal, they definitely represent the same resource.
|
|
If they are considered different, they might still refer to the same resource depending on the application.
|
|
|
|
Context-dependent rules can be considered to minimize the number of false negatives, where cheaper methods have a higher chance of producing false negatives:
|
|
|
|
* Simple String Comparison
|
|
* Syntax-Based Normalization
|
|
* Scheme-Based Normalization
|
|
* Protocol-Based Normalization
|
|
|
|
== Simple String Comparison
|
|
|
|
Simple String Comparison can be performed by accessing the underlying buffer of URLs:
|
|
|
|
[source,cpp]
|
|
----
|
|
include::example$unit/snippets.cpp[tag=snippet_normalizing_1,indent=0]
|
|
----
|
|
|
|
By only considering the rules of https://tools.ietf.org/html/rfc3986[rfc3986,window=blank_], Simple String Comparison fails to identify the URLs above point to the same resource.
|
|
|
|
== Syntax-Based Normalization
|
|
|
|
The comparison operators implement Syntax-Based Normalization, which implements the rules defined by https://tools.ietf.org/html/rfc3986[rfc3986,window=blank_].
|
|
|
|
[source,cpp]
|
|
----
|
|
include::example$unit/snippets.cpp[tag=snippet_normalizing_2,indent=0]
|
|
----
|
|
|
|
In mutable URLs, the member function cpp:normalize[] can also be used to apply Syntax-Based Normalization to a URL.
|
|
A normalized URL is represented by a canonical string where any two strings that would compare equal end up with the same underlying representation.
|
|
In other words, Simple String Comparison of two normalized URLs is equivalent to Syntax-Based Normalization.
|
|
|
|
[source,cpp]
|
|
----
|
|
include::example$unit/snippets.cpp[tag=snippet_normalizing_3,indent=0]
|
|
----
|
|
|
|
=== Syntax-Based Normalization Procedure
|
|
|
|
Normalization uses the following definitions of https://tools.ietf.org/html/rfc3986[rfc3986,window=blank_]
|
|
to minimize false negatives:
|
|
|
|
* Case Normalization: percent-encoding triplets are normalized to use uppercase letters
|
|
* Percent-Encoding Normalization: decode octets that do not require percent-encoding
|
|
* Path Segment Normalization: path segments "." and ".." are resolved
|
|
|
|
The following example normalizes the percent-encoding and path segments of a URL:
|
|
|
|
[source,cpp]
|
|
----
|
|
include::example$unit/snippets.cpp[tag=snippet_normalizing_4,indent=0]
|
|
----
|
|
|
|
== Scheme and Protocol-Based Normalization
|
|
|
|
Syntax-Based Normalization can also be used as a first step for Scheme-Based and Protocol-Based Normalization.
|
|
One common scheme-specific rule is ignoring the default port for that scheme and empty absolute paths:
|
|
|
|
[source,cpp]
|
|
----
|
|
include::example$unit/snippets.cpp[tag=snippet_normalizing_5,indent=0]
|
|
----
|
|
|
|
Other criteria commonly used to minimize false negatives for specific schemes are:
|
|
|
|
* Handling empty authority component as an error or localhost
|
|
* Replacing authority with empty string for the default authority
|
|
* Normalizing extra subcomponents with case-insensitive data
|
|
* Normalizing paths with case-insensitive data
|
|
|