Files
website-v2-docs/user-guide/modules/ROOT/pages/task-text-processing.adoc
2025-09-04 11:34:01 -07:00

476 lines
16 KiB
Plaintext

////
Copyright (c) 2024 The C++ Alliance, Inc. (https://cppalliance.org)
Distributed under the Boost Software License, Version 1.0. (See accompanying
file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
Official repository: https://github.com/boostorg/website-v2-docs
////
= Text Processing
:navtitle: Text Processing
Developing a word processor, or other text based app, involves handling text, GUI (Graphical User Interface), file operations, and possibly networking for cloud features. Boost does not provide a library for creating a GUI. You may want to consider using a library like https://www.qt.io/product/development-tools[Qt] or https://wxwidgets.org/[wxWidgets] for the GUI part of your word processor.
[square]
* <<Libraries>>
* <<Sample of Regular Expression Parsing>>
* <<Add Robust Date and Time Parsing>>
* <<Culturally Aware Date Formatting>>
* <<Local Time>>
* <<See Also>>
== Libraries
Here are some Boost libraries that might assist you in processing text:
[circle]
* boost:regex[]: For some simpler parsing tasks, regular expressions can be sufficient and easier to use than full-blown parsing libraries. You could use these features to match specific patterns in your input text, like specific commands or phrases, word boundaries, etc.
* boost:locale[] : This library provides a way of handling and manipulating text in a culturally-aware manner. It provides localization and internationalization facilities, allowing your word processor to be used by people with different languages and locales.
* boost:spirit[] : This library is a parser framework that can parse complex data structures. If you're creating a word processor, it could be useful to interpret different markup and file formats.
* boost:date_time[] : If you need to timestamp changes or edits, or if you're implementing any kind of version history feature, this library can help.
* boost:filesystem[] : This library provides a way of manipulating files and directories. This would be critical in a word processor for opening, saving, and managing documents.
* boost:asio[] : If your word processor has network-related features, such as real-time collaboration or cloud-based storage, boost:asio[] provides a consistent asynchronous model for network programming.
* boost:serialization[] : This library provides a way of serializing and deserializing data, which could be useful for saving and loading documents in a specific format.
* boost:xpressive[] : Could be useful for implementing features like search and replace, spell-checking, and more.
* boost:algorithm[] : This library includes a variety of algorithms for string and sequence processing, which can be useful for handling text.
* boost:multi-index[] : This library provides a way of maintaining a set of items sorted according to multiple keys, which could be useful for implementing features like an index or a sorted list of items.
* boost:thread[] : If your application is multithreaded (for example, if you want to save a document while the user continues to work), this library will be useful.
Note:: The code in this tutorial was written and tested using Microsoft Visual Studio (Visual C++ 2022, Console App project) with Boost version 1.88.0.
== Sample of Regular Expression Parsing
If the text you are parsing is well-formatted then you can use boost:regex[] which we will base our sample on here, rather than a full-blown parser implementation using boost:spirit[].
We'll write a program that scans a string for dates in the format "YYYY-MM-DD" and validates them. The code:
. Finds dates in text
. Validates correct formats (for example, 2024-02-20 is valid, but 2024-15-45 is not)
. Handles multiple dates in a single input string
[source,cpp]
----
#include <iostream>
#include <boost/regex.hpp>
#include <boost/algorithm/string.hpp>
// Function to check if a given date is valid (basic validation)
bool is_valid_date(int year, int month, int day) {
if (month < 1 || month > 12 || day < 1 || day > 31) return false;
if ((month == 4 || month == 6 || month == 9 || month == 11) && day > 30) return false;
if (month == 2) {
bool leap = (year % 4 == 0 && year % 100 != 0) || (year % 400 == 0);
if (day > (leap ? 29 : 28)) return false;
}
return true;
}
// Function to find and validate dates in a text
void find_dates(const std::string& text) {
// Regex pattern: YYYY-MM-DD format
boost::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))");
boost::smatch match;
std::string::const_iterator start = text.begin();
std::string::const_iterator end = text.end();
bool found = false;
while (boost::regex_search(start, end, match, date_pattern)) {
int year = std::stoi(match[1]);
int month = std::stoi(match[2]);
int day = std::stoi(match[3]);
if (is_valid_date(year, month, day)) {
std::cout << "Valid date found: " << match[0] << "\n";
} else {
std::cout << "Invalid date: " << match[0] << " (Incorrect month/day)\n";
}
start = match[0].second; // Move to next match
found = true;
}
if (!found) {
std::cout << "No valid dates found in the input text.\n";
}
}
int main() {
std::string input;
std::cout << "Enter a sentence containing dates (YYYY-MM-DD format):\n";
std::getline(std::cin, input);
find_dates(input);
return 0;
}
----
The following shows a successful parse:
[source,text]
----
Enter a sentence containing dates (YYYY-MM-DD format):
Today is 2024-02-19, and tomorrow is 2024-02-20.
Valid date found: 2024-02-19
Valid date found: 2024-02-20
----
And the following shows several unsuccessful parses:
[source,text]
----
Enter a sentence containing dates (YYYY-MM-DD format):
The deadline is 2024-02-30.
Invalid date: 2024-02-30 (Incorrect month/day)
Enter a sentence containing dates (YYYY-MM-DD format):
There are no dates in this sentence.
No valid dates found in the input text.
----
== Add Robust Date and Time Parsing
The clunky date validation in the sample above can be improved by integrating boost:date_time[], which provides functions for handling dates and times correctly.
[source,cpp]
----
#include <boost/regex.hpp>
#include <boost/date_time/gregorian/gregorian.hpp>
namespace greg = boost::gregorian;
// Function to check if a date is valid using Boost.Date_Time
bool is_valid_date(int year, int month, int day) {
try {
greg::date test_date(year, month, day);
return true; // If no exception, it's valid
}
catch (const std::exception& e) {
return false; // Invalid date
}
}
// Function to find and validate dates in a text
void find_dates(const std::string& text) {
boost::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))");
boost::smatch match;
std::string::const_iterator start = text.begin();
std::string::const_iterator end = text.end();
bool found = false;
while (boost::regex_search(start, end, match, date_pattern)) {
int year = std::stoi(match[1]);
int month = std::stoi(match[2]);
int day = std::stoi(match[3]);
if (is_valid_date(year, month, day)) {
greg::date valid_date(year, month, day);
std::cout << "Valid date found: " << valid_date << "\n";
}
else {
std::cout << "Invalid date: " << match[0] << " (Does not exist)\n";
}
start = match[0].second; // Move to next match
found = true;
}
if (!found) {
std::cout << "No valid dates found in the input text.\n";
}
}
int main() {
std::string input;
std::cout << "Enter a sentence containing dates (YYYY-MM-DD format):\n";
std::getline(std::cin, input);
find_dates(input);
return 0;
}
----
Note:: The code handles leap years correctly, and invalid dates throw an exception.
The following shows a successful parse:
[source,text]
----
Enter a sentence containing dates (YYYY-MM-DD format):
Today is 2024-02-29, and tomorrow is 2024-03-01.
Valid date found: 2024-Feb-29
Valid date found: 2024-Mar-01
----
Note:: The "Valid date found" output now includes text for the month name.
And the following shows several unsuccessful parses:
[source,text]
----
Enter a sentence containing dates (YYYY-MM-DD format):
The deadline is 2024-02-30.
Invalid date: 2024-02-30 (Does not exist)
Enter a sentence containing dates (YYYY-MM-DD format):
There are no dates in this sentence.
No valid dates found in the input text.
----
== Culturally Aware Date Formatting
Dates are not represented consistently across the globe. Let's use boost:locale[] to format dates according to the user's locale. For example:
* US: March 15, 2024
* UK: 15 March, 2024
* France: 15 mars 2024
* Germany: 15. März 2024
[source,cpp]
----
#include <boost/regex.hpp>
#include <boost/date_time/gregorian/gregorian.hpp>
#include <boost/locale.hpp>
namespace greg = boost::gregorian;
namespace loc = boost::locale;
// Function to check if a date is valid using Boost.Date_Time
bool is_valid_date(int year, int month, int day) {
try {
greg::date test_date(year, month, day);
return true; // If no exception, it's valid
}
catch (const std::exception&) {
return false; // Invalid date
}
}
// Function to format and display dates based on locale
void display_localized_date(const greg::date& date, const std::string& locale_name) {
std::locale locale = loc::generator().generate(locale_name);
std::cout.imbue(locale); // Apply locale to std::cout
std::cout << locale_name << " formatted date: "
<< loc::as::date << date << "\n";
}
// Function to find and validate dates in a text
void find_dates(const std::string& text, const std::string& locale_name) {
boost::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))");
boost::smatch match;
std::string::const_iterator start = text.begin();
std::string::const_iterator end = text.end();
bool found = false;
while (boost::regex_search(start, end, match, date_pattern)) {
int year = std::stoi(match[1]);
int month = std::stoi(match[2]);
int day = std::stoi(match[3]);
if (is_valid_date(year, month, day)) {
greg::date valid_date(year, month, day);
std::cout << "Valid date found: " << valid_date << "\n";
display_localized_date(valid_date, locale_name);
}
else {
std::cout << "Invalid date: " << match[0] << " (Does not exist)\n";
}
start = match[0].second; // Move to next match
found = true;
}
if (!found) {
std::cout << "No valid dates found in the input text.\n";
}
}
int main() {
std::locale::global(loc::generator().generate("en_US.UTF-8")); // Default global locale
std::cout.imbue(std::locale()); // Apply to output stream
std::string input;
std::cout << "Enter a sentence containing dates (YYYY-MM-DD format):\n";
std::getline(std::cin, input);
std::string user_locale;
std::cout << "Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): ";
std::cin >> user_locale;
find_dates(input, user_locale);
return 0;
}
----
The following shows successful parses:
[source,text]
----
Enter a sentence containing dates (YYYY-MM-DD format):
The meeting is on 2024-03-15.
Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): en_US.UTF-8
Valid date found: 2024-Mar-15
en_US.UTF-8 formatted date: March 15, 2024
Enter a sentence containing dates (YYYY-MM-DD format):
Rendez-vous le 2024-07-20.
Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): fr_FR.UTF-8
Valid date found: 2024-Jul-20
fr_FR.UTF-8 formatted date: 20 juillet 2024
----
And the following shows an unsuccessful parse:
[source,text]
----
Enter a sentence containing dates (YYYY-MM-DD format):
The deadline is 2024-02-30.
Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): en_US.UTF-8
Invalid date: 2024-02-30 (Does not exist)
----
== Local Time
On a similar global vein, when you install the boost:date_time[] library (or all the Boost libraries), a file containing definitions of time zones across the world is available for your use at: `boost_<version>\\libs\\date_time\\data\\date_time_zonespec.csv`.
The following short sample shows how to use the contents of the file. Enter a city and timezone in the IANA format (such as: 'Europe/Berlin' or 'Asia/Tokyo'), and the current date and time will be output.
[source,cpp]
----
#include <boost/date_time/local_time/local_time.hpp>
namespace pt = boost::posix_time;
namespace lt = boost::local_time;
int main() {
try {
//---------------------------------------------
// Load the Boost tz_database from CSV
//---------------------------------------------
lt::tz_database tz_db;
tz_db.load_from_file("<YOUR PATH>\\date_time_zonespec.csv"); // Adjust the path to your Boost installation
// Extract all valid timezone names
std::vector<std::string> valid_timezones;
for (const auto& tz_name : tz_db.region_list()) {
valid_timezones.push_back(tz_name);
}
std::string city;
while (true) {
std::cout << "\nEnter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): ";
std::getline(std::cin, city);
if (city == "exit") break;
if (city == "zones")
{
std::cout << "Available timezones:\n";
for (const auto& tz : valid_timezones) {
std::cout << tz << "\n";
}
}
else
{
// Find the timezone (case-sensitive, must match CSV)
lt::time_zone_ptr tz = tz_db.time_zone_from_region(city);
if (!tz) {
std::cout << "Invalid timezone! Try again.\n";
continue;
}
// Get current UTC time
pt::ptime utc_now = pt::second_clock::universal_time();
// Convert UTC to local time in the chosen timezone
lt::local_date_time local_now(utc_now, tz);
// Get user's local machine time
pt::ptime user_now = pt::second_clock::local_time();
std::cout << "\nYour local system time: " << user_now << "\n";
std::cout << "Current local time in " << city << ": " << local_now << "\n";
}
}
}
catch (const std::exception& e) {
std::cerr << "Fatal error: " << e.what() << "\n";
return 1;
}
return 0;
}
----
Run the program and test out a few options:
[source,text]
----
Enter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): America/New_York
Your local system time: 2025-Sep-03 16:38:02
Current local time in America/New_York: 2025-Sep-03 19:38:02 EDT
Enter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): Antarctica/South_Pole
Your local system time: 2025-Sep-03 16:38:20
Current local time in Antarctica/South_Pole: 2025-Sep-04 11:38:20 NZST
Enter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): zones
Available timezones:
Africa/Abidjan
Africa/Accra
Africa/Addis_Ababa
Africa/Algiers
Africa/Asmara
Africa/Asmera
Africa/Bamako
Africa/Bangui
Africa/Banjul
Africa/Bissau
Africa/Blantyre
Africa/Brazzaville
Africa/Bujumbura
Africa/Cairo
Africa/Casablanca
Africa/Ceuta
Africa/Conakry
....
----
== Next Steps
If more complex input is required, consider the boost:spirit[] approach to parsing, refer to xref:task-natural-language-parsing.adoc[].
== See Also
* https://www.boost.org/doc/libs/latest/libs/libraries.htm#Miscellaneous[Category: Miscellaneous]
* https://www.boost.org/doc/libs/latest/libs/libraries.htm#Parsing[Category: Parsing]
* https://www.boost.org/doc/libs/latest/libs/libraries.htm#String[Category: String and text processing]