//// Copyright (c) 2024 The C++ Alliance, Inc. (https://cppalliance.org) Distributed under the Boost Software License, Version 1.0. (See accompanying file LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt) Official repository: https://github.com/boostorg/website-v2-docs //// = Text Processing :navtitle: Text Processing Developing a word processor, or other text based app, involves handling text, GUI (Graphical User Interface), file operations, and possibly networking for cloud features. Boost does not provide a library for creating a GUI. You may want to consider using a library like https://www.qt.io/product/development-tools[Qt] or https://wxwidgets.org/[wxWidgets] for the GUI part of your word processor. [square] * <> * <> * <> * <> * <> * <> == Libraries Here are some Boost libraries that might assist you in processing text: [circle] * boost:regex[]: For some simpler parsing tasks, regular expressions can be sufficient and easier to use than full-blown parsing libraries. You could use these features to match specific patterns in your input text, like specific commands or phrases, word boundaries, etc. * boost:locale[] : This library provides a way of handling and manipulating text in a culturally-aware manner. It provides localization and internationalization facilities, allowing your word processor to be used by people with different languages and locales. * boost:spirit[] : This library is a parser framework that can parse complex data structures. If you're creating a word processor, it could be useful to interpret different markup and file formats. * boost:date_time[] : If you need to timestamp changes or edits, or if you're implementing any kind of version history feature, this library can help. * boost:filesystem[] : This library provides a way of manipulating files and directories. This would be critical in a word processor for opening, saving, and managing documents. * boost:asio[] : If your word processor has network-related features, such as real-time collaboration or cloud-based storage, boost:asio[] provides a consistent asynchronous model for network programming. * boost:serialization[] : This library provides a way of serializing and deserializing data, which could be useful for saving and loading documents in a specific format. * boost:xpressive[] : Could be useful for implementing features like search and replace, spell-checking, and more. * boost:algorithm[] : This library includes a variety of algorithms for string and sequence processing, which can be useful for handling text. * boost:multi-index[] : This library provides a way of maintaining a set of items sorted according to multiple keys, which could be useful for implementing features like an index or a sorted list of items. * boost:thread[] : If your application is multithreaded (for example, if you want to save a document while the user continues to work), this library will be useful. Note:: The code in this tutorial was written and tested using Microsoft Visual Studio (Visual C++ 2022, Console App project) with Boost version 1.88.0. == Sample of Regular Expression Parsing If the text you are parsing is well-formatted then you can use boost:regex[] which we will base our sample on here, rather than a full-blown parser implementation using boost:spirit[]. We'll write a program that scans a string for dates in the format "YYYY-MM-DD" and validates them. The code: . Finds dates in text . Validates correct formats (for example, 2024-02-20 is valid, but 2024-15-45 is not) . Handles multiple dates in a single input string [source,cpp] ---- #include #include #include // Function to check if a given date is valid (basic validation) bool is_valid_date(int year, int month, int day) { if (month < 1 || month > 12 || day < 1 || day > 31) return false; if ((month == 4 || month == 6 || month == 9 || month == 11) && day > 30) return false; if (month == 2) { bool leap = (year % 4 == 0 && year % 100 != 0) || (year % 400 == 0); if (day > (leap ? 29 : 28)) return false; } return true; } // Function to find and validate dates in a text void find_dates(const std::string& text) { // Regex pattern: YYYY-MM-DD format boost::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))"); boost::smatch match; std::string::const_iterator start = text.begin(); std::string::const_iterator end = text.end(); bool found = false; while (boost::regex_search(start, end, match, date_pattern)) { int year = std::stoi(match[1]); int month = std::stoi(match[2]); int day = std::stoi(match[3]); if (is_valid_date(year, month, day)) { std::cout << "Valid date found: " << match[0] << "\n"; } else { std::cout << "Invalid date: " << match[0] << " (Incorrect month/day)\n"; } start = match[0].second; // Move to next match found = true; } if (!found) { std::cout << "No valid dates found in the input text.\n"; } } int main() { std::string input; std::cout << "Enter a sentence containing dates (YYYY-MM-DD format):\n"; std::getline(std::cin, input); find_dates(input); return 0; } ---- The following shows a successful parse: [source,text] ---- Enter a sentence containing dates (YYYY-MM-DD format): Today is 2024-02-19, and tomorrow is 2024-02-20. Valid date found: 2024-02-19 Valid date found: 2024-02-20 ---- And the following shows several unsuccessful parses: [source,text] ---- Enter a sentence containing dates (YYYY-MM-DD format): The deadline is 2024-02-30. Invalid date: 2024-02-30 (Incorrect month/day) Enter a sentence containing dates (YYYY-MM-DD format): There are no dates in this sentence. No valid dates found in the input text. ---- == Add Robust Date and Time Parsing The clunky date validation in the sample above can be improved by integrating boost:date_time[], which provides functions for handling dates and times correctly. [source,cpp] ---- #include #include namespace greg = boost::gregorian; // Function to check if a date is valid using Boost.Date_Time bool is_valid_date(int year, int month, int day) { try { greg::date test_date(year, month, day); return true; // If no exception, it's valid } catch (const std::exception& e) { return false; // Invalid date } } // Function to find and validate dates in a text void find_dates(const std::string& text) { boost::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))"); boost::smatch match; std::string::const_iterator start = text.begin(); std::string::const_iterator end = text.end(); bool found = false; while (boost::regex_search(start, end, match, date_pattern)) { int year = std::stoi(match[1]); int month = std::stoi(match[2]); int day = std::stoi(match[3]); if (is_valid_date(year, month, day)) { greg::date valid_date(year, month, day); std::cout << "Valid date found: " << valid_date << "\n"; } else { std::cout << "Invalid date: " << match[0] << " (Does not exist)\n"; } start = match[0].second; // Move to next match found = true; } if (!found) { std::cout << "No valid dates found in the input text.\n"; } } int main() { std::string input; std::cout << "Enter a sentence containing dates (YYYY-MM-DD format):\n"; std::getline(std::cin, input); find_dates(input); return 0; } ---- Note:: The code handles leap years correctly, and invalid dates throw an exception. The following shows a successful parse: [source,text] ---- Enter a sentence containing dates (YYYY-MM-DD format): Today is 2024-02-29, and tomorrow is 2024-03-01. Valid date found: 2024-Feb-29 Valid date found: 2024-Mar-01 ---- Note:: The "Valid date found" output now includes text for the month name. And the following shows several unsuccessful parses: [source,text] ---- Enter a sentence containing dates (YYYY-MM-DD format): The deadline is 2024-02-30. Invalid date: 2024-02-30 (Does not exist) Enter a sentence containing dates (YYYY-MM-DD format): There are no dates in this sentence. No valid dates found in the input text. ---- == Culturally Aware Date Formatting Dates are not represented consistently across the globe. Let's use boost:locale[] to format dates according to the user's locale. For example: * US: March 15, 2024 * UK: 15 March, 2024 * France: 15 mars 2024 * Germany: 15. März 2024 [source,cpp] ---- #include #include #include namespace greg = boost::gregorian; namespace loc = boost::locale; // Function to check if a date is valid using Boost.Date_Time bool is_valid_date(int year, int month, int day) { try { greg::date test_date(year, month, day); return true; // If no exception, it's valid } catch (const std::exception&) { return false; // Invalid date } } // Function to format and display dates based on locale void display_localized_date(const greg::date& date, const std::string& locale_name) { std::locale locale = loc::generator().generate(locale_name); std::cout.imbue(locale); // Apply locale to std::cout std::cout << locale_name << " formatted date: " << loc::as::date << date << "\n"; } // Function to find and validate dates in a text void find_dates(const std::string& text, const std::string& locale_name) { boost::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))"); boost::smatch match; std::string::const_iterator start = text.begin(); std::string::const_iterator end = text.end(); bool found = false; while (boost::regex_search(start, end, match, date_pattern)) { int year = std::stoi(match[1]); int month = std::stoi(match[2]); int day = std::stoi(match[3]); if (is_valid_date(year, month, day)) { greg::date valid_date(year, month, day); std::cout << "Valid date found: " << valid_date << "\n"; display_localized_date(valid_date, locale_name); } else { std::cout << "Invalid date: " << match[0] << " (Does not exist)\n"; } start = match[0].second; // Move to next match found = true; } if (!found) { std::cout << "No valid dates found in the input text.\n"; } } int main() { std::locale::global(loc::generator().generate("en_US.UTF-8")); // Default global locale std::cout.imbue(std::locale()); // Apply to output stream std::string input; std::cout << "Enter a sentence containing dates (YYYY-MM-DD format):\n"; std::getline(std::cin, input); std::string user_locale; std::cout << "Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): "; std::cin >> user_locale; find_dates(input, user_locale); return 0; } ---- The following shows successful parses: [source,text] ---- Enter a sentence containing dates (YYYY-MM-DD format): The meeting is on 2024-03-15. Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): en_US.UTF-8 Valid date found: 2024-Mar-15 en_US.UTF-8 formatted date: March 15, 2024 Enter a sentence containing dates (YYYY-MM-DD format): Rendez-vous le 2024-07-20. Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): fr_FR.UTF-8 Valid date found: 2024-Jul-20 fr_FR.UTF-8 formatted date: 20 juillet 2024 ---- And the following shows an unsuccessful parse: [source,text] ---- Enter a sentence containing dates (YYYY-MM-DD format): The deadline is 2024-02-30. Enter your preferred locale (e.g., en_US.UTF-8, fr_FR.UTF-8, de_DE.UTF-8): en_US.UTF-8 Invalid date: 2024-02-30 (Does not exist) ---- == Local Time On a similar global vein, when you install the boost:date_time[] library (or all the Boost libraries), a file containing definitions of time zones across the world is available for your use at: `boost_\\libs\\date_time\\data\\date_time_zonespec.csv`. The following short sample shows how to use the contents of the file. Enter a city and timezone in the IANA format (such as: 'Europe/Berlin' or 'Asia/Tokyo'), and the current date and time will be output. [source,cpp] ---- #include namespace pt = boost::posix_time; namespace lt = boost::local_time; int main() { try { //--------------------------------------------- // Load the Boost tz_database from CSV //--------------------------------------------- lt::tz_database tz_db; tz_db.load_from_file("\\date_time_zonespec.csv"); // Adjust the path to your Boost installation // Extract all valid timezone names std::vector valid_timezones; for (const auto& tz_name : tz_db.region_list()) { valid_timezones.push_back(tz_name); } std::string city; while (true) { std::cout << "\nEnter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): "; std::getline(std::cin, city); if (city == "exit") break; if (city == "zones") { std::cout << "Available timezones:\n"; for (const auto& tz : valid_timezones) { std::cout << tz << "\n"; } } else { // Find the timezone (case-sensitive, must match CSV) lt::time_zone_ptr tz = tz_db.time_zone_from_region(city); if (!tz) { std::cout << "Invalid timezone! Try again.\n"; continue; } // Get current UTC time pt::ptime utc_now = pt::second_clock::universal_time(); // Convert UTC to local time in the chosen timezone lt::local_date_time local_now(utc_now, tz); // Get user's local machine time pt::ptime user_now = pt::second_clock::local_time(); std::cout << "\nYour local system time: " << user_now << "\n"; std::cout << "Current local time in " << city << ": " << local_now << "\n"; } } } catch (const std::exception& e) { std::cerr << "Fatal error: " << e.what() << "\n"; return 1; } return 0; } ---- Run the program and test out a few options: [source,text] ---- Enter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): America/New_York Your local system time: 2025-Sep-03 16:38:02 Current local time in America/New_York: 2025-Sep-03 19:38:02 EDT Enter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): Antarctica/South_Pole Your local system time: 2025-Sep-03 16:38:20 Current local time in Antarctica/South_Pole: 2025-Sep-04 11:38:20 NZST Enter 'city/timezone' (or 'exit' to quit, or 'zones' for list of options): zones Available timezones: Africa/Abidjan Africa/Accra Africa/Addis_Ababa Africa/Algiers Africa/Asmara Africa/Asmera Africa/Bamako Africa/Bangui Africa/Banjul Africa/Bissau Africa/Blantyre Africa/Brazzaville Africa/Bujumbura Africa/Cairo Africa/Casablanca Africa/Ceuta Africa/Conakry .... ---- == Next Steps If more complex input is required, consider the boost:spirit[] approach to parsing, refer to xref:task-natural-language-parsing.adoc[]. == See Also * https://www.boost.org/doc/libs/latest/libs/libraries.htm#Miscellaneous[Category: Miscellaneous] * https://www.boost.org/doc/libs/latest/libs/libraries.htm#Parsing[Category: Parsing] * https://www.boost.org/doc/libs/latest/libs/libraries.htm#String[Category: String and text processing]