Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges
Table of Contents
- Project Architecture
- CMakeLists.txt Setup
- Core Types and Interface Design
- File Reading with std::filesystem and Memory-Mapped Views
- Text Analysis: Words, Lines, Characters
- Pattern Search with ANSI Highlighting
- Case Transformation with Ranges
- Word Frequency Analysis
- Command-Line Interface with std::print
- Compiler Flags and Sanitizers
- Extension Challenges
- Phase 1 Reflection
Project Architecture
text-processor/
├── CMakeLists.txt
├── include/
│ ├── processor.hpp # Core types and function declarations
│ ├── searcher.hpp # Pattern search engine
│ └── formatter.hpp # Output formatting utilities
├── src/
│ ├── main.cpp # CLI entry point
│ ├── processor.cpp # Text analysis implementation
│ ├── searcher.cpp # Search + highlight implementation
│ └── formatter.cpp # Tabular output formatting
└── tests/
├── CMakeLists.txt
├── test_processor.cpp # Google Test unit tests
└── test_data/
└── sample.txt # Test fixtureCMakeLists.txt Setup
cmake_minimum_required(VERSION 3.25)
project(TextProcessor VERSION 1.0.0 LANGUAGES CXX)
set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF) # Use -std=c++23, not -std=gnu++23
# Compiler warnings:
target_compile_options(TextProcessor PRIVATE
$<$<CXX_COMPILER_ID:GNU,Clang>:-Wall -Wextra -Wpedantic -Werror>
$<$<CXX_COMPILER_ID:MSVC>:/W4 /WX>
)
# AddressSanitizer in Debug:
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
target_compile_options(TextProcessor PRIVATE -fsanitize=address,undefined)
target_link_options(TextProcessor PRIVATE -fsanitize=address,undefined)
endif()
add_library(processor_lib STATIC
src/processor.cpp
src/searcher.cpp
src/formatter.cpp
)
target_include_directories(processor_lib PUBLIC include)
add_executable(TextProcessor src/main.cpp)
target_link_libraries(TextProcessor PRIVATE processor_lib)
# Unit tests with Google Test:
find_package(GTest REQUIRED)
add_executable(tests tests/test_processor.cpp)
target_link_libraries(tests PRIVATE processor_lib GTest::gtest_main)
include(GoogleTest)
gtest_discover_tests(tests)Core Types and Interface Design
// include/processor.hpp
#pragma once
#include <string>
#include <string_view>
#include <vector>
#include <map>
#include <cstddef>
// Aggregate statistics returned from analysis:
struct TextStats {
size_t lines = 0;
size_t words = 0;
size_t characters = 0;
size_t paragraphs = 0; // Double newline-separated sections
size_t unique_words = 0;
double avg_word_length = 0.0;
};
// Search result with match positions:
struct SearchResult {
std::string_view line; // The full line containing the match
size_t line_no; // 1-based line number
size_t col_no; // 1-based column of first match
size_t match_len; // Length of the matched text
};
// Word frequency entry:
struct WordFreq {
std::string word;
size_t count;
};
// Core analysis functions (all zero-copy — no allocation for input):
TextStats analyze_text(std::string_view content);
std::vector<SearchResult> search_text(std::string_view content,
std::string_view pattern,
bool case_sensitive = true);
std::string to_uppercase(std::string_view content);
std::string to_lowercase(std::string_view content);
std::vector<WordFreq> top_words(std::string_view content, size_t n = 10);Text Analysis: Words, Lines, Characters
// src/processor.cpp
#include "processor.hpp"
#include <ranges>
#include <algorithm>
#include <sstream>
#include <unordered_map>
TextStats analyze_text(std::string_view content) {
TextStats stats;
bool in_word = false;
bool last_newline = false;
for (char ch : content) {
stats.characters++;
if (ch == '\n') {
stats.lines++;
if (last_newline) stats.paragraphs++; // Double newline = paragraph
}
last_newline = (ch == '\n');
bool is_whitespace = (ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
if (!is_whitespace && !in_word) {
stats.words++;
in_word = true;
} else if (is_whitespace) {
in_word = false;
}
}
if (!content.empty() && content.back() != '\n') stats.lines++;
// Use ranges for unique word count:
auto word_view = content
| std::views::split(' ')
| std::views::transform([](auto r) -> std::string {
return std::string(r.begin(), r.end());
})
| std::views::filter([](const std::string& s) { return !s.empty(); });
std::unordered_map<std::string, size_t> freq;
for (const auto& word : word_view) freq[word]++;
stats.unique_words = freq.size();
// Average word length:
size_t total_chars = 0;
for (const auto& [word, count] : freq) total_chars += word.size() * count;
stats.avg_word_length = stats.words > 0
? static_cast<double>(total_chars) / stats.words : 0.0;
return stats;
}
std::vector<WordFreq> top_words(std::string_view content, size_t n) {
std::unordered_map<std::string, size_t> freq;
// Tokenize using string_view splits — no heap per token:
for (auto word_sv : content | std::views::split(' ')) {
std::string word(word_sv.begin(), word_sv.end());
// Remove punctuation:
std::erase_if(word, [](char c){ return std::ispunct(static_cast<unsigned char>(c)); });
// Lowercase for frequency:
std::ranges::transform(word, word.begin(), ::tolower);
if (!word.empty() && word.size() > 2) freq[word]++;
}
// Get top N by frequency:
std::vector<WordFreq> results;
results.reserve(freq.size());
for (auto& [word, count] : freq) results.push_back({word, count});
std::ranges::partial_sort(results,
results.begin() + std::min(n, results.size()),
[](const WordFreq& a, const WordFreq& b){ return a.count > b.count; });
results.resize(std::min(n, results.size()));
return results;
}Pattern Search with ANSI Highlighting
// src/searcher.cpp
#include "searcher.hpp"
#include <print>
std::vector<SearchResult> search_text(std::string_view content,
std::string_view pattern,
bool case_sensitive) {
std::vector<SearchResult> results;
size_t line_no = 1;
for (auto line_sv : content | std::views::split('\n')) {
std::string_view line(line_sv.begin(), line_sv.end());
// Search within line:
size_t pos = 0;
while (true) {
size_t found;
if (case_sensitive) {
found = line.find(pattern, pos);
} else {
// Case-insensitive: lowercase both for comparison
std::string lower_line(line);
std::string lower_pat(pattern);
std::ranges::transform(lower_line, lower_line.begin(), ::tolower);
std::ranges::transform(lower_pat, lower_pat.begin(), ::tolower);
found = lower_line.find(lower_pat, pos);
}
if (found == std::string_view::npos) break;
results.push_back({line, line_no, found + 1, pattern.size()});
pos = found + 1;
}
line_no++;
}
return results;
}
void print_highlighted(const SearchResult& r, std::string_view pattern) {
std::string_view before = r.line.substr(0, r.col_no - 1);
std::string_view match = r.line.substr(r.col_no - 1, r.match_len);
std::string_view after = r.line.substr(r.col_no - 1 + r.match_len);
// ANSI: bold + red highlight
std::print("L{:4}: {}\033[1;31m{}\033[0m{}\n",
r.line_no, before, match, after);
}Command-Line Interface with std::print
// src/main.cpp
#include <print>
#include <filesystem>
#include <fstream>
#include <string>
#include "processor.hpp"
namespace fs = std::filesystem;
std::string read_file(const fs::path& path) {
std::ifstream file(path, std::ios::in | std::ios::binary);
if (!file) throw std::runtime_error("Cannot open: " + path.string());
// Get file size for reserve:
file.seekg(0, std::ios::end);
std::string content;
content.resize(static_cast<size_t>(file.tellg()));
file.seekg(0, std::ios::beg);
file.read(content.data(), static_cast<std::streamsize>(content.size()));
return content;
}
void print_stats(const TextStats& s, const fs::path& path) {
std::println("â•”â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•—");
std::println("â•‘ Text Processor Analysis â•‘");
std::println("â• â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•£");
std::println("â•‘ File: {:24} â•‘", path.filename().string());
std::println("â• â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•£");
std::println("â•‘ Lines: {:>12} â•‘", s.lines);
std::println("â•‘ Words: {:>12} â•‘", s.words);
std::println("â•‘ Characters: {:>12} â•‘", s.characters);
std::println("â•‘ Paragraphs: {:>12} â•‘", s.paragraphs);
std::println("â•‘ Unique words: {:>12} â•‘", s.unique_words);
std::println("â•‘ Avg word len: {:>12.1f} â•‘", s.avg_word_length);
std::println("╚â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•");
}
int main(int argc, char* argv[]) {
if (argc < 2) {
std::println(stderr, "Usage: text-proc <file> [--search <pattern>] [--top N]");
return 1;
}
fs::path file_path(argv[1]);
if (!fs::exists(file_path)) {
std::println(stderr, "File not found: {}", file_path.string());
return 1;
}
auto content = read_file(file_path);
std::string_view view(content); // Zero-copy view for all analysis
auto stats = analyze_text(view);
print_stats(stats, file_path);
// Optional: search mode
for (int i = 2; i < argc; i++) {
if (std::string_view(argv[i]) == "--search" && i + 1 < argc) {
auto results = search_text(view, argv[++i]);
std::println("\nSearch results for '{}':", argv[i]);
for (const auto& r : results) print_highlighted(r, argv[i]);
}
if (std::string_view(argv[i]) == "--top" && i + 1 < argc) {
size_t n = std::stoul(argv[++i]);
std::println("\nTop {} words:", n);
for (auto& [word, count] : top_words(view, n)) {
std::println(" {:>6}× {}", count, word);
}
}
}
return 0;
}Extension Challenges
- Multi-file pipeline: Accept multiple filenames and aggregate stats across all files
- Regex search: Replace
string_view::findwithstd::regexorstd::basic_regexvia Ranges pipeline - Output formats: Add
--jsonand--csvflags usingstd::formatoutput sinks - Parallel analysis: Use
std::execution::par_unseqwithstd::reducefor character counting - Memory-mapped files: Use
mmap/MapViewOfFilefor true zero-copy on large files (>100MB)
Phase 1 Reflection
You have built a genuine, usable tool that applies every concept from Modules 1–9:
| Module Concept | Used In Project |
|---|---|
| CMake + Compiler Presets (Module 1) | CMakeLists.txt with sanitizers |
auto, const, string_view (Module 2) | Analysis functions parameters |
| Stack allocation, RAII (Module 3) | std::string content lifetime |
| Structured bindings (Module 4) | for (auto& [word, count] : freq) |
noexcept, error handling (Module 5) | File reading with exceptions |
References, const& (Module 6) | TextStats&, SearchResult& |
| RAII classes (Module 8) | std::ifstream auto-close |
std::format, std::print (Module 9) | All output formatting |
Proceed to Phase 2: Memory Layout: Stack vs Heap →
Frequently Asked Questions
Q: What standard library components are most useful for building a C++ CLI text processor?
The key components are: std::ifstream/std::ofstream for file I/O, std::string and std::string_view for efficient text manipulation, std::regex for pattern matching and substitution, std::getopt or a library like CLI11/cxxopts for argument parsing, and std::cout/std::cerr for output and error reporting. For line-by-line processing, std::getline in a loop over an ifstream is the idiomatic approach.
Q: How do you handle large files efficiently in a C++ text processing tool?
Process the file line-by-line rather than reading it all into memory at once — std::getline with a std::ifstream keeps memory usage constant regardless of file size. For binary or structured data, use read() with a fixed buffer. If transformation speed matters, consider memory-mapping the file with mmap (POSIX) or CreateFileMapping (Windows), which lets the OS page in only the portions you access.
Q: What is the recommended way to parse command-line arguments in a modern C++ project?
Avoid hand-parsing argv for anything beyond trivial cases. Use a lightweight header-only library: CLI11 is the most popular modern choice (supports subcommands, validation, and help generation with zero dependencies), cxxopts is simpler for smaller tools. Both are available via vcpkg or as single-header downloads. For standard POSIX tools targeting Linux only, getopt_long from <getopt.h> is an alternative.
Part of the C++ Mastery Course — 30 modules from modern C++ basics to expert systems engineering.
