What are the main components of a CLI text processor in C++?

A typical CLI text processor has: a command-line argument parser (getopt, CLI11, or argparse), an input reader (std::cin or file stream), a processing pipeline (transform, filter, aggregate operations), and an output writer (std::cout or file stream). For large files, use line-by-line streaming rather than loading the full file into memory. Structure the pipeline as composable function objects or a chain of responsibility for flexibility.

How do I parse command-line arguments in a C++ CLI tool?

Use a library rather than parsing argv manually. CLI11 is the most popular modern choice - it is header-only, supports subcommands, options, positional arguments, and generates help text automatically. Lyra is another header-only option with a more declarative syntax. For simple tools with a few flags, getopt (POSIX) is sufficient. Avoid manual argv parsing - it is error-prone and does not handle quoted arguments correctly.

How do I process large text files efficiently in C++?

Read line by line with std::getline rather than loading the entire file into a std::string - this keeps memory usage constant regardless of file size. For very large files, memory-mapped I/O (mmap on Linux, MapViewOfFile on Windows) can improve throughput by letting the OS manage page faults. Process lines in a pipeline rather than buffering all output before writing. Write output as you go to keep memory usage O(1) with respect to file size.

How should I handle Unicode and encoding in a C++ text processor?

The C++ standard library has weak Unicode support - std::string stores bytes, not characters. For correct Unicode handling, use the ICU library (industry standard, used by Chrome and Firefox) or the utf8cpp header-only library for basic UTF-8 operations. For most CLI tools processing UTF-8 text, treating strings as byte sequences works correctly for line splitting, trimming whitespace, and pattern matching as long as you never split multi-byte sequences mid-character.

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

← Back to C++ Mastery

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

Project Architecture
CMakeLists.txt Setup
Core Types and Interface Design
File Reading with std::filesystem and Memory-Mapped Views
Text Analysis: Words, Lines, Characters
Pattern Search with ANSI Highlighting
Case Transformation with Ranges
Word Frequency Analysis
Command-Line Interface with std::print
Compiler Flags and Sanitizers
Extension Challenges
Phase 1 Reflection

Project Architecture

text

text-processor/
+-- CMakeLists.txt
+-- include/
|   +-- processor.hpp          # Core types and function declarations
|   +-- searcher.hpp           # Pattern search engine
|   +--- formatter.hpp         # Output formatting utilities
+-- src/
|   +-- main.cpp              # CLI entry point
|   +-- processor.cpp         # Text analysis implementation
|   +-- searcher.cpp          # Search + highlight implementation
|   +--- formatter.cpp        # Tabular output formatting
+--- tests/
    +-- CMakeLists.txt
    +-- test_processor.cpp    # Google Test unit tests
    +--- test_data/
        +--- sample.txt        # Test fixture

CMakeLists.txt Setup

cmake

cmake_minimum_required(VERSION 3.25)
project(TextProcessor VERSION 1.0.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)  # Use -std=c++23, not -std=gnu++23

# Compiler warnings:
target_compile_options(TextProcessor PRIVATE
    $<$<CXX_COMPILER_ID:GNU,Clang>:-Wall -Wextra -Wpedantic -Werror>
    $<$<CXX_COMPILER_ID:MSVC>:/W4 /WX>
)

# AddressSanitizer in Debug:
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
    target_compile_options(TextProcessor PRIVATE -fsanitize=address,undefined)
    target_link_options(TextProcessor PRIVATE -fsanitize=address,undefined)
endif()

add_library(processor_lib STATIC
    src/processor.cpp
    src/searcher.cpp
    src/formatter.cpp
)
target_include_directories(processor_lib PUBLIC include)

add_executable(TextProcessor src/main.cpp)
target_link_libraries(TextProcessor PRIVATE processor_lib)

# Unit tests with Google Test:
find_package(GTest REQUIRED)
add_executable(tests tests/test_processor.cpp)
target_link_libraries(tests PRIVATE processor_lib GTest::gtest_main)
include(GoogleTest)
gtest_discover_tests(tests)

Core Types and Interface Design

cpp

// include/processor.hpp
#pragma once
#include <string>
#include <string_view>
#include <vector>
#include <map>
#include <cstddef>

// Aggregate statistics returned from analysis:
struct TextStats {
    size_t lines       = 0;
    size_t words       = 0;
    size_t characters  = 0;
    size_t paragraphs  = 0;  // Double newline-separated sections
    size_t unique_words = 0;
    double avg_word_length = 0.0;
};

// Search result with match positions:
struct SearchResult {
    std::string_view line;    // The full line containing the match
    size_t           line_no; // 1-based line number
    size_t           col_no;  // 1-based column of first match
    size_t           match_len; // Length of the matched text
};

// Word frequency entry:
struct WordFreq {
    std::string word;
    size_t      count;
};

// Core analysis functions (all zero-copy - no allocation for input):
TextStats                analyze_text(std::string_view content);
std::vector<SearchResult> search_text(std::string_view content,
                                       std::string_view pattern,
                                       bool case_sensitive = true);
std::string              to_uppercase(std::string_view content);
std::string              to_lowercase(std::string_view content);
std::vector<WordFreq>    top_words(std::string_view content, size_t n = 10);

Text Analysis: Words, Lines, Characters

cpp

// src/processor.cpp
#include "processor.hpp"
#include <ranges>
#include <algorithm>
#include <sstream>
#include <unordered_map>

TextStats analyze_text(std::string_view content) {
    TextStats stats;
    bool in_word = false;
    bool last_newline = false;
    
    for (char ch : content) {
        stats.characters++;
        
        if (ch == '\n') {
            stats.lines++;
            if (last_newline) stats.paragraphs++; // Double newline = paragraph
        }
        last_newline = (ch == '\n');
        
        bool is_whitespace = (ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
        if (!is_whitespace && !in_word) {
            stats.words++;
            in_word = true;
        } else if (is_whitespace) {
            in_word = false;
        }
    }
    if (!content.empty() && content.back() != '\n') stats.lines++;
    
    // Use ranges for unique word count:
    auto word_view = content
        | std::views::split(' ')
        | std::views::transform([](auto r) -> std::string {
            return std::string(r.begin(), r.end());
          })
        | std::views::filter([](const std::string& s) { return !s.empty(); });
    
    std::unordered_map<std::string, size_t> freq;
    for (const auto& word : word_view) freq[word]++;
    stats.unique_words = freq.size();
    
    // Average word length:
    size_t total_chars = 0;
    for (const auto& [word, count] : freq) total_chars += word.size() * count;
    stats.avg_word_length = stats.words > 0
        ? static_cast<double>(total_chars) / stats.words : 0.0;
    
    return stats;
}

std::vector<WordFreq> top_words(std::string_view content, size_t n) {
    std::unordered_map<std::string, size_t> freq;
    
    // Tokenize using string_view splits - no heap per token:
    for (auto word_sv : content | std::views::split(' ')) {
        std::string word(word_sv.begin(), word_sv.end());
        // Remove punctuation:
        std::erase_if(word, [](char c){ return std::ispunct(static_cast<unsigned char>(c)); });
        // Lowercase for frequency:
        std::ranges::transform(word, word.begin(), ::tolower);
        if (!word.empty() && word.size() > 2) freq[word]++;
    }
    
    // Get top N by frequency:
    std::vector<WordFreq> results;
    results.reserve(freq.size());
    for (auto& [word, count] : freq) results.push_back({word, count});
    
    std::ranges::partial_sort(results,
        results.begin() + std::min(n, results.size()),
        [](const WordFreq& a, const WordFreq& b){ return a.count > b.count; });
    results.resize(std::min(n, results.size()));
    return results;
}

Pattern Search with ANSI Highlighting

cpp

// src/searcher.cpp
#include "searcher.hpp"
#include <print>

std::vector<SearchResult> search_text(std::string_view content,
                                       std::string_view pattern,
                                       bool case_sensitive) {
    std::vector<SearchResult> results;
    size_t line_no = 1;
    
    for (auto line_sv : content | std::views::split('\n')) {
        std::string_view line(line_sv.begin(), line_sv.end());
        
        // Search within line:
        size_t pos = 0;
        while (true) {
            size_t found;
            if (case_sensitive) {
                found = line.find(pattern, pos);
            } else {
                // Case-insensitive: lowercase both for comparison
                std::string lower_line(line);
                std::string lower_pat(pattern);
                std::ranges::transform(lower_line, lower_line.begin(), ::tolower);
                std::ranges::transform(lower_pat, lower_pat.begin(), ::tolower);
                found = lower_line.find(lower_pat, pos);
            }
            
            if (found == std::string_view::npos) break;
            results.push_back({line, line_no, found + 1, pattern.size()});
            pos = found + 1;
        }
        line_no++;
    }
    return results;
}

void print_highlighted(const SearchResult& r, std::string_view pattern) {
    std::string_view before = r.line.substr(0, r.col_no - 1);
    std::string_view match  = r.line.substr(r.col_no - 1, r.match_len);
    std::string_view after  = r.line.substr(r.col_no - 1 + r.match_len);
    
    // ANSI: bold + red highlight
    std::print("L{:4}: {}\033[1;31m{}\033[0m{}\n",
               r.line_no, before, match, after);
}

Command-Line Interface with std::print

cpp

// src/main.cpp
#include <print>
#include <filesystem>
#include <fstream>
#include <string>
#include "processor.hpp"

namespace fs = std::filesystem;

std::string read_file(const fs::path& path) {
    std::ifstream file(path, std::ios::in | std::ios::binary);
    if (!file) throw std::runtime_error("Cannot open: " + path.string());
    
    // Get file size for reserve:
    file.seekg(0, std::ios::end);
    std::string content;
    content.resize(static_cast<size_t>(file.tellg()));
    file.seekg(0, std::ios::beg);
    file.read(content.data(), static_cast<std::streamsize>(content.size()));
    return content;
}

void print_stats(const TextStats& s, const fs::path& path) {
    std::println("╔══════════════════════════════╗");
    std::println("║  Text Processor Analysis      ║");
    std::println("╠══════════════════════════════╣");
    std::println("║  File: {:24} ║", path.filename().string());
    std::println("╠══════════════════════════════╣");
    std::println("║  Lines:        {:>12} ║", s.lines);
    std::println("║  Words:        {:>12} ║", s.words);
    std::println("║  Characters:   {:>12} ║", s.characters);
    std::println("║  Paragraphs:   {:>12} ║", s.paragraphs);
    std::println("║  Unique words: {:>12} ║", s.unique_words);
    std::println("║  Avg word len: {:>12.1f} ║", s.avg_word_length);
    std::println("╚══════════════════════════════╝");
}

int main(int argc, char* argv[]) {
    if (argc < 2) {
        std::println(stderr, "Usage: text-proc <file> [--search <pattern>] [--top N]");
        return 1;
    }
    
    fs::path file_path(argv[1]);
    if (!fs::exists(file_path)) {
        std::println(stderr, "File not found: {}", file_path.string());
        return 1;
    }
    
    auto content = read_file(file_path);
    std::string_view view(content); // Zero-copy view for all analysis
    
    auto stats = analyze_text(view);
    print_stats(stats, file_path);
    
    // Optional: search mode
    for (int i = 2; i < argc; i++) {
        if (std::string_view(argv[i]) == "--search" && i + 1 < argc) {
            auto results = search_text(view, argv[++i]);
            std::println("\nSearch results for '{}':", argv[i]);
            for (const auto& r : results) print_highlighted(r, argv[i]);
        }
        if (std::string_view(argv[i]) == "--top" && i + 1 < argc) {
            size_t n = std::stoul(argv[++i]);
            std::println("\nTop {} words:", n);
            for (auto& [word, count] : top_words(view, n)) {
                std::println("  {:>6}x {}", count, word);
            }
        }
    }
    return 0;
}

Extension Challenges

Multi-file pipeline: Accept multiple filenames and aggregate stats across all files
Regex search: Replace string_view::find with std::regex or std::basic_regex via Ranges pipeline
Output formats: Add --json and --csv flags using std::format output sinks
Parallel analysis: Use std::execution::par_unseq with std::reduce for character counting
Memory-mapped files: Use mmap/MapViewOfFile for true zero-copy on large files (>100MB)

Phase 1 Reflection

You have built a genuine, usable tool that applies every concept from Modules 1-9:

Module Concept	Used In Project
CMake + Compiler Presets (Module 1)	`CMakeLists.txt` with sanitizers
`auto`, `const`, `string_view` (Module 2)	Analysis functions parameters
Stack allocation, RAII (Module 3)	`std::string content` lifetime
Structured bindings (Module 4)	`for (auto& [word, count] : freq)`
`noexcept`, error handling (Module 5)	File reading with exceptions
References, `const&` (Module 6)	`TextStats&`, `SearchResult&`
RAII classes (Module 8)	`std::ifstream` auto-close
`std::format`, `std::print` (Module 9)	All output formatting

Proceed to Phase 2: Memory Layout: Stack vs Heap ->

Frequently Asked Questions

Q: What standard library components are most useful for building a C++ CLI text processor? The key components are: std::ifstream/std::ofstream for file I/O, std::string and std::string_view for efficient text manipulation, std::regex for pattern matching and substitution, std::getopt or a library like CLI11/cxxopts for argument parsing, and std::cout/std::cerr for output and error reporting. For line-by-line processing, std::getline in a loop over an ifstream is the idiomatic approach.

Q: How do you handle large files efficiently in a C++ text processing tool? Process the file line-by-line rather than reading it all into memory at once - std::getline with a std::ifstream keeps memory usage constant regardless of file size. For binary or structured data, use read() with a fixed buffer. If transformation speed matters, consider memory-mapping the file with mmap (POSIX) or CreateFileMapping (Windows), which lets the OS page in only the portions you access.

Q: What is the recommended way to parse command-line arguments in a modern C++ project? Avoid hand-parsing argv for anything beyond trivial cases. Use a lightweight header-only library: CLI11 is the most popular modern choice (supports subcommands, validation, and help generation with zero dependencies), cxxopts is simpler for smaller tools. Both are available via vcpkg or as single-header downloads. For standard POSIX tools targeting Linux only, getopt_long from <getopt.h> is an alternative.

Part of the C++ Mastery Course - 30 modules from modern C++ basics to expert systems engineering.

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

Table of Contents

Project Architecture

CMakeLists.txt Setup

Core Types and Interface Design

Text Analysis: Words, Lines, Characters

Pattern Search with ANSI Highlighting

Command-Line Interface with std::print

Extension Challenges

Phase 1 Reflection

Frequently Asked Questions