C++Projects

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

TT
TopicTrick Team
Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges

Project 1: Building a High-Performance C++ CLI Text Processor with std::format & Ranges


Table of Contents


Project Architecture

text
text-processor/
├── CMakeLists.txt
├── include/
│   ├── processor.hpp          # Core types and function declarations
│   ├── searcher.hpp           # Pattern search engine
│   └── formatter.hpp         # Output formatting utilities
├── src/
│   ├── main.cpp              # CLI entry point
│   ├── processor.cpp         # Text analysis implementation
│   ├── searcher.cpp          # Search + highlight implementation
│   └── formatter.cpp        # Tabular output formatting
└── tests/
    ├── CMakeLists.txt
    ├── test_processor.cpp    # Google Test unit tests
    └── test_data/
        └── sample.txt        # Test fixture

CMakeLists.txt Setup

cmake
cmake_minimum_required(VERSION 3.25)
project(TextProcessor VERSION 1.0.0 LANGUAGES CXX)

set(CMAKE_CXX_STANDARD 23)
set(CMAKE_CXX_STANDARD_REQUIRED ON)
set(CMAKE_CXX_EXTENSIONS OFF)  # Use -std=c++23, not -std=gnu++23

# Compiler warnings:
target_compile_options(TextProcessor PRIVATE
    $<$<CXX_COMPILER_ID:GNU,Clang>:-Wall -Wextra -Wpedantic -Werror>
    $<$<CXX_COMPILER_ID:MSVC>:/W4 /WX>
)

# AddressSanitizer in Debug:
if(CMAKE_BUILD_TYPE STREQUAL "Debug")
    target_compile_options(TextProcessor PRIVATE -fsanitize=address,undefined)
    target_link_options(TextProcessor PRIVATE -fsanitize=address,undefined)
endif()

add_library(processor_lib STATIC
    src/processor.cpp
    src/searcher.cpp
    src/formatter.cpp
)
target_include_directories(processor_lib PUBLIC include)

add_executable(TextProcessor src/main.cpp)
target_link_libraries(TextProcessor PRIVATE processor_lib)

# Unit tests with Google Test:
find_package(GTest REQUIRED)
add_executable(tests tests/test_processor.cpp)
target_link_libraries(tests PRIVATE processor_lib GTest::gtest_main)
include(GoogleTest)
gtest_discover_tests(tests)

Core Types and Interface Design

cpp
// include/processor.hpp
#pragma once
#include <string>
#include <string_view>
#include <vector>
#include <map>
#include <cstddef>

// Aggregate statistics returned from analysis:
struct TextStats {
    size_t lines       = 0;
    size_t words       = 0;
    size_t characters  = 0;
    size_t paragraphs  = 0;  // Double newline-separated sections
    size_t unique_words = 0;
    double avg_word_length = 0.0;
};

// Search result with match positions:
struct SearchResult {
    std::string_view line;    // The full line containing the match
    size_t           line_no; // 1-based line number
    size_t           col_no;  // 1-based column of first match
    size_t           match_len; // Length of the matched text
};

// Word frequency entry:
struct WordFreq {
    std::string word;
    size_t      count;
};

// Core analysis functions (all zero-copy — no allocation for input):
TextStats                analyze_text(std::string_view content);
std::vector<SearchResult> search_text(std::string_view content,
                                       std::string_view pattern,
                                       bool case_sensitive = true);
std::string              to_uppercase(std::string_view content);
std::string              to_lowercase(std::string_view content);
std::vector<WordFreq>    top_words(std::string_view content, size_t n = 10);

Text Analysis: Words, Lines, Characters

cpp
// src/processor.cpp
#include "processor.hpp"
#include <ranges>
#include <algorithm>
#include <sstream>
#include <unordered_map>

TextStats analyze_text(std::string_view content) {
    TextStats stats;
    bool in_word = false;
    bool last_newline = false;
    
    for (char ch : content) {
        stats.characters++;
        
        if (ch == '\n') {
            stats.lines++;
            if (last_newline) stats.paragraphs++; // Double newline = paragraph
        }
        last_newline = (ch == '\n');
        
        bool is_whitespace = (ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r');
        if (!is_whitespace && !in_word) {
            stats.words++;
            in_word = true;
        } else if (is_whitespace) {
            in_word = false;
        }
    }
    if (!content.empty() && content.back() != '\n') stats.lines++;
    
    // Use ranges for unique word count:
    auto word_view = content
        | std::views::split(' ')
        | std::views::transform([](auto r) -> std::string {
            return std::string(r.begin(), r.end());
          })
        | std::views::filter([](const std::string& s) { return !s.empty(); });
    
    std::unordered_map<std::string, size_t> freq;
    for (const auto& word : word_view) freq[word]++;
    stats.unique_words = freq.size();
    
    // Average word length:
    size_t total_chars = 0;
    for (const auto& [word, count] : freq) total_chars += word.size() * count;
    stats.avg_word_length = stats.words > 0
        ? static_cast<double>(total_chars) / stats.words : 0.0;
    
    return stats;
}

std::vector<WordFreq> top_words(std::string_view content, size_t n) {
    std::unordered_map<std::string, size_t> freq;
    
    // Tokenize using string_view splits — no heap per token:
    for (auto word_sv : content | std::views::split(' ')) {
        std::string word(word_sv.begin(), word_sv.end());
        // Remove punctuation:
        std::erase_if(word, [](char c){ return std::ispunct(static_cast<unsigned char>(c)); });
        // Lowercase for frequency:
        std::ranges::transform(word, word.begin(), ::tolower);
        if (!word.empty() && word.size() > 2) freq[word]++;
    }
    
    // Get top N by frequency:
    std::vector<WordFreq> results;
    results.reserve(freq.size());
    for (auto& [word, count] : freq) results.push_back({word, count});
    
    std::ranges::partial_sort(results,
        results.begin() + std::min(n, results.size()),
        [](const WordFreq& a, const WordFreq& b){ return a.count > b.count; });
    results.resize(std::min(n, results.size()));
    return results;
}

Pattern Search with ANSI Highlighting

cpp
// src/searcher.cpp
#include "searcher.hpp"
#include <print>

std::vector<SearchResult> search_text(std::string_view content,
                                       std::string_view pattern,
                                       bool case_sensitive) {
    std::vector<SearchResult> results;
    size_t line_no = 1;
    
    for (auto line_sv : content | std::views::split('\n')) {
        std::string_view line(line_sv.begin(), line_sv.end());
        
        // Search within line:
        size_t pos = 0;
        while (true) {
            size_t found;
            if (case_sensitive) {
                found = line.find(pattern, pos);
            } else {
                // Case-insensitive: lowercase both for comparison
                std::string lower_line(line);
                std::string lower_pat(pattern);
                std::ranges::transform(lower_line, lower_line.begin(), ::tolower);
                std::ranges::transform(lower_pat, lower_pat.begin(), ::tolower);
                found = lower_line.find(lower_pat, pos);
            }
            
            if (found == std::string_view::npos) break;
            results.push_back({line, line_no, found + 1, pattern.size()});
            pos = found + 1;
        }
        line_no++;
    }
    return results;
}

void print_highlighted(const SearchResult& r, std::string_view pattern) {
    std::string_view before = r.line.substr(0, r.col_no - 1);
    std::string_view match  = r.line.substr(r.col_no - 1, r.match_len);
    std::string_view after  = r.line.substr(r.col_no - 1 + r.match_len);
    
    // ANSI: bold + red highlight
    std::print("L{:4}: {}\033[1;31m{}\033[0m{}\n",
               r.line_no, before, match, after);
}

Command-Line Interface with std::print

cpp
// src/main.cpp
#include <print>
#include <filesystem>
#include <fstream>
#include <string>
#include "processor.hpp"

namespace fs = std::filesystem;

std::string read_file(const fs::path& path) {
    std::ifstream file(path, std::ios::in | std::ios::binary);
    if (!file) throw std::runtime_error("Cannot open: " + path.string());
    
    // Get file size for reserve:
    file.seekg(0, std::ios::end);
    std::string content;
    content.resize(static_cast<size_t>(file.tellg()));
    file.seekg(0, std::ios::beg);
    file.read(content.data(), static_cast<std::streamsize>(content.size()));
    return content;
}

void print_stats(const TextStats& s, const fs::path& path) {
    std::println("╔══════════════════════════════╗");
    std::println("â•‘  Text Processor Analysis      â•‘");
    std::println("╠══════════════════════════════╣");
    std::println("â•‘  File: {:24} â•‘", path.filename().string());
    std::println("╠══════════════════════════════╣");
    std::println("â•‘  Lines:        {:>12} â•‘", s.lines);
    std::println("â•‘  Words:        {:>12} â•‘", s.words);
    std::println("â•‘  Characters:   {:>12} â•‘", s.characters);
    std::println("â•‘  Paragraphs:   {:>12} â•‘", s.paragraphs);
    std::println("â•‘  Unique words: {:>12} â•‘", s.unique_words);
    std::println("â•‘  Avg word len: {:>12.1f} â•‘", s.avg_word_length);
    std::println("╚══════════════════════════════╝");
}

int main(int argc, char* argv[]) {
    if (argc < 2) {
        std::println(stderr, "Usage: text-proc <file> [--search <pattern>] [--top N]");
        return 1;
    }
    
    fs::path file_path(argv[1]);
    if (!fs::exists(file_path)) {
        std::println(stderr, "File not found: {}", file_path.string());
        return 1;
    }
    
    auto content = read_file(file_path);
    std::string_view view(content); // Zero-copy view for all analysis
    
    auto stats = analyze_text(view);
    print_stats(stats, file_path);
    
    // Optional: search mode
    for (int i = 2; i < argc; i++) {
        if (std::string_view(argv[i]) == "--search" && i + 1 < argc) {
            auto results = search_text(view, argv[++i]);
            std::println("\nSearch results for '{}':", argv[i]);
            for (const auto& r : results) print_highlighted(r, argv[i]);
        }
        if (std::string_view(argv[i]) == "--top" && i + 1 < argc) {
            size_t n = std::stoul(argv[++i]);
            std::println("\nTop {} words:", n);
            for (auto& [word, count] : top_words(view, n)) {
                std::println("  {:>6}× {}", count, word);
            }
        }
    }
    return 0;
}

Extension Challenges

  1. Multi-file pipeline: Accept multiple filenames and aggregate stats across all files
  2. Regex search: Replace string_view::find with std::regex or std::basic_regex via Ranges pipeline
  3. Output formats: Add --json and --csv flags using std::format output sinks
  4. Parallel analysis: Use std::execution::par_unseq with std::reduce for character counting
  5. Memory-mapped files: Use mmap/MapViewOfFile for true zero-copy on large files (>100MB)

Phase 1 Reflection

You have built a genuine, usable tool that applies every concept from Modules 1–9:

Module ConceptUsed In Project
CMake + Compiler Presets (Module 1)CMakeLists.txt with sanitizers
auto, const, string_view (Module 2)Analysis functions parameters
Stack allocation, RAII (Module 3)std::string content lifetime
Structured bindings (Module 4)for (auto& [word, count] : freq)
noexcept, error handling (Module 5)File reading with exceptions
References, const& (Module 6)TextStats&, SearchResult&
RAII classes (Module 8)std::ifstream auto-close
std::format, std::print (Module 9)All output formatting

Proceed to Phase 2: Memory Layout: Stack vs Heap →

Frequently Asked Questions

Q: What standard library components are most useful for building a C++ CLI text processor? The key components are: std::ifstream/std::ofstream for file I/O, std::string and std::string_view for efficient text manipulation, std::regex for pattern matching and substitution, std::getopt or a library like CLI11/cxxopts for argument parsing, and std::cout/std::cerr for output and error reporting. For line-by-line processing, std::getline in a loop over an ifstream is the idiomatic approach.

Q: How do you handle large files efficiently in a C++ text processing tool? Process the file line-by-line rather than reading it all into memory at once — std::getline with a std::ifstream keeps memory usage constant regardless of file size. For binary or structured data, use read() with a fixed buffer. If transformation speed matters, consider memory-mapping the file with mmap (POSIX) or CreateFileMapping (Windows), which lets the OS page in only the portions you access.

Q: What is the recommended way to parse command-line arguments in a modern C++ project? Avoid hand-parsing argv for anything beyond trivial cases. Use a lightweight header-only library: CLI11 is the most popular modern choice (supports subcommands, validation, and help generation with zero dependencies), cxxopts is simpler for smaller tools. Both are available via vcpkg or as single-header downloads. For standard POSIX tools targeting Linux only, getopt_long from <getopt.h> is an alternative.


Part of the C++ Mastery Course — 30 modules from modern C++ basics to expert systems engineering.