CProjects

Project: Building a Fast CLI Data Processor in C (Phase 1 Capstone)

TT
TopicTrick Team
Project: Building a Fast CLI Data Processor in C (Phase 1 Capstone)

Project: Building a Fast CLI Data Processor in C (Phase 1 Capstone)

Phase 1 Capstone. You've learned types, variables, control flow, functions, and arrays. Now build something real — a command-line tool that reads numeric data from stdin or a file, computes descriptive statistics, sorts the data, and produces a formatted report. This tool uses every concept from Phase 1 and produces a useful, real-world program.


Table of Contents


Project Scope and Goals

Your CLI data processor will:

  • Read up to 1,000 integers or floating-point numbers from stdin or a file.
  • Compute: count, min, max, sum, mean (average), median, variance, standard deviation.
  • Sort the dataset and display it.
  • Print a formatted ASCII report table.
  • Handle invalid input gracefully (skip non-numeric lines, report errors).
  • Process a CSV file in a single pass using fgets.

Architecture: The Pipeline Pattern


Step 1: Safe Input Handling with fgets

Never use scanf("%f", &x) or gets() for user input. Both have serious problems. Use fgets for all line-based input:

c
#include <stdio.h>
#include <string.h>

#define MAX_LINE 256

// Read one line from fp — strip trailing newline
// Returns: number of chars read, or -1 on EOF/error
ssize_t read_line(FILE *fp, char *buf, size_t bufsize) {
    if (!fgets(buf, (int)bufsize, fp)) return -1;
    
    size_t len = strlen(buf);
    
    // Strip trailing newline and/or carriage return
    while (len > 0 && (buf[len-1] == '\n' || buf[len-1] == '\r')) {
        buf[--len] = '\0';
    }
    
    return (ssize_t)len;
}

Step 2: Parsing and Validating Numbers

atof() silently converts any string to 0.0. Use strtod() instead — it detects invalid input:

c
#include <stdlib.h>
#include <errno.h>

// Parse a string as a double — returns true and sets *out on success
bool parse_double(const char *str, double *out) {
    if (!str || *str == '\0') return false;
    
    char *endptr;
    errno = 0;
    double val = strtod(str, &endptr);
    
    // Must have consumed all characters (or just trailing whitespace)
    while (*endptr == ' ' || *endptr == '\t') endptr++;
    
    if (*endptr != '\0') return false; // Trailing non-numeric chars
    if (errno == ERANGE) return false; // Overflow/underflow
    
    *out = val;
    return true;
}

Step 3: Statistical Computations

c
#include <math.h>

typedef struct {
    size_t count;
    double min;
    double max;
    double sum;
    double mean;
    double median;    // Requires sorted data
    double variance;
    double std_dev;
} Statistics;

Statistics compute_stats(const double *data, size_t count) {
    Statistics s = { .count = count };
    if (count == 0) return s;
    
    // Single pass: min, max, sum
    s.min = s.max = data[0];
    s.sum = 0.0;
    for (size_t i = 0; i < count; i++) {
        if (data[i] < s.min) s.min = data[i];
        if (data[i] > s.max) s.max = data[i];
        s.sum += data[i];
    }
    s.mean = s.sum / (double)count;
    
    // Variance: E[(X - mean)^2] — Welford's method for numerical stability
    double m2 = 0.0;
    for (size_t i = 0; i < count; i++) {
        double diff = data[i] - s.mean;
        m2 += diff * diff;
    }
    s.variance = m2 / (double)count;        // Population variance
    s.std_dev  = sqrt(s.variance);
    
    // Median (requires sorted data — must sort first)
    if (count % 2 == 1) {
        s.median = data[count / 2];         // Middle element
    } else {
        s.median = (data[count/2 - 1] + data[count/2]) / 2.0; // Average of middle two
    }
    
    return s;
}

Step 4: Sorting with qsort

c
#include <stdlib.h>

// Comparator for qsort (ascending order)
int compare_doubles(const void *a, const void *b) {
    double da = *(const double*)a;
    double db = *(const double*)b;
    return (da > db) - (da < db); // Branchless: returns -1, 0, or +1
}

void sort_data(double *data, size_t count) {
    qsort(data, count, sizeof(double), compare_doubles);
}

Step 5: Formatted Report Output

c
#include <stdio.h>

void print_report(const Statistics *s, const double *sorted_data) {
    printf("\n");
    printf("╔══════════════════════════════════════╗\n");
    printf("â•‘     Data Analysis Report              â•‘\n");
    printf("╠══════════════════════════════════════╣\n");
    printf("â•‘  Count:          %20zu  â•‘\n", s->count);
    printf("â•‘  Minimum:        %20.4f  â•‘\n", s->min);
    printf("â•‘  Maximum:        %20.4f  â•‘\n", s->max);
    printf("â•‘  Sum:            %20.4f  â•‘\n", s->sum);
    printf("â•‘  Mean:           %20.4f  â•‘\n", s->mean);
    printf("â•‘  Median:         %20.4f  â•‘\n", s->median);
    printf("â•‘  Variance:       %20.4f  â•‘\n", s->variance);
    printf("â•‘  Std Deviation:  %20.4f  â•‘\n", s->std_dev);
    printf("╚══════════════════════════════════════╝\n");
    
    if (s->count <= 20 && sorted_data) {
        printf("\nSorted Data: ");
        for (size_t i = 0; i < s->count; i++) {
            printf("%.2f", sorted_data[i]);
            if (i < s->count - 1) printf(", ");
        }
        printf("\n");
    }
}

Step 6: File Input Mode

c
// Read all numbers from a file (CSV or one-per-line)
size_t load_from_file(const char *filename, double *data, size_t max_count) {
    FILE *fp = fopen(filename, "r");
    if (!fp) { perror(filename); return 0; }
    
    size_t count = 0;
    char line[MAX_LINE];
    int  line_num = 0;
    
    while (count < max_count && read_line(fp, line, sizeof(line)) >= 0) {
        line_num++;
        if (line[0] == '\0' || line[0] == '#') continue; // Skip empty/comment
        
        // Handle comma-separated values on one line
        char *token = strtok(line, ",; \t");
        while (token && count < max_count) {
            double val;
            if (parse_double(token, &val)) {
                data[count++] = val;
            } else {
                fprintf(stderr, "Warning: line %d: skipping non-numeric '%s'\n",
                        line_num, token);
            }
            token = strtok(NULL, ",; \t");
        }
    }
    
    fclose(fp);
    printf("Loaded %zu values from '%s'\n", count, filename);
    return count;
}

Complete Program: Full Integration

c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <errno.h>
#include <math.h>

#define MAX_VALUES 1000

int main(int argc, char *argv[]) {
    double data[MAX_VALUES];
    size_t count = 0;
    
    if (argc > 1) {
        // File mode: read from file passed as argument
        count = load_from_file(argv[1], data, MAX_VALUES);
    } else {
        // Interactive mode: read from stdin
        printf("Enter numbers (one per line or CSV), Ctrl-D to finish:\n");
        char line[MAX_LINE];
        while (count < MAX_VALUES && read_line(stdin, line, sizeof(line)) >= 0) {
            if (line[0] == '\0') continue;
            double val;
            if (parse_double(line, &val)) {
                data[count++] = val;
            } else {
                fprintf(stderr, "Invalid: '%s' — skipping\n", line);
            }
        }
    }
    
    if (count == 0) {
        fprintf(stderr, "No valid data read. Nothing to analyze.\n");
        return 1;
    }
    
    // Sort data (required for median)
    sort_data(data, count);
    
    // Compute statistics
    Statistics stats = compute_stats(data, count);
    
    // Display report
    print_report(&stats, data);
    
    return 0;
}

Compile and test:

bash
gcc -O2 -Wall -lm processor.c -o processor

# Interactive mode
echo -e "10\n20\n15\n5\n25\n30" | ./processor

# File mode
echo -e "1.5, 2.3, 0.7\n4.1, 3.9, 5.2" > data.csv
./processor data.csv

Extension Challenges

  1. Histogram: Print an ASCII bar chart of value distribution by dividing the range into 10 bins.
  2. Percentiles: Compute P25, P75, P90, P95, P99 using the sorted array for performance benchmarking analysis.
  3. Moving average: Read a time series and output a sliding window average using a circular buffer.
  4. Multiple files: Accept multiple filenames, process each independently, then combine statistics.
  5. Output formats: Support --csv, --json, --markdown flags for different output formats.

Phase 1 Reflection

You've successfully moved from "code" to "machine logic." Every tool you used in this project — fgets for safe input, strtod for parsing, qsort for sorting, printf for formatted output — follows the same pattern: explicit bounds, explicit types, explicit error checking.

This is the C discipline. In Phase 2, we'll leave the safety of the stack and explore the heap — where professional-scale applications are built with malloc, free, and pointer-based data structures.

Read next: Phase 2: Pointers & Manual Memory Management →

Frequently Asked Questions

Q: What sorting algorithms are most practical to implement from scratch in C? Quicksort is the workhorse — average O(n log n), in-place, and fast in practice due to cache locality. Implement it with a median-of-three pivot to avoid worst-case O(n²) on sorted input. Merge sort is preferred when stability is required (equal elements maintain their original order). For small arrays (under 16 elements), insertion sort outperforms both. The C standard library's qsort() uses an introsort hybrid internally — study its compar function pointer pattern to understand how C achieves generic sorting.

Q: How do you read and parse structured data from a file for sorting in C? Use fopen() with mode "r", then fgets() or fscanf() to read line-by-line. For CSV-like data, strtok() or manual pointer arithmetic splits fields by delimiter. Store records in a dynamically allocated array: malloc(capacity * sizeof(Record)), doubling capacity with realloc() when full. Always check return values — fopen returns NULL on failure, malloc/realloc return NULL on allocation failure. Close the file with fclose() when done.

Q: How do you pass a custom comparator to qsort() in C? Define a function with signature int cmp(const void *a, const void *b) that returns negative if a < b, zero if equal, positive if a > b. Cast the void pointers to your actual type inside: const Record *ra = (const Record *)a. Pass the function pointer as the fourth argument: qsort(array, count, sizeof(Record), cmp). For descending order, swap the return signs. For multi-key sorting, chain comparisons: compare primary key first, return secondary key comparison only if primary keys are equal.


Part of the C Mastery Course — 30 modules from C basics to expert systems engineering.