Project: Building a Fast CLI Data Processor in C (Phase 1 Capstone)

Project: Building a Fast CLI Data Processor in C (Phase 1 Capstone)
Phase 1 Capstone. You've learned types, variables, control flow, functions, and arrays. Now build something real — a command-line tool that reads numeric data from stdin or a file, computes descriptive statistics, sorts the data, and produces a formatted report. This tool uses every concept from Phase 1 and produces a useful, real-world program.
Table of Contents
- Project Scope and Goals
- Architecture: The Pipeline Pattern
- Step 1: Safe Input Handling with fgets
- Step 2: Parsing and Validating Numbers
- Step 3: Statistical Computations
- Step 4: Sorting with qsort
- Step 5: Formatted Report Output
- Step 6: File Input Mode
- Complete Program: Full Integration
- Extension Challenges
- Phase 1 Reflection
Project Scope and Goals
Your CLI data processor will:
- Read up to 1,000 integers or floating-point numbers from stdin or a file.
- Compute: count, min, max, sum, mean (average), median, variance, standard deviation.
- Sort the dataset and display it.
- Print a formatted ASCII report table.
- Handle invalid input gracefully (skip non-numeric lines, report errors).
- Process a CSV file in a single pass using
fgets.
Architecture: The Pipeline Pattern
Step 1: Safe Input Handling with fgets
Never use scanf("%f", &x) or gets() for user input. Both have serious problems. Use fgets for all line-based input:
#include <stdio.h>
#include <string.h>
#define MAX_LINE 256
// Read one line from fp — strip trailing newline
// Returns: number of chars read, or -1 on EOF/error
ssize_t read_line(FILE *fp, char *buf, size_t bufsize) {
if (!fgets(buf, (int)bufsize, fp)) return -1;
size_t len = strlen(buf);
// Strip trailing newline and/or carriage return
while (len > 0 && (buf[len-1] == '\n' || buf[len-1] == '\r')) {
buf[--len] = '\0';
}
return (ssize_t)len;
}Step 2: Parsing and Validating Numbers
atof() silently converts any string to 0.0. Use strtod() instead — it detects invalid input:
#include <stdlib.h>
#include <errno.h>
// Parse a string as a double — returns true and sets *out on success
bool parse_double(const char *str, double *out) {
if (!str || *str == '\0') return false;
char *endptr;
errno = 0;
double val = strtod(str, &endptr);
// Must have consumed all characters (or just trailing whitespace)
while (*endptr == ' ' || *endptr == '\t') endptr++;
if (*endptr != '\0') return false; // Trailing non-numeric chars
if (errno == ERANGE) return false; // Overflow/underflow
*out = val;
return true;
}Step 3: Statistical Computations
#include <math.h>
typedef struct {
size_t count;
double min;
double max;
double sum;
double mean;
double median; // Requires sorted data
double variance;
double std_dev;
} Statistics;
Statistics compute_stats(const double *data, size_t count) {
Statistics s = { .count = count };
if (count == 0) return s;
// Single pass: min, max, sum
s.min = s.max = data[0];
s.sum = 0.0;
for (size_t i = 0; i < count; i++) {
if (data[i] < s.min) s.min = data[i];
if (data[i] > s.max) s.max = data[i];
s.sum += data[i];
}
s.mean = s.sum / (double)count;
// Variance: E[(X - mean)^2] — Welford's method for numerical stability
double m2 = 0.0;
for (size_t i = 0; i < count; i++) {
double diff = data[i] - s.mean;
m2 += diff * diff;
}
s.variance = m2 / (double)count; // Population variance
s.std_dev = sqrt(s.variance);
// Median (requires sorted data — must sort first)
if (count % 2 == 1) {
s.median = data[count / 2]; // Middle element
} else {
s.median = (data[count/2 - 1] + data[count/2]) / 2.0; // Average of middle two
}
return s;
}Step 4: Sorting with qsort
#include <stdlib.h>
// Comparator for qsort (ascending order)
int compare_doubles(const void *a, const void *b) {
double da = *(const double*)a;
double db = *(const double*)b;
return (da > db) - (da < db); // Branchless: returns -1, 0, or +1
}
void sort_data(double *data, size_t count) {
qsort(data, count, sizeof(double), compare_doubles);
}Step 5: Formatted Report Output
#include <stdio.h>
void print_report(const Statistics *s, const double *sorted_data) {
printf("\n");
printf("â•”â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•—\n");
printf("â•‘ Data Analysis Report â•‘\n");
printf("â• â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•£\n");
printf("â•‘ Count: %20zu â•‘\n", s->count);
printf("â•‘ Minimum: %20.4f â•‘\n", s->min);
printf("â•‘ Maximum: %20.4f â•‘\n", s->max);
printf("â•‘ Sum: %20.4f â•‘\n", s->sum);
printf("â•‘ Mean: %20.4f â•‘\n", s->mean);
printf("â•‘ Median: %20.4f â•‘\n", s->median);
printf("â•‘ Variance: %20.4f â•‘\n", s->variance);
printf("â•‘ Std Deviation: %20.4f â•‘\n", s->std_dev);
printf("╚â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•â•\n");
if (s->count <= 20 && sorted_data) {
printf("\nSorted Data: ");
for (size_t i = 0; i < s->count; i++) {
printf("%.2f", sorted_data[i]);
if (i < s->count - 1) printf(", ");
}
printf("\n");
}
}Step 6: File Input Mode
// Read all numbers from a file (CSV or one-per-line)
size_t load_from_file(const char *filename, double *data, size_t max_count) {
FILE *fp = fopen(filename, "r");
if (!fp) { perror(filename); return 0; }
size_t count = 0;
char line[MAX_LINE];
int line_num = 0;
while (count < max_count && read_line(fp, line, sizeof(line)) >= 0) {
line_num++;
if (line[0] == '\0' || line[0] == '#') continue; // Skip empty/comment
// Handle comma-separated values on one line
char *token = strtok(line, ",; \t");
while (token && count < max_count) {
double val;
if (parse_double(token, &val)) {
data[count++] = val;
} else {
fprintf(stderr, "Warning: line %d: skipping non-numeric '%s'\n",
line_num, token);
}
token = strtok(NULL, ",; \t");
}
}
fclose(fp);
printf("Loaded %zu values from '%s'\n", count, filename);
return count;
}Complete Program: Full Integration
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
#include <errno.h>
#include <math.h>
#define MAX_VALUES 1000
int main(int argc, char *argv[]) {
double data[MAX_VALUES];
size_t count = 0;
if (argc > 1) {
// File mode: read from file passed as argument
count = load_from_file(argv[1], data, MAX_VALUES);
} else {
// Interactive mode: read from stdin
printf("Enter numbers (one per line or CSV), Ctrl-D to finish:\n");
char line[MAX_LINE];
while (count < MAX_VALUES && read_line(stdin, line, sizeof(line)) >= 0) {
if (line[0] == '\0') continue;
double val;
if (parse_double(line, &val)) {
data[count++] = val;
} else {
fprintf(stderr, "Invalid: '%s' — skipping\n", line);
}
}
}
if (count == 0) {
fprintf(stderr, "No valid data read. Nothing to analyze.\n");
return 1;
}
// Sort data (required for median)
sort_data(data, count);
// Compute statistics
Statistics stats = compute_stats(data, count);
// Display report
print_report(&stats, data);
return 0;
}Compile and test:
gcc -O2 -Wall -lm processor.c -o processor
# Interactive mode
echo -e "10\n20\n15\n5\n25\n30" | ./processor
# File mode
echo -e "1.5, 2.3, 0.7\n4.1, 3.9, 5.2" > data.csv
./processor data.csvExtension Challenges
- Histogram: Print an ASCII bar chart of value distribution by dividing the range into 10 bins.
- Percentiles: Compute P25, P75, P90, P95, P99 using the sorted array for performance benchmarking analysis.
- Moving average: Read a time series and output a sliding window average using a circular buffer.
- Multiple files: Accept multiple filenames, process each independently, then combine statistics.
- Output formats: Support
--csv,--json,--markdownflags for different output formats.
Phase 1 Reflection
You've successfully moved from "code" to "machine logic." Every tool you used in this project — fgets for safe input, strtod for parsing, qsort for sorting, printf for formatted output — follows the same pattern: explicit bounds, explicit types, explicit error checking.
This is the C discipline. In Phase 2, we'll leave the safety of the stack and explore the heap — where professional-scale applications are built with malloc, free, and pointer-based data structures.
Read next: Phase 2: Pointers & Manual Memory Management →
Frequently Asked Questions
Q: What sorting algorithms are most practical to implement from scratch in C?
Quicksort is the workhorse — average O(n log n), in-place, and fast in practice due to cache locality. Implement it with a median-of-three pivot to avoid worst-case O(n²) on sorted input. Merge sort is preferred when stability is required (equal elements maintain their original order). For small arrays (under 16 elements), insertion sort outperforms both. The C standard library's qsort() uses an introsort hybrid internally — study its compar function pointer pattern to understand how C achieves generic sorting.
Q: How do you read and parse structured data from a file for sorting in C?
Use fopen() with mode "r", then fgets() or fscanf() to read line-by-line. For CSV-like data, strtok() or manual pointer arithmetic splits fields by delimiter. Store records in a dynamically allocated array: malloc(capacity * sizeof(Record)), doubling capacity with realloc() when full. Always check return values — fopen returns NULL on failure, malloc/realloc return NULL on allocation failure. Close the file with fclose() when done.
Q: How do you pass a custom comparator to qsort() in C?
Define a function with signature int cmp(const void *a, const void *b) that returns negative if a < b, zero if equal, positive if a > b. Cast the void pointers to your actual type inside: const Record *ra = (const Record *)a. Pass the function pointer as the fourth argument: qsort(array, count, sizeof(Record), cmp). For descending order, swap the return signs. For multi-key sorting, chain comparisons: compare primary key first, return secondary key comparison only if primary keys are equal.
Part of the C Mastery Course — 30 modules from C basics to expert systems engineering.
