Artificial IntelligenceAnthropicProjects

Project: Build a Data Analyst Agent — CSV Insights in Plain English with Claude

TT
TopicTrick
Project: Build a Data Analyst Agent — CSV Insights in Plain English with Claude

Most data insight questions are simple. "What was our best sales month?" "Which product category has the highest return rate?" "Show me the top ten customers by revenue." But answering them means opening a spreadsheet, writing a formula or a query, and waiting.

A data analyst agent backed by Claude changes that equation. Upload a CSV, ask in plain English, and the agent writes and executes the analysis code, then explains the result in plain language — no SQL, no pandas knowledge required.

This project builds that agent: Claude as the reasoning brain, Python execution as the tool, and an agentic loop that iterates until the question is fully answered.


Architecture

The agent has three components:

  1. Data loader: Read CSV into a pandas DataFrame and generate a schema summary Claude can reference
  2. Code executor tool: A sandboxed Python execution environment Claude calls to run analysis code
  3. Agentic loop: Claude interprets the question, writes code, observes results, and may run additional code until satisfied

The security-critical element is the code executor. Claude generates Python code; you must never eval() it in your production environment without restrictions. This implementation uses a subprocess sandbox with a strict allowlist.


Complete Implementation

python
1import anthropic 2import pandas as pd 3import io 4import sys 5import subprocess 6import json 7import tempfile 8import os 9from pathlib import Path 10 11client = anthropic.Anthropic() 12 13 14# ─── Data Loading ────────────────────────────────────────────────────────────── 15 16def load_csv(file_path: str) -> tuple[pd.DataFrame, str]: 17 """ 18 Load a CSV and return the DataFrame and a schema description for Claude. 19 """ 20 df = pd.read_csv(file_path) 21 22 # Build schema summary 23 schema_lines = [ 24 f"File: {Path(file_path).name}", 25 f"Rows: {len(df):,}", 26 f"Columns: {len(df.columns)}", 27 "", 28 "Column schema:" 29 ] 30 31 for col in df.columns: 32 dtype = str(df[col].dtype) 33 nulls = df[col].isna().sum() 34 35 if pd.api.types.is_numeric_dtype(df[col]): 36 summary = f"min={df[col].min():.2f}, max={df[col].max():.2f}, mean={df[col].mean():.2f}" 37 elif pd.api.types.is_datetime64_any_dtype(df[col]): 38 summary = f"from {df[col].min()} to {df[col].max()}" 39 else: 40 n_unique = df[col].nunique() 41 top_values = df[col].value_counts().head(3).to_dict() 42 summary = f"{n_unique} unique values, top: {top_values}" 43 44 null_note = f", {nulls} nulls" if nulls > 0 else "" 45 schema_lines.append(f" - {col} ({dtype}{null_note}): {summary}") 46 47 # Sample rows 48 schema_lines.append("") 49 schema_lines.append("First 3 rows:") 50 schema_lines.append(df.head(3).to_string()) 51 52 return df, "\n".join(schema_lines) 53 54 55def save_csv_for_sandbox(df: pd.DataFrame, temp_dir: str) -> str: 56 """Save DataFrame as CSV for the sandbox to read.""" 57 path = os.path.join(temp_dir, "data.csv") 58 df.to_csv(path, index=False) 59 return path 60 61 62# ─── Sandboxed Code Executor ─────────────────────────────────────────────────── 63 64ALLOWED_IMPORTS = { 65 "pandas", "numpy", "json", "math", "statistics", 66 "collections", "itertools", "datetime", "re" 67} 68 69 70def validate_code(code: str) -> tuple[bool, str]: 71 """ 72 Basic static validation of generated code. 73 Blocks dangerous operations before execution. 74 """ 75 forbidden_patterns = [ 76 "import os", "import sys", "import subprocess", "import socket", 77 "import shutil", "__import__", "open(", "eval(", "exec(", 78 "compile(", "globals()", "locals()", "getattr", "setattr", 79 "delattr", "vars(", "dir(", "__class__", "__bases__", 80 "import requests", "import urllib", "import http" 81 ] 82 83 code_lower = code.lower() 84 for pattern in forbidden_patterns: 85 if pattern.lower() in code_lower: 86 return False, f"Forbidden pattern detected: '{pattern}'" 87 88 return True, "OK" 89 90 91def execute_code_sandbox(code: str, data_csv_path: str) -> dict: 92 """ 93 Execute analysis code in a subprocess sandbox. 94 95 The code receives a 'df' variable (the loaded DataFrame) and should 96 print results to stdout in JSON format. 97 """ 98 # Validate before execution 99 is_safe, reason = validate_code(code) 100 if not is_safe: 101 return { 102 "success": False, 103 "error": f"Code validation failed: {reason}", 104 "output": None 105 } 106 107 # Wrap the code with DataFrame loading preamble and output capture 108 wrapped_code = f""" 109import pandas as pd 110import numpy as np 111import json 112import math 113import statistics 114from collections import Counter 115 116# Load the data 117df = pd.read_csv("{data_csv_path}") 118 119# User code starts here 120{code} 121""" 122 123 # Write to temp file and execute in subprocess 124 with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False, 125 encoding="utf-8") as f: 126 f.write(wrapped_code) 127 script_path = f.name 128 129 try: 130 result = subprocess.run( 131 [sys.executable, script_path], 132 capture_output=True, 133 text=True, 134 timeout=30, # 30 second limit 135 env={ 136 "PATH": os.environ.get("PATH", ""), 137 "PYTHONPATH": os.environ.get("PYTHONPATH", ""), 138 "HOME": os.environ.get("HOME", "") 139 } 140 ) 141 142 if result.returncode == 0: 143 return { 144 "success": True, 145 "output": result.stdout.strip(), 146 "error": None 147 } 148 else: 149 return { 150 "success": False, 151 "output": result.stdout.strip() or None, 152 "error": result.stderr.strip() 153 } 154 155 except subprocess.TimeoutExpired: 156 return { 157 "success": False, 158 "output": None, 159 "error": "Code execution timed out after 30 seconds." 160 } 161 162 finally: 163 # Always clean up the temp file 164 if os.path.exists(script_path): 165 os.unlink(script_path) 166 167 168# ─── Tool Definitions ────────────────────────────────────────────────────────── 169 170ANALYST_TOOLS = [ 171 { 172 "name": "run_analysis_code", 173 "description": """Execute Python code to analyse the dataset. 174 175The code has access to a pandas DataFrame called 'df' containing the full dataset. 176Use print() to output results. Output will be returned to you for interpretation. 177 178IMPORTANT: 179- Always print your results, not just compute them 180- For numbers, round to 2 decimal places 181- For DataFrames, use .to_string() or .to_dict() 182- Do not import any modules not in the standard data science stack 183""", 184 "input_schema": { 185 "type": "object", 186 "properties": { 187 "code": { 188 "type": "string", 189 "description": "Valid Python code to execute against the DataFrame df" 190 }, 191 "description": { 192 "type": "string", 193 "description": "One-line description of what this code does" 194 } 195 }, 196 "required": ["code", "description"] 197 } 198 } 199] 200 201 202# ─── Data Analyst Agent ──────────────────────────────────────────────────────── 203 204class DataAnalystAgent: 205 """Conversational data analyst backed by Claude.""" 206 207 def __init__(self, csv_path: str, model: str = "claude-sonnet-4-6"): 208 self.model = model 209 self.df, self.schema = load_csv(csv_path) 210 self.temp_dir = tempfile.mkdtemp() 211 self.data_path = save_csv_for_sandbox(self.df, self.temp_dir) 212 self.conversation_history = [] 213 214 print(f"Loaded dataset:") 215 print(self.schema) 216 print("\nData analyst agent ready. Ask me anything about this data.\n") 217 218 def _build_system_prompt(self) -> str: 219 return f"""You are an expert data analyst. You have access to a Python execution tool 220that lets you run pandas and numpy code against a dataset. 221 222DATASET SCHEMA: 223{self.schema} 224 225Approach: 2261. Understand the user's question 2272. Write precise Python code to answer it 2283. Observe the output and interpret it clearly 2294. If the first code run is incomplete, run additional code to refine 2305. Give a clear, jargon-free answer citing the numbers you found 231 232Always explain what the numbers mean, not just what they are.""" 233 234 def ask(self, question: str) -> str: 235 """ 236 Ask a question about the dataset. 237 Returns Claude's final natural language answer. 238 """ 239 self.conversation_history.append({ 240 "role": "user", 241 "content": question 242 }) 243 244 max_iterations = 8 245 iteration = 0 246 247 while iteration < max_iterations: 248 iteration += 1 249 250 response = client.messages.create( 251 model=self.model, 252 max_tokens=4096, 253 system=self._build_system_prompt(), 254 tools=ANALYST_TOOLS, 255 messages=self.conversation_history 256 ) 257 258 # Collect tool calls and text 259 tool_calls = [] 260 text_blocks = [] 261 262 for block in response.content: 263 if block.type == "tool_use": 264 tool_calls.append(block) 265 elif block.type == "text": 266 text_blocks.append(block.text) 267 268 # Add assistant response to history 269 self.conversation_history.append({ 270 "role": "assistant", 271 "content": response.content 272 }) 273 274 # If no tool calls, Claude is done 275 if response.stop_reason == "end_turn" or not tool_calls: 276 answer = "\n".join(text_blocks) 277 return answer 278 279 # Execute all tool calls and collect results 280 tool_results = [] 281 282 for tool_call in tool_calls: 283 code = tool_call.input.get("code", "") 284 description = tool_call.input.get("description", "") 285 286 print(f" Executing: {description}") 287 288 result = execute_code_sandbox(code, self.data_path) 289 290 if result["success"]: 291 content = result["output"] or "(No output produced)" 292 else: 293 content = f"Error: {result['error']}" 294 if result["output"]: 295 content += f"\nPartial output: {result['output']}" 296 297 tool_results.append({ 298 "type": "tool_result", 299 "tool_use_id": tool_call.id, 300 "content": content 301 }) 302 303 # Add tool results to history 304 self.conversation_history.append({ 305 "role": "user", 306 "content": tool_results 307 }) 308 309 return "Analysis reached maximum iterations. Please try a more specific question." 310 311 def run_interactive(self): 312 """Run an interactive analysis session.""" 313 print("Type your question or 'quit' to exit.\n") 314 315 while True: 316 question = input("You: ").strip() 317 if question.lower() in ("quit", "exit"): 318 break 319 if question: 320 answer = self.ask(question) 321 print(f"\nAnalyst: {answer}\n")

Code Execution Security

Never run Claude-generated code directly in your main process without sandboxing. This implementation uses subprocess isolation with a 30-second timeout, a static allowlist check, and a restricted environment with no access to network imports. For production deployments, consider container-based sandboxing (like Docker gVisor) or a dedicated code execution service for stronger isolation.


    Example Session

    python
    1if __name__ == "__main__": 2 # Point to your CSV file 3 agent = DataAnalystAgent("sales_data.csv") 4 5 # Single question mode 6 print(agent.ask("Which product category generated the most revenue last quarter?")) 7 print(agent.ask("What is the average order value and how does it vary by region?")) 8 9 # Or run an interactive session 10 # agent.run_interactive()

    Sample output for a sales dataset:

    You: Which month had the highest total sales? Executing: Calculate monthly sales totals and find the maximum Executing: Format monthly breakdown for top 3 months Analyst: March 2025 was your strongest month with £847,230 in total sales — about 23% above average monthly revenue of £689,500. The top three months were: - March 2025: £847,230 - October 2024: £812,450 - November 2024: £798,100 There's a clear seasonal pattern: autumn/early spring tends to outperform summer months in this dataset.

    Extending to Charts

    Claude can generate matplotlib chart code too. Add a matplotlib chart generation tool that saves a PNG to a temp path and returns the path, then display or serve it. For a web interface, pair this agent with a FastAPI backend and a React frontend that renders the chart images alongside Claude's analysis text.


      Summary

      This data analyst agent turns any CSV into an interactive analytics tool — no SQL, no pandas expertise required from the end user.

      • The schema summary given to Claude at startup prevents hallucinated column names and ensures accurate queries
      • Subprocess sandboxing is non-negotiable: Claude-generated code must never execute with elevated privileges
      • The agentic loop with max_iterations prevents infinite loops while allowing multi-step analyses
      • Conversation history enables follow-up questions that reference previous results naturally

      Next IT pro project: Project: Deploy Claude on AWS Bedrock — A Production Setup Guide.


      This post is part of the Anthropic AI Tutorial Series. Previous post: Project: Build an AI-Powered IT Incident Report Generator.