Code Execution Sandbox Design¶

Phase 4.7 - gVisor-Based Code Execution

This document describes the architecture for secure code execution using gVisor sandboxing, providing stronger isolation than Docker containers alone while maintaining practical usability for AI agent workflows.

Overview¶

The code execution sandbox allows AI agents to run arbitrary code (Python, JavaScript, shell scripts) in a highly isolated environment with:

gVisor kernel isolation - Application kernel in userspace, limiting host kernel exposure
Air-gapped by default - No network access unless explicitly enabled
Resource constraints - CPU, memory, disk, and time limits
Filesystem isolation - Temporary workspace, no host filesystem access
Optional package installation - Allowlisted registries (PyPI, npm) when needed
Multi-language support - Python, Node.js, shell scripts
HITL integration - Risk-based approval for sensitive operations

Architecture¶

System Diagram¶

┌─────────────────────────────────────────────────────┐
│  Agent (via MCP Gateway)                            │
│  Requests code execution via code_execute tool      │
└─────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────┐
│  HITL Gate (Risk Classifier)                        │
│  Classifies operation risk (LOW/MED/HIGH/CRITICAL)  │
│  Requires approval for HIGH+ operations             │
└─────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────┐
│  SandboxManager                                     │
│  - Creates gVisor sandbox container                 │
│  - Injects code into isolated workspace             │
│  - Configures resource limits                       │
│  - Optionally enables network (allowlisted domains) │
│  - Captures stdout/stderr/exit_code                 │
└─────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────┐
│  gVisor Sandbox (runsc runtime)                     │
│                                                     │
│  ┌───────────────────────────────────────────────┐ │
│  │ Application Kernel (Go, userspace)            │ │
│  │ - Intercepts syscalls                         │ │
│  │ - Virtualizes devices                         │ │
│  │ - Isolates from host kernel                   │ │
│  └───────────────────────────────────────────────┘ │
│                                                     │
│  ┌───────────────────────────────────────────────┐ │
│  │ Code Execution Environment                    │ │
│  │ - Python 3.11+ / Node.js 20+ / Bash          │ │
│  │ - Temporary workspace: /workspace             │ │
│  │ - No host filesystem access                   │ │
│  │ - Network: disabled (or allowlisted domains)  │ │
│  └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────┐
│  Audit Logger                                       │
│  - Logs all code execution requests                 │
│  - Records approval decisions                       │
│  - Tracks execution results and errors              │
└─────────────────────────────────────────────────────┘

Components¶

1. SandboxManager¶

Purpose: Manages gVisor sandbox lifecycle and code execution.

Responsibilities:

Create and configure gVisor sandbox containers
Inject code into isolated workspace
Apply resource limits (CPU, memory, disk, time)
Configure network isolation (disabled by default)
Execute code and capture results
Cleanup sandbox after execution

Key Methods:

class SandboxManager:
    async def create_sandbox(
        self,
        language: str,
        sandbox_id: str | None = None,
        network_enabled: bool = False,
        allowed_domains: list[str] | None = None,
    ) -> str

    async def execute_code(
        self,
        sandbox_id: str,
        code: str,
        timeout: int = 30,
        max_memory_mb: int = 512,
    ) -> ExecutionResult

    async def install_package(
        self,
        sandbox_id: str,
        package: str,
        registry: str = "pypi",
    ) -> InstallResult

    async def write_file(
        self,
        sandbox_id: str,
        file_path: str,
        content: str,
    ) -> WriteResult

    async def read_file(
        self,
        sandbox_id: str,
        file_path: str,
    ) -> ReadResult

    async def list_files(
        self,
        sandbox_id: str,
        path: str = "/workspace",
    ) -> ListResult

    async def destroy_sandbox(
        self,
        sandbox_id: str,
    ) -> None

2. Code Execution Tools¶

Purpose: MCP-compatible tools for agent code execution.

Tools:

code_execute - Execute code in sandbox
code_install_package - Install packages from registries
code_write_file - Write files to workspace
code_read_file - Read files from workspace
code_list_files - List files in workspace
code_destroy_sandbox - Cleanup sandbox

Example Tool Schema:

{
    "name": "code_execute",
    "description": "Execute code in isolated gVisor sandbox",
    "inputSchema": {
        "type": "object",
        "properties": {
            "language": {
                "type": "string",
                "enum": ["python", "javascript", "shell"],
                "description": "Programming language"
            },
            "code": {
                "type": "string",
                "description": "Code to execute"
            },
            "timeout": {
                "type": "integer",
                "default": 30,
                "description": "Execution timeout in seconds"
            },
            "network_enabled": {
                "type": "boolean",
                "default": false,
                "description": "Enable network access (requires approval)"
            },
            "allowed_domains": {
                "type": "array",
                "items": {"type": "string"},
                "description": "Allowlisted domains (when network enabled)"
            }
        },
        "required": ["language", "code"]
    }
}

3. Risk Classification Rules¶

Purpose: HITL integration for code execution operations.

Risk Levels:

CRITICAL (30s timeout):

Code execution with network enabled
Installing packages from non-standard registries
Code containing dangerous patterns (rm -rf, curl | sh, eval)

HIGH (60s timeout):

Any code execution (default)
Installing packages from standard registries
Writing executable files

MEDIUM (120s timeout):

Reading/writing non-executable files
Listing files in workspace

LOW (auto-approved):

Destroying sandbox (cleanup)
Reading sandbox metadata

Example Rules:

HITLRule(
    tools=["code_execute"],
    risk=RiskLevel.CRITICAL,
    conditions=[
        {"param": "network_enabled", "equals": True}
    ],
    timeout=30,
    description="Code execution with network access"
),
HITLRule(
    tools=["code_execute"],
    risk=RiskLevel.CRITICAL,
    conditions=[
        {"param": "code", "matches": r"(?i)(rm\s+-rf|curl.*\|\s*sh|eval\(|exec\()"}
    ],
    timeout=30,
    description="Dangerous code patterns detected"
),
HITLRule(
    tools=["code_execute"],
    risk=RiskLevel.HIGH,
    require_approval=True,
    timeout=60,
    description="Code execution in sandbox"
),
HITLRule(
    tools=["code_install_package"],
    risk=RiskLevel.HIGH,
    timeout=60,
    description="Package installation"
),

gVisor Integration¶

Why gVisor?¶

Traditional Docker containers share the host kernel, which exposes a large attack surface (~300+ syscalls). gVisor provides:

Application kernel in userspace - Written in memory-safe Go
Syscall interception - Limits host kernel exposure to ~70 syscalls
No VM overhead - Faster startup than VMs, lighter than Kata Containers
Production-ready - Used by Google GKE, maintained by Google

Security Comparison:

Feature	Docker	Docker + gVisor	VM (Firecracker)
Kernel isolation	❌ Shared	✅ Isolated	✅ Isolated
Syscall filtering	⚠️ seccomp	✅ Application kernel	✅ Full VM
Startup time	~100ms	~500ms	~1s
Memory overhead	~10MB	~50MB	~150MB
I/O performance	Excellent	Good	Fair

References:

Installation¶

Installing gVisor:

# Download runsc
wget https://storage.googleapis.com/gvisor/releases/release/latest/$(uname -m)/runsc
chmod +x runsc
sudo mv runsc /usr/local/bin/

# Configure Docker to use runsc
sudo runsc install

# Verify installation
docker run --runtime=runsc --rm hello-world

Docker Configuration (/etc/docker/daemon.json):

{
  "runtimes": {
    "runsc": {
      "path": "/usr/local/bin/runsc",
      "runtimeArgs": ["--network=none"]
    }
  }
}

Reference: gVisor Docker Quick Start

Docker Integration¶

Creating gVisor Sandbox:

# Create container with runsc runtime
container = await docker_client.containers.create(
    image="harombe/code-sandbox:python3.11",
    runtime="runsc",  # Use gVisor
    command=["python", "/workspace/script.py"],
    network_mode="none",  # Air-gapped
    mem_limit="512m",
    cpu_period=100000,
    cpu_quota=50000,  # 50% of 1 CPU
    volumes={
        temp_workspace: {
            "bind": "/workspace",
            "mode": "rw"
        }
    },
    working_dir="/workspace",
    remove=True,  # Auto-cleanup
)

With Network (Optional):

# Create custom network with egress filtering
network = await docker_client.networks.create(
    name=f"sandbox-{sandbox_id}",
    driver="bridge",
    options={
        "com.docker.network.bridge.enable_icc": "false",
        "com.docker.network.bridge.enable_ip_masquerade": "true"
    }
)

# Apply iptables rules for domain allowlist
# (Similar to Phase 4.4 network isolation)
await apply_egress_filter(
    container_id=container.id,
    allowed_domains=["pypi.org", "files.pythonhosted.org"]
)

Supported Languages¶

Python¶

Runtime: Python 3.11+

Default Packages:

Standard library only
No pre-installed third-party packages

Package Installation:

await sandbox.install_package(
    sandbox_id=sandbox_id,
    package="requests==2.31.0",
    registry="pypi"
)

Execution:

result = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code="""
import sys
print(f"Python {sys.version}")
print("Hello from gVisor sandbox!")
""",
    timeout=30
)

JavaScript (Node.js)¶

Runtime: Node.js 20+

Default Packages:

Node.js core modules only
No pre-installed npm packages

Package Installation:

await sandbox.install_package(
    sandbox_id=sandbox_id,
    package="axios@1.6.0",
    registry="npm"
)

Execution:

result = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code="""
console.log(`Node.js ${process.version}`);
console.log("Hello from gVisor sandbox!");
""",
    timeout=30
)

Shell¶

Runtime: Bash 5.2+

Available Commands:

Basic POSIX utilities (ls, cat, grep, etc.)
No network utilities (curl, wget) unless network enabled

Execution:

result = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code="""
echo "Shell: $BASH_VERSION"
ls -la /workspace
""",
    timeout=30
)

Resource Constraints¶

Default Limits¶

DEFAULT_LIMITS = {
    "max_memory_mb": 512,      # 512MB RAM
    "max_cpu_cores": 0.5,      # 50% of 1 CPU core
    "max_disk_mb": 1024,       # 1GB disk
    "max_execution_time": 30,  # 30 seconds
    "max_output_bytes": 1_048_576,  # 1MB stdout/stderr
}

Configurable Per Execution¶

result = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code=code,
    timeout=60,               # Override default timeout
    max_memory_mb=1024,       # Override default memory
    max_output_bytes=5_242_880  # 5MB output
)

Enforcement¶

Time Limits:

Enforced by Docker timeout
SIGTERM after timeout, then SIGKILL after grace period

Memory Limits:

Enforced by Docker cgroup limits
OOM killer terminates process if exceeded

Disk Limits:

Enforced by tmpfs mount with size limit
Write fails when limit reached

Output Limits:

Enforced by capture buffer size
Truncated with warning if exceeded

Network Isolation¶

Default: Air-Gapped¶

Network disabled by default:

result = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code="import requests; requests.get('https://example.com')",
    # network_enabled=False (default)
)
# Result: ConnectionError (no network access)

Optional: Allowlisted Domains¶

Enable network with domain allowlist:

result = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code="""
import requests
response = requests.get('https://pypi.org')
print(response.status_code)
""",
    network_enabled=True,
    allowed_domains=["pypi.org", "files.pythonhosted.org"]
)
# Result: 200 (pypi.org accessible)

result2 = await sandbox.execute_code(
    sandbox_id=sandbox_id,
    code="import requests; requests.get('https://evil.com')",
    network_enabled=True,
    allowed_domains=["pypi.org"]
)
# Result: ConnectionError (evil.com blocked)

Implementation:

Reuses Phase 4.4 network isolation (iptables egress filtering)
DNS resolution controlled by custom DNS server
All traffic outside allowlist dropped

Filesystem Isolation¶

Workspace Structure¶

/workspace/          # Temporary directory (tmpfs, size-limited)
├── script.py        # Injected code file
├── output.txt       # Agent-created files
└── data/            # Agent-created directories
    └── results.json

No Host Access¶

Blocked:

No access to host filesystem
No access to /proc, /sys (filtered by gVisor)
No access to other containers

Temporary Workspace:

Created per sandbox
Destroyed after execution
Max size: 1GB (configurable)

File Operations¶

Write File:

await sandbox.write_file(
    sandbox_id=sandbox_id,
    file_path="/workspace/config.json",
    content='{"key": "value"}'
)

Read File:

result = await sandbox.read_file(
    sandbox_id=sandbox_id,
    file_path="/workspace/output.txt"
)
print(result.content)

List Files:

result = await sandbox.list_files(
    sandbox_id=sandbox_id,
    path="/workspace"
)
print(result.files)  # ["script.py", "output.txt", "data/"]

Security Model¶

Threat Model¶

Threats Mitigated:

Kernel exploits - gVisor isolates from host kernel
Container escape - Application kernel in userspace prevents breakout
Resource exhaustion - CPU/memory/disk limits prevent DoS
Data exfiltration - Network disabled by default, allowlisted when enabled
Malicious code execution - Sandbox isolation limits blast radius

Threats NOT Mitigated:

Logic bombs - Malicious code that appears benign
Side-channel attacks - Timing attacks, speculative execution
Social engineering - Convincing user to approve dangerous operations

Defense in Depth¶

Layer 1: HITL Gates

User approval required for HIGH+ risk operations
Dangerous code patterns detected and flagged

Layer 2: gVisor Isolation

Application kernel limits host kernel exposure
Syscall interception prevents kernel exploits

Layer 3: Resource Limits

Time limits prevent infinite loops
Memory limits prevent DoS
Disk limits prevent storage exhaustion

Layer 4: Network Isolation

Air-gapped by default
Allowlist-based egress filtering when network needed

Layer 5: Audit Logging

All code execution logged
Approval decisions recorded
Results and errors tracked

Security Best Practices¶

Always require HITL approval for code execution
Keep network disabled unless absolutely necessary
Use minimal allowlists when network is required
Review code before approving - check for dangerous patterns
Monitor audit logs for suspicious activity
Keep gVisor updated - security patches and improvements

Execution Flow¶

Standard Code Execution¶

sequenceDiagram
    participant Agent
    participant Gateway
    participant HITL
    participant SandboxMgr
    participant gVisor
    participant AuditLog

    Agent->>Gateway: code_execute(code, language)
    Gateway->>HITL: Check risk level
    HITL->>HITL: Classify as HIGH
    HITL->>User: Request approval
    User->>HITL: Approve
    HITL->>Gateway: Approved
    Gateway->>SandboxMgr: create_sandbox(language)
    SandboxMgr->>gVisor: docker run --runtime=runsc
    gVisor->>SandboxMgr: sandbox_id
    SandboxMgr->>gVisor: Write code to /workspace
    SandboxMgr->>gVisor: Execute code (timeout)
    gVisor->>gVisor: Run in isolated kernel
    gVisor->>SandboxMgr: stdout, stderr, exit_code
    SandboxMgr->>gVisor: Destroy sandbox
    SandboxMgr->>Gateway: ExecutionResult
    Gateway->>Agent: Return result
    Gateway->>AuditLog: Log execution

With Package Installation¶

sequenceDiagram
    participant Agent
    participant Gateway
    participant HITL
    participant SandboxMgr
    participant gVisor

    Agent->>Gateway: code_execute(code, network_enabled=True)
    Gateway->>HITL: Check risk (CRITICAL)
    HITL->>User: Request approval (network enabled)
    User->>HITL: Approve
    HITL->>Gateway: Approved
    Gateway->>SandboxMgr: create_sandbox(network_enabled=True)
    SandboxMgr->>gVisor: Create with network
    SandboxMgr->>gVisor: Apply egress filter (pypi.org)
    Agent->>Gateway: code_install_package("requests")
    Gateway->>HITL: Check risk (HIGH)
    HITL->>User: Request approval
    User->>HITL: Approve
    Gateway->>SandboxMgr: install_package("requests")
    SandboxMgr->>gVisor: pip install requests
    gVisor->>SandboxMgr: Success
    Agent->>Gateway: code_execute(code using requests)
    Gateway->>SandboxMgr: execute_code
    SandboxMgr->>gVisor: Execute
    gVisor->>SandboxMgr: Result
    SandboxMgr->>Gateway: Return result

Configuration¶

YAML Configuration¶

security:
  sandbox:
    enabled: true
    runtime: runsc # gVisor runtime

    # Default resource limits
    limits:
      max_memory_mb: 512
      max_cpu_cores: 0.5
      max_disk_mb: 1024
      max_execution_time: 30
      max_output_bytes: 1048576

    # Network configuration
    network:
      enabled_by_default: false
      allowed_registries:
        pypi:
          - pypi.org
          - files.pythonhosted.org
        npm:
          - registry.npmjs.org
          - registry.npmjs.com

    # Supported languages
    languages:
      python:
        image: harombe/sandbox-python:3.11
        default_packages: []
      javascript:
        image: harombe/sandbox-node:20
        default_packages: []
      shell:
        image: harombe/sandbox-shell:latest
        default_packages: []

    # HITL integration
    hitl:
      enabled: true
      auto_approve_low_risk: true

Docker Images¶

Python Sandbox Image¶

Dockerfile (docker/sandbox-python.Dockerfile):

FROM python:3.11-slim

# Install minimal dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Create workspace
RUN mkdir /workspace
WORKDIR /workspace

# Non-root user
RUN useradd -m -u 1000 sandbox
USER sandbox

# No default packages (install on demand)
CMD ["python", "--version"]

Node.js Sandbox Image¶

Dockerfile (docker/sandbox-node.Dockerfile):

FROM node:20-slim

# Install minimal dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    ca-certificates \
    && rm -rf /var/lib/apt/lists/*

# Create workspace
RUN mkdir /workspace
WORKDIR /workspace

# Non-root user
RUN useradd -m -u 1000 sandbox
USER sandbox

# No default packages
CMD ["node", "--version"]

Testing Strategy¶

Unit Tests¶

Test sandbox manager:

Sandbox creation and destruction
Code execution with various languages
Resource limit enforcement
Network isolation verification
File operations

Test code execution tools:

Tool invocation with valid inputs
Error handling for invalid inputs
HITL integration
Result serialization

Integration Tests¶

Test gVisor isolation:

Verify syscall filtering
Attempt kernel exploits (should fail)
Verify network isolation
Verify filesystem isolation

Test resource limits:

Execution timeout enforcement
Memory limit enforcement (OOM killer)
Disk limit enforcement
Output truncation

Security Tests¶

Test dangerous code patterns:

Shell command injection attempts
Path traversal attempts
Network exfiltration attempts (when disabled)
Resource exhaustion attempts

Implementation Phases¶

Phase 1: Core Sandbox Manager¶

Install and configure gVisor
Implement SandboxManager class
Support Python execution
Basic resource limits
Unit tests

Phase 2: Multi-Language Support¶

Add Node.js support
Add shell script support
Unified execution interface
Language-specific handling

Phase 3: Network & Packages¶

Optional network enablement
Egress filtering integration (Phase 4.4)
Package installation (pip, npm)
Registry allowlists

Phase 4: MCP Tools & HITL¶

Implement code execution tools
Define risk classification rules
HITL integration
Audit logging integration

Phase 5: Testing & Documentation¶

Comprehensive test suite
Usage documentation
Security guide
Architecture updates

Future Enhancements¶

Persistent Sandboxes:

Reuse sandbox across multiple executions
Faster iteration for development workflows

More Languages:

Go, Rust, Java support
Custom language runtime plugins

Advanced Security:

Seccomp profile customization
SELinux/AppArmor policies
Rootless containers

Performance Optimization:

Sandbox pool (pre-created sandboxes)
Faster cold starts
Streaming output

References¶

Next Steps¶

Review and approve this design
Set up gVisor development environment
Implement SandboxManager (Phase 1)
Add multi-language support (Phase 2)
Integrate with MCP Gateway (Phase 3-4)
Complete testing and documentation (Phase 5)