Code Execution Sandbox Design¶
Phase 4.7 - gVisor-Based Code Execution
This document describes the architecture for secure code execution using gVisor sandboxing, providing stronger isolation than Docker containers alone while maintaining practical usability for AI agent workflows.
Overview¶
The code execution sandbox allows AI agents to run arbitrary code (Python, JavaScript, shell scripts) in a highly isolated environment with:
- gVisor kernel isolation - Application kernel in userspace, limiting host kernel exposure
- Air-gapped by default - No network access unless explicitly enabled
- Resource constraints - CPU, memory, disk, and time limits
- Filesystem isolation - Temporary workspace, no host filesystem access
- Optional package installation - Allowlisted registries (PyPI, npm) when needed
- Multi-language support - Python, Node.js, shell scripts
- HITL integration - Risk-based approval for sensitive operations
Architecture¶
System Diagram¶
┌─────────────────────────────────────────────────────┐
│ Agent (via MCP Gateway) │
│ Requests code execution via code_execute tool │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ HITL Gate (Risk Classifier) │
│ Classifies operation risk (LOW/MED/HIGH/CRITICAL) │
│ Requires approval for HIGH+ operations │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ SandboxManager │
│ - Creates gVisor sandbox container │
│ - Injects code into isolated workspace │
│ - Configures resource limits │
│ - Optionally enables network (allowlisted domains) │
│ - Captures stdout/stderr/exit_code │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ gVisor Sandbox (runsc runtime) │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Application Kernel (Go, userspace) │ │
│ │ - Intercepts syscalls │ │
│ │ - Virtualizes devices │ │
│ │ - Isolates from host kernel │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────┐ │
│ │ Code Execution Environment │ │
│ │ - Python 3.11+ / Node.js 20+ / Bash │ │
│ │ - Temporary workspace: /workspace │ │
│ │ - No host filesystem access │ │
│ │ - Network: disabled (or allowlisted domains) │ │
│ └───────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Audit Logger │
│ - Logs all code execution requests │
│ - Records approval decisions │
│ - Tracks execution results and errors │
└─────────────────────────────────────────────────────┘
Components¶
1. SandboxManager¶
Purpose: Manages gVisor sandbox lifecycle and code execution.
Responsibilities:
- Create and configure gVisor sandbox containers
- Inject code into isolated workspace
- Apply resource limits (CPU, memory, disk, time)
- Configure network isolation (disabled by default)
- Execute code and capture results
- Cleanup sandbox after execution
Key Methods:
class SandboxManager:
async def create_sandbox(
self,
language: str,
sandbox_id: str | None = None,
network_enabled: bool = False,
allowed_domains: list[str] | None = None,
) -> str
async def execute_code(
self,
sandbox_id: str,
code: str,
timeout: int = 30,
max_memory_mb: int = 512,
) -> ExecutionResult
async def install_package(
self,
sandbox_id: str,
package: str,
registry: str = "pypi",
) -> InstallResult
async def write_file(
self,
sandbox_id: str,
file_path: str,
content: str,
) -> WriteResult
async def read_file(
self,
sandbox_id: str,
file_path: str,
) -> ReadResult
async def list_files(
self,
sandbox_id: str,
path: str = "/workspace",
) -> ListResult
async def destroy_sandbox(
self,
sandbox_id: str,
) -> None
2. Code Execution Tools¶
Purpose: MCP-compatible tools for agent code execution.
Tools:
- code_execute - Execute code in sandbox
- code_install_package - Install packages from registries
- code_write_file - Write files to workspace
- code_read_file - Read files from workspace
- code_list_files - List files in workspace
- code_destroy_sandbox - Cleanup sandbox
Example Tool Schema:
{
"name": "code_execute",
"description": "Execute code in isolated gVisor sandbox",
"inputSchema": {
"type": "object",
"properties": {
"language": {
"type": "string",
"enum": ["python", "javascript", "shell"],
"description": "Programming language"
},
"code": {
"type": "string",
"description": "Code to execute"
},
"timeout": {
"type": "integer",
"default": 30,
"description": "Execution timeout in seconds"
},
"network_enabled": {
"type": "boolean",
"default": false,
"description": "Enable network access (requires approval)"
},
"allowed_domains": {
"type": "array",
"items": {"type": "string"},
"description": "Allowlisted domains (when network enabled)"
}
},
"required": ["language", "code"]
}
}
3. Risk Classification Rules¶
Purpose: HITL integration for code execution operations.
Risk Levels:
CRITICAL (30s timeout):
- Code execution with network enabled
- Installing packages from non-standard registries
- Code containing dangerous patterns (rm -rf, curl | sh, eval)
HIGH (60s timeout):
- Any code execution (default)
- Installing packages from standard registries
- Writing executable files
MEDIUM (120s timeout):
- Reading/writing non-executable files
- Listing files in workspace
LOW (auto-approved):
- Destroying sandbox (cleanup)
- Reading sandbox metadata
Example Rules:
HITLRule(
tools=["code_execute"],
risk=RiskLevel.CRITICAL,
conditions=[
{"param": "network_enabled", "equals": True}
],
timeout=30,
description="Code execution with network access"
),
HITLRule(
tools=["code_execute"],
risk=RiskLevel.CRITICAL,
conditions=[
{"param": "code", "matches": r"(?i)(rm\s+-rf|curl.*\|\s*sh|eval\(|exec\()"}
],
timeout=30,
description="Dangerous code patterns detected"
),
HITLRule(
tools=["code_execute"],
risk=RiskLevel.HIGH,
require_approval=True,
timeout=60,
description="Code execution in sandbox"
),
HITLRule(
tools=["code_install_package"],
risk=RiskLevel.HIGH,
timeout=60,
description="Package installation"
),
gVisor Integration¶
Why gVisor?¶
Traditional Docker containers share the host kernel, which exposes a large attack surface (~300+ syscalls). gVisor provides:
- Application kernel in userspace - Written in memory-safe Go
- Syscall interception - Limits host kernel exposure to ~70 syscalls
- No VM overhead - Faster startup than VMs, lighter than Kata Containers
- Production-ready - Used by Google GKE, maintained by Google
Security Comparison:
| Feature | Docker | Docker + gVisor | VM (Firecracker) |
|---|---|---|---|
| Kernel isolation | ❌ Shared | ✅ Isolated | ✅ Isolated |
| Syscall filtering | ⚠️ seccomp | ✅ Application kernel | ✅ Full VM |
| Startup time | ~100ms | ~500ms | ~1s |
| Memory overhead | ~10MB | ~50MB | ~150MB |
| I/O performance | Excellent | Good | Fair |
References:
- gVisor Official Documentation
- How to sandbox AI agents in 2026
- 4 ways to sandbox untrusted code in 2026
Installation¶
Installing gVisor:
# Download runsc
wget https://storage.googleapis.com/gvisor/releases/release/latest/$(uname -m)/runsc
chmod +x runsc
sudo mv runsc /usr/local/bin/
# Configure Docker to use runsc
sudo runsc install
# Verify installation
docker run --runtime=runsc --rm hello-world
Docker Configuration (/etc/docker/daemon.json):
Reference: gVisor Docker Quick Start
Docker Integration¶
Creating gVisor Sandbox:
# Create container with runsc runtime
container = await docker_client.containers.create(
image="harombe/code-sandbox:python3.11",
runtime="runsc", # Use gVisor
command=["python", "/workspace/script.py"],
network_mode="none", # Air-gapped
mem_limit="512m",
cpu_period=100000,
cpu_quota=50000, # 50% of 1 CPU
volumes={
temp_workspace: {
"bind": "/workspace",
"mode": "rw"
}
},
working_dir="/workspace",
remove=True, # Auto-cleanup
)
With Network (Optional):
# Create custom network with egress filtering
network = await docker_client.networks.create(
name=f"sandbox-{sandbox_id}",
driver="bridge",
options={
"com.docker.network.bridge.enable_icc": "false",
"com.docker.network.bridge.enable_ip_masquerade": "true"
}
)
# Apply iptables rules for domain allowlist
# (Similar to Phase 4.4 network isolation)
await apply_egress_filter(
container_id=container.id,
allowed_domains=["pypi.org", "files.pythonhosted.org"]
)
Supported Languages¶
Python¶
Runtime: Python 3.11+
Default Packages:
- Standard library only
- No pre-installed third-party packages
Package Installation:
Execution:
result = await sandbox.execute_code(
sandbox_id=sandbox_id,
code="""
import sys
print(f"Python {sys.version}")
print("Hello from gVisor sandbox!")
""",
timeout=30
)
JavaScript (Node.js)¶
Runtime: Node.js 20+
Default Packages:
- Node.js core modules only
- No pre-installed npm packages
Package Installation:
Execution:
result = await sandbox.execute_code(
sandbox_id=sandbox_id,
code="""
console.log(`Node.js ${process.version}`);
console.log("Hello from gVisor sandbox!");
""",
timeout=30
)
Shell¶
Runtime: Bash 5.2+
Available Commands:
- Basic POSIX utilities (ls, cat, grep, etc.)
- No network utilities (curl, wget) unless network enabled
Execution:
result = await sandbox.execute_code(
sandbox_id=sandbox_id,
code="""
echo "Shell: $BASH_VERSION"
ls -la /workspace
""",
timeout=30
)
Resource Constraints¶
Default Limits¶
DEFAULT_LIMITS = {
"max_memory_mb": 512, # 512MB RAM
"max_cpu_cores": 0.5, # 50% of 1 CPU core
"max_disk_mb": 1024, # 1GB disk
"max_execution_time": 30, # 30 seconds
"max_output_bytes": 1_048_576, # 1MB stdout/stderr
}
Configurable Per Execution¶
result = await sandbox.execute_code(
sandbox_id=sandbox_id,
code=code,
timeout=60, # Override default timeout
max_memory_mb=1024, # Override default memory
max_output_bytes=5_242_880 # 5MB output
)
Enforcement¶
Time Limits:
- Enforced by Docker timeout
- SIGTERM after timeout, then SIGKILL after grace period
Memory Limits:
- Enforced by Docker cgroup limits
- OOM killer terminates process if exceeded
Disk Limits:
- Enforced by tmpfs mount with size limit
- Write fails when limit reached
Output Limits:
- Enforced by capture buffer size
- Truncated with warning if exceeded
Network Isolation¶
Default: Air-Gapped¶
Network disabled by default:
result = await sandbox.execute_code(
sandbox_id=sandbox_id,
code="import requests; requests.get('https://example.com')",
# network_enabled=False (default)
)
# Result: ConnectionError (no network access)
Optional: Allowlisted Domains¶
Enable network with domain allowlist:
result = await sandbox.execute_code(
sandbox_id=sandbox_id,
code="""
import requests
response = requests.get('https://pypi.org')
print(response.status_code)
""",
network_enabled=True,
allowed_domains=["pypi.org", "files.pythonhosted.org"]
)
# Result: 200 (pypi.org accessible)
result2 = await sandbox.execute_code(
sandbox_id=sandbox_id,
code="import requests; requests.get('https://evil.com')",
network_enabled=True,
allowed_domains=["pypi.org"]
)
# Result: ConnectionError (evil.com blocked)
Implementation:
- Reuses Phase 4.4 network isolation (iptables egress filtering)
- DNS resolution controlled by custom DNS server
- All traffic outside allowlist dropped
Filesystem Isolation¶
Workspace Structure¶
/workspace/ # Temporary directory (tmpfs, size-limited)
├── script.py # Injected code file
├── output.txt # Agent-created files
└── data/ # Agent-created directories
└── results.json
No Host Access¶
Blocked:
- No access to host filesystem
- No access to /proc, /sys (filtered by gVisor)
- No access to other containers
Temporary Workspace:
- Created per sandbox
- Destroyed after execution
- Max size: 1GB (configurable)
File Operations¶
Write File:
await sandbox.write_file(
sandbox_id=sandbox_id,
file_path="/workspace/config.json",
content='{"key": "value"}'
)
Read File:
result = await sandbox.read_file(
sandbox_id=sandbox_id,
file_path="/workspace/output.txt"
)
print(result.content)
List Files:
result = await sandbox.list_files(
sandbox_id=sandbox_id,
path="/workspace"
)
print(result.files) # ["script.py", "output.txt", "data/"]
Security Model¶
Threat Model¶
Threats Mitigated:
- Kernel exploits - gVisor isolates from host kernel
- Container escape - Application kernel in userspace prevents breakout
- Resource exhaustion - CPU/memory/disk limits prevent DoS
- Data exfiltration - Network disabled by default, allowlisted when enabled
- Malicious code execution - Sandbox isolation limits blast radius
Threats NOT Mitigated:
- Logic bombs - Malicious code that appears benign
- Side-channel attacks - Timing attacks, speculative execution
- Social engineering - Convincing user to approve dangerous operations
Defense in Depth¶
Layer 1: HITL Gates
- User approval required for HIGH+ risk operations
- Dangerous code patterns detected and flagged
Layer 2: gVisor Isolation
- Application kernel limits host kernel exposure
- Syscall interception prevents kernel exploits
Layer 3: Resource Limits
- Time limits prevent infinite loops
- Memory limits prevent DoS
- Disk limits prevent storage exhaustion
Layer 4: Network Isolation
- Air-gapped by default
- Allowlist-based egress filtering when network needed
Layer 5: Audit Logging
- All code execution logged
- Approval decisions recorded
- Results and errors tracked
Security Best Practices¶
- Always require HITL approval for code execution
- Keep network disabled unless absolutely necessary
- Use minimal allowlists when network is required
- Review code before approving - check for dangerous patterns
- Monitor audit logs for suspicious activity
- Keep gVisor updated - security patches and improvements
Execution Flow¶
Standard Code Execution¶
sequenceDiagram
participant Agent
participant Gateway
participant HITL
participant SandboxMgr
participant gVisor
participant AuditLog
Agent->>Gateway: code_execute(code, language)
Gateway->>HITL: Check risk level
HITL->>HITL: Classify as HIGH
HITL->>User: Request approval
User->>HITL: Approve
HITL->>Gateway: Approved
Gateway->>SandboxMgr: create_sandbox(language)
SandboxMgr->>gVisor: docker run --runtime=runsc
gVisor->>SandboxMgr: sandbox_id
SandboxMgr->>gVisor: Write code to /workspace
SandboxMgr->>gVisor: Execute code (timeout)
gVisor->>gVisor: Run in isolated kernel
gVisor->>SandboxMgr: stdout, stderr, exit_code
SandboxMgr->>gVisor: Destroy sandbox
SandboxMgr->>Gateway: ExecutionResult
Gateway->>Agent: Return result
Gateway->>AuditLog: Log execution
With Package Installation¶
sequenceDiagram
participant Agent
participant Gateway
participant HITL
participant SandboxMgr
participant gVisor
Agent->>Gateway: code_execute(code, network_enabled=True)
Gateway->>HITL: Check risk (CRITICAL)
HITL->>User: Request approval (network enabled)
User->>HITL: Approve
HITL->>Gateway: Approved
Gateway->>SandboxMgr: create_sandbox(network_enabled=True)
SandboxMgr->>gVisor: Create with network
SandboxMgr->>gVisor: Apply egress filter (pypi.org)
Agent->>Gateway: code_install_package("requests")
Gateway->>HITL: Check risk (HIGH)
HITL->>User: Request approval
User->>HITL: Approve
Gateway->>SandboxMgr: install_package("requests")
SandboxMgr->>gVisor: pip install requests
gVisor->>SandboxMgr: Success
Agent->>Gateway: code_execute(code using requests)
Gateway->>SandboxMgr: execute_code
SandboxMgr->>gVisor: Execute
gVisor->>SandboxMgr: Result
SandboxMgr->>Gateway: Return result
Configuration¶
YAML Configuration¶
security:
sandbox:
enabled: true
runtime: runsc # gVisor runtime
# Default resource limits
limits:
max_memory_mb: 512
max_cpu_cores: 0.5
max_disk_mb: 1024
max_execution_time: 30
max_output_bytes: 1048576
# Network configuration
network:
enabled_by_default: false
allowed_registries:
pypi:
- pypi.org
- files.pythonhosted.org
npm:
- registry.npmjs.org
- registry.npmjs.com
# Supported languages
languages:
python:
image: harombe/sandbox-python:3.11
default_packages: []
javascript:
image: harombe/sandbox-node:20
default_packages: []
shell:
image: harombe/sandbox-shell:latest
default_packages: []
# HITL integration
hitl:
enabled: true
auto_approve_low_risk: true
Docker Images¶
Python Sandbox Image¶
Dockerfile (docker/sandbox-python.Dockerfile):
FROM python:3.11-slim
# Install minimal dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Create workspace
RUN mkdir /workspace
WORKDIR /workspace
# Non-root user
RUN useradd -m -u 1000 sandbox
USER sandbox
# No default packages (install on demand)
CMD ["python", "--version"]
Node.js Sandbox Image¶
Dockerfile (docker/sandbox-node.Dockerfile):
FROM node:20-slim
# Install minimal dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Create workspace
RUN mkdir /workspace
WORKDIR /workspace
# Non-root user
RUN useradd -m -u 1000 sandbox
USER sandbox
# No default packages
CMD ["node", "--version"]
Testing Strategy¶
Unit Tests¶
Test sandbox manager:
- Sandbox creation and destruction
- Code execution with various languages
- Resource limit enforcement
- Network isolation verification
- File operations
Test code execution tools:
- Tool invocation with valid inputs
- Error handling for invalid inputs
- HITL integration
- Result serialization
Integration Tests¶
Test gVisor isolation:
- Verify syscall filtering
- Attempt kernel exploits (should fail)
- Verify network isolation
- Verify filesystem isolation
Test resource limits:
- Execution timeout enforcement
- Memory limit enforcement (OOM killer)
- Disk limit enforcement
- Output truncation
Security Tests¶
Test dangerous code patterns:
- Shell command injection attempts
- Path traversal attempts
- Network exfiltration attempts (when disabled)
- Resource exhaustion attempts
Implementation Phases¶
Phase 1: Core Sandbox Manager¶
- Install and configure gVisor
- Implement
SandboxManagerclass - Support Python execution
- Basic resource limits
- Unit tests
Phase 2: Multi-Language Support¶
- Add Node.js support
- Add shell script support
- Unified execution interface
- Language-specific handling
Phase 3: Network & Packages¶
- Optional network enablement
- Egress filtering integration (Phase 4.4)
- Package installation (pip, npm)
- Registry allowlists
Phase 4: MCP Tools & HITL¶
- Implement code execution tools
- Define risk classification rules
- HITL integration
- Audit logging integration
Phase 5: Testing & Documentation¶
- Comprehensive test suite
- Usage documentation
- Security guide
- Architecture updates
Future Enhancements¶
Persistent Sandboxes:
- Reuse sandbox across multiple executions
- Faster iteration for development workflows
More Languages:
- Go, Rust, Java support
- Custom language runtime plugins
Advanced Security:
- Seccomp profile customization
- SELinux/AppArmor policies
- Rootless containers
Performance Optimization:
- Sandbox pool (pre-created sandboxes)
- Faster cold starts
- Streaming output
References¶
- gVisor Official Documentation
- gVisor Docker Quick Start
- How to sandbox AI agents in 2026
- 4 ways to sandbox untrusted code in 2026
- gVisor Security Model
- Container Sandboxing with gVisor
Next Steps¶
- Review and approve this design
- Set up gVisor development environment
- Implement
SandboxManager(Phase 1) - Add multi-language support (Phase 2)
- Integrate with MCP Gateway (Phase 3-4)
- Complete testing and documentation (Phase 5)