Security Best Practices
This guide provides operational security recommendations for building, deploying, and running production AgentWeave agents securely.
Table of Contents
- Security Best Practices
Identity Management
Use SPIFFE/SPIRE in Production
Never use mock identity in production.
1
2
3
4
5
6
7
8
9
10
11
# ❌ BAD - Mock identity for production
identity:
provider: "mock"
spiffe_id: "spiffe://example.com/agent/my-agent"
# ✅ GOOD - SPIRE for production
identity:
provider: "spire"
socket_path: "/run/spire/sockets/agent.sock"
allowed_trust_domains:
- "example.com"
Mock identity should only be used for:
- Local development
- Unit testing
- Integration testing in CI
Register Unique SPIFFE IDs per Agent
Each agent instance should have a unique, descriptive SPIFFE ID:
1
2
3
4
5
6
7
8
# ✅ GOOD - Specific IDs
spiffe://example.com/agent/data-processor/production
spiffe://example.com/agent/api-gateway/staging
spiffe://example.com/agent/orchestrator/dev
# ❌ BAD - Generic IDs
spiffe://example.com/agent
spiffe://example.com/service
SPIFFE ID Best Practices:
- Include environment (production, staging, dev)
- Include functional role (data-processor, orchestrator)
- Use hierarchical structure:
/agent/<role>/<env> - Document your SPIFFE ID naming convention
Rotate SVIDs Regularly
Configure short TTL for SVIDs:
1
2
3
4
5
6
7
8
9
10
11
12
13
# ✅ GOOD - 1 hour TTL
spire-server entry create \
-spiffeID spiffe://example.com/agent/data-processor/prod \
-parentID spiffe://example.com/k8s-node \
-selector k8s:ns:agentweave \
-selector k8s:sa:data-processor \
-ttl 3600 # 1 hour
# ⚠️ ACCEPTABLE - 4 hours for testing
-ttl 14400
# ❌ BAD - 1 day
-ttl 86400
SVID TTL Recommendations:
- Production: 1 hour (3600 seconds)
- Staging: 2-4 hours
- Development: 4-8 hours
- Testing: Can be longer for convenience
SPIRE automatically rotates at 50% of TTL (30 minutes for 1-hour TTL).
Validate Trust Domains
Explicitly configure allowed trust domains:
1
2
3
4
5
6
7
8
9
10
11
# ✅ GOOD - Explicit allowlist
identity:
allowed_trust_domains:
- "example.com" # Own domain
- "partner.trusted-org.com" # Federated partner
- "vendor.external-service.io" # Trusted vendor
# ❌ BAD - Wildcard trust
identity:
allowed_trust_domains:
- "*" # Never do this!
Monitor SVID Health
Set up monitoring for SVID rotation:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class MyAgent(SecureAgent):
async def on_svid_update(self, new_svid):
"""Called when SVID rotates."""
self.logger.info(
"SVID rotated successfully",
extra={
"spiffe_id": new_svid.spiffe_id,
"expiry": new_svid.expiry,
"ttl_seconds": new_svid.ttl,
}
)
# Alert if TTL is shorter than expected
if new_svid.ttl < 3600:
self.logger.warning(
"SVID TTL is shorter than expected",
extra={"ttl": new_svid.ttl}
)
Prometheus Metrics:
1
2
3
4
5
6
7
8
9
10
11
# Alert on rotation failures
- alert: SVIDRotationFailed
expr: agentweave_svid_rotation_errors_total > 0
annotations:
summary: "SVID rotation failed for "
# Alert on expiring SVIDs
- alert: SVIDNearExpiry
expr: agentweave_svid_ttl_seconds < 300
annotations:
summary: "SVID expires in less than 5 minutes"
Authorization Policies
Default Deny
Always use default deny authorization:
1
2
3
4
5
6
7
8
9
10
11
# ✅ GOOD - Default deny
package agentweave.authz
import rego.v1
default allow := false
# Explicit rules required for access
allow if {
# ... specific allow conditions
}
1
2
3
4
# ❌ BAD - Default allow
default allow := true
# This is dangerous!
In agent configuration:
1
2
3
4
5
6
7
# ✅ GOOD
authorization:
default_action: "deny"
# ❌ BAD
authorization:
default_action: "allow"
Principle of Least Privilege
Grant minimal permissions necessary:
1
2
3
4
5
6
7
8
9
10
11
12
# ✅ GOOD - Specific capabilities
allow if {
input.caller_spiffe_id == "spiffe://example.com/agent/api-gateway"
input.callee_spiffe_id == "spiffe://example.com/agent/data-processor"
input.action in ["query", "health_check"] # Only what's needed
}
# ❌ BAD - Overly permissive
allow if {
input.caller_spiffe_id == "spiffe://example.com/agent/api-gateway"
# Grants access to ALL capabilities
}
Regular Policy Audits
Schedule policy reviews:
1
2
3
4
5
6
7
8
9
# Export current policies
curl -X GET http://localhost:8181/v1/policies > policies-$(date +%Y%m%d).json
# Review who can call what
curl -X POST http://localhost:8181/v1/data/agentweave/authz/allow \
-d @test-cases.json
# Use OPA test framework
opa test policies/ -v
Quarterly Review Checklist:
- Review all allowlist entries
- Check for overly broad permissions
- Remove stale/unused entries
- Verify federated trust is still needed
- Test deny cases still work
- Update documentation
Test Policies Thoroughly
Use OPA's test framework:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# policies/authz_test.rego
package agentweave.authz
# Test allowed access
test_api_gateway_can_query {
allow with input as {
"caller_spiffe_id": "spiffe://example.com/agent/api-gateway",
"callee_spiffe_id": "spiffe://example.com/agent/data-processor",
"action": "query"
}
}
# Test denied access
test_unknown_caller_denied {
not allow with input as {
"caller_spiffe_id": "spiffe://example.com/agent/unknown",
"callee_spiffe_id": "spiffe://example.com/agent/data-processor",
"action": "query"
}
}
# Test default deny
test_no_policy_denies {
not allow with input as {
"caller_spiffe_id": "spiffe://example.com/agent/new-agent",
"callee_spiffe_id": "spiffe://example.com/agent/new-service",
"action": "new_action"
}
}
Run tests in CI/CD:
1
2
3
4
5
6
7
8
# Run all policy tests
opa test policies/ -v
# Check coverage (aim for 100%)
opa test policies/ --coverage
# Fail CI if tests fail
opa test policies/ || exit 1
Separate Policy from Code
Store policies in version control separately:
1
2
3
4
5
6
7
8
project/
├── agents/
│ ├── data-processor/
│ └── orchestrator/
└── policies/
├── authz.rego
├── authz_test.rego
└── data.json
Use OPA bundles for distribution:
1
2
3
4
5
6
7
8
9
10
11
12
# OPA config
services:
bundle-server:
url: https://policy-server.example.com
bundles:
authz:
service: bundle-server
resource: bundles/agentweave-authz.tar.gz
polling:
min_delay_seconds: 30
max_delay_seconds: 60
Transport Security
Enforce TLS 1.3 Minimum
1
2
3
4
5
6
7
8
9
10
11
# ✅ GOOD - TLS 1.3 only
transport:
tls_min_version: "1.3"
# ⚠️ ACCEPTABLE for legacy compatibility
transport:
tls_min_version: "1.2"
# ❌ BAD - TLS 1.1 is deprecated
transport:
tls_min_version: "1.1"
Enable Strict Peer Verification
1
2
3
4
5
6
7
8
# ✅ GOOD - Verify peer identity
transport:
verify_peer: true
require_client_cert: true
# ❌ BAD - Skip verification
transport:
verify_peer: false # Never do this!
Configure Proper Cipher Suites
For TLS 1.3 (recommended):
1
2
3
4
5
6
7
transport:
tls_min_version: "1.3"
# TLS 1.3 cipher suites are secure by default
cipher_suites:
- "TLS_AES_256_GCM_SHA384"
- "TLS_AES_128_GCM_SHA256"
- "TLS_CHACHA20_POLY1305_SHA256"
For TLS 1.2 (if required):
1
2
3
4
5
6
7
transport:
tls_min_version: "1.2"
cipher_suites:
- "TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384"
- "TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256"
- "TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384"
- "TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256"
Certificate Pinning Considerations
For high-security environments, consider trust bundle pinning:
1
2
3
4
5
6
7
8
9
10
11
class MyAgent(SecureAgent):
async def validate_peer_svid(self, peer_svid):
"""Additional validation beyond standard checks."""
# Validate trust bundle fingerprint
expected_bundle_hash = "sha256:abcd1234..."
if peer_svid.bundle_hash != expected_bundle_hash:
raise SecurityError("Trust bundle mismatch")
# Validate SPIFFE ID matches expected pattern
if not peer_svid.spiffe_id.startswith("spiffe://example.com/"):
raise SecurityError("Unexpected trust domain")
Configuration Security
Never Disable Security Checks
AgentWeave does not allow disabling security:
1
2
3
4
5
# ❌ These options don't exist (by design)
security:
enabled: false # Not possible
skip_authz: true # Not possible
disable_tls: true # Not possible
If you need to test without full security, use mock mode:
1
2
3
# ✅ GOOD - Explicit mock mode for testing
identity:
provider: "mock" # Clearly mock, not "disabled security"
Environment-Specific Configurations
Separate configs by environment:
1
2
3
4
5
configs/
├── base.yaml # Shared settings
├── development.yaml # Dev overrides
├── staging.yaml # Staging overrides
└── production.yaml # Production settings
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# production.yaml
identity:
provider: "spire"
socket_path: "/run/spire/sockets/agent.sock"
authorization:
default_action: "deny"
opa_url: "http://localhost:8181"
observability:
audit_log:
enabled: true
destination: "syslog"
syslog_address: "logs.example.com:514"
Secrets Management
Never commit secrets to configuration files:
1
2
3
4
5
6
7
# ❌ BAD - Secrets in config
database:
password: "super-secret-password" # Never!
# ✅ GOOD - Reference to secret
database:
password_env: "DB_PASSWORD" # From environment variable
Use secret management systems:
1
2
3
4
5
6
7
8
# Kubernetes Secret
apiVersion: v1
kind: Secret
metadata:
name: agent-secrets
type: Opaque
data:
db-password: <base64-encoded>
1
2
3
4
5
6
7
# Agent deployment references secret
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: agent-secrets
key: db-password
For sensitive API keys:
1
2
3
4
5
6
7
8
9
10
11
import os
from agentweave import SecureAgent
class MyAgent(SecureAgent):
def __init__(self, config_path: str):
super().__init__(config_path)
# Load secrets from environment, not config
self.api_key = os.environ.get("EXTERNAL_API_KEY")
if not self.api_key:
raise ValueError("EXTERNAL_API_KEY environment variable required")
Validate Configurations
Always validate before deploying:
1
2
3
4
5
# Validate configuration
agentweave validate config/production.yaml
# Run validation in CI/CD
agentweave validate config/*.yaml || exit 1
Operational Security
Enable Audit Logging
Always enable audit logging in production:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# ✅ GOOD - Comprehensive audit logging
observability:
audit_log:
enabled: true
destination: "syslog"
syslog_address: "logs.example.com:514"
syslog_protocol: "tcp"
level: "info"
include_payloads: false # Don't log sensitive data
fields:
- "timestamp"
- "caller_spiffe_id"
- "callee_spiffe_id"
- "capability"
- "action"
- "decision"
- "trace_id"
See Audit Logging for complete guide.
Monitor for Anomalies
Set up security monitoring:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
# Prometheus alert rules
groups:
- name: agentweave-security
rules:
# High rate of denials
- alert: HighAuthzDenialRate
expr: rate(agentweave_authz_denied_total[5m]) > 10
for: 5m
annotations:
summary: "Unusually high authorization denial rate"
# Unknown callers
- alert: UnknownCallerAttempt
expr: agentweave_authz_denied_total{reason="unknown_caller"} > 0
annotations:
summary: "Unknown agent attempted access"
# SVID issues
- alert: SVIDRotationFailure
expr: agentweave_svid_rotation_errors_total > 0
annotations:
summary: "SVID rotation failed"
# Unusual capability usage
- alert: UnusualCapabilityUsage
expr: rate(agentweave_capability_calls_total{capability="admin"}[1h]) > 1
annotations:
summary: "Unusual admin capability usage pattern"
Incident Response Plan
Have a documented incident response plan:
- Detection: How do you detect security incidents?
- Alerts from monitoring
- Audit log analysis
- User reports
- Containment: How do you limit damage?
- Revoke compromised SVIDs
- Update OPA policies to deny access
- Isolate affected agents
- Investigation: How do you determine what happened?
- Review audit logs
- Check distributed traces
- Analyze authorization decisions
- Recovery: How do you restore normal operations?
- Issue new SVIDs
- Update policies
- Restart agents
- Post-Incident: How do you prevent recurrence?
- Root cause analysis
- Update policies
- Improve monitoring
Example: Revoking Compromised Agent
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Step 1: Delete SPIRE entry
spire-server entry delete \
-spiffeID spiffe://example.com/agent/compromised
# Step 2: Ban in OPA
curl -X PUT http://localhost:8181/v1/data/banned_agents \
-d '["spiffe://example.com/agent/compromised"]'
# Step 3: Update policy to deny
cat <<EOF | curl -X PUT http://localhost:8181/v1/policies/authz --data-binary @-
package agentweave.authz
import rego.v1
default allow := false
# Deny banned agents
allow if {
not input.caller_spiffe_id in data.banned_agents
# ... other rules
}
EOF
# Step 4: Verify access denied
agentweave authz check \
--caller spiffe://example.com/agent/compromised \
--callee spiffe://example.com/agent/target
Code Security
Input Validation
Validate all inputs at capability boundaries:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from pydantic import BaseModel, Field, validator
class ProcessRequest(BaseModel):
"""Validated input for process capability."""
user_id: str = Field(..., min_length=1, max_length=100)
data: str = Field(..., max_length=10000)
@validator('user_id')
def validate_user_id(cls, v):
# Only allow alphanumeric and hyphens
if not re.match(r'^[a-zA-Z0-9-]+$', v):
raise ValueError("Invalid user_id format")
return v
class DataProcessor(SecureAgent):
@capability(
name="process",
description="Process user data"
)
async def process(self, request: ProcessRequest) -> dict:
# Input is validated by Pydantic
return await self._process_internal(request)
Error Handling Without Info Leakage
Don't expose internal details in errors:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# ❌ BAD - Leaks internal info
@capability(name="query")
async def query(self, sql: str):
try:
return await self.db.execute(sql)
except DatabaseError as e:
# Exposes database schema, credentials, etc.
raise CapabilityError(f"Database error: {e}")
# ✅ GOOD - Generic error, log details internally
@capability(name="query")
async def query(self, query_id: str):
try:
return await self.execute_safe_query(query_id)
except DatabaseError as e:
# Log full error internally
self.logger.error(
"Database error during query",
extra={
"query_id": query_id,
"error": str(e),
"trace_id": self.context.trace_id
}
)
# Return generic error to caller
raise CapabilityError("Query failed. Check logs for details.")
Dependency Scanning
Scan dependencies regularly:
1
2
3
4
5
6
7
8
# Install safety
pip install safety
# Scan for known vulnerabilities
safety check --json
# Use in CI/CD
safety check || exit 1
1
2
3
4
5
# GitHub Actions
- name: Dependency Security Scan
run: |
pip install safety
safety check --json --output safety-report.json
Secure Coding Checklist
- All inputs validated
- No SQL injection possible (use parameterized queries)
- No command injection (don't use
shell=True) - No path traversal (validate file paths)
- Secrets not in code or logs
- Error messages don't leak info
- Dependencies scanned for vulnerabilities
- Code reviewed for security issues
Kubernetes Security
Pod Security Standards
Use restricted Pod Security Standard:
1
2
3
4
5
6
7
8
apiVersion: v1
kind: Namespace
metadata:
name: agentweave
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
Security Context
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
securityContext:
# Pod-level
runAsNonRoot: true
runAsUser: 10001
fsGroup: 10001
seccompProfile:
type: RuntimeDefault
containers:
- name: agent
securityContext:
# Container-level
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
runAsNonRoot: true
runAsUser: 10001
capabilities:
drop:
- ALL
Network Policies
Restrict network access:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: agent-netpol
spec:
podSelector:
matchLabels:
app: my-agent
policyTypes:
- Ingress
- Egress
ingress:
# Only from other agents
- from:
- podSelector:
matchLabels:
app.kubernetes.io/part-of: agentweave
egress:
# To SPIRE
- to:
- namespaceSelector:
matchLabels:
name: spire-system
ports:
- protocol: TCP
port: 8081
# To other agents
- to:
- podSelector:
matchLabels:
app.kubernetes.io/part-of: agentweave
ports:
- protocol: TCP
port: 8443
# DNS
- to:
- namespaceSelector:
matchLabels:
name: kube-system
ports:
- protocol: UDP
port: 53
Resource Limits
Prevent resource exhaustion:
1
2
3
4
5
6
7
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
Summary
Critical Security Practices:
- ✅ Use SPIRE identity in production (never mock)
- ✅ Default deny authorization
- ✅ TLS 1.3 minimum
- ✅ Enable audit logging
- ✅ Short SVID TTL (1 hour)
- ✅ Regular policy reviews
- ✅ Monitor for anomalies
- ✅ Validate all inputs
- ✅ Never commit secrets
- ✅ Run as non-root with restricted permissions
Next Steps:
- Review Threat Model to understand what you're protecting against
- Set up Audit Logging for security monitoring
- Check Compliance for regulatory requirements
- See main Security Guide for deployment details