系统设计题指南 - Go Agent Infrastructure

框架概述

每道系统设计题都遵循 HIRED 模式：

H = Hear & Clarify       (2分钟)  - 问问题，澄清需求
I = Initial Architecture (3分钟)  - 画图，列出组件
R = Reason about Options (2分钟)  - 权衡分析
E = Enumerate Choices    (2分钟)  - 考虑过什么
D = Deep Dive           (11分钟) - 选一个area深挖

总时间： 45分钟面试中的15-20分钟

题目1：Design a customer support copilot with Go backend

H - Hear & Clarify

问题：

日均用户数？（100K? 1M?）
并发用户数？（peak QPS?）
延迟SLA？（<1s? <5s?）
可用性？（99% or 99.9%?）
知识库规模？（100 docs? 10K docs?）
工具有哪些？（查ticket、创建ticket、查FAQ等）
是否需要流式响应？（好的UX的话需要）
是否需要conversation history？（多轮对话?）

假设：

1000个并发用户
10 QPS peak
<2秒延迟
99.9%可用性
5000份FAQ + Policy文档
3个核心工具：search_kb、get_ticket、create_ticket_draft
需要流式响应
无需conversation history（stateless）

I - Initial Architecture

graph TB
    subgraph Client["📱 Client Layer"]
        Web["🌐 Web Browser"]
        Mobile["📱 Mobile App"]
    end
    
    subgraph Gateway["🚪 Gateway Layer"]
        LB["Load Balancer<br/>nginx/ALB"]
        RateLimit["Rate Limiter<br/>100 req/sec/user"]
    end
    
    subgraph API["🔵 API Layer"]
        APIServer["API Server<br/>Go + Chi<br/>3-5 instances"]
        Auth["Auth Service<br/>JWT validation"]
        UserCtx["User Context<br/>tenant, role, permissions"]
    end
    
    subgraph Logic["⚙️ Application Logic"]
        Agent["Agent Runtime<br/>LLM + Tool orchestration"]
        RAG["RAG Engine<br/>Retrieval + Rerank"]
        Approval["Approval Service<br/>Draft + Human approval"]
        Guardrails["Guardrails<br/>4-layer protection"]
    end
    
    subgraph Data["💾 Data Layer"]
        Cache["Redis Cache<br/>embeddings<br/>retrieval results"]
        DB["PostgreSQL<br/>states, logs<br/>approvals"]
        Vector["Qdrant<br/>vector index<br/>semantic search"]
        Storage["S3/Blob<br/>documents<br/>artifacts"]
    end
    
    subgraph External["🌍 External Services"]
        LLM["OpenAI API<br/>GPT-4"]
        Email["Email Service<br/>notifications"]
    end
    
    subgraph Monitoring["📊 Observability"]
        Tracing["Jaeger<br/>trace requests"]
        Metrics["Prometheus<br/>latency, QPS"]
        Logs["Loki<br/>structured logs"]
    end
    
    Client -->|HTTP/WS| Gateway
    Gateway -->|Auth| API
    API -->|Process| Logic
    Logic -->|Query| Data
    Logic -->|Call| External
    API -->|Emit| Monitoring
    
    style Client fill:#f0f0f0
    style Gateway fill:#fff4e1
    style API fill:#e1f5ff
    style Logic fill:#e1ffe1
    style Data fill:#ffe1f5
    style External fill:#f5e1ff
    style Monitoring fill:#e1e1ff

主要组件：

Load Balancer - 分散流量
API Server - 处理请求，3-5个实例
Cache Layer - Redis，缓存embedding、retrieval结果
Database - PostgreSQL，存状态、日志、audit trail
Vector DB - Qdrant，存chunk embeddings
RAG Engine - 检索、rerank、citation
LLM Client - OpenAI API调用
Tool Registry - 工具管理和执行

R - Reason about Options

选择1：API Gateway vs Load Balancer

Load Balancer（选）：
+ 简单，直接分散流量
+ 足够处理basic routing

API Gateway（不选）：
+ 可以做rate limiting、auth、logging
- 增加复杂度和延迟
- 如果不需要高级功能，过度设计

选择2：Vector DB选择

Qdrant（选）：
+ Go client支持好
+ 单机到集群可扩展
+ 足够性能

PgVector（不选）：
+ 存在PostgreSQL里
- 搜索性能不如专门的vector DB

Pinecone（不选）：
+ 云服务，无运维
- 成本高，不是自己控制

选择3：Caching策略

三层缓存（选）：
1. embedding缓存（TTL: 1周）
2. retrieval缓存（TTL: 1小时）
3. LLM safe answer缓存（TTL: 1天）
+ 大幅减少API调用
+ 降低成本和延迟

E - Enumerate Choices

考虑过的方案：

流式vs非流式响应
- 最终选：流式
- 理由：好的UX，用户能看到实时进度
Agent Loop的超时
- 最终选：总超时2秒，单步操作500ms
- 理由：权衡用户体验和成本
工具调用是否并行
- 最终选：不并行（串行）
- 理由：工具间可能有依赖，简化逻辑
是否需要approval
- 最终选：create_ticket需要approval
- 理由：写操作必须谨慎

D - Deep Dive（选Latency Optimization）

现状：

平均延迟：1.5秒
目标：<1秒

分解：

Retrieval: 400ms
LLM:       700ms
Processing: 200ms
Network:   200ms
────────────────
Total:    1500ms

优化方案：

Retrieval优化（400 → 150ms）

// 并行化vector + keyword搜索
go func() {
    vectorResults <- vectorSearch(query)
}()
go func() {
    keywordResults <- keywordSearch(query)
}()
// 等待两个结果
merged := merge(vectorResults, keywordResults)

// 缓存热点query
if cachedResult := getCache(query); cachedResult != nil {
    return cachedResult  // 5ms
}

LLM优化（700 → 400ms）

// 用小模型做分类和rerank
if isRoutineQuestion(query) {
    // 用GPT-4O mini：200ms
    answer := smallModel.Generate(query)
} else {
    // 高风险用GPT-4 Turbo：400ms
    answer := largeModel.Generate(query)
}

// Context compression
compressed := compressChunks(topChunks)  // 少传token

并行化优化（整体提升30%）

// 当LLM处理时，同时rerank下一批候选
go func() {
    retrievalResult := retrieve(query)
    rerankedResult := rerank(retrievalResult)
}()

llmResult := llm.Generate(firstBatchChunks)
// rerankedResult已经准备好了

Network优化（200 → 50ms）
- Vector DB同机房部署
- 减少跨region调用
- 连接池复用

新的延迟分布：

Retrieval:     150ms (cache hit: 5ms)
LLM:           400ms
Processing:    150ms
Network:        50ms
────────────────
Total avg:     750ms ✓
p99:          1200ms ✓

Metrics监控：

// 每个阶段记录延迟
func trackLatency(stage string, duration time.Duration) {
    metrics.recordLatency(stage, duration.Milliseconds())
}

// 关键指标
- retrieval_latency_p50/p95/p99
- llm_latency_p50/p95/p99
- cache_hit_rate
- end_to_end_latency_p99 < 1000ms (alert if > 1200ms)

题目2：Design an internal enterprise agent platform

H - Hear & Clarify

问题：

多少个部门？（5？50？）
每个部门有多少agent？（1？10？）
跨部门共享的工具有吗？
权限模型复杂度？（简单RBAC还是属性ACL？）
合规要求？（SOC2？HIPAA？）
成本预算？
部署方式？（单region还是multi-region？）
Agent更新频率？（实时还是daily？）

假设：

10个部门
每部门3-5个agent
有共享工具库
RBAC + 部门边界隔离
合规要求：完整审计日志
月度成本预算：$50K
单region（可扩展到多region）
Agent更新：push to production

I - Initial Architecture

graph TB
    subgraph Tenant["🏢 Multi-Tenant"]
        HR["HR Dept<br/>Agents"]
        Eng["Engineering<br/>Agents"]
        Sales["Sales Dept<br/>Agents"]
    end
    
    subgraph Core["🔷 Core Platform"]
        API["API Gateway"]
        AgentMgr["Agent Manager<br/>CRUD agents"]
        Workflow["Workflow Engine<br/>State machine"]
        RAG["RAG Service<br/>Shared KB"]
    end
    
    subgraph Tools["🔧 Tool Ecosystem"]
        Registry["Tool Registry<br/>discover tools"]
        MCP["MCP Server<br/>standardized interface"]
        Executor["Tool Executor<br/>with RBAC"]
    end
    
    subgraph Security["🔐 Security"]
        RBAC["Role-Based<br/>Access Control"]
        TenantIsolation["Tenant Isolation<br/>data + compute"]
        Guardrails["Guardrails<br/>input + output"]
    end
    
    subgraph Data["💾 Shared Data"]
        PG["PostgreSQL<br/>state, audit"]
        Redis["Redis<br/>cache"]
        Vector["Vector DB<br/>embeddings"]
    end
    
    subgraph Observability["📊 Monitoring"]
        Audit["Audit Log<br/>compliance"]
        Metrics["Metrics<br/>cost tracking"]
    end
    
    Tenant -->|Use| Core
    Core -->|Manage| Tools
    Tools -->|Enforce| Security
    Core -->|Access| Data
    Core -->|Emit| Observability
    
    style Tenant fill:#fff4e1
    style Core fill:#e1f5ff
    style Tools fill:#e1ffe1
    style Security fill:#ffe1f5
    style Data fill:#f5e1ff
    style Observability fill:#e1e1ff

多租户隔离：

// 每个部门是一个tenant
type Tenant struct {
    ID          string
    Name        string
    Agents      []Agent
    ToolAccess  []string  // 可访问的工具
    DataFilter  ACLRule   // 数据级别的权限
}

// 数据查询时自动加上tenant filter
func (s *StateStore) GetByTenant(tenantID string) []State {
    return db.Query("SELECT * FROM states WHERE tenant_id = ?", tenantID)
}

// API endpoint必须验证tenant
func (h *Handler) GetAgents(w http.ResponseWriter, r *http.Request) {
    user := getCurrentUser(r)
    tenantID := user.TenantID  // 从user context取
    
    agents := agentService.ListByTenant(tenantID)
    json.NewEncoder(w).Encode(agents)
}

共享工具访问控制：

// Tool级别的权限矩阵
type ToolPermission struct {
    ToolID   string
    TenantID string
    Access   string  // "allow" / "deny"
}

// 执行工具前检查权限
func (tr *ToolRegistry) Execute(ctx context.Context, toolName string) error {
    tenantID := getTenantFromContext(ctx)
    
    perm, _ := permStore.Get(toolName, tenantID)
    if perm.Access == "deny" {
        return fmt.Errorf("tenant %s not allowed to use %s", tenantID, toolName)
    }
    
    return tr.execute(toolName)
}

R - Reason about Options

选择1：Multi-tenant存储

方案A - 分database（选）：
+ 完全隔离
+ 部门可独立备份/迁移
- 运维复杂

方案B - 同database不同schema（选）：
+ 简单运维
+ 共享基础设施
+ schema内可隔离

最终：选B + schema隔离

选择2：Agent部署模式

方案A - 每个agent一个instance：
+ 隔离
- 资源浪费

方案B - 所有agent共享instance，用逻辑隔离（选）：
+ 资源高效
+ 伸缩容易
- 需要更严格的权限检查

最终：选B，但用namespace/tenant隔离

E - Enumerate Choices

共享vs独立RAG
- 最终选：共享RAG + 权限过滤
- 理由：成本、便于跨部门查询
Agent更新方式
- 最终选：蓝绿部署（blue/green）
- 理由：无缝更新，可快速回滚
成本分摊
- 最终选：按tenant标记成本
- 理由：财务透明

D - Deep Dive（选Multi-Tenant Isolation）

隔离维度：

1. Network隔离
   - 不同tenant的请求分开处理
   - 不混淆context
   
2. Data隔离
   - 查询自动加where tenant_id = ?
   - RAG结果按权限过滤
   
3. Audit隔离
   - 每个操作记录tenant_id
   - Tenant能查自己的日志
   
4. Cost隔离
   - 每个tenant的token计数分开
   - 报表分别生成

实现细节：

// Context传播
type ContextKey string
const TenantContextKey ContextKey = "tenant_id"

func WithTenant(ctx context.Context, tenantID string) context.Context {
    return context.WithValue(ctx, TenantContextKey, tenantID)
}

func GetTenant(ctx context.Context) string {
    return ctx.Value(TenantContextKey).(string)
}

// 所有DB操作都加tenant filter
func (s *Service) GetChunks(ctx context.Context, query string) []Chunk {
    tenantID := GetTenant(ctx)
    
    // SELECT * FROM chunks 
    // WHERE tenant_id = tenantID AND (...)
    chunks := s.db.Query("SELECT * FROM chunks WHERE tenant_id = ? AND ...", tenantID)
    
    return chunks
}

// Tool执行前检查权限
func (tr *ToolRegistry) ExecuteWithTenantCheck(ctx context.Context, toolName string) error {
    tenantID := GetTenant(ctx)
    
    // 检查这个tenant是否有权限用这个工具
    allowed := tr.checkToolAccess(tenantID, toolName)
    if !allowed {
        slog.Warn("tenant denied access", "tenant", tenantID, "tool", toolName)
        return fmt.Errorf("access denied")
    }
    
    return tr.execute(ctx, toolName)
}

// Audit log自动包含tenant
func (a *AuditLogger) Log(ctx context.Context, event string, data interface{}) {
    tenantID := GetTenant(ctx)
    
    a.db.Exec("INSERT INTO audit_logs (tenant_id, event, data, timestamp) VALUES (?, ?, ?, NOW())",
        tenantID, event, data)
}

测试隔离：

func TestTenantIsolation(t *testing.T) {
    // Tenant A创建的agent
    agentA := createAgent("tenant-a", "support-bot")
    
    // 用Tenant B的context查询
    ctxB := WithTenant(context.Background(), "tenant-b")
    agentsB := service.ListAgents(ctxB)
    
    // 不应该看到agentA
    for _, agent := range agentsB {
        if agent.ID == agentA.ID {
            t.Fatalf("tenant B can see tenant A's agent!")
        }
    }
}

隔离验证清单：

题目3：Design security model for an AI agent in enterprise

High-level Security Model

层级1：入口（Input）
      ↓
      防prompt injection、敏感词检查
      ↓
层级2：权限（Authorization）
      ↓
      用户身份验证、角色检查、资源ACL
      ↓
层级3：执行（Execution）
      ↓
      工具risk assessment、approval流程
      ↓
层级4：输出（Output）
      ↓
      敏感信息过滤、Citation验证
      ↓
层级5：审计（Audit）
      ↓
      完整的操作日志、可追踪

关键设计

1. Input Validation & Prompt Injection防御

type InputGuardrail struct {
    injectionPatterns []string
    maxInputLength    int
}

func (g *InputGuardrail) Check(input string) error {
    // 长度限制
    if len(input) > g.maxInputLength {
        return fmt.Errorf("input too long")
    }
    
    // Pattern匹配
    lowerInput := strings.ToLower(input)
    for _, pattern := range g.injectionPatterns {
        if strings.Contains(lowerInput, pattern) {
            return fmt.Errorf("suspicious pattern detected: %s", pattern)
        }
    }
    
    // 额外：token检查（防止特殊字符attack）
    if containsSuspiciousTokens(input) {
        return fmt.Errorf("invalid tokens")
    }
    
    return nil
}

2. Authorization - 早期检查

func (h *Handler) ProcessRequest(w http.ResponseWriter, r *http.Request) {
    user := authenticateUser(r)
    if user == nil {
        http.Error(w, "unauthorized", http.StatusUnauthorized)
        return
    }
    
    // 立即检查权限，不要等到后面
    if !h.rbac.HasPermission(user.Role, PermViewKB) {
        http.Error(w, "forbidden", http.StatusForbidden)
        return
    }
    
    // 只有这里通过了，才继续处理
    ctx := WithUser(r.Context(), user)
    h.handleChat(w, ctx)
}

3. Retrieval级权限过滤

type SecureRetriever struct {
    retriever RAGRetriever
    acl       ACLStore
}

func (sr *SecureRetriever) Search(ctx context.Context, query string) ([]Chunk, error) {
    user := GetUser(ctx)
    
    // 获取所有相关chunks
    chunks, _ := sr.retriever.Search(ctx, query)
    
    // 过滤：只返回用户有权访问的
    var allowedChunks []Chunk
    for _, chunk := range chunks {
        hasAccess, _ := sr.acl.CanAccess(user.ID, chunk.DocID)
        if hasAccess {
            allowedChunks = append(allowedChunks, chunk)
        } else {
            // LLM永远看不到这个文档
            slog.Info("chunk filtered", "user", user.ID, "doc", chunk.DocID)
        }
    }
    
    return allowedChunks, nil
}

4. Tool Execution的Risk Assessment

type ToolExecutor struct {
    tools          ToolRegistry
    rbac           RBAC
    approvalStore  ApprovalStore
}

func (te *ToolExecutor) Execute(ctx context.Context, toolName string, input json.RawMessage) error {
    user := GetUser(ctx)
    tool := te.tools.Get(toolName)
    
    // Step 1: 这个工具存在吗？
    if tool == nil {
        return fmt.Errorf("tool not found: %s", toolName)
    }
    
    // Step 2: 用户有权调这个工具吗？
    if !te.rbac.CanExecuteTool(user.Role, toolName) {
        return fmt.Errorf("insufficient permission for tool: %s", toolName)
    }
    
    // Step 3: 这是高风险操作吗？
    if tool.RiskLevel == "high" {
        // 需要approval
        action := &PendingAction{
            ToolName:      toolName,
            Input:         input,
            CreatedBy:     user.ID,
            RequiresApprovalFrom: "admin",
        }
        te.approvalStore.Create(action)
        return fmt.Errorf("pending approval")
    }
    
    // Step 4: 执行
    result, err := tool.Execute(ctx, input)
    
    // Step 5: 记录
    te.audit(user.ID, toolName, "success")
    
    return nil
}

5. Output Filtering & Citation Verification

type OutputValidator struct {
    sensitivePatterns *regexp.Regexp
    blacklistedWords  []string
}

func (ov *OutputValidator) Validate(answer string) error {
    // 检查敏感信息（邮箱、SSN等）
    if ov.sensitivePatterns.MatchString(answer) {
        return fmt.Errorf("output contains sensitive information")
    }
    
    // 检查黑名单词
    for _, word := range ov.blacklistedWords {
        if strings.Contains(answer, word) {
            return fmt.Errorf("blacklisted word in output: %s", word)
        }
    }
    
    return nil
}

// Citation必须可验证
func (r *RAGEngine) GenerateWithValidation(ctx context.Context, chunks []Chunk) (*Answer, error) {
    answer, _ := r.llm.Generate(ctx, chunks)
    
    // 解析citations
    citations := extractCitations(answer.Text)
    
    // 验证每个citation指向的chunk确实存在且可访问
    for _, cite := range citations {
        found := false
        for _, chunk := range chunks {
            if chunk.ID == cite.ChunkID {
                found = true
                break
            }
        }
        
        if !found {
            // 是幻觉！
            return nil, fmt.Errorf("citation references non-existent chunk: %s", cite.ChunkID)
        }
    }
    
    return answer, nil
}

6. Comprehensive Audit Logging

type AuditLog struct {
    Timestamp   time.Time
    UserID      string
    Action      string
    Resource    string
    Result      string  // "success" / "denied" / "error"
    Reason      string  // 为什么失败？
    Details     json[.RawMessage
}

func (a *AuditLogger) LogAction(ctx context.Context, action string, resource string, result string) {
    user := GetUser(ctx)
    
    log := AuditLog{
        Timestamp: time.Now(),
        UserID:    user.ID,
        Action:    action,
        Resource:  resource,
        Result:    result,
    }
    
    a.db.Exec("INSERT INTO audit_logs ... VALUES ...", log)
}

// 可审计的行为：
// - 用户登录/登出
// - 权限变更
// - Tool执行
// - Approval决定
// - Data访问
// - Configuration变更

7. Monitoring & Detection

type SecurityMonitor struct {
    alertThresholds map[string]int
}

func (sm *SecurityMonitor) DetectAnomalies(ctx context.Context) {
    // 检测异常：
    
    // 1. 高失败率
    deniedCount := sm.getRecentDeniedCount()
    if deniedCount > sm.alertThresholds["max_denied_per_hour"] {
        sm.alert("HIGH_DENIAL_RATE")
    }
    
    // 2. 异常工具调用模式
    toolCalls := sm.getToolCallsLastHour()
    if sm.isAnomalous(toolCalls) {
        sm.alert("ANOMALOUS_TOOL_USAGE")
    }
    
    // 3. 权限提升尝试
    privEscAttempts := sm.countPrivEscAttempts()
    if privEscAttempts > 0 {
        sm.alert("PRIVILEGE_ESCALATION_ATTEMPT")
    }
}

面试技巧

回答follow-up的模板

常见follow-up：

"如果要支持更多用户？"

我会从几个方面扩展： 1. API层：更多实例，load balancer分散 2. 数据库：read replicas for query, sharding for writes 3. Cache：扩大容量和TTL 4. Vector DB：cluster mode 5. Tool执行：worker pool扩大或使用job queue
"最大的瓶颈是什么？"

当前是LLM延迟。虽然我做了很多优化（模型选择、 context compression），但OpenAI API的响应时间还是主要成本。可能的解决方案是本地LLM或多模型fallback。
"怎样处理故障恢复？"

三级防线： 1. 快速失败（fail fast），不要hang用户 2. Fallback到degraded mode（小模型、cached答案） 3. Circuit breaker防级联故障 4. Checkpoint和重试机制
"成本怎样控制？"

几个杠杆： 1. 缓存减少API调用（50%成本优化） 2. 模型选择（小模型省钱，大模型准确） 3. Token压缩（context减少） 4. 使用免费API tier（比如自托管embedding）

常见错误要避免

❌ 过度设计 - "我会用Kubernetes + istio + ..."
✅ 实用主义 - "从docker-compose开始，如果QPS超过100/s再考虑K8s"

❌ 忽视成本 - "就用最贵的model"
✅ 成本意识 - "在不损害质量的前提下，用最便宜的方案"

❌ 忽视权限 - "用户都是trust的"
✅ 安全第一 - "Zero trust，即使内部也要检查权限"

❌ 太关注细节 - "讲database schema的每一列"
✅ 关注架构 - "组件之间的关系和交互"

回答检查清单

完成系统设计题后，问自己：

是否澄清了所有非功能需求？
Architecture图是否清晰？
是否讲了为什么这样选，而不是怎样做？
是否有fallback/failover方案？
是否考虑了成本？
是否考虑了可观测性？
Deep dive部分是否deep enough？
是否能回答follow-up？

如果有一个打不上✓，需要继续改进。

系统设计题指南 - Go Agent Infrastructure#

框架概述#

题目1：Design a customer support copilot with Go backend#

H - Hear & Clarify#

I - Initial Architecture#

R - Reason about Options#

E - Enumerate Choices#

D - Deep Dive（选Latency Optimization）#

题目2：Design an internal enterprise agent platform#

H - Hear & Clarify#

I - Initial Architecture#

R - Reason about Options#

E - Enumerate Choices#

D - Deep Dive（选Multi-Tenant Isolation）#

题目3：Design security model for an AI agent in enterprise#

High-level Security Model#

关键设计#

面试技巧#

回答follow-up的模板#

常见错误要避免#

回答检查清单#

系统设计题指南 - Go Agent Infrastructure

框架概述

题目1：Design a customer support copilot with Go backend

H - Hear & Clarify

I - Initial Architecture

R - Reason about Options

E - Enumerate Choices

D - Deep Dive（选Latency Optimization）

题目2：Design an internal enterprise agent platform

H - Hear & Clarify

I - Initial Architecture

R - Reason about Options

E - Enumerate Choices

D - Deep Dive（选Multi-Tenant Isolation）

题目3：Design security model for an AI agent in enterprise

High-level Security Model

关键设计

面试技巧

回答follow-up的模板

常见错误要避免

回答检查清单