快速参考 - Go Agent Infrastructure 30天

🎯 每日速查

第1周

Day 1：Go项目骨架

// Graceful shutdown模式
sigChan := make(chan os.Signal, 1)
signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)
<-sigChan
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
defer cancel()
httpServer.Shutdown(ctx)

// Structured logging
logger := slog.New(slog.NewJSONHandler(os.Stdout, nil))
slog.SetDefault(logger)
slog.Info("event", "key", "value")

Day 2：LLMClient Interface

type LLMClient interface {
	Generate(ctx context.Context, req GenerateRequest) (*GenerateResponse, error)
}

// 为什么：可替换性 + 可测性 + 错误分类

Day 3：ToolRegistry

type ToolDefinition struct {
	Name        string
	Description string
	InputSchema map[string]interface{}
	RiskLevel   string  // low/medium/high
	Timeout     time.Duration
	Execute     func(ctx context.Context, input json.RawMessage) (interface{}, error)
}

Day 4：Agent Loop

用户输入 → LLM判断 → ToolCall? 
  ├─ Yes: 执行工具 → 反馈LLM → 继续
  └─ No: 返回答案

Day 5：Structured Output

type AgentAnswer struct {
	Answer           string
	Confidence       string    // "low" / "medium" / "high"
	Sources          []string
	ToolCalls        []string
	NeedsHumanReview bool
}
// 规则：sources为空 → confidence不能是high
//       创建工单 → NeedsHumanReview必须true

Day 6：Retry Pattern

// Exponential backoff
backoff *= 1.5  // 每次乘1.5

// 检查超时
select {
case <-ctx.Done():
    return ctx.Err()
default:
}

// 并发去重
r.cache.Get(idempotencyKey)  // 检查已执行
r.cache.Set(idempotencyKey, result)  // 缓存结果

🏗️ 架构速览

┌─────────────────────────────────────────┐
│           HTTP Handler (/chat)          │
├─────────────────────────────────────────┤
│  Agent Runtime                          │
│  ├─ Conversation Management             │
│  ├─ LLM Client (with timeout/retry)     │
│  ├─ Tool Registry                       │
│  └─ Loop Control (max iterations)       │
├─────────────────────────────────────────┤
│  Tool Execution                         │
│  ├─ Tool Not Found → Error              │
│  ├─ Tool Timeout → Error + Backoff      │
│  └─ Tool Success → Result to LLM        │
├─────────────────────────────────────────┤
│  Output Validation                      │
│  └─ AgentAnswer.Validate()              │
└─────────────────────────────────────────┘

🔑 Go面试关键概念

context.Context 族

context.Background()              // 根context
context.WithTimeout(parent, 5*time.Second)  // 有超时
context.WithCancel(parent)        // 可手动取消
context.WithValue(parent, key, value)       // 携带数据

// 使用方式
func foo(ctx context.Context) error {
    select {
    case <-ctx.Done():
        return ctx.Err()  // 超时或取消
    case <-time.After(1*time.Second):
        return nil
    }
}

Interface设计

// ✅ 好：抽象行为，方便替换
type LLMClient interface {
    Generate(ctx context.Context, req Request) (*Response, error)
}

// ❌ 坏：直接依赖具体实现，难以替换
func useOpenAI(client *openai.Client) { }

Error Handling

// ✅ 分类错误
type ErrorType string
const (
    ErrorTypeTimeout ErrorType = "timeout"
    ErrorTypePermission ErrorType = "permission"
)

// ❌ 只return error
if err != nil {
    return err  // 无法区分错误类型
}

Goroutine安全

// ✅ 用sync.Map或加锁
var mu sync.Mutex
m := make(map[string]string)
mu.Lock()
m[key] = value
mu.Unlock()

// ❌ 无锁并发访问map
m[key] = value  // race condition

🛠️ 常用代码片段

HTTP Middleware

func LoggingMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()
        next.ServeHTTP(w, r)
        duration := time.Since(start)
        slog.Info("request", "method", r.Method, "path", r.URL.Path, "duration_ms", duration.Milliseconds())
    })
}

Tool Execution with Timeout

ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
defer cancel()

result, err := toolRegistry.Execute(ctx, toolName, input)
if err != nil {
    if errors.Is(err, context.DeadlineExceeded) {
        slog.Error("tool timeout", "tool", toolName)
    }
}

Retry Loop

for attempt := 0; attempt < maxRetries; attempt++ {
    select {
    case <-ctx.Done():
        return ctx.Err()
    default:
    }
    
    err := operation()
    if err == nil {
        return nil
    }
    
    if attempt < maxRetries-1 {
        select {
        case <-time.After(time.Duration(math.Pow(2, float64(attempt)))*time.Second):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
}

📊 Week 1检查清单

代码质量

无race condition（用go test -race）
无memory leak（ctx正确cancel）
无nil panic（nil检查）
无deadlock（channel正确使用）

功能完整性

/health 返回正确格式
/chat 完整的agent loop
所有错误都分类和记录
优雅关闭不丢请求

文档清晰

README说清楚怎么跑
代码注释解释Why不是What
有系统设计图
有troubleshooting指南

🎤 30秒Pitch模板

我做了一个Go版企业客服Agent。它包含三个核心部分：

第一是Agent Runtime——用户输入后，LLM决定是否调工具，如果需要就从Tool Registry里执行，把结果反馈给LLM生成最终答案。这里的关键是处理超时、重试和错误恢复。

第二是RAG系统（第2周）——从知识库检索相关文档，加上向量检索和重排，确保LLM回答时引用的资料都是准确的。

第三是安全性和观测性（第3-4周）——工单创建这样的写操作需要人工审批；权限检查在检索阶段就做，不是事后过滤；每个请求都有trace_id追踪，所有工具调用都有metrics记录。

我的定位不是做一个聊天demo，而是把agent设计成生产级系统——有状态、有权限、有审批、有评估、有观测。

🔄 每周回顾问题

Week 1结束时问自己：

Agent Loop的每一步是什么？
为什么要用interface而不是直接调SDK？
如果工具超时了，系统怎么应对？
怎样防止工单重复创建？
如何在失败时恢复？
工具调用是串行还是并行？（Week 2回答）

答不出来？ 说明需要再看Day X的教材。

⚠️ 常见踩坑

Panic的三大来源

Nil dereference - 忘了null check
```
if root == nil { return nil }
```
Slice越界 - 忘了len检查
```
if len(s) > i { x := s[i] }
```

Map零值 - 分不清不存在和值为0

if v, ok := m[key]; ok { /* 确实存在 */ }

Goroutine泄漏

// ❌ 坏：没cancel，goroutine一直等
go func() {
    <-ctx.Done()  // 如果ctx没有超时或cancel，会永久阻塞
}()

// ✅ 好：给个timeout
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()

Timeout处理不当

// ❌ 坏：timeout了还继续访问result
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
// ... operation might timeout ...
if err := operation(ctx); err != nil {
    // 但result的值可能不完整
}

// ✅ 好：分别处理timeout和成功
ctx, cancel := context.WithTimeout(ctx, 5*time.Second)
defer cancel()
result, err := operation(ctx)
if err == context.DeadlineExceeded {
    // 超时处理
}

📈 学习进度跟踪

在项目根目录建 PROGRESS.md：

# 学习进度

- [ ] Day 1: Go骨架 - 2h
- [ ] Day 2: LLM Client - 1.5h
- [ ] Day 3: Tool Registry - 3h
- [ ] ...

## 关键milestone
- [ ] v0.1完整 (Day 7)
- [ ] v0.2 RAG (Day 14)
- [ ] v0.3安全 (Day 21)
- [ ] v1.0生产 (Day 30)

## 算法进度
- [ ] Week 1: 12/12道
- [ ] Week 2: 12/12道
- [ ] Week 3-4: 21/21道

## 问题记录
- Day 4: Agent loop里重试逻辑搞不清
  → 看了day6的retry模式，理解了

📊 架构图速览

所有架构图都用Mermaid画，嵌入到各个文档中：

Week 1 架构图 → weekplan/week1_overview.md

Agent Runtime 整体架构
Agent Loop 细节流程（Day 4核心）
项目文件结构

Week 2 架构图 → weekplan/week2_overview.md

完整RAG Pipeline（5个阶段）
Hybrid Retrieval 权重组合

Week 3 架构图 → weekplan/week3_overview.md

Workflow 状态机（Day 15）
安全防护分层（6层防护 - Day 18）
Approval 流程（Day 17）

Week 4 架构图 → weekplan/week4_overview.md

Docker Compose 部署架构
Observability 架构（Tracing + Metrics + Logs）
Eval + CI 流程

系统设计题架构图 → docs/system_design_guide.md

Customer Support Copilot 完整架构
Enterprise Agent Platform 多租户架构

查看方式：

# 直接在各个markdown文件中查看Mermaid图
cat weekplan/week1_overview.md

# 或在GitHub上查看（Mermaid自动渲染）
# 或复制图表代码到 https://mermaid.live 编辑

🚀 快速启动

# 第一次运行
cd /Users/zhangruobin/Episodes/Areas/interview-prepare

# 看总纲
cat README.md

# 看Day 1教材
cat weekplan/week1_day1.md

# 看完整代码示例
cat code-examples/day1_complete.md

# 开始编码
mkdir agent-runtime && cd agent-runtime
go mod init github.com/yourname/agent-runtime
# ... 按code-examples/day1_complete.md建文件

# 跑起来
go run cmd/api/main.go

📞 遇到问题？

看不懂Day X的教材 → 回到前一个Day，补基础
代码跑不起来 → 检查go.mod，看error message，不要瞎改
理解不了某个概念 → Google "golang [concept]" + 官方文档
感觉进度太慢 → 正常的，Day 4和Day 6是最难的，要4-5小时

让我们开始吧！ 🚀

从Day 1的README和week1_day1.md开始。

快速参考 - Go Agent Infrastructure 30天#

🎯 每日速查#

第1周#

🏗️ 架构速览#

🔑 Go面试关键概念#

context.Context 族#

Interface设计#

Error Handling#

Goroutine安全#

🛠️ 常用代码片段#

HTTP Middleware#

Tool Execution with Timeout#

Retry Loop#

📊 Week 1检查清单#

代码质量#

功能完整性#

文档清晰#

🎤 30秒Pitch模板#

🔄 每周回顾问题#

⚠️ 常见踩坑#

Panic的三大来源#

Goroutine泄漏#

Timeout处理不当#

📈 学习进度跟踪#

📊 架构图速览#

🚀 快速启动#

📞 遇到问题？#