构建不可变推理单元以 Packer 和 ASP.NET Core 赋能 Kubeflow 服务

MLOps

文章字数: 3.8k

阅读时长: 17 分

团队的 MLOps 平台最近遇到了一个棘手的瓶颈。我们的模型训练环境是清一色的 Python 技术栈，但绝大部分线上业务系统都构建在 .NET 之上。这就导致了一个持续的矛盾：当数据科学家交付一个 pickle 文件或一组 .py 脚本时，工程团队需要将其翻译、封装，甚至重写为 C#，以便集成到现有的 ASP.NET Core 服务中。这个转换过程不仅耗时，还频繁引入难以追踪的细微偏差，导致线上推理结果与离线评估指标对不上。更糟糕的是，Python 依赖环境的脆弱性，让“在我机器上能跑”成了挥之不去的魔咒。

我们需要一个方案，既能保留 .NET 在生产环境中的性能和生态优势，又能彻底解决模型交付的一致性问题。思路逐渐清晰：将模型、运行时、依赖和推理代码，作为一个整体，打包成一个原子化的、不可变的部署单元。这个单元一经构建，其内部的一切便被“冻结”，无论在测试、预发还是生产环境，行为都将完全一致。这便是不可变基础设施理念在模型服务领域的延伸——“不可变推理单元”。

为了实现这个构想，我们最终的技术选型是 Packer、ASP.NET Core 和 Kubeflow 的组合。Packer 负责创建这个不可变的“单元”（具体为 Docker 镜像）；ASP.NET Core 作为高性能的推理服务器；而 Kubeflow（具体说是 KServe）则作为我们标准的模型部署与管理平台。这个方案的核心挑战在于，如何让这个非典型的、由 .NET 驱动的推理单元，无缝地融入以 Python 为主流的 Kubeflow 生态。

第一步：打造坚固的 ASP.NET Core 推理内核

我们的目标不是做一个简单的玩具服务，而是要构建一个能在生产环境中稳定运行的推理 API。这意味着除了核心的预测逻辑，还必须考虑配置、日志、健康检查和并发性能。

我们选择使用 ONNX (Open Neural Network Exchange) 格式作为模型交付的标准，因为它具备跨平台的优良特性，并且在 .NET 中有微软官方支持的高性能运行时 Microsoft.ML.OnnxRuntime。

以下是推理服务的核心代码结构。它没有多余的业务逻辑，只专注于一件事：高效、稳定地执行 ONNX 模型推理。

Program.cs - 应用入口与服务配置

using Microsoft.AspNetCore.Builder;
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Hosting;
using Microsoft.Extensions.Logging;
using Microsoft.ML.OnnxRuntime;
using Serilog;
using System.IO;
using System.Reflection;

// Production-ready logging setup with Serilog
Log.Logger = new LoggerConfiguration()
    .MinimumLevel.Information()
    .Enrich.FromLogContext()
    .WriteTo.Console(outputTemplate: "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj}{NewLine}{Exception}")
    .CreateLogger();

var builder = WebApplication.CreateBuilder(args);

// Replace default logger with Serilog
builder.Host.UseSerilog();

// Register the ONNX inference session as a singleton.
// This is critical for performance, as model loading and session initialization
// are expensive operations that should only happen once at startup.
builder.Services.AddSingleton<InferenceSession>(sp =>
{
    var logger = sp.GetRequiredService<ILogger<Program>>();
    
    // In a real project, this path should come from configuration (IConfiguration).
    // For this immutable unit, we bake it directly into the image at a known location.
    var modelPath = Path.Combine(AppContext.BaseDirectory, "model", "model.onnx");
    
    if (!File.Exists(modelPath))
    {
        logger.LogCritical("Model file not found at {ModelPath}. The application cannot start.", modelPath);
        // This will cause the application to crash, which is the desired behavior.
        // Kubernetes will then restart the pod, preventing a broken service from running.
        throw new FileNotFoundException("Required model file is missing.", modelPath);
    }

    logger.LogInformation("Loading ONNX model from: {ModelPath}", modelPath);

    // Fine-tuning session options for production
    var sessionOptions = new SessionOptions
    {
        // For CPU-based inference, InterOpNumThreads and IntraOpNumThreads
        // are key performance tuning parameters. The optimal values depend on
        // the model architecture and hardware. It's best to benchmark.
        // A common starting point is to set them based on available cores.
        IntraOpNumThreads = System.Environment.ProcessorCount,
        ExecutionMode = ExecutionMode.ORT_SEQUENTIAL,
        GraphOptimizationLevel = GraphOptimizationLevel.ORT_ENABLE_ALL,
    };
    
    try
    {
        return new InferenceSession(modelPath, sessionOptions);
    }
    catch (OnnxRuntimeException ex)
    {
        logger.LogCritical(ex, "Failed to initialize the ONNX InferenceSession. Check model integrity and runtime compatibility.");
        throw; // Fail fast
    }
});

builder.Services.AddControllers();

// Add standard health checks for Kubernetes liveness and readiness probes.
builder.Services.AddHealthChecks();

var app = builder.Build();

if (app.Environment.IsDevelopment())
{
    // No Swagger in production image
}

app.UseSerilogRequestLogging();
app.UseAuthorization();
app.MapControllers();

// Map health check endpoint
app.MapHealthChecks("/healthz");

app.Run();

InferenceController.cs - 推理端点实现

为了与 KServe 的 V1 预测协议保持兼容，我们特意将端点路径设置为 /v1/models/{model_name}:predict。这虽然不是强制要求，但有助于在混合环境中保持接口风格的统一。

using Microsoft.AspNetCore.Mvc;
using Microsoft.Extensions.Logging;
using Microsoft.ML.OnnxRuntime;
using Microsoft.ML.OnnxRuntime.Tensors;
using System;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

namespace ImmutableInference.Controllers
{
    // A simplified request/response structure mimicking KServe's V1 protocol
    public class InferenceRequest
    {
        public List<List<float>> Instances { get; set; }
    }

    public class InferenceResponse
    {
        public List<object> Predictions { get; set; }
    }

    [ApiController]
    [Route("/")]
    public class InferenceController : ControllerBase
    {
        private readonly InferenceSession _session;
        private readonly ILogger<InferenceController> _logger;

        public InferenceController(InferenceSession session, ILogger<InferenceController> logger)
        {
            _session = session;
            _logger = logger;
        }

        [HttpPost("v1/models/iris:predict")]
        public IActionResult Predict([FromBody] InferenceRequest request)
        {
            if (request?.Instances == null || !request.Instances.Any())
            {
                return BadRequest("Input 'instances' cannot be null or empty.");
            }

            try
            {
                // In a real-world scenario, you must know the exact input node name and shape.
                // This information should be part of the contract between ML scientists and engineers.
                var inputMeta = _session.InputMetadata;
                var inputNodeName = inputMeta.Keys.First();
                var inputShape = inputMeta[inputNodeName].Dimensions;

                // Simple validation for our example Iris model (batch_size, 4 features)
                if (inputShape.Length != 2 || inputShape[1] != 4)
                {
                    _logger.LogWarning("Model input shape mismatch. Expected dimension 2 with 4 features, but got dimension {dim_count}.", inputShape.Length);
                    // This could be a 500 error as it indicates a deployment mismatch.
                }

                var batchSize = request.Instances.Count;
                var inputTensor = new DenseTensor<float>(new[] { batchSize, 4 });
                for (int i = 0; i < batchSize; i++)
                {
                    for (int j = 0; j < 4; j++)
                    {
                        // Proper bounds checking is omitted for brevity but crucial in production.
                        inputTensor[i, j] = request.Instances[i][j];
                    }
                }
                
                var inputs = new List<NamedOnnxValue>
                {
                    NamedOnnxValue.CreateFromTensor(inputNodeName, inputTensor)
                };

                using var results = _session.Run(inputs);

                // Assuming the output is a label (long) and probabilities (float tensor)
                var outputLabel = results.FirstOrDefault(r => r.Name == "label")?.AsTensor<long>().ToArray();
                var outputProbs = results.FirstOrDefault(r => r.Name == "probabilities")?.AsEnumerable<NamedOnnxValue>().ToList();

                // Construct a meaningful response.
                var predictions = new List<object>();
                for (int i = 0; i < outputLabel?.Length; i++)
                {
                    predictions.Add(new { label = outputLabel[i], score = outputProbs?[i].AsTensor<float>().ToArray() });
                }

                return Ok(new InferenceResponse { Predictions = predictions });
            }
            catch (Exception ex)
            {
                _logger.LogError(ex, "An error occurred during inference.");
                // Avoid leaking internal exception details to the client.
                return StatusCode(500, "Internal server error during prediction.");
            }
        }
    }
}

这段代码具备了生产级服务的基本素质：

**单例 InferenceSession**：模型加载是昂贵的，通过依赖注入注册为单例，确保应用生命周期内只加载一次。
启动时检查：如果模型文件在指定路径不存在，应用会立即崩溃退出。这是“快速失败”原则的体现，防止一个处于无效状态的服务上线。
详尽日志：使用 Serilog 记录关键操作，并在出现错误时提供上下文。
健康检查：内置 /healthz 端点，供 Kubernetes 的 Liveness 和 Readiness 探针使用。
健壮的错误处理：Controller 层捕获了推理过程中的所有异常，并返回统一的 500 错误，避免了内部堆栈信息的泄露。

第二步：用 Packer 固化构建流程

有了应用代码，下一步就是将其与模型文件打包成一个 Docker 镜像。通常的做法是手写一个 Dockerfile，然后执行 docker build 命令。但这种方式在团队协作和自动化流程中存在问题：构建参数、版本标签、目标仓库等信息散落在脚本各处，难以管理和审计。

Packer 通过 HCL (HashiCorp Configuration Language) 语言，允许我们用声明式的方式定义整个镜像构建过程。这就像给镜像构建过程提供了“基础设施即代码”（IaC）的能力。

首先，一个优化的多阶段 Dockerfile 是必不可少的。它能确保最终的生产镜像只包含必要的运行时文件，体积更小，攻击面也更少。

Dockerfile

# Stage 1: Build the application
# Use the official .NET SDK image which contains all tools needed to build and publish.
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build-env
WORKDIR /app

# Copy csproj and restore as distinct layers to leverage Docker cache
COPY *.csproj ./
RUN dotnet restore

# Copy everything else and build
COPY . ./
# Publish the application, ensuring it's a release build and self-contained
RUN dotnet publish -c Release -o out

# Stage 2: Build the final production image
# Use the minimal ASP.NET runtime image
FROM mcr.microsoft.com/dotnet/aspnet:8.0
WORKDIR /app

# Copy the published application from the build stage
COPY --from=build-env /app/out .

# Copy the model file into a specific directory within the image.
# This assumes the model file is in a 'model' subdirectory of the build context.
COPY model/ ./model/

# Set the entrypoint for the container
ENTRYPOINT ["dotnet", "ImmutableInference.dll"]

接下来是 Packer 的配置文件。我们将使用 packer.pkr.hcl 来编排 Docker 的构建过程。

packer-build.pkr.hcl

packer {
  required_plugins {
    docker = {
      version = ">= 1.0.0"
      source  = "github.com/hashicorp/docker"
    }
  }
}

// Variables allow parameterizing the build. These can be overridden
// from the command line or CI/CD environment variables (e.g., TF_VAR_image_version).
variable "image_repository" {
  type    = string
  default = "my-registry/immutable-inference"
}

variable "image_version" {
  type    = string
  default = "0.1.0-local"
}

// A "source" block defines what kind of machine image to build.
// Here, we are using the "docker" builder.
source "docker" "aspnet-inference" {
  image  = "mcr.microsoft.com/dotnet/sdk:8.0" // Start with the build image
  commit = true                              // Commit the container to an image upon completion
  changes = [
    "WORKDIR /app",
    "ENTRYPOINT [\"dotnet\", \"ImmutableInference.dll\"]"
  ]
}

// The "build" block defines the steps to build the image.
build {
  name    = "aspnet-inference-service"
  sources = ["source.docker.aspnet-inference"]

  // Provisioners are used to install software, upload files, or configure the machine.
  
  // 1. Copy the entire application source code into the container.
  provisioner "file" {
    source      = "./"
    destination = "/app/"
  }
  
  // 2. Run the dotnet publish command inside the container.
  // This is equivalent to the RUN commands in the first stage of our Dockerfile.
  provisioner "shell" {
    inline = [
      "cd /app",
      "dotnet publish -c Release -o out"
    ]
  }

  // The post-processor section runs after the image has been built.
  // We use it to tag and push the image to a container registry.
  post-processor "docker-tag" {
    repository = var.image_repository
    tags       = [var.image_version, "latest"]
  }
  
  // A second post-processor can flatten the image to create a final, smaller image.
  // This effectively replicates the multi-stage Dockerfile behavior.
  post-processor "docker-import" {
    repository = var.image_repository
    tag        = var.image_version
    changes = [
        "FROM mcr.microsoft.com/dotnet/aspnet:8.0",
        "WORKDIR /app",
        "COPY /app/out .",
        "COPY /app/model/ ./model/",
        "ENTRYPOINT [\"dotnet\", \"ImmutableInference.dll\"]"
    ]
  }
}

注意：上述 packer-build.pkr.hcl 是一种实现方式，它在 Packer 内部模拟了多阶段构建。在实践中，直接让 Packer 调用一个已经写好的多阶段 Dockerfile 会更直观和易于维护。使用 docker builder 的 build_files 或 inline 参数可以做到这一点。这里的例子展示了 Packer 对构建过程的强大控制力。

一个更简洁且推荐的做法是：

packer-dockerfile.pkr.hcl (推荐)

packer {
  required_plugins {
    docker = {
      version = ">= 1.0.0"
      source  = "github.com/hashicorp/docker"
    }
  }
}

variable "image_repository" {
  type    = string
  default = "my-registry/immutable-inference"
}

variable "image_version" {
  type    = string
  description = "The version tag for the Docker image, e.g., git commit SHA."
}

source "docker" "from-dockerfile" {
  // Let Packer manage the Dockerfile directly
  dockerfile = "./Dockerfile"
  image_tags = ["${var.image_repository}:${var.image_version}"]
}

build {
  sources = ["source.docker.from-dockerfile"]
  
  // Post-processor to push to the registry
  post-processor "docker-push" {
    // This is intentionally left empty so Packer pushes all tags
    // defined in the source block.
  }
}

这个版本更加清晰，它将构建逻辑保留在 Dockerfile 中，而 Packer 则专注于构建的编排、参数化和发布。在 CI/CD 流水线中，我们可以这样执行：

packer build -var "image_version=$(git rev-parse --short HEAD)" .

这行命令会用当前的 Git Commit SHA 作为版本标签来构建和推送镜像。至此，我们拥有了一个可重复、可追溯的构建流程。每一次代码或模型的变更，都会生成一个带有唯一版本标识的、全新的、完整的推理单元镜像。

第三步：将不可变单元部署到 Kubeflow

最后一步是将这个 Packer 构建的镜像部署到 Kubeflow 集群。我们将使用 KServe 的 InferenceService CRD。这里的关键在于，我们不使用 KServe 的标准模型加载器（如 storageUri），而是直接指定我们自己的容器。

InferenceService 的 predictor 规范允许我们定义一个 custom predictor（在较新版本中，直接使用 containers 字段），这给了我们完全的控制权。

kserve-deployment.yaml

apiVersion: "serving.kserve.io/v1beta1"
kind: "InferenceService"
metadata:
  name: "aspnet-iris-predictor"
  namespace: kubeflow-user-example-com
spec:
  predictor:
    # We are not using standard model servers, so no 'model' spec is needed.
    # Instead, we define our custom container directly.
    containers:
      - name: kf-container # The container must be named kf-container or kserve-container depending on the version
        # This is where the magic happens: we specify the exact immutable image
        # built by Packer and tagged with the Git commit hash.
        image: my-registry/immutable-inference:a1b2c3d
        ports:
          - containerPort: 8080 # Default HTTP port for KServe
        env:
          # In .NET 8, the default port is 8080. If your base image uses port 80,
          # you'll need to set ASPNETCORE_URLS to listen on the correct port KServe expects.
          - name: ASPNETCORE_URLS
            value: http://*:8080
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "1"
            memory: "2Gi"
        # Liveness and Readiness probes are crucial for production stability.
        # KServe will use these to manage the pod's lifecycle.
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 10
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 15
          periodSeconds: 20

当我们用 kubectl apply -f kserve-deployment.yaml 部署这个资源后，KServe 的控制器会创建一个 Deployment、Service 和 Knative Service（如果开启了自动缩放）。它会拉取我们指定的、由 Packer 构建的镜像，并根据我们定义的健康检查探针来判断服务是否就绪。

整个流程可以用下面的图来表示：

graph TD
    subgraph "CI/CD Pipeline (e.g., Jenkins, GitLab CI)"
        A[Git Commit] --> B{Trigger Build};
        B --> C[Run Packer Build];
        C -- reads --> D[packer.pkr.hcl];
        C -- uses --> E[Dockerfile];
        C -- packages --> F[ONNX Model];
        C -- packages --> G[ASP.NET Core Source];
        C --> H[Build & Push Image to Registry];
        H -- image: my-registry/app:commit-sha --> I[Update kserve-deployment.yaml];
        I --> J[GitOps: Apply to K8s Cluster];
    end

    subgraph "Kubernetes Cluster (Kubeflow)"
        J --> K[KServe Controller];
        K -- creates --> L[Deployment];
        L -- pulls image --> M[Container Registry];
        L -- creates --> N[Pod: Immutable Inference Unit];
        N -- runs --> O[ASP.NET Core Service];
        O -- serves --> P[Prediction Endpoint];
    end

    Q[User Request] --> P;

这个流程实现了真正的端到端不可变性。从代码提交到服务上线，中间产物（Docker 镜像）是完全自包含和版本化的。调试和回滚变得异常简单：如果发现线上版本 a1b2c3d 有问题，我们只需要将 kserve-deployment.yaml 中的镜像标签改回上一个稳定版本（例如 f9e8d7c）并重新应用即可。由于镜像是不可变的，我们能百分之百确定回滚后的行为会和之前完全一样。

局限与展望

这个方案并非没有代价。最显著的权衡在于灵活性和不可变性之间的取舍。

模型更新成本: 在此架构下，更新模型意味着必须重新构建并部署整个镜像。对于需要频繁更新模型的场景，这会增加 CI/CD 的负担和时间。相比之下，KServe 从对象存储（如 S3）动态加载模型的原生方式在模型迭代上要轻量得多。
镜像体积: 将模型和运行时打包在一起，通常会产生比仅含代码的镜像更大的体积，这可能影响 Pod 的启动速度（冷启动）。

一个可能的优化路径是采用混合模式：使用 Packer 构建一个不含模型、但包含所有运行时和推理逻辑的“基础推理镜像”。然后，在 Kubernetes 的 Pod 规范中使用一个 initContainer，在主容器启动前从模型库（如 S3）拉取指定的模型文件，并挂载到主容器的特定路径下。

这种混合模式既保留了运行时的不可变性，又恢复了动态加载模型的灵活性。然而，它也重新引入了一定程度的复杂性——需要管理 initContainer 的逻辑、权限以及模型版本与运行时版本的兼容性。在真实项目中，选择哪种方案取决于对安全性、部署频率和操作简易性的综合考量。对于那些模型相对稳定、但对环境一致性和审计要求极高的场景，我们当前实现的纯粹不可变单元，仍然是一个极其稳固和可靠的选择。

ASP.NET Core Packer Kubeflow

基于ActiveMQ与Chakra UI构建Jib容器化CV处理管道的实时监控系统

2023-10-27 分布式系统

可观测性 Chakra UI ActiveMQ Jib CV

构建接入ELK观测体系的WebAssembly与Zustand高性能前端应用

2023-10-27 前端工程化

可观测性 ELK Stack Zustand WebAssembly Rust

构建不可变推理单元 以 Packer 和 ASP.NET Core 赋能 Kubeflow 服务

第一步：打造坚固的 ASP.NET Core 推理内核

第二步：用 Packer 固化构建流程

第三步：将不可变单元部署到 Kubeflow

局限与展望

构建不可变推理单元以 Packer 和 ASP.NET Core 赋能 Kubeflow 服务