Orleans 在 Kubernetes 上的部署配置与源码机制说明

本文基于源码和官方文档梳理在 Kubernetes 上托管 Orleans 的正确姿势,包含:

  • 配置点与约束
  • 关键源码行为与引用
  • 示例应用代码与 Kubernetes YAML
  • 启动与运行期的时序图
  • 常见问题与排查

参考文档:


一、核心概念与约束

  • 使用 Microsoft.Orleans.Hosting.Kubernetes 增强在 Kubernetes 的托管体验,通过 UseKubernetesHosting() 完成:
    • SiloOptions.SiloName 设为 Pod 名称
    • EndpointOptions.AdvertisedIPAddress 设为 Pod IP(或由 PodName 解析)
    • EndpointOptions.SiloListeningEndpoint/GatewayListeningEndpoint 绑定到 Any 地址,端口默认 11111 / 30000
    • 从 Pod 标签/环境变量设置 ClusterOptions.ServiceIdClusterOptions.ClusterId
    • 启动期:探测 K8s 中不再存在的 Pod 与 Orleans 成员差异,标记失配 Silo 为 Dead
    • 运行期:集群内仅选取少量 Silo(默认 2 个)作为“观察者”监视 K8s 事件,减少 API Server 压力
  • 注意:Kubernetes 托管不等于 Orleans 集群成员管理(Clustering Provider 仍需单独配置,如 Azure Storage/ADO.NET/Consul 等)
  • 必要标签与环境变量:
    • Pod 标签:orleans/serviceIdorleans/clusterId
    • 环境变量:POD_NAMEPOD_NAMESPACEPOD_IPORLEANS_SERVICE_IDORLEANS_CLUSTER_ID

二、关键源码位置与行为

  • 托管扩展注册与默认配置(添加 ConfigureKubernetesHostingOptionsKubernetesClusterAgent
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Options;
using Orleans.Configuration;
using Orleans.Hosting.Kubernetes;
using Orleans.Runtime;
using System;
namespace Orleans.Hosting
{
    /// 
    /// Extensions for hosting a silo in Kubernetes.
    /// 
    public static class KubernetesHostingExtensions
    {
        /// 
        /// Adds Kubernetes hosting support.
        /// 
        public static ISiloBuilder UseKubernetesHosting(this ISiloBuilder siloBuilder)
        {
            return siloBuilder.ConfigureServices(services => services.UseKubernetesHosting(configureOptions: null));
        }
        /// 
        /// Adds Kubernetes hosting support.
        /// 
        public static ISiloBuilder UseKubernetesHosting(this ISiloBuilder siloBuilder, Action> configureOptions)
        {
            return siloBuilder.ConfigureServices(services => services.UseKubernetesHosting(configureOptions));
        }
        /// 
        /// Adds Kubernetes hosting support.
        /// 
        public static IServiceCollection UseKubernetesHosting(this IServiceCollection services) => services.UseKubernetesHosting(configureOptions: null);
        /// 
        /// Adds Kubernetes hosting support.
        /// 
        public static IServiceCollection UseKubernetesHosting(this IServiceCollection services, Action> configureOptions)
        {
            configureOptions?.Invoke(services.AddOptions());
            // Configure defaults based on the current environment.
            services.AddSingleton, ConfigureKubernetesHostingOptions>();
            services.AddSingleton, ConfigureKubernetesHostingOptions>();
            services.AddSingleton, ConfigureKubernetesHostingOptions>();
            services.AddSingleton, ConfigureKubernetesHostingOptions>();
            services.AddSingleton, KubernetesHostingOptionsValidator>();
            services.AddSingleton, KubernetesClusterAgent>();
            return services;
        }
    }
}
  • 环境变量/标签映射与端点配置(将 POD_* 映射到 SiloOptions/EndpointOptions,将 ORLEANS_* 映射到 ClusterOptions
#nullable enable
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Options;
using Orleans.Configuration;
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Net;
using System.Net.Sockets;
namespace Orleans.Hosting.Kubernetes
{
    internal class ConfigureKubernetesHostingOptions :
        IConfigureOptions,
        IConfigureOptions,
        IPostConfigureOptions,
        IConfigureOptions
    {
        private readonly IServiceProvider _serviceProvider;
        public ConfigureKubernetesHostingOptions(IServiceProvider serviceProvider)
        {
            _serviceProvider = serviceProvider;
        }
        public void Configure(KubernetesHostingOptions options)
        {
            options.Namespace ??= Environment.GetEnvironmentVariable(KubernetesHostingOptions.PodNamespaceEnvironmentVariable) ?? ReadNamespaceFromServiceAccount();
            options.PodName ??= Environment.GetEnvironmentVariable(KubernetesHostingOptions.PodNameEnvironmentVariable) ?? Environment.MachineName;
            options.PodIP ??= Environment.GetEnvironmentVariable(KubernetesHostingOptions.PodIPEnvironmentVariable);
        }
        public void Configure(ClusterOptions options)
        {
            var serviceIdEnvVar = Environment.GetEnvironmentVariable(KubernetesHostingOptions.ServiceIdEnvironmentVariable);
            if (!string.IsNullOrWhiteSpace(serviceIdEnvVar))
            {
                options.ServiceId = serviceIdEnvVar;
            }
            var clusterIdEnvVar = Environment.GetEnvironmentVariable(KubernetesHostingOptions.ClusterIdEnvironmentVariable);
            if (!string.IsNullOrWhiteSpace(clusterIdEnvVar))
            {
                options.ClusterId = clusterIdEnvVar;
            }
        }
        public void Configure(SiloOptions options)
        {
            var hostingOptions = _serviceProvider.GetRequiredService>().Value;
            if (!string.IsNullOrWhiteSpace(hostingOptions.PodName))
            {
                options.SiloName = hostingOptions.PodName;
            }
        }
        public void PostConfigure(string? name, EndpointOptions options)
        {
            // Use PostConfigure to give the developer an opportunity to set SiloPort and GatewayPort using regular
            // Configure methods without needing to worry about ordering with respect to the UseKubernetesHosting call.
            if (options.AdvertisedIPAddress is null)
            {
                var hostingOptions = _serviceProvider.GetRequiredService>().Value;
                IPAddress? podIp = null;
                if (hostingOptions.PodIP is not null)
                {
                    podIp = IPAddress.Parse(hostingOptions.PodIP);
                }
                else
                {
                    var hostAddresses = Dns.GetHostAddresses(hostingOptions.PodName);
                    if (hostAddresses != null)
                    {
                        podIp = IPAddressSelector.PickIPAddress(hostAddresses);
                    }
                }
                if (podIp is not null)
                {
                    options.AdvertisedIPAddress = podIp;
                }
            }
            if (options.SiloListeningEndpoint is null)
            {
                options.SiloListeningEndpoint = new IPEndPoint(IPAddress.Any, options.SiloPort);
            }
            if (options.GatewayListeningEndpoint is null && options.GatewayPort > 0)
            {
                options.GatewayListeningEndpoint = new IPEndPoint(IPAddress.Any, options.GatewayPort);
            }
        }
        private string? ReadNamespaceFromServiceAccount()
        {
            // Read the namespace from the pod's service account.
  • 常量:环境变量与标签名(确保 YAML 和应用一致)
using k8s;
using System;
namespace Orleans.Hosting.Kubernetes
{
    /// 
    /// Options for hosting in Kubernetes.
    /// 
    public sealed class KubernetesHostingOptions
    {
        private readonly Lazy _clientConfiguration;
        /// 
        /// The environment variable for specifying the Kubernetes namespace which all silos in this cluster belong to.
        /// 
        public const string PodNamespaceEnvironmentVariable = "POD_NAMESPACE";
        /// 
        /// The environment variable for specifying the name of the Kubernetes pod which this silo is executing in.
        /// 
        public const string PodNameEnvironmentVariable = "POD_NAME";
        /// 
        /// The environment variable for specifying the IP address of this pod.
        /// 
        public const string PodIPEnvironmentVariable = "POD_IP";
        /// 
        /// The environment variable for specifying .
        /// 
        public const string ClusterIdEnvironmentVariable = "ORLEANS_CLUSTER_ID";
        /// 
        /// The environment variable for specifying .
        /// 
        public const string ServiceIdEnvironmentVariable = "ORLEANS_SERVICE_ID";
        /// 
        /// The name of the  label on the pod.
        /// 
        public const string ServiceIdLabel = "orleans/serviceId";
        /// 
        /// The name of the  label on the pod.
        /// 
        public const string ClusterIdLabel = "orleans/clusterId";
        public KubernetesHostingOptions()
        {
            _clientConfiguration = new Lazy(() => this.GetClientConfiguration());
  • 代理:启动期“对齐”与运行期“观察/标记/删除”
    • 启动时:写回本 Pod 标签的 ServiceId/ClusterId,列举同标签 Pods,与 Orleans 成员对比,未匹配的活跃 Silo 标记为 Dead
    • 运行时:选择 N 个活跃 Silo 作为 watcher(默认 2),监听 Pod 删除事件并将对应 Silo 标记为 Dead;可选地删除失效 Silo 对应 Pod(配置控制)
        private async Task OnStart(CancellationToken cancellation)
        {
            var attempts = 0;
            while (!cancellation.IsCancellationRequested)
            {
                try
                {
                    await AddClusterOptionsToPodLabels(cancellation);
                    // Find the currently known cluster members first, before interrogating Kubernetes
                    await _clusterMembershipService.Refresh();
                    var snapshot = _clusterMembershipService.CurrentSnapshot.Members;
                    // Find the pods which correspond to this cluster
                    var pods = await _client.ListNamespacedPodAsync(
                        namespaceParameter: _podNamespace,
                        labelSelector: _podLabelSelector,
                        cancellationToken: cancellation);
                    var clusterPods = new HashSet { _podName };
                    foreach (var pod in pods.Items)
                    {
                        clusterPods.Add(pod.Metadata.Name);
                    }
                    var known = new HashSet();
                    var knownMap = new Dictionary();
                    known.Add(_podName);
                    foreach (var member in snapshot.Values)
                    {
                        if (member.Status == SiloStatus.Dead)
                        {
                            continue;
                        }
                        known.Add(member.Name);
                        knownMap[member.Name] = member;
                    }
                    var unknownPods = new List(clusterPods.Except(known));
                    unknownPods.Sort();
                    foreach (var pod in unknownPods)
                    {
                        _logger.LogWarning("Pod {PodName} does not correspond to any known silos", pod);
                        // Delete the pod once it has been active long enough?
                    }
                    var unmatched = new List(known.Except(clusterPods));
                    unmatched.Sort();
                    foreach (var pod in unmatched)
                    {
                        var siloAddress = knownMap[pod];
                        if (siloAddress.Status is not SiloStatus.Active)
                        {
                            continue;
                        }
                        _logger.LogWarning("Silo {SiloAddress} does not correspond to any known pod. Marking it as dead.", siloAddress);
                        await _clusterMembershipService.TryKill(siloAddress.SiloAddress);
                    }
                    break;
                }
                catch (HttpOperationException exception) when (exception.Response.StatusCode is System.Net.HttpStatusCode.Forbidden)
                {
                    _logger.LogError(exception, $"Unable to monitor pods due to insufficient permissions. Ensure that this pod has an appropriate Kubernetes role binding. Here is an example role binding:\n{ExampleRoleBinding}");
                }
                catch (Exception exception)
                {
                    _logger.LogError(exception, "Error while initializing Kubernetes cluster agent");
                    if (++attempts > _options.CurrentValue.MaxKubernetesApiRetryAttempts)
                    {
                        throw;
                    }
                    await Task.Delay(1000, cancellation);
                }
            }
            // Start monitoring loop
            ThreadPool.UnsafeQueueUserWorkItem(_ => _runTask = Task.WhenAll(Task.Run(MonitorOrleansClustering), Task.Run(MonitorKubernetesPods)), null);
        }
        private async Task MonitorOrleansClustering()
        {
            var previous = _clusterMembershipService.CurrentSnapshot;
            while (!_shutdownToken.IsCancellationRequested)
            {
                try
                {
                    await foreach (var update in _clusterMembershipService.MembershipUpdates.WithCancellation(_shutdownToken.Token))
                    {
                        // Determine which silos should be monitoring Kubernetes
                        var chosenSilos = _clusterMembershipService.CurrentSnapshot.Members.Values
                            .Where(s => s.Status == SiloStatus.Active)
                            .OrderBy(s => s.SiloAddress)
                            .Take(_options.CurrentValue.MaxAgents)
                            .ToList();
                        if (!_enableMonitoring && chosenSilos.Any(s => s.SiloAddress.Equals(_localSiloDetails.SiloAddress)))
                        {
                            _enableMonitoring = true;
                            _pauseMonitoringSemaphore.Release(1);
                        }
                        else if (_enableMonitoring)
                        {
                            _enableMonitoring = false;
                        }
                        if (_enableMonitoring && _options.CurrentValue.DeleteDefunctSiloPods)
                        {
                            var delta = update.CreateUpdate(previous);
                            foreach (var change in delta.Changes)
                            {
                                if (change.SiloAddress.Equals(_localSiloDetails.SiloAddress))
                                {
                                    // Ignore all changes for this silo
                                    continue;
                                }
                                if (change.Status == SiloStatus.Dead)
                                {
                                    try
                                    {
                                        if (_logger.IsEnabled(LogLevel.Information))
                                        {
                                            _logger.LogInformation("Silo {SiloAddress} is dead, proceeding to delete the corresponding pod, {PodName}, in namespace {PodNamespace}", change.SiloAddress, change.Name, _podNamespace);
                                        }
                                        await _client.DeleteNamespacedPodAsync(change.Name, _podNamespace);
                                    }
                                    catch (Exception exception)
                                    {
                                        _logger.LogError(exception, "Error deleting pod {PodName} in namespace {PodNamespace} corresponding to defunct silo {SiloAddress}", change.Name, _podNamespace, change.SiloAddress);
                                    }
                                }
                            }
                        }
                        previous = update;
                    }
                }
                catch (Exception exception) when (!(_shutdownToken.IsCancellationRequested && (exception is TaskCanceledException || exception is OperationCanceledException)))
                {
                    if (_logger.IsEnabled(LogLevel.Debug))
                    await foreach (var (eventType, pod) in pods.WatchAsync(_shutdownToken.Token))
                    {
                        if (!_enableMonitoring || _shutdownToken.IsCancellationRequested)
                        {
                            break;
                        }
                        if (string.Equals(pod.Metadata.Name, _podName, StringComparison.Ordinal))
                        {
                            // Never declare ourselves dead this way.
                            continue;
                        }
                        if (eventType == WatchEventType.Modified)
                        {
                            // TODO: Remember silo addresses for pods that are restarting/terminating
                        }
                        if (eventType == WatchEventType.Deleted)
                        {
                            if (this.TryMatchSilo(pod, out var member) && member.Status != SiloStatus.Dead)
                            {
                                if (_logger.IsEnabled(LogLevel.Information))
                                {
                                    _logger.LogInformation("Declaring server {Silo} dead since its corresponding pod, {Pod}, has been deleted", member.SiloAddress, pod.Metadata.Name);
                                }
                                await _clusterMembershipService.TryKill(member.SiloAddress);
                            }
                        }
                    }

三、应用最小化示例(C#)

var builder = Host.CreateDefaultBuilder(args)
.UseOrleans(silo =>
{
// 启用 Kubernetes 托管(核心)
silo.UseKubernetesHosting();
// 必须选择一个 Clustering Provider(示例:Azure Storage)
silo.UseAzureStorageClustering(options =>
{
options.ConnectionString = Environment.GetEnvironmentVariable("STORAGE_CONNECTION_STRING");
});
// 端口(可选;缺省为 11111 / 30000)
silo.Configure<EndpointOptions>(opt =>
  {
  opt.SiloPort = 11111;
  opt.GatewayPort = 30000;
  });
  });
  await builder.RunConsoleAsync();

四、Kubernetes YAML 示例与解释

  • Deployment(含标签/环境变量/端口/探针/优雅终止)
apiVersion: apps/v1
kind: Deployment
metadata:
name: orleans-dictionary-app
labels:
app: orleans-dictionary-app
orleans/serviceId: dictionary-app
spec:
replicas: 3
selector:
matchLabels:
app: orleans-dictionary-app
template:
metadata:
labels:
app: orleans-dictionary-app
orleans/serviceId: dictionary-app
orleans/clusterId: dictionary-app
spec:
serviceAccountName: default
automountServiceAccountToken: true
containers:
- name: silo
image: my-registry.azurecr.io/my-orleans-app:latest
imagePullPolicy: Always
ports:
- name: silo
containerPort: 11111
- name: gateway
containerPort: 30000
env:
- name: ORLEANS_SERVICE_ID
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/serviceId']
- name: ORLEANS_CLUSTER_ID
valueFrom:
fieldRef:
fieldPath: metadata.labels['orleans/clusterId']
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_IP
valueFrom:
fieldRef:
fieldPath: status.podIP
- name: STORAGE_CONNECTION_STRING
valueFrom:
secretKeyRef:
name: az-storage-acct
key: key
- name: DOTNET_SHUTDOWNTIMEOUTSECONDS
value: "120"
# 探针建议:轻量本地检查(与 Orleans 成员探测互补)
livenessProbe:
tcpSocket:
port: silo
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
readinessProbe:
tcpSocket:
port: silo
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 6
resources:
requests:
cpu: "200m"
memory: "512Mi"
limits:
cpu: "2"
memory: "2Gi"
terminationGracePeriodSeconds: 180
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 0
maxSurge: 1
minReadySeconds: 60
  • RBAC(允许 list/watch/delete/patch Pods,供代理使用)
kind: Role
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: orleans-hosting
rules:
- apiGroups: [ "" ]
resources: ["pods"]
verbs: ["get", "watch", "list", "delete", "patch"]
---
kind: RoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
name: orleans-hosting-binding
subjects:
- kind: ServiceAccount
name: default
apiGroup: ''
roleRef:
kind: Role
name: orleans-hosting
apiGroup: ''
  • Service(Silo 端口集群内可达,Gateway 端口对客户端暴露)
apiVersion: v1
kind: Service
metadata:
name: orleans-silo
spec:
selector:
app: orleans-dictionary-app
ports:
- name: silo
port: 11111
targetPort: 11111
clusterIP: None
---
apiVersion: v1
kind: Service
metadata:
name: orleans-gateway
spec:
type: LoadBalancer
selector:
app: orleans-dictionary-app
ports:
- name: gateway
port: 30000
targetPort: 30000

解释要点:

  • 标签 orleans/serviceIdorleans/clusterId 必须与应用一致(配置通过 env 注入到 ClusterOptions)。
  • 环境变量 POD_NAME/POD_NAMESPACE/POD_IP 用于设置 SiloNameAdvertisedIPAddress 等。
  • 探针以本地 TCP 检查为宜(不做跨 Pod 功能校验),与 Orleans 成员失效探测互补。
  • 需要 RBAC 权限,避免代理在启动期或运行期访问 K8s API 遭遇 403。

五、时序图

  • 启动期:对齐标签与成员、标记失配 Silo 为 Dead
Kubernetes API Pod/Silo Process KubernetesClusterAgent OrleansMembershipService Host.UseOrleans().UseKubernetesHosting() 载入 POD_* / ORLEANS_* 环境变量\n设置 SiloName/AdvertisedIPAddress/监听端点 ISiloLifecycle.AfterRuntimeGrainServices 订阅 Patch 本 Pod labels(serviceId/clusterId) Refresh() 获取当前 Silo 成员 List Pods by label(serviceId,clusterId) 对比 Pods 与 Silo 成员 TryKill() 标记无对应 Pod 的活跃 Silo 为 Dead 启动 MonitorOrleansClustering 与 MonitorKubernetesPods Kubernetes API Pod/Silo Process KubernetesClusterAgent OrleansMembershipService
  • 运行期:选择 watchers 监听 K8s;Pod 删除触发 Silo Dead;可选删除失效 Pod
OrleansMembershipService KubernetesClusterAgent Kubernetes API MembershipUpdates 选择前 N(默认2) 活跃 Silo 作为 watchers Watch Pods by label Pod Deleted 事件 TryKill(将对应 Silo 标记为 Dead) alt [本 Silo 被选中] DeleteNamespacedPod(删除失效 Silo 的 Pod) alt [DeleteDefunctSiloPods 开启] loop [Membership updates] OrleansMembershipService KubernetesClusterAgent Kubernetes API

六、常见问题与排查

  • 报错:KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined

    • 进入 Pod 检查是否存在:kubectl exec -it <pod> -- printenv | findstr KUBERNETES_SERVICE_
    • 确保 automountServiceAccountToken: true 且绑定了有权限的 ServiceAccount(见上文 RBAC)
    • 参考:learn.microsoft.com - Orleans Kubernetes hosting
  • Silo 名称与 Pod 名称要一致(由 POD_NAME 注入)。端口默认为 11111/30000,如自定义请在应用中配置 EndpointOptions

  • 未配置 Clustering Provider 时 Silo 无法加入集群:请在 UseKubernetesHosting() 同时配置任意一个 Provider(Azure/ADO.NET/Consul/…)。


七、最小化落地步骤

  1. 在应用中启用 UseKubernetesHosting() 并配置任一 Clustering Provider。
  2. 打包镜像并推送至镜像仓库。
  3. 创建集群 Secret(如 az-storage-acct)存放 Clustering 连接串。
  4. 应用本文示例 Deployment、RBAC、Service 清单。
  5. 验证:
    • Pod 上标签/环境变量齐全;
    • 日志显示 AdvertisedIPAddressPOD_IP
    • 多副本时可互相发现,删除某 Pod 会将对应 Silo 标记为 Dead;
    • 探针通过,滚动升级不中断。

引用: