4.Elsa源码探索-分布式运行时-Distributed

一、这个模块是做什么的？

上一篇介绍的 Elsa.Workflows.Runtime 是单节点部署的"本地运行时"。当系统扩展到多节点集群时，同一个工作流实例可能在任意节点被触发，如果两个节点同时操作同一实例就会出现数据竞争。

Elsa.Workflows.Runtime.Distributed 就是解决这个问题的分布式扩展层，它通过 分布式锁 确保同一实例同一时刻只有一个节点在跑。

二、注册方式：通过 Feature 替换关键实现

DistributedRuntimeFeature（Features/DistributedRuntimeFeature.cs）是启用分布式模式的入口。它通过 Elsa 的 Feature 体系，在 Configure 阶段替换关键接口的实现：

// Features/DistributedRuntimeFeature.cs 第19行
public override void Configure()
{
    Module.UseWorkflowRuntime(runtime =>
    {
        // 替换本地运行时为分布式运行时
        runtime.WorkflowRuntime = sp => sp.GetRequiredService<DistributedWorkflowRuntime>();
        // 替换书签队列工作者为带分布式锁版本
        runtime.BookmarkQueueWorker = sp => sp.GetRequiredService<DistributedBookmarkQueueWorker>();
    });
}

public override void Apply()
{
    Services
        .AddScoped<DistributedWorkflowRuntime>()
        .AddScoped<DistributedBookmarkQueueWorker>()
        // 装饰器模式：给两个刷新服务加分布式锁保护
        .Decorate<IWorkflowDefinitionsRefresher, DistributedWorkflowDefinitionsRefresher>()
        .Decorate<IWorkflowDefinitionsReloader, DistributedWorkflowDefinitionsReloader>();
}

启用后，通过 DI 容器的注册覆盖几个关键实现，其余所有服务（触发器索引、书签匹配等）保持不变，无需任何修改。调用方式：

// 应用程序启动时启用分布式运行时
services.AddElsa(elsa => elsa.UseWorkflowRuntime(runtime => runtime.UseDistributedRuntime()));

注意：默认情况下 elsa 的分布式锁是基于文件系统的，也就是说如果没有做共享券的话，需要改用Redis等其他存储方案。

三、分布式运行时：DistributedWorkflowRuntime

DistributedWorkflowRuntime 与 LocalWorkflowRuntime 接口完全相同，唯一区别是 CreateClientAsync() 创建的是 DistributedWorkflowClient 而不是 LocalWorkflowClient：

// Services/DistributedWorkflowRuntime.cs 第30行
public ValueTask<IWorkflowClient> CreateClientAsync(string? workflowInstanceId, CancellationToken cancellationToken = default)
{
    workflowInstanceId ??= _identityGenerator.GenerateId();
    // 关键：创建带分布式锁保护的客户端
    var client = (IWorkflowClient)ActivatorUtilities.CreateInstance(
        _serviceProvider, typeof(DistributedWorkflowClient), workflowInstanceId);
    return new(client);
}

四、分布式锁保护：DistributedWorkflowClient

DistributedWorkflowClient 是分布式版本的工作流客户端，它内部持有一个 LocalWorkflowClient，在执行写操作前加分布式锁：

// Services/DistributedWorkflowClient.cs 第13行
public class DistributedWorkflowClient(
    string workflowInstanceId,
    IDistributedLockProvider distributedLockProvider,
    ITransientExceptionDetector transientExceptionDetector,
    IOptions<DistributedLockingOptions> distributedLockingOptions,
    IServiceProvider serviceProvider,
    ILogger<DistributedWorkflowClient> logger) : IWorkflowClient
{
    // 内部持有本地客户端，所有实际操作委托给它
    private readonly LocalWorkflowClient _localWorkflowClient =
        ActivatorUtilities.CreateInstance<LocalWorkflowClient>(serviceProvider, workflowInstanceId);
}

哪些操作需要加锁，哪些不需要：

// Services/DistributedWorkflowClient.cs

// 创建实例：不需要加锁（创建是幂等的，不存在并发写同一记录的情况）
public async Task<CreateWorkflowInstanceResponse> CreateInstanceAsync(...) =>
    await _localWorkflowClient.CreateInstanceAsync(request, cancellationToken);

// 运行实例：需要加锁，防止两个节点同时运行同一实例
public async Task<RunWorkflowInstanceResponse> RunInstanceAsync(...) =>
    await WithLockAsync(async () => await _localWorkflowClient.RunInstanceAsync(request, cancellationToken), cancellationToken);

// 创建并运行：先无锁创建，再加锁运行
// 新建的工作流也需要锁，因为可能关联子工作流，子工作流可能尝试恢复父工作流
public async Task<RunWorkflowInstanceResponse> CreateAndRunInstanceAsync(...)
{
    var workflowInstance = await _localWorkflowClient.CreateInstanceInternalAsync(createRequest, ...);
    return await WithLockAsync(async () => await _localWorkflowClient.RunInstanceAsync(workflowInstance, ...), ...);
}

// 删除：需要加锁（防止与执行并发，导致读到已删除状态）
public async Task<bool> DeleteAsync(...) =>
    await WithLockAsync(async () => await _localWorkflowClient.DeleteAsync(cancellationToken), cancellationToken);

// 导出/导入状态、检查实例存在性：不需要加锁（只读或幂等写）
public async Task<WorkflowState> ExportStateAsync(...) => await _localWorkflowClient.ExportStateAsync(cancellationToken);

锁的实现（Services/DistributedWorkflowClient.cs 第89行）：

// Services/DistributedWorkflowClient.cs 第89行
private async Task<TReturn> WithLockAsync<TReturn>(Func<Task<TReturn>> func, CancellationToken cancellationToken = default)
{
    var lockKey = $"workflow-instance:{WorkflowInstanceId}"; // 锁粒度 = 单个实例
    var lockHandle = await AcquireLockWithRetryAsync(lockKey, cancellationToken);

    try
    {
        return await func();
    }
    finally
    {
        await ReleaseLockAsync(lockHandle);
        // 注意：释放失败只记录日志不抛异常，锁会在连接断开时自动过期
    }
}

重试策略：Elsa 使用 Polly 弹性策略库来实现指数退避重试

// Services/DistributedWorkflowClient.cs 第130行
private static ResiliencePipeline CreateRetryPipeline(...)
{
    return new ResiliencePipelineBuilder()
        .AddRetry(new()
        {
            MaxRetryAttempts = 3,                        // 最多重试 3 次
            Delay = TimeSpan.FromMilliseconds(500),      // 初始延迟 500ms
            BackoffType = DelayBackoffType.Exponential,  // 指数退避
            UseJitter = true,                            // 抖动防止惊群
            ShouldHandle = new PredicateBuilder().Handle<Exception>(transientExceptionDetector.IsTransient),
            OnRetry = args =>
            {
                logger.LogWarning(args.Outcome.Exception, "Transient error acquiring lock for workflow instance {WorkflowInstanceId}. Attempt {AttemptNumber}.", ...);
                return ValueTask.CompletedTask;
            }
        })
        .Build();
}

锁的 Key 格式是 workflow-instance:{WorkflowInstanceId}，锁粒度到单个实例，不同实例可以并行在不同节点执行，互不影响。

五、书签队列的分布式保护：DistributedBookmarkQueueWorker

上一篇介绍的 BookmarkQueueWorker 在每次唤醒时处理所有排队书签，单节点没有问题。多节点时每个节点都有自己的 Worker，可能同时处理同一批书签导致重复恢复。

DistributedBookmarkQueueWorker 通过继承 BookmarkQueueWorker 并覆盖 ProcessAsync() 方法，在处理前先尝试获取分布式锁：

// Services/DistributedBookmarkQueueWorker.cs 第13行
protected override async Task ProcessAsync(CancellationToken cancellationToken)
{
    // 尝试获取锁（超时为 0，即不等待，获取失败直接跳过本次处理）
    await using var handle = await distributedLockProvider.TryAcquireLockAsync(
        nameof(DistributedBookmarkQueueWorker), TimeSpan.Zero, cancellationToken);

    if (handle == null)
    {
        // 获取失败说明其他节点正在处理，本节点直接跳过
        logger.LogInformation("Could not acquire lock for distributed bookmark queue worker. " +
            "This is usually an indication that another application instance is already processing.");
        return;
    }

    await base.ProcessAsync(cancellationToken); // 成功获锁，正常处理
}

同一时刻只有一个节点能成功获取锁，其他节点跳过本次处理，等下次信号到来时再竞争。TimeSpan.Zero 意味着不等待，不引入任何延迟。

六、工作流定义刷新的分布式保护

DistributedWorkflowDefinitionsRefresher 是装饰器模式的典型应用——它包装了内部的 IWorkflowDefinitionsRefresher，在实际执行前加分布式锁：

// Services/DistributedWorkflowDefinitionsRefresher.cs 第20行
public async Task<RefreshWorkflowDefinitionsResponse> RefreshWorkflowDefinitionsAsync(
    RefreshWorkflowDefinitionsRequest request, ...)
{
    var isRefreshingAll = request.DefinitionIds == null || request.DefinitionIds.Count == 0;

    // 锁 Key 区分"刷新全部"和"刷新指定 ID 列表"两种场景
    var lockKey = isRefreshingAll
        ? "WorkflowDefinitionsRefresher:All"
        : $"WorkflowDefinitionsRefresher:{string.Join(",", request.DefinitionIds!.OrderBy(x => x))}";

    await using var distributedLock = await distributedLockProvider.TryAcquireLockAsync(
        lockKey, TimeSpan.Zero, cancellationToken);

    if (distributedLock == null)
    {
        // 已有其他节点在刷新，返回 AlreadyInProgress 状态
        return new(Array.Empty<string>(), failedDefinitionIds, RefreshWorkflowDefinitionsStatus.AlreadyInProgress);
    }

    return await inner.RefreshWorkflowDefinitionsAsync(request, cancellationToken);
}

锁 Key 精确区分了两种刷新范围，避免"A 节点刷新全部"和"B 节点刷新部分"之间不必要的锁竞争。

七、分布式锁底层：Medallion.Threading

Elsa 使用 Medallion.Threading 作为分布式锁的底层实现库（通过 IDistributedLockProvider 接口抽象）。该库支持多种后端：

后端	适用场景
`SqlServer`	使用 SQL Server 的 `sp_getapplock` 实现，适合已有 SQL Server 的项目
`Postgres`	使用 Postgres 的 Advisory Lock 实现
`Redis`	基于 Redis 实现，适合高并发场景
`Azure`	基于 Azure Blob Storage 实现
`FileSystem`	基于文件系统锁，适合本地测试

应用程序只需配置使用哪种后端，分布式客户端代码无需改动。

八、单节点 vs 分布式：全局对比

组件	单节点（Local）	分布式（Distributed）
`IWorkflowRuntime`	`LocalWorkflowRuntime`	`DistributedWorkflowRuntime`
`IWorkflowClient`	`LocalWorkflowClient`	`DistributedWorkflowClient`（加分布式锁后委托给 Local）
`IBookmarkQueueWorker`	`BookmarkQueueWorker`	`DistributedBookmarkQueueWorker`（加分布式锁后委托给 base）
`IWorkflowDefinitionsRefresher`	直接刷新	`DistributedWorkflowDefinitionsRefresher`（装饰器，加锁后委托给 inner）
执行逻辑	完全相同	完全相同（分布式层不改变执行逻辑，只加协调机制）
重试策略	N/A	Polly 弹性管道（指数退避 + 抖动 + 可检测瞬态异常）

分布式版本的核心设计原则：只替换"对外的运行时和客户端"，不改变"实际执行逻辑"。真正的执行仍然由 Elsa.Workflows.Core + Elsa.Workflows.Runtime 的本地逻辑完成，分布式层只是在外面加了一圈协调外壳。

九、整体流程总结

多节点场景下，一个外部信号从进入到被某个节点执行的完整链路：

外部信号（任意节点接收）
  ↓
StimulusSender.SendAsync()（与单节点完全相同）
  ├─ TriggerNewWorkflowsAsync()
  │    → DistributedWorkflowRuntime.CreateClientAsync()
  │    → DistributedWorkflowClient.CreateAndRunInstanceAsync()
  │         → 无锁创建实例（LocalWorkflowClient.CreateInstanceInternalAsync()）
  │         → WithLockAsync() 加分布式锁 workflow-instance:{id}
  │              → Polly 重试策略（最多 3 次，500ms 指数退避）
  │         → LocalWorkflowClient.RunInstanceAsync()
  │         → WorkflowRunner.RunAsync() [WorkflowCore 执行]
  │
  └─ ResumeExistingWorkflowsAsync()
       → WorkflowResumer.ResumeAsync()
       → DistributedWorkflowRuntime.CreateClientAsync()
       → DistributedWorkflowClient.RunInstanceAsync()
            → WithLockAsync() 加分布式锁 workflow-instance:{id}
            → LocalWorkflowClient.RunInstanceAsync()
            → WorkflowRunner.RunAsync() [WorkflowCore 执行]

书签队列处理（多节点竞争）：
  每个节点的 BookmarkQueueWorker 都会收到信号
  → DistributedBookmarkQueueWorker.ProcessAsync()
  → TryAcquireLockAsync(TimeSpan.Zero)
    ├─ 获锁成功 → base.ProcessAsync() → BookmarkQueueProcessor 处理队列
    └─ 获锁失败 → 直接跳过（其他节点正在处理）

分布式锁保证了链路末端的 WorkflowRunner.RunAsync() 对同一实例始终是串行执行，消除了并发安全风险。

下一篇：Elsa.Expressions — 表达式引擎详解

posted @ 2026-04-25 09:16 叨奈特挖井人阅读(20) 评论(0) 收藏举报

刷新页面返回顶部

mengeneli