Skip to content

Conversation

nihongye
Copy link

@nihongye nihongye commented May 22, 2025

Please do not create a Pull Request without creating an issue first.

What is the purpose of the change

fix #12792 Fix io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryErrorcaused by distro verify task burst

Brief changelog

  1. Root Cause: When the number of connections exceeds the cluster's gRPC thread pool queue size (configured via remote.executor.queue.size), a sudden burst of requests initiated by DistroVerifyTimedTask might surpass the processing thread pool queue capacity. This triggers task rejection exceptions, resulting in improper release of memory blocks (typically 64KB or 128KB) allocated by gRPC for requests and responses.

  2. Resolution: Refactored the task scheduling mechanism of DistroVerifyExecuteTask to distribute batch verification requests across the entire verifyIntervalMillis period. This prevents abrupt queue buildup in the thread pool and reduces the peak demand for direct memory.

Verifying this change

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
  • Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.
  • Run mvn -B clean package apache-rat:check findbugs:findbugs -Dmaven.test.skip=true to make sure basic checks pass. Run mvn clean install -DskipITs to make sure unit-test pass. Run mvn clean test-compile failsafe:integration-test to make sure integration-test pass.

…OfDirectMemoryError 的一个问题

1. 原因:当连接数超过集群grpc线程池队列大小remote.executor.queue.size时,DistroVerifyTimedTask瞬间发起的请求数量可能超过
grpc处理线程池队列大小,进而触发任务拒绝异常,再进而grpc未正常释放为请求、响应分配的内存块(通常为64KB或128KB)
2. 修复方法:将请求打散在整个Distro的verifyIntervalMillis范围内,避免瞬间堆积连接池队列,也可降低directMemory的需求上限。
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


hongye.nhy seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link

Thanks for your this PR. 🙏
Please check again for your PR changes whether contains any usage/api/configuration change such as Add new API , Add new configuration, Change default value of configuration.
If so, please add or update documents(markdown type) in docs/next/ for repository nacos-group/nacos-group.github.io


感谢您提交的PR。 🙏
请再次查看您的PR内容,确认是否包含任何使用方式/API/配置参数的变更,如:新增API新增配置参数修改默认配置等操作。
如果是,请确保在提交之前,在仓库nacos-group/nacos-group.github.io中的docs/next/目录下添加或更新文档(markdown格式)。

Copy link

lingma-agents bot commented May 22, 2025

修复因Distro验证任务突发导致的Netty直接内存OOM问题

变更文件

文件路径 变更说明
core/​src/​main/​java/​com/​alibaba/​nacos/​core/​distributed/​distro/​DistroConfig.java 新增verifyBatchSize配置项及其getter/setter方法,将验证任务批处理大小引入动态配置体系,并更新配置打印格式
core/​src/​main/​java/​com/​alibaba/​nacos/​core/​distributed/​distro/​DistroConstants.java 添加DATA_VERIFY_BATCH_SIZE配置常量及默认值100的定义
core/​src/​main/​java/​com/​alibaba/​nacos/​core/​distributed/​distro/​task/​verify/​DistroVerifyExecuteTask.java 将单线程连续处理改为分批次异步调度,通过计算批次间隔时间均匀分布请求,使用专用线程池执行验证操作
core/​src/​main/​java/​com/​alibaba/​nacos/​core/​utils/​GlobalExecutor.java 创建4线程的DISTRO_VERIFY_EXECUTOR线程池用于处理分批验证任务
core/​src/​test/​java/​com/​alibaba/​nacos/​core/​distributed/​distro/​task/​verify/​DistroVerifyExecuteTaskTest.ja​va 添加不同批次大小场景下的测试用例,验证任务分发和执行的正确性

时序图

sequenceDiagram
    participant DVEST as DistroVerifyExecuteTask
    participant GE as GlobalExecutor
    participant DTA as DistroTransportAgent
    DVEST->>+GE: schedule batch tasks with random delay
    loop 每个批次
        GE->>DVEST: execute batch task
        DVEST->>+DTA: syncVerifyData for batch items
        DTA-->>-DVEST: 返回验证结果
    end
    DVEST-->>GE: 所有批次执行完成
Loading

💡 小贴士

与 lingma-agents 交流的方式

📜 直接回复评论
直接回复本条评论,lingma-agents 将自动处理您的请求。例如:

  • 在当前代码中添加详细的注释说明。

  • 请详细介绍一下你说的 LRU 改造方案,并使用伪代码加以说明。

📜 在代码行处标记
在文件的特定位置创建评论并 @lingma-agents。例如:

  • @Lingma-Agent 分析这个方法的性能瓶颈并提供优化建议。

  • @Lingma-Agent 对这个方法生成优化代码。

📜 在讨论中提问
在任何讨论中 @lingma-agents 来获取帮助。例如:

  • @Lingma-Agent 请总结上述讨论并提出解决方案。

  • @Lingma-Agent 请根据讨论内容生成优化代码。

Copy link

@lingma-agents lingma-agents bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔍 代码评审报告

🎯 评审意见概览

严重度 数量 说明
🔴 Blocker 0 阻断性问题,需立即修复。例如:系统崩溃、关键功能不可用或严重安全漏洞。
🟠 Critical 1 严重问题,高优先级修复。例如:核心功能异常或性能瓶颈影响用户体验。
🟡 Major 2 主要问题,建议修复。例如:非核心功能缺陷或代码维护性较差。
🟢 Minor 0 次要问题,酌情优化。例如:代码格式不规范或注释缺失。

总计: 3 个问题


📋 评审意见详情

💡 单文件建议

以下是文件级别的代码建议,聚焦于代码的可读性、可维护性和潜在问题。
☕ core/src/main/java/com/alibaba/nacos/core/distributed/distro/DistroConfig.java (1 💬)
☕ core/src/main/java/com/alibaba/nacos/core/utils/GlobalExecutor.java (1 💬)
☕ core/src/test/java/com/alibaba/nacos/core/distributed/distro/task/verify/DistroVerifyExecuteTaskTest.java (1 💬)

🚀 跨文件建议

以下是对代码架构和设计的综合分析,聚焦于跨文件交互、系统一致性和潜在优化空间。
🔍 1. 线程池配置未动态调整可能导致性能瓶颈

新增的DISTRO_VERIFY_EXECUTOR线程池固定配置了4个线程,但未根据系统负载或数据量动态调整线程数。在高并发场景下,固定线程池可能成为性能瓶颈,导致任务队列堆积或超时,特别是在批量处理大量distro验证数据时。建议引入动态线程调整策略或增加线程数配置参数

⚠️ 潜在风险: 在高并发或大规模数据验证场景下,固定线程池可能无法及时处理请求,导致任务积压引发新的内存压力或超时异常

🔍 2. 批量处理未实现异常隔离机制

在requestBatch方法中,单个批次的处理失败会中断整个批次的验证流程。当某个DistroData验证失败时,会导致整个批次剩余数据的验证被终止,未实现异常隔离和重试机制。这可能造成部分数据验证未完成,影响分布式数据一致性

📌 关键代码:

for (DistroData each : batch) {
    try {
        ...
    } catch (Throwable e) {
        Loggers.DISTRO.error(...);
    }
}

⚠️ 潜在风险: 单点异常可能导致大量未验证数据堆积,影响集群数据同步可靠性

🔍 3. 配置参数缺乏校验机制

新增的verifyBatchSize配置参数未实现数值合法性校验(如<=0的无效值),可能导致程序进入不可预期状态。虽然文件级建议提到注释缺失,但系统层面需要在配置加载时进行参数有效性验证,避免非法配置引发运行时异常

📌 关键代码:

public static final int DEFAULT_DATA_VERIFY_BATCH_SIZE = 100;

⚠️ 潜在风险: 非法配置值可能导致无限循环或内存溢出,例如当配置值为0时会导致分页逻辑失效

🔍 4. 随机延迟策略可能引发验证延迟

使用RandomUtils.nextLong(0, interval)生成随机延迟,当verifyIntervalMillis配置较小时(如<100ms),可能导致任务调度间隔过短,反而加剧并发压力。建议采用更合理的随机范围计算方式(如配置最小间隔)

📌 关键代码:

RandomUtils.nextLong(0, DistroConfig.getInstance().getVerifyIntervalMillis())

⚠️ 潜在风险: 不当的随机延迟可能导致任务调度过于密集,反而引发新的资源竞争问题

🔍 5. 未实现配置变更热加载

DistroConfig中的verifyBatchSize配置未实现动态热加载能力,当生产环境需要调整批处理大小时,必须重启服务才能生效。这降低了系统对突发流量的实时应对能力

📌 关键代码:

verifyBatchSize = EnvUtil.getProperty(...)

⚠️ 潜在风险: 无法在不中断服务的情况下应对突发流量,降低系统弹性扩展能力

🔍 6. 批次处理逻辑存在潜在N+1问题

当前按批次提交任务时,每个批次都创建独立的ScheduledFuture,但未对完成状态进行跟踪。当系统异常时可能导致部分批次任务未完成,且无法追溯执行状态,建议引入任务状态跟踪机制

📌 关键代码:

GlobalExecutor.DISTRO_VERIFY_EXECUTOR.schedule((() -> { ... }))

⚠️ 潜在风险: 无法保证所有批次任务的最终完成状态,可能造成数据验证不完整


💡 小贴士

与 lingma-agents 交流的方式

📜 直接回复评论
直接回复本条评论,lingma-agents 将自动处理您的请求。例如:

  • 在当前代码中添加详细的注释说明。

  • 请详细介绍一下你说的 LRU 改造方案,并使用伪代码加以说明。

📜 在代码行处标记
在文件的特定位置创建评论并 @lingma-agents。例如:

  • @Lingma-Agent 分析这个方法的性能瓶颈并提供优化建议。

  • @Lingma-Agent 对这个方法生成优化代码。

📜 在讨论中提问
在任何讨论中 @lingma-agents 来获取帮助。例如:

  • @Lingma-Agent 请总结上述讨论并提出解决方案。

  • @Lingma-Agent 请根据讨论内容生成优化代码。

@@ -43,7 +43,9 @@ public class DistroConfig extends AbstractDynamicConfig {
private long loadDataRetryDelayMillis = DistroConstants.DEFAULT_DATA_LOAD_RETRY_DELAY_MILLISECONDS;

private long loadDataTimeoutMillis = DistroConstants.DEFAULT_DATA_LOAD_TIMEOUT_MILLISECONDS;


private int verifyBatchSize = DistroConstants.DEFAULT_DATA_VERIFY_BATCH_SIZE;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的verifyBatchSize字段缺少配置说明注释

🟡 Major | 🧹 Code Smells

📋 问题详情

在DistroConfig类中新增的verifyBatchSize字段未添加配置说明注释,导致其他开发者无法明确该配置项的用途和合理取值范围。根据代码标准化规范,所有配置项应包含清晰的注释说明其业务含义和默认值。

💡 解决方案

在字段前添加配置说明注释:

+    /**
+     * 验证任务数据分批次处理的批次大小,默认值为100
+     * 该参数用于控制每次网络验证请求的数据量,避免单次请求过大导致内存溢出
+     */
🔧 建议代码

‼️AI 生成代码 - 请在应用前检查逻辑、规范并测试

Suggested change
private int verifyBatchSize = DistroConstants.DEFAULT_DATA_VERIFY_BATCH_SIZE;
/**
* 验证任务数据分批次处理的批次大小默认值为100
* 该参数用于控制每次网络验证请求的数据量避免单次请求过大导致内存溢出
*/
private int verifyBatchSize = DistroConstants.DEFAULT_DATA_VERIFY_BATCH_SIZE;

您的反馈对我们很重要!(建议右键在新标签页中打开以下链接)

有用意见👍无用意见👎错误意见❌

Comment on lines +57 to +59
public static final ScheduledExecutorService DISTRO_VERIFY_EXECUTOR = ExecutorFactory.Managed
.newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4,
new NameThreadFactory("com.alibaba.nacos.core.distro.verify"));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

新增的DISTRO_VERIFY_EXECUTOR线程池未配置核心线程保活策略

🟡 Major | 📝 Performance optimization

📋 问题详情

新创建的DISTRO_VERIFY_EXECUTOR线程池未设置核心线程保活时间,可能导致线程频繁创建/销毁。根据性能优化规范,应为ScheduledExecutorService配置合理的keepAliveTime参数以提升线程复用效率。

💡 解决方案

添加keepAliveTime参数配置:

-            .newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4,
+            .newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4, 60L, TimeUnit.SECONDS,
🔧 建议代码

‼️AI 生成代码 - 请在应用前检查逻辑、规范并测试

Suggested change
public static final ScheduledExecutorService DISTRO_VERIFY_EXECUTOR = ExecutorFactory.Managed
.newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4,
new NameThreadFactory("com.alibaba.nacos.core.distro.verify"));
public static final ScheduledExecutorService DISTRO_VERIFY_EXECUTOR = ExecutorFactory.Managed
.newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4, 60L, TimeUnit.SECONDS,
new NameThreadFactory("com.alibaba.nacos.core.distro.verify"));

您的反馈对我们很重要!(建议右键在新标签页中打开以下链接)

有用意见👍无用意见👎错误意见❌

Comment on lines +54 to +69
@Test
void testRunWithSmallBatchSize() throws InterruptedException {
// 准备测试数据
when(distroConfig.getVerifyBatchSize()).thenReturn(2);
when(distroConfig.getVerifyIntervalMillis()).thenReturn(50L);
try (MockedStatic<DistroConfig> mockedStatic = mockStatic(DistroConfig.class)) {
mockedStatic.when(DistroConfig::getInstance).thenReturn(distroConfig);

// 执行测试
distroVerifyExecuteTask.run();

// 验证全部被调用
Thread.sleep(100L);
verify(transportAgent, times(10)).syncVerifyData(any(), any());
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

单元测试未验证配置边界条件

🟠 Critical | 📝 Unit test coverage

📋 问题详情

测试用例未覆盖verifyBatchSize=0的异常配置场景。根据单元测试规范,应添加验证配置有效性(如最小值检查)的测试用例,确保系统在非法配置下能正确处理。

💡 解决方案

添加边界值测试用例:

+    @Test(expected = IllegalArgumentException.class)
+    void testInvalidBatchSize() {
+        when(distroConfig.getVerifyBatchSize()).thenReturn(0);
+        mockedStatic.when(DistroConfig::getInstance).thenReturn(distroConfig);
+        distroVerifyExecuteTask.run();
+    }

您的反馈对我们很重要!(建议右键在新标签页中打开以下链接)

有用意见👍无用意见👎错误意见❌

@KomachiSion
Copy link
Collaborator

Root Cause: When the number of connections exceeds the cluster's gRPC thread pool queue size (configured via remote.executor.queue.size), a sudden burst of requests initiated by DistroVerifyTimedTask might surpass the processing thread pool queue capacity. This triggers task rejection exceptions, resulting in improper release of memory blocks (typically 64KB or 128KB) allocated by gRPC for requests and responses.

If the reason is task rejection exceptions, you should fix it by release of memory blocks when exceptions. But not batch verify.

The Verify design to quick and only renew the sync data, which need simple and fast response. So batch verify is not match the verify design.

In other way, the data syned expired is 3 min default, you can use nacos.core.protocol.distro.data.verify.intervalMs (default 5s) to change the interval of each time do verify to reduce the queue pressure.

@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 69.56522% with 7 lines in your changes missing coverage. Please review.

Project coverage is 70.30%. Comparing base (ee59f98) to head (17ae27f).

Files with missing lines Patch % Lines
...ed/distro/task/verify/DistroVerifyExecuteTask.java 73.33% 3 Missing and 1 partial ⚠️
...ba/nacos/core/distributed/distro/DistroConfig.java 50.00% 3 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@              Coverage Diff              @@
##             develop   #13419      +/-   ##
=============================================
+ Coverage      70.27%   70.30%   +0.03%     
- Complexity     11728    11735       +7     
=============================================
  Files           1601     1601              
  Lines          51098    51114      +16     
  Branches        5152     5153       +1     
=============================================
+ Hits           35910    35938      +28     
+ Misses         12756    12743      -13     
- Partials        2432     2433       +1     
Files with missing lines Coverage Δ
...nacos/core/distributed/distro/DistroConstants.java 0.00% <ø> (ø)
...a/com/alibaba/nacos/core/utils/GlobalExecutor.java 50.00% <100.00%> (+2.77%) ⬆️
...ba/nacos/core/distributed/distro/DistroConfig.java 81.81% <50.00%> (-3.90%) ⬇️
...ed/distro/task/verify/DistroVerifyExecuteTask.java 57.89% <73.33%> (+57.89%) ⬆️

... and 1 file with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ee59f98...17ae27f. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@nihongye
Copy link
Author

nihongye commented May 23, 2025

If the reason is task rejection exceptions, you should fix it by release of memory blocks when exceptions. But not batch verify.

The Verify design to quick and only renew the sync data, which need simple and fast response. So batch verify is not match the verify design.

In other way, the data syned expired is 3 min default, you can use nacos.core.protocol.distro.data.verify.intervalMs (default 5s) to change the interval of each time do verify to reduce the queue pressure.

Indeed, immediately releasing memory upon task rejection serves as an effective mechanism to prevent memory leaks. Furthermore, given that the server allocates 64KB or 128KB of memory per request, the total unconsumed memory scales linearly as n × XKB (where X=64/128 and n represents the number of pending tasks). By distributing schedule execution across the entire verifyIntervalMillis period window, we can effectively reduce this peak memory demand through temporal load dispersion.

Furthermore, although server-side verify design to quick, but the client request rate substantially outpaces the server's verify throughput. This imbalance results in instantaneous request pileup on the server-side, particularly when handling burst traffic patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nacos 2.1.0 集群,Dubbo注册实例数超过41K后,经常出现个别节点上实例数不一致,疑似堆外内存泄露
4 participants