-
Notifications
You must be signed in to change notification settings - Fork 13.1k
[ISSUE #12792] Fix Netty direct memory OOM caused by distro verify task burst #13419
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
…OfDirectMemoryError 的一个问题 1. 原因:当连接数超过集群grpc线程池队列大小remote.executor.queue.size时,DistroVerifyTimedTask瞬间发起的请求数量可能超过 grpc处理线程池队列大小,进而触发任务拒绝异常,再进而grpc未正常释放为请求、响应分配的内存块(通常为64KB或128KB) 2. 修复方法:将请求打散在整个Distro的verifyIntervalMillis范围内,避免瞬间堆积连接池队列,也可降低directMemory的需求上限。
hongye.nhy seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Thanks for your this PR. 🙏 感谢您提交的PR。 🙏 |
修复因Distro验证任务突发导致的Netty直接内存OOM问题变更文件
时序图sequenceDiagram
participant DVEST as DistroVerifyExecuteTask
participant GE as GlobalExecutor
participant DTA as DistroTransportAgent
DVEST->>+GE: schedule batch tasks with random delay
loop 每个批次
GE->>DVEST: execute batch task
DVEST->>+DTA: syncVerifyData for batch items
DTA-->>-DVEST: 返回验证结果
end
DVEST-->>GE: 所有批次执行完成
💡 小贴士与 lingma-agents 交流的方式📜 直接回复评论
📜 在代码行处标记
📜 在讨论中提问
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔍 代码评审报告
🎯 评审意见概览
严重度 | 数量 | 说明 |
---|---|---|
🔴 Blocker | 0 | 阻断性问题,需立即修复。例如:系统崩溃、关键功能不可用或严重安全漏洞。 |
🟠 Critical | 1 | 严重问题,高优先级修复。例如:核心功能异常或性能瓶颈影响用户体验。 |
🟡 Major | 2 | 主要问题,建议修复。例如:非核心功能缺陷或代码维护性较差。 |
🟢 Minor | 0 | 次要问题,酌情优化。例如:代码格式不规范或注释缺失。 |
总计: 3 个问题
📋 评审意见详情
💡 单文件建议
以下是文件级别的代码建议,聚焦于代码的可读性、可维护性和潜在问题。
☕ core/src/main/java/com/alibaba/nacos/core/distributed/distro/DistroConfig.java (1 💬)
☕ core/src/main/java/com/alibaba/nacos/core/utils/GlobalExecutor.java (1 💬)
- 新增的DISTRO_VERIFY_EXECUTOR线程池未配置核心线程保活策略 (L57-L59)
☕ core/src/test/java/com/alibaba/nacos/core/distributed/distro/task/verify/DistroVerifyExecuteTaskTest.java (1 💬)
- 单元测试未验证配置边界条件 (L54-L69)
🚀 跨文件建议
以下是对代码架构和设计的综合分析,聚焦于跨文件交互、系统一致性和潜在优化空间。
🔍 1. 线程池配置未动态调整可能导致性能瓶颈
新增的DISTRO_VERIFY_EXECUTOR线程池固定配置了4个线程,但未根据系统负载或数据量动态调整线程数。在高并发场景下,固定线程池可能成为性能瓶颈,导致任务队列堆积或超时,特别是在批量处理大量distro验证数据时。建议引入动态线程调整策略或增加线程数配置参数
🔍 2. 批量处理未实现异常隔离机制
在requestBatch方法中,单个批次的处理失败会中断整个批次的验证流程。当某个DistroData验证失败时,会导致整个批次剩余数据的验证被终止,未实现异常隔离和重试机制。这可能造成部分数据验证未完成,影响分布式数据一致性
📌 关键代码:
for (DistroData each : batch) {
try {
...
} catch (Throwable e) {
Loggers.DISTRO.error(...);
}
}
🔍 3. 配置参数缺乏校验机制
新增的verifyBatchSize配置参数未实现数值合法性校验(如<=0的无效值),可能导致程序进入不可预期状态。虽然文件级建议提到注释缺失,但系统层面需要在配置加载时进行参数有效性验证,避免非法配置引发运行时异常
📌 关键代码:
public static final int DEFAULT_DATA_VERIFY_BATCH_SIZE = 100;
🔍 4. 随机延迟策略可能引发验证延迟
使用RandomUtils.nextLong(0, interval)生成随机延迟,当verifyIntervalMillis配置较小时(如<100ms),可能导致任务调度间隔过短,反而加剧并发压力。建议采用更合理的随机范围计算方式(如配置最小间隔)
📌 关键代码:
RandomUtils.nextLong(0, DistroConfig.getInstance().getVerifyIntervalMillis())
🔍 5. 未实现配置变更热加载
DistroConfig中的verifyBatchSize配置未实现动态热加载能力,当生产环境需要调整批处理大小时,必须重启服务才能生效。这降低了系统对突发流量的实时应对能力
📌 关键代码:
verifyBatchSize = EnvUtil.getProperty(...)
🔍 6. 批次处理逻辑存在潜在N+1问题
当前按批次提交任务时,每个批次都创建独立的ScheduledFuture,但未对完成状态进行跟踪。当系统异常时可能导致部分批次任务未完成,且无法追溯执行状态,建议引入任务状态跟踪机制
📌 关键代码:
GlobalExecutor.DISTRO_VERIFY_EXECUTOR.schedule((() -> { ... }))
💡 小贴士
与 lingma-agents 交流的方式
📜 直接回复评论
直接回复本条评论,lingma-agents 将自动处理您的请求。例如:
-
在当前代码中添加详细的注释说明。
-
请详细介绍一下你说的 LRU 改造方案,并使用伪代码加以说明。
📜 在代码行处标记
在文件的特定位置创建评论并 @lingma-agents。例如:
-
@Lingma-Agent 分析这个方法的性能瓶颈并提供优化建议。
-
@Lingma-Agent 对这个方法生成优化代码。
📜 在讨论中提问
在任何讨论中 @lingma-agents 来获取帮助。例如:
-
@Lingma-Agent 请总结上述讨论并提出解决方案。
-
@Lingma-Agent 请根据讨论内容生成优化代码。
@@ -43,7 +43,9 @@ public class DistroConfig extends AbstractDynamicConfig { | |||
private long loadDataRetryDelayMillis = DistroConstants.DEFAULT_DATA_LOAD_RETRY_DELAY_MILLISECONDS; | |||
|
|||
private long loadDataTimeoutMillis = DistroConstants.DEFAULT_DATA_LOAD_TIMEOUT_MILLISECONDS; | |||
|
|||
|
|||
private int verifyBatchSize = DistroConstants.DEFAULT_DATA_VERIFY_BATCH_SIZE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增的verifyBatchSize字段缺少配置说明注释
🟡 Major | 🧹 Code Smells
📋 问题详情
在DistroConfig类中新增的verifyBatchSize字段未添加配置说明注释,导致其他开发者无法明确该配置项的用途和合理取值范围。根据代码标准化规范,所有配置项应包含清晰的注释说明其业务含义和默认值。
💡 解决方案
在字段前添加配置说明注释:
+ /**
+ * 验证任务数据分批次处理的批次大小,默认值为100
+ * 该参数用于控制每次网络验证请求的数据量,避免单次请求过大导致内存溢出
+ */
🔧 建议代码
‼️ AI 生成代码 - 请在应用前检查逻辑、规范并测试
private int verifyBatchSize = DistroConstants.DEFAULT_DATA_VERIFY_BATCH_SIZE; | |
/** | |
* 验证任务数据分批次处理的批次大小,默认值为100 | |
* 该参数用于控制每次网络验证请求的数据量,避免单次请求过大导致内存溢出 | |
*/ | |
private int verifyBatchSize = DistroConstants.DEFAULT_DATA_VERIFY_BATCH_SIZE; | |
您的反馈对我们很重要!(建议右键在新标签页中打开以下链接)
public static final ScheduledExecutorService DISTRO_VERIFY_EXECUTOR = ExecutorFactory.Managed | ||
.newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4, | ||
new NameThreadFactory("com.alibaba.nacos.core.distro.verify")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
新增的DISTRO_VERIFY_EXECUTOR线程池未配置核心线程保活策略
🟡 Major | 📝 Performance optimization
📋 问题详情
新创建的DISTRO_VERIFY_EXECUTOR线程池未设置核心线程保活时间,可能导致线程频繁创建/销毁。根据性能优化规范,应为ScheduledExecutorService配置合理的keepAliveTime参数以提升线程复用效率。
💡 解决方案
添加keepAliveTime参数配置:
- .newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4,
+ .newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4, 60L, TimeUnit.SECONDS,
🔧 建议代码
‼️ AI 生成代码 - 请在应用前检查逻辑、规范并测试
public static final ScheduledExecutorService DISTRO_VERIFY_EXECUTOR = ExecutorFactory.Managed | |
.newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4, | |
new NameThreadFactory("com.alibaba.nacos.core.distro.verify")); | |
public static final ScheduledExecutorService DISTRO_VERIFY_EXECUTOR = ExecutorFactory.Managed | |
.newScheduledExecutorService(ClassUtils.getCanonicalName(GlobalExecutor.class), 4, 60L, TimeUnit.SECONDS, | |
new NameThreadFactory("com.alibaba.nacos.core.distro.verify")); | |
您的反馈对我们很重要!(建议右键在新标签页中打开以下链接)
@Test | ||
void testRunWithSmallBatchSize() throws InterruptedException { | ||
// 准备测试数据 | ||
when(distroConfig.getVerifyBatchSize()).thenReturn(2); | ||
when(distroConfig.getVerifyIntervalMillis()).thenReturn(50L); | ||
try (MockedStatic<DistroConfig> mockedStatic = mockStatic(DistroConfig.class)) { | ||
mockedStatic.when(DistroConfig::getInstance).thenReturn(distroConfig); | ||
|
||
// 执行测试 | ||
distroVerifyExecuteTask.run(); | ||
|
||
// 验证全部被调用 | ||
Thread.sleep(100L); | ||
verify(transportAgent, times(10)).syncVerifyData(any(), any()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单元测试未验证配置边界条件
🟠 Critical | 📝 Unit test coverage
📋 问题详情
测试用例未覆盖verifyBatchSize=0的异常配置场景。根据单元测试规范,应添加验证配置有效性(如最小值检查)的测试用例,确保系统在非法配置下能正确处理。
💡 解决方案
添加边界值测试用例:
+ @Test(expected = IllegalArgumentException.class)
+ void testInvalidBatchSize() {
+ when(distroConfig.getVerifyBatchSize()).thenReturn(0);
+ mockedStatic.when(DistroConfig::getInstance).thenReturn(distroConfig);
+ distroVerifyExecuteTask.run();
+ }
您的反馈对我们很重要!(建议右键在新标签页中打开以下链接)
If the reason is task rejection exceptions, you should fix it by release of memory blocks when exceptions. But not batch verify. The Verify design to quick and only renew the sync data, which need simple and fast response. So batch verify is not match the verify design. In other way, the data syned expired is 3 min default, you can use |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #13419 +/- ##
=============================================
+ Coverage 70.27% 70.30% +0.03%
- Complexity 11728 11735 +7
=============================================
Files 1601 1601
Lines 51098 51114 +16
Branches 5152 5153 +1
=============================================
+ Hits 35910 35938 +28
+ Misses 12756 12743 -13
- Partials 2432 2433 +1
... and 1 file with indirect coverage changes Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
Indeed, immediately releasing memory upon task rejection serves as an effective mechanism to prevent memory leaks. Furthermore, given that the server allocates 64KB or 128KB of memory per request, the total unconsumed memory scales linearly as n × XKB (where X=64/128 and n represents the number of pending tasks). By distributing schedule execution across the entire verifyIntervalMillis period window, we can effectively reduce this peak memory demand through temporal load dispersion. Furthermore, although server-side verify design to quick, but the client request rate substantially outpaces the server's verify throughput. This imbalance results in instantaneous request pileup on the server-side, particularly when handling burst traffic patterns. |
Please do not create a Pull Request without creating an issue first.
What is the purpose of the change
fix #12792 Fix io.grpc.netty.shaded.io.netty.util.internal.OutOfDirectMemoryErrorcaused by distro verify task burst
Brief changelog
Root Cause: When the number of connections exceeds the cluster's gRPC thread pool queue size (configured via remote.executor.queue.size), a sudden burst of requests initiated by DistroVerifyTimedTask might surpass the processing thread pool queue capacity. This triggers task rejection exceptions, resulting in improper release of memory blocks (typically 64KB or 128KB) allocated by gRPC for requests and responses.
Resolution: Refactored the task scheduling mechanism of DistroVerifyExecuteTask to distribute batch verification requests across the entire verifyIntervalMillis period. This prevents abrupt queue buildup in the thread pool and reduces the peak demand for direct memory.
Verifying this change
Follow this checklist to help us incorporate your contribution quickly and easily:
[ISSUE #123] Fix UnknownException when host config not exist
. Each commit in the pull request should have a meaningful subject line and body.mvn -B clean package apache-rat:check findbugs:findbugs -Dmaven.test.skip=true
to make sure basic checks pass. Runmvn clean install -DskipITs
to make sure unit-test pass. Runmvn clean test-compile failsafe:integration-test
to make sure integration-test pass.