-
Notifications
You must be signed in to change notification settings - Fork 733
Add support for GDRCOPY #1339
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add support for GDRCOPY #1339
Conversation
8136d62
to
217aa20
Compare
b348981
to
79606de
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for GDRCOPY functionality to the NVIDIA Kubernetes device plugin, following the same pattern as existing GDS and MOFED support. GDRCOPY enables GPU Direct Remote Direct Memory Access (RDMA) Copy functionality for high-performance GPU-to-GPU communication.
- Adds GDRCOPY configuration option with environment variable support
- Integrates GDRCOPY into the CDI (Container Device Interface) specification generation
- Updates dependencies to support the new GDRCOPY functionality in nvidia-container-toolkit
Reviewed Changes
Copilot reviewed 12 out of 138 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
internal/plugin/server.go | Sets NVIDIA_GDRCOPY environment variable when GDRCOPY is enabled |
internal/cdi/options.go | Adds WithGdrcopyEnabled option for CDI handler configuration |
internal/cdi/imex.go | Updates interface signatures for compatibility with nvidia-container-toolkit |
internal/cdi/cdi.go | Integrates GDRCOPY into CDI spec generation alongside GDS/MOFED |
internal/cdi/api.go | Removes local cdiSpecGenerator interface, uses upstream interface |
go.mod | Updates nvidia-container-toolkit and CDI dependencies to support GDRCOPY |
deployments/helm/nvidia-device-plugin/values.yaml | Adds gdrcopyEnabled helm configuration option |
deployments/helm/nvidia-device-plugin/templates/daemonset-device-plugin.yml | Adds GDRCOPY_ENABLED environment variable to daemonset |
cmd/nvidia-device-plugin/plugin-manager.go | Passes GDRCOPY configuration to CDI handler |
cmd/nvidia-device-plugin/main.go | Adds gdrcopy-enabled CLI flag |
api/config/v1/flags_test.go | Updates test expectations for new GDRCOPY flag |
api/config/v1/flags.go | Adds GDRCopyEnabled field to configuration structure |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
internal/cdi/options.go
Outdated
} | ||
} | ||
|
||
// WithGdrcopyEnabled provides and option to set whether a GDS CDI spec should be generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment incorrectly mentions 'GDS' instead of 'GDRCOPY'. It should read 'provides an option to set whether a GDRCOPY CDI spec should be generated'.
// WithGdrcopyEnabled provides and option to set whether a GDS CDI spec should be generated | |
// WithGdrcopyEnabled provides an option to set whether a GDRCOPY CDI spec should be generated |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments. LGTM otherwise.
cmd/nvidia-device-plugin/main.go
Outdated
}, | ||
&cli.BoolFlag{ | ||
Name: "gdrcopy-enabled", | ||
Usage: "ensure that containers are started with NVIDIA_GDRCOPY=enabled", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question -- since the NVIDIA_GDRCOPY=enabled
envvar only gets set when a non-CDI device list strategy is configured, would it make sense to omit this implementation detail from the usage string? What about a more generic usage string, like "ensure that containers get access to the GDRCopy device nodes". This comment also applies to the gds-enabled
and mofed-enabled
flags.
internal/cdi/options.go
Outdated
} | ||
} | ||
|
||
// WithGdrcopyEnabled provides and option to set whether a GDS CDI spec should be generated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
….0-rc.4 Signed-off-by: Evan Lezar <[email protected]>
Signed-off-by: Evan Lezar <[email protected]>
79606de
to
1e85d06
Compare
Signed-off-by: Evan Lezar <[email protected]>
3037ba8
to
af51db9
Compare
Signed-off-by: Evan Lezar <[email protected]>
af51db9
to
640469e
Compare
This change adds support for GDRCOPY through the same mechansim as for GDS and MOFED. This requires new functionality in the nvidia-container-toolkit to generated CDI specs for gdrcopy devices when using CDI mode.
This depends on NVIDIA/nvidia-container-toolkit#1230