Skip to content

Conversation

elezar
Copy link
Member

@elezar elezar commented Aug 1, 2025

This change adds support for GDRCOPY through the same mechansim as for GDS and MOFED. This requires new functionality in the nvidia-container-toolkit to generated CDI specs for gdrcopy devices when using CDI mode.

This depends on NVIDIA/nvidia-container-toolkit#1230

@elezar elezar added this to the v0.18.0 milestone Aug 1, 2025
@elezar elezar force-pushed the bump-nvidia-container-toolkit-v1.18.0-rc.2 branch 2 times, most recently from 8136d62 to 217aa20 Compare August 8, 2025 11:39
@elezar elezar changed the title Bump nvidia container toolkit v1.18.0 rc.2 Add support for GDRCOPY Aug 8, 2025
@elezar elezar marked this pull request as draft August 8, 2025 11:42
@elezar elezar force-pushed the bump-nvidia-container-toolkit-v1.18.0-rc.2 branch 2 times, most recently from b348981 to 79606de Compare August 13, 2025 11:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for GDRCOPY functionality to the NVIDIA Kubernetes device plugin, following the same pattern as existing GDS and MOFED support. GDRCOPY enables GPU Direct Remote Direct Memory Access (RDMA) Copy functionality for high-performance GPU-to-GPU communication.

  • Adds GDRCOPY configuration option with environment variable support
  • Integrates GDRCOPY into the CDI (Container Device Interface) specification generation
  • Updates dependencies to support the new GDRCOPY functionality in nvidia-container-toolkit

Reviewed Changes

Copilot reviewed 12 out of 138 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/plugin/server.go Sets NVIDIA_GDRCOPY environment variable when GDRCOPY is enabled
internal/cdi/options.go Adds WithGdrcopyEnabled option for CDI handler configuration
internal/cdi/imex.go Updates interface signatures for compatibility with nvidia-container-toolkit
internal/cdi/cdi.go Integrates GDRCOPY into CDI spec generation alongside GDS/MOFED
internal/cdi/api.go Removes local cdiSpecGenerator interface, uses upstream interface
go.mod Updates nvidia-container-toolkit and CDI dependencies to support GDRCOPY
deployments/helm/nvidia-device-plugin/values.yaml Adds gdrcopyEnabled helm configuration option
deployments/helm/nvidia-device-plugin/templates/daemonset-device-plugin.yml Adds GDRCOPY_ENABLED environment variable to daemonset
cmd/nvidia-device-plugin/plugin-manager.go Passes GDRCOPY configuration to CDI handler
cmd/nvidia-device-plugin/main.go Adds gdrcopy-enabled CLI flag
api/config/v1/flags_test.go Updates test expectations for new GDRCOPY flag
api/config/v1/flags.go Adds GDRCopyEnabled field to configuration structure

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

}
}

// WithGdrcopyEnabled provides and option to set whether a GDS CDI spec should be generated
Copy link
Preview

Copilot AI Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment incorrectly mentions 'GDS' instead of 'GDRCOPY'. It should read 'provides an option to set whether a GDRCOPY CDI spec should be generated'.

Suggested change
// WithGdrcopyEnabled provides and option to set whether a GDS CDI spec should be generated
// WithGdrcopyEnabled provides an option to set whether a GDRCOPY CDI spec should be generated

Copilot uses AI. Check for mistakes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor

@cdesiniotis cdesiniotis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor comments. LGTM otherwise.

},
&cli.BoolFlag{
Name: "gdrcopy-enabled",
Usage: "ensure that containers are started with NVIDIA_GDRCOPY=enabled",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question -- since the NVIDIA_GDRCOPY=enabled envvar only gets set when a non-CDI device list strategy is configured, would it make sense to omit this implementation detail from the usage string? What about a more generic usage string, like "ensure that containers get access to the GDRCopy device nodes". This comment also applies to the gds-enabled and mofed-enabled flags.

}
}

// WithGdrcopyEnabled provides and option to set whether a GDS CDI spec should be generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@elezar elezar force-pushed the bump-nvidia-container-toolkit-v1.18.0-rc.2 branch from 79606de to 1e85d06 Compare September 25, 2025 19:13
@elezar elezar force-pushed the bump-nvidia-container-toolkit-v1.18.0-rc.2 branch from 3037ba8 to af51db9 Compare September 25, 2025 20:10
@elezar elezar force-pushed the bump-nvidia-container-toolkit-v1.18.0-rc.2 branch from af51db9 to 640469e Compare September 25, 2025 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants