Skip to content

Coalesce upstream requests where possible #22184

@Strainy

Description

@Strainy

Is your feature request related to a problem? Please describe.
I'm currently using Harbor to service a large number of K8S nodes. I'm facing a problem where occasionally a large number of those nodes concurrently attempt to pull the same artifact from the registry. If Harbor happens to receive N requests at the same time, I typically see it making a commensurate number of requests to the remote repository for manifests/blobs. For relatively large artifacts/blobs, this has the effect of completely saturating our bandwidth and pulls never completing; requiring manual intervention.

Describe the solution you'd like
Harbor should coalesce remote requests for the same manifests/blobs. This could work similarly to how pushes to the local registry are de-duplicated here. The end result should basically be one remote fetch for one artifact, with all subsequent requests being served from the local registry cache.

Describe the main design/architecture of your solution
I think the ideal solution here is to use the sync/singleflight package to de-duplicate remote requests within a single core instance. The specific manner in which this can be handled differs based on whether we're dealing with a blob or manifest request.

For the manifest requests the solution is pretty simple. Because we can buffer the manifest response, we can just update the proxy controller to wrap existing manifest fetching implementations. Only one fetch for the manifest will occur at a time, with all concurrent requests waiting on the initial fetch and receiving the same response on completion.

// This singleflight group is used to deduplicate concurrent manifest requests
result, err, _ := manifestFetchGroup.Do(artifactKey, func() (any, error) {
	log.Debugf("Fetching manifest from remote registry, url:%v", remoteRepo)

	man, dig, err := remote.Manifest(remoteRepo, ref)
	if err != nil {
		return nil, err
	}
	ct, _, err := man.Payload()
	if err != nil {
		return nil, err
	}

	// Push manifest in background

    return man
}

For the blob requests, this is going to be a bit trickier. This is because the proxy controller implementation currently propagates an io.ReadCloser back the caller. The io.ReadCloser can obviously only be read once, which necessitates a different approach.

What I believe should be done here is introduce an opt-in configuration for waiting for the first remote blob request to be cached in the local registry. After ensuring the first remote request for a blob has been written to the registry, all other requests that arrived at the same time would then skip their remote request and read directly from the local registry. This would look something like:

func (c *controller) ProxyBlob(ctx context.Context, _ *proModels.Project, art lib.ArtifactInfo, remote RemoteInterface) (int64, io.ReadCloser, error) {
	remoteRepo := getRemoteRepo(art)
	asyncEnabled := config.EnableAsyncLocalCaching()
	
	if asyncEnabled {
		return c.proxyBlobAsync(ctx, art, remote, remoteRepo)
	}
	return c.proxyBlobSync(ctx, art, remote, remoteRepo)
}

func (c *controller) proxyBlobAsync(_ context.Context, art lib.ArtifactInfo, remote RemoteInterface, remoteRepo string) (int64, io.ReadCloser, error) {
	// existing behaviour of fetching from the remote directly and 
	// pushing it into the local registry in a background goroutine
}

func (c *controller) proxyBlobSync(ctx context.Context, art lib.ArtifactInfo, remote RemoteInterface, remoteRepo string) (int64, io.ReadCloser, error) {
	artifactKey := remoteRepo + ":" + art.Digest

	// all requests wait for an initial synchronous push into the local registry
	_, err, _ := blobGroup.Do(artifactKey, func() (any, error) {
		return nil, c.ensureBlobCached(ctx, art, remote, remoteRepo)
	})
	
	if err != nil {
		return 0, nil, err
	}

	// now all concurrent requests can pull the blob from the local registry, rather than going out to the remote
	size, reader, err := c.local.PullBlob(art.Repository, digest.Digest(art.Digest))
	if err != nil {
		return 0, nil, err
	}
	
	return size, reader, nil
}

Describe the development plan you've considered
I've got a first draft of a PR ready for review here: #22185

☝️ This change is intended to preserve existing behavior and be opt-in.

Additional context
I'm currently running this in production across a number of data centers and have observed this behavior working as expected.

~800 concurrent requests coalescing into 5 requests for remote blobs:

Image Image

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions