[KEP-5966] etcd RangeStream by Jefftree · Pull Request #5967 · kubernetes/enhancements

Jefftree · 2026-03-18T15:10:52Z

This KEP proposes adding a RangeStream RPC to etcd's KV service. Instead of buffering the entire Range response in memory before sending, the server streams results back in chunks.

The main problems this addresses:

Large Range responses cause memory spikes on the server because the KV slice, serialized protobuf, and gRPC send buffer all have to coexist in memory
Client-side pagination is wasteful — every page recomputes the total count by walking the entire B-tree index

The new RPC reuses the existing RangeRequest and wraps responses in a RangeStreamResponse. Clients reassemble the stream with proto.Merge() and get identical results to a regular Range() call. The server handles chunking internally with adaptive sizing, pins to a single MVCC revision for consistency, and only computes the total count once on the first page.

Requests with non-default sort orders fall back to the regular buffered path since sorting defeats the purpose of streaming. Older clients are completely unaffected, and downgrading to a version without RangeStream just returns Unimplemented.

NONE

k8s-ci-robot · 2026-03-18T15:10:55Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

serathius · 2026-03-31T11:37:25Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+### Non-Goals
+
+- Supporting custom sort orders in streaming mode. Requests with non-default
+  sort order fall back to a single buffered Range call.


What do you mean by fall back? Could we just say that users that need non-default sort order can continue to use Range?

Updated. I was originally thinking of having the server auto switch the unary Range since the contract between the two is the same, but returning an error and getting the client to use unary Range seems to make more sense.

serathius · 2026-03-31T11:39:15Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts
+the existing `RangeRequest` and returns a stream of `RangeStreamResponse`
+messages. The server handles pagination internally, pins to a single MVCC
+revision for snapshot consistency, and uses adaptive chunk sizing to


What happens if the stream took so long that revision is no longer available? Please describe the error returned by server and how client should detect and handle it.

One more question regarding long-running streaming: currently, when BoltDB remaps, it requires mmaplock, which could potentially cause stalls in the background commit goroutine. This needs to be handled carefully. While we don’t need to dive into too many details here, it's important that we test this aspect. One possible approach could be modifying compaction to preserve key-value pairs, in order to avoid locking the read transaction for too long.

Added. Return ErrCompacted and clients should retry.

edit: Just saw fuweid's comment that still needs to be addressed.

What happens if the stream took so long that revision is no longer available?

This shouldn't happen, because each TXN guarantees repeatable read.

long-running streaming: currently, when BoltDB remaps, it requires mmaplock, which could potentially cause stalls in the background commit goroutine.

Right, this is a drawback. It's recommende that Try to avoid long running read transactions, refer to https://github.com/etcd-io/bbolt?tab=readme-ov-file#caveats--limitations.

So let's clearly call out the trade-off of using rangeStream here.

Pros: reduce the memory usage so that it can avoid OOM

Cons: long-running read transaction may block write transaction.

Note: normally a read TXNs doesn't block write TXN. A write TXN will only be blocked by a read TXN when it needs to allocate more space/pages.

This gets into how we expect the kube-apiserver to use RangeStream.

Do we expect kube-apiserver move entirely away from batching and fetch a list in a single request? Or are we still expecting the kube-apiserver to batch but possibly request fewer, larger batched chunks?

If we are making a single range request to etcd, we need to understand the impact of the long-running read txn and at what range size it becomes a problem.

Chatted with @Jefftree out-of-band and I understand this better now. The proposal is to make the etcd server responsible for chunking the client's requests into multiple txns on the server side to avoid the long running transaction problem.

Per discussion with @jpbetz, added a note to notes/caveats section that this rangestream implementation does not use read transactions and rather by pinning mvcc revision after the first chunk and reusing it for subsequent chunks under separate txns.

client's requests into multiple txns

Sounds good.

serathius · 2026-03-31T11:42:24Z

cc @liggitt @jpbetz @deads2k

serathius · 2026-03-31T11:45:16Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+treated as internal design. The defined contract is that the merged
+`RangeResponse` produces identical results as `proto.Merge`.
+
+### CountTotal Optimization


That sounds like separate feature, do we even need a limit in K8s? For watch cache list we don't need patination and for client requests we can just have client close connection after they hit limit

This is about the internal pagination for stream chunks, and is different than the client limit.

re client limit: The streaming API is designed to wrap RangeRequest which has the client limit field. I don't think we technically need to set the limit for watch cache, but removing it would be an API change that changes the request structure compared to unary.

keps/sig-etcd/5966-etcd-range-stream/README.md

serathius · 2026-03-31T11:51:03Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+
+### Adaptive Chunk Sizing
+
+The chunk limit starts at 10 keys and adjusts based on response size relative


Instead of using adaptive chunking, could we have MVCC to return up to X bytes? Instead of having caller driven pagination on MVCC, have MVCC decide page size based on expected result size.

I think this is doable. Updated.

keps/sig-etcd/5966-etcd-range-stream/README.md

rf232 · 2026-03-31T11:56:07Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+| Final chunk          | Kvs, Count, More                                      |
+
+Count is deferred to the final message. Revision is only in the first data
+chunk. Clients reassemble by merging all messages.


From goals

Eliminate redundant count computation across paginated requests by computing
the total count once on the first chunk.

Why do we want the count done on the first chunk

+1 to return count first

We can do it on any chunk, realistically just want to count once so we can merge the count result in RangeResponse. I've aligned the inconsistent description to now both count and return the count on the first chunk.

After discussion with @serathius, returning the count on the first chunk will require a full traversal to obtain the chunk. If we return on the last chunk, we can keep a running count and avoid the duplicate traversal. Going to keep this on the last chunk (and other chunks omit count) unless there is strong objection.

keps/sig-etcd/5966-etcd-range-stream/README.md

fuweid

@ahrtr

keps/sig-etcd/5966-etcd-range-stream/kep.yaml

fuweid · 2026-03-31T14:43:37Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts
+the existing `RangeRequest` and returns a stream of `RangeStreamResponse`
+messages. The server handles pagination internally, pins to a single MVCC
+revision for snapshot consistency, and uses adaptive chunk sizing to


One more question regarding long-running streaming: currently, when BoltDB remaps, it requires mmaplock, which could potentially cause stalls in the background commit goroutine. This needs to be handled carefully. While we don’t need to dive into too many details here, it's important that we test this aspect. One possible approach could be modifying compaction to preserve key-value pairs, in order to avoid locking the read transaction for too long.

fuweid · 2026-03-31T14:44:19Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+
+```protobuf
+service KV {
+  rpc RangeStream(RangeRequest) returns (stream RangeStreamResponse) {}


Do we support all the fields in RangeRequest? like limit

Yes. One caveat is that we reject requests with non default sort order because they defeat the purpose of streaming.

fuweid · 2026-03-31T14:44:44Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+| Final chunk          | Kvs, Count, More                                      |
+
+Count is deferred to the final message. Revision is only in the first data
+chunk. Clients reassemble by merging all messages.


+1 to return count first

fuweid · 2026-03-31T14:50:15Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+  results in chunks instead of buffering the entire response.
+- Eliminate redundant count computation across paginated requests by computing
+  the total count once on the first chunk.
+- Provide a streaming API that produces results identical to the unary Range


Maybe I missed something—do we have a section that describes how to integrate with the kube-apiserver? Should the kube-apiserver be aware of the RangeStream API? If not, we can keep the details hidden - https://github.com/etcd-io/etcd/blob/6a4e69bb85c485115540ff0384dde195c0bbdb1b/client/v3/kubernetes/interface.go#L42

do we have a section that describes how to integrate with the kube-apiserver?

+1

This will also help us evaluate whether the API is well-designed.

As K8s apiserver storage approver I have reviewed the API and it make sense to me. I also invited other API machinery members to review it.

I'm supportive of this.

The batched range requests was band-aid added quickly many years ago to avoid sending large range requests with long running read txns to etcd. Adding streaming and server side chunking will be a huge improvement.

The watch cache goes through clientv3.KV directly so the detail is hidden in the Kubernetes interface. Given that we have consistent list from cache, almost nothing should be hitting the etcd interface List.

Given that we have consistent list from cache, almost nothing should be hitting the etcd interface List

Please note all watch cache features have a fallback mechanism to etcd. Consistent reads from cache fallbacks if cache is older than 3 seconds, pagination fallbacks when snapshots are not available during after restart. We see <1% requests fallbacking, still their high cost has huge impact on cluster stability.

See SIG API machinery meeting notes for Mar 4th.

[serathius] Consistent read from cache fallback, works great until defrag. * Defrag stalls member, watch gets delayed more than 3s, watch cache falls back reads to etcd. * Due to lack of APF protection (not aware of fallback), memory jumps (2GB pods) * Etcd 17GB->180GB * API server 50GB->160GB * Possible fixes: * Fallback down to APF and recalculate request cost. Should increase APF cost from 10 to 100, 10x less requests passing to etcd. Still memory jumping 2x. (1.36) * Etcd range streaming (proposal [RangeStream Design](https://docs.google.com/document/d/1nSO2CvjvFjPkI5tRJxSQjpDhe87a8FQiHJv0Ri24t1E/edit?usp=sharing), etcd 3.7, maybe K8s 1.37-1.38?). * Stabilizes etcd memory, but still API server manages whole LIST object. * Graceful degrade and return 429 status code instead of making things worse (https://tinyurl.com/k8s-graceful-shutdown )

To support watch cache fallback to regular List, we'll need to add a new interface ListStream or something similar for the streaming API. Updated this section. I don't think we can fully abstract out the interface because clients need to know whether RangeStream is supported to determine whether to paginate client side. We can abstract the details and say something like the interface handles chunking internally and the returned response is identical to a RangeResponse.

ahrtr · 2026-03-31T15:33:47Z

It would be better to add a section to design/clarify the Kubernetes/api-server side change.

keps/sig-etcd/5966-etcd-range-stream/README.md

jpbetz · 2026-04-01T13:43:02Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+work when clients paginate (repeated Range calls with increasing keys
+recompute the total count on every page by walking the full B-tree index).


This question is not specific to this KEP, but would it be useful for etcd to provide a option to either avoid sending the total count? I suspect clients most often only care if a request that asks for a range with a limit reached the end of the range or not.

This is an interesting angle. We wouldn't need to go through the CountTotal optimization section, and the API would be simpler. If it's just an option though, I'd imagine we still want to keep both options optimized and perform the counttotal optimization anyway.

RangeStream is intended to replace Range in high large Range requests anyway (and solves the count problem), would any consumers care about the option to skip the count?

The ability to opt-out for existing Range op feels like a easy-ish win. Not sure if will matter for k8s given the plan to switch RangeStream.

For RangeStream, since the plan is to implement the limit opt, I can imagine clients wanting to know the total.. But for k8s, I think we'd opt out of total if we could?

jpbetz · 2026-04-01T13:55:03Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts
+the existing `RangeRequest` and returns a stream of `RangeStreamResponse`
+messages. The server handles pagination internally, pins to a single MVCC
+revision for snapshot consistency, and uses adaptive chunk sizing to


This gets into how we expect the kube-apiserver to use RangeStream.

Do we expect kube-apiserver move entirely away from batching and fetch a list in a single request? Or are we still expecting the kube-apiserver to batch but possibly request fewer, larger batched chunks?

If we are making a single range request to etcd, we need to understand the impact of the long-running read txn and at what range size it becomes a problem.

Jefftree

I think all comments have been addressed.

Add bbolt per-chunk transaction note
Kubernetes API Server Integration section (watch cache path, client-side simplification)

keps/sig-etcd/5966-etcd-range-stream/README.md

keps/sig-etcd/5966-etcd-range-stream/kep.yaml

keps/sig-etcd/5966-etcd-range-stream/README.md

Jefftree · 2026-04-01T19:36:57Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+work when clients paginate (repeated Range calls with increasing keys
+recompute the total count on every page by walking the full B-tree index).


This is an interesting angle. We wouldn't need to go through the CountTotal optimization section, and the API would be simpler. If it's just an option though, I'd imagine we still want to keep both options optimized and perform the counttotal optimization anyway.

RangeStream is intended to replace Range in high large Range requests anyway (and solves the count problem), would any consumers care about the option to skip the count?

Jefftree · 2026-04-01T19:39:39Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts
+the existing `RangeRequest` and returns a stream of `RangeStreamResponse`
+messages. The server handles pagination internally, pins to a single MVCC
+revision for snapshot consistency, and uses adaptive chunk sizing to


Per discussion with @jpbetz, added a note to notes/caveats section that this rangestream implementation does not use read transactions and rather by pinning mvcc revision after the first chunk and reusing it for subsequent chunks under separate txns.

serathius · 2026-04-07T15:53:37Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+| Message              | Contents                                              |
+|----------------------|-------------------------------------------------------|
+| Header               | ClusterId, MemberId, RaftTerm (sent immediately from v3rpc layer) |
+| First chunk          | Revision, Count, Kvs                                  |


Do we need Count in the first chunk? It would require another tree scan, while not giving any benefits.

Moved count to last chunk so we don't need an extra tree scan.

fuweid · 2026-04-08T17:42:33Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+returns `InvalidArgument` for these requests and clients should use
+the unary `Range` RPC instead.
+
+### Chunk Sizing


Is it new field in RangeStream or new option for etcd process?

cc @linxiulei :) - etcd-io/etcd#16300

Current thinking is that it's derived from etcd's existing max-request-bytes server config option so no new fields. This is referring to an implementation detail between server and the underlying MVCC.

k8s-ci-robot · 2026-04-08T18:29:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Jefftree
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

keps/sig-etcd/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fuweid · 2026-04-08T18:47:31Z

LGTM overall.

The etcd release cadence is slow. so my concern is that when we can get new release for v3.7 for this.

ahrtr · 2026-04-08T20:40:36Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts
+the existing `RangeRequest` and returns a stream of `RangeStreamResponse`
+messages. The server handles pagination internally, pins to a single MVCC
+revision for snapshot consistency, and uses adaptive chunk sizing to


client's requests into multiple txns

Sounds good.

ahrtr · 2026-04-08T20:44:16Z

keps/sig-etcd/5966-etcd-range-stream/README.md

+```go
+ListStream(ctx context.Context, prefix string, opts ListStreamOptions, cb func(ListStreamResponse) error) error
+```


Can we reuse the existing List interface method? If users want sorted responses, then use the range, otherwise use RangeStream. We encapsulate the details inside etcd's client SDK.

I wanted to but I think it'll be difficult. Currently pagination is done on the client side so they call List with a limit and manage continuation tokens themselves. The logic lives outside of List.

RangeStream will allow callers to issue List without a limit and have chunking handled internally. However, when falling back to an etcd server that doesn't support RangeStream we need to fall back to the existing client-side pagination pattern that is controlled by the caller rather than encapsulated within the List. This makes it hard to transparently switch between the two behind a single List interface.

Eventually I think we will need to consolidate them into one List. Can we add an option into ListOptions for now to differentiate the List and streamList instead of adding a new interface method ListStream?

Currently pagination is done on the client side so they call List with a limit and manage continuation tokens themselves. The logic lives outside of List.

Not sure whether it's feasible to move that code into etcd client SDK eventually. But it can be discussed separately.

Yeah that seems reasonable. Updated to include an additional ListOptions parameter Stream.

cc @serathius @fuweid

Not sure whether it's feasible to move that code into etcd client SDK eventually. But it can be discussed separately.

Agree it may be feasible but we can keep it as a separate discussion.

ahrtr · 2026-04-08T20:45:58Z

Overall looks good with a comment #5967 (comment)

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 18, 2026

k8s-ci-robot requested a review from ahrtr March 18, 2026 15:11

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 18, 2026

k8s-ci-robot requested a review from siyuanfoundation March 18, 2026 15:11

k8s-ci-robot added sig/etcd Categorizes an issue or PR as relevant to SIG Etcd. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 18, 2026

Jefftree mentioned this pull request Mar 19, 2026

Large range queries from api-server could take down etcd due to OOM etcd-io/etcd#12342

Open

Jefftree force-pushed the etcd-rangestream branch from b653d81 to 7a151e9 Compare March 30, 2026 20:19

Jefftree marked this pull request as ready for review March 30, 2026 21:29

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 30, 2026

k8s-ci-robot requested a review from serathius March 30, 2026 21:29

serathius reviewed Mar 31, 2026

View reviewed changes

keps/sig-etcd/5966-etcd-range-stream/README.md Outdated Show resolved Hide resolved

serathius reviewed Mar 31, 2026

View reviewed changes

rf232 reviewed Mar 31, 2026

View reviewed changes

keps/sig-etcd/5966-etcd-range-stream/README.md Show resolved Hide resolved

rf232 reviewed Mar 31, 2026

View reviewed changes

serathius reviewed Mar 31, 2026

View reviewed changes

keps/sig-etcd/5966-etcd-range-stream/README.md Show resolved Hide resolved

serathius reviewed Mar 31, 2026

View reviewed changes

keps/sig-etcd/5966-etcd-range-stream/README.md Show resolved Hide resolved

fuweid reviewed Mar 31, 2026

View reviewed changes

ahrtr reviewed Mar 31, 2026

View reviewed changes

keps/sig-etcd/5966-etcd-range-stream/README.md Outdated Show resolved Hide resolved

jpbetz reviewed Apr 1, 2026

View reviewed changes

Jefftree commented Apr 1, 2026

View reviewed changes

serathius reviewed Apr 7, 2026

View reviewed changes

fuweid reviewed Apr 8, 2026

View reviewed changes

k8s-ci-robot added sig/auth Categorizes an issue or PR as relevant to SIG Auth. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. labels Apr 8, 2026

github-project-automation bot added this to SIG Apps Apr 8, 2026

k8s-ci-robot added the sig/cli Categorizes an issue or PR as relevant to SIG CLI. label Apr 8, 2026

github-project-automation bot moved this to Needs Triage in SIG Apps Apr 8, 2026

k8s-ci-robot added the sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. label Apr 8, 2026

github-project-automation bot added this to SIG CLI Apr 8, 2026

k8s-ci-robot added sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/multicluster Categorizes an issue or PR as relevant to SIG Multicluster. labels Apr 8, 2026

github-project-automation bot moved this to Needs Triage in SIG CLI Apr 8, 2026

github-project-automation bot added this to SIG Scheduling Apr 8, 2026

github-project-automation bot moved this to Needs Triage in SIG Scheduling Apr 8, 2026

Jefftree force-pushed the etcd-rangestream branch from 0199d5e to f47cfe3 Compare April 8, 2026 18:25

KEP-5966: etcd RangeStream

da9f765

Jefftree force-pushed the etcd-rangestream branch from f47cfe3 to da9f765 Compare April 8, 2026 18:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 8, 2026

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 8, 2026

ahrtr reviewed Apr 8, 2026

View reviewed changes

KEP-5966: Use ListOptions.Stream instead of separate ListStream method

33e99fd


		### Adaptive Chunk Sizing

		The chunk limit starts at 10 keys and adjusts based on response size relative

		work when clients paginate (repeated Range calls with increasing keys
		recompute the total count on every page by walking the full B-tree index).

Conversation

Jefftree commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Mar 18, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serathius Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serathius commented Mar 31, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rf232 Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

fuweid left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

serathius Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Jefftree commented Mar 18, 2026 •

edited

Loading

serathius Mar 31, 2026 •

edited

Loading

rf232 Mar 31, 2026 •

edited

Loading

serathius Apr 8, 2026 •

edited

Loading

jpbetz Apr 1, 2026 •

edited

Loading