Conversation
|
Skipping CI for Draft Pull Request. |
b653d81 to
7a151e9
Compare
| ### Non-Goals | ||
|
|
||
| - Supporting custom sort orders in streaming mode. Requests with non-default | ||
| sort order fall back to a single buffered Range call. |
There was a problem hiding this comment.
What do you mean by fall back? Could we just say that users that need non-default sort order can continue to use Range?
There was a problem hiding this comment.
Updated. I was originally thinking of having the server auto switch the unary Range since the contract between the two is the same, but returning an error and getting the client to use unary Range seems to make more sense.
| Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts | ||
| the existing `RangeRequest` and returns a stream of `RangeStreamResponse` | ||
| messages. The server handles pagination internally, pins to a single MVCC | ||
| revision for snapshot consistency, and uses adaptive chunk sizing to |
There was a problem hiding this comment.
What happens if the stream took so long that revision is no longer available? Please describe the error returned by server and how client should detect and handle it.
There was a problem hiding this comment.
One more question regarding long-running streaming: currently, when BoltDB remaps, it requires mmaplock, which could potentially cause stalls in the background commit goroutine. This needs to be handled carefully. While we don’t need to dive into too many details here, it's important that we test this aspect. One possible approach could be modifying compaction to preserve key-value pairs, in order to avoid locking the read transaction for too long.
There was a problem hiding this comment.
Added. Return ErrCompacted and clients should retry.
edit: Just saw fuweid's comment that still needs to be addressed.
There was a problem hiding this comment.
What happens if the stream took so long that revision is no longer available?
This shouldn't happen, because each TXN guarantees repeatable read.
long-running streaming: currently, when BoltDB remaps, it requires mmaplock, which could potentially cause stalls in the background commit goroutine.
Right, this is a drawback. It's recommende that Try to avoid long running read transactions, refer to https://github.com/etcd-io/bbolt?tab=readme-ov-file#caveats--limitations.
So let's clearly call out the trade-off of using rangeStream here.
- Pros: reduce the memory usage so that it can avoid OOM
- Cons: long-running read transaction may block write transaction.
- Note: normally a read TXNs doesn't block write TXN. A write TXN will only be blocked by a read TXN when it needs to allocate more space/pages.
There was a problem hiding this comment.
This gets into how we expect the kube-apiserver to use RangeStream.
Do we expect kube-apiserver move entirely away from batching and fetch a list in a single request? Or are we still expecting the kube-apiserver to batch but possibly request fewer, larger batched chunks?
If we are making a single range request to etcd, we need to understand the impact of the long-running read txn and at what range size it becomes a problem.
There was a problem hiding this comment.
Chatted with @Jefftree out-of-band and I understand this better now. The proposal is to make the etcd server responsible for chunking the client's requests into multiple txns on the server side to avoid the long running transaction problem.
There was a problem hiding this comment.
Per discussion with @jpbetz, added a note to notes/caveats section that this rangestream implementation does not use read transactions and rather by pinning mvcc revision after the first chunk and reusing it for subsequent chunks under separate txns.
There was a problem hiding this comment.
client's requests into multiple txns
Sounds good.
| treated as internal design. The defined contract is that the merged | ||
| `RangeResponse` produces identical results as `proto.Merge`. | ||
|
|
||
| ### CountTotal Optimization |
There was a problem hiding this comment.
That sounds like separate feature, do we even need a limit in K8s? For watch cache list we don't need patination and for client requests we can just have client close connection after they hit limit
There was a problem hiding this comment.
This is about the internal pagination for stream chunks, and is different than the client limit.
re client limit: The streaming API is designed to wrap RangeRequest which has the client limit field. I don't think we technically need to set the limit for watch cache, but removing it would be an API change that changes the request structure compared to unary.
|
|
||
| ### Adaptive Chunk Sizing | ||
|
|
||
| The chunk limit starts at 10 keys and adjusts based on response size relative |
There was a problem hiding this comment.
Instead of using adaptive chunking, could we have MVCC to return up to X bytes? Instead of having caller driven pagination on MVCC, have MVCC decide page size based on expected result size.
There was a problem hiding this comment.
I think this is doable. Updated.
| | Final chunk | Kvs, Count, More | | ||
|
|
||
| Count is deferred to the final message. Revision is only in the first data | ||
| chunk. Clients reassemble by merging all messages. |
There was a problem hiding this comment.
From goals
Eliminate redundant count computation across paginated requests by computing
the total count once on the first chunk.
Why do we want the count done on the first chunk
There was a problem hiding this comment.
We can do it on any chunk, realistically just want to count once so we can merge the count result in RangeResponse. I've aligned the inconsistent description to now both count and return the count on the first chunk.
There was a problem hiding this comment.
After discussion with @serathius, returning the count on the first chunk will require a full traversal to obtain the chunk. If we return on the last chunk, we can keep a running count and avoid the duplicate traversal. Going to keep this on the last chunk (and other chunks omit count) unless there is strong objection.
| Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts | ||
| the existing `RangeRequest` and returns a stream of `RangeStreamResponse` | ||
| messages. The server handles pagination internally, pins to a single MVCC | ||
| revision for snapshot consistency, and uses adaptive chunk sizing to |
There was a problem hiding this comment.
One more question regarding long-running streaming: currently, when BoltDB remaps, it requires mmaplock, which could potentially cause stalls in the background commit goroutine. This needs to be handled carefully. While we don’t need to dive into too many details here, it's important that we test this aspect. One possible approach could be modifying compaction to preserve key-value pairs, in order to avoid locking the read transaction for too long.
|
|
||
| ```protobuf | ||
| service KV { | ||
| rpc RangeStream(RangeRequest) returns (stream RangeStreamResponse) {} |
There was a problem hiding this comment.
Do we support all the fields in RangeRequest? like limit
There was a problem hiding this comment.
Yes. One caveat is that we reject requests with non default sort order because they defeat the purpose of streaming.
| | Final chunk | Kvs, Count, More | | ||
|
|
||
| Count is deferred to the final message. Revision is only in the first data | ||
| chunk. Clients reassemble by merging all messages. |
| results in chunks instead of buffering the entire response. | ||
| - Eliminate redundant count computation across paginated requests by computing | ||
| the total count once on the first chunk. | ||
| - Provide a streaming API that produces results identical to the unary Range |
There was a problem hiding this comment.
Maybe I missed something—do we have a section that describes how to integrate with the kube-apiserver? Should the kube-apiserver be aware of the RangeStream API? If not, we can keep the details hidden - https://github.com/etcd-io/etcd/blob/6a4e69bb85c485115540ff0384dde195c0bbdb1b/client/v3/kubernetes/interface.go#L42
There was a problem hiding this comment.
do we have a section that describes how to integrate with the kube-apiserver?
+1
There was a problem hiding this comment.
This will also help us evaluate whether the API is well-designed.
There was a problem hiding this comment.
As K8s apiserver storage approver I have reviewed the API and it make sense to me. I also invited other API machinery members to review it.
There was a problem hiding this comment.
I'm supportive of this.
The batched range requests was band-aid added quickly many years ago to avoid sending large range requests with long running read txns to etcd. Adding streaming and server side chunking will be a huge improvement.
There was a problem hiding this comment.
The watch cache goes through clientv3.KV directly so the detail is hidden in the Kubernetes interface. Given that we have consistent list from cache, almost nothing should be hitting the etcd interface List.
There was a problem hiding this comment.
Given that we have consistent list from cache, almost nothing should be hitting the etcd interface List
Please note all watch cache features have a fallback mechanism to etcd. Consistent reads from cache fallbacks if cache is older than 3 seconds, pagination fallbacks when snapshots are not available during after restart. We see <1% requests fallbacking, still their high cost has huge impact on cluster stability.
See SIG API machinery meeting notes for Mar 4th.
[serathius] Consistent read from cache fallback, works great until defrag.
* Defrag stalls member, watch gets delayed more than 3s, watch cache falls back reads to etcd.
* Due to lack of APF protection (not aware of fallback), memory jumps (2GB pods)
* Etcd 17GB->180GB
* API server 50GB->160GB
* Possible fixes:
* Fallback down to APF and recalculate request cost. Should increase APF cost from 10 to 100, 10x less requests passing to etcd. Still memory jumping 2x. (1.36)
* Etcd range streaming (proposal [RangeStream Design](https://docs.google.com/document/d/1nSO2CvjvFjPkI5tRJxSQjpDhe87a8FQiHJv0Ri24t1E/edit?usp=sharing), etcd 3.7, maybe K8s 1.37-1.38?).
* Stabilizes etcd memory, but still API server manages whole LIST object.
* Graceful degrade and return 429 status code instead of making things worse (https://tinyurl.com/k8s-graceful-shutdown )
There was a problem hiding this comment.
To support watch cache fallback to regular List, we'll need to add a new interface ListStream or something similar for the streaming API. Updated this section. I don't think we can fully abstract out the interface because clients need to know whether RangeStream is supported to determine whether to paginate client side. We can abstract the details and say something like the interface handles chunking internally and the returned response is identical to a RangeResponse.
|
It would be better to add a section to design/clarify the Kubernetes/api-server side change. |
| work when clients paginate (repeated Range calls with increasing keys | ||
| recompute the total count on every page by walking the full B-tree index). |
There was a problem hiding this comment.
This question is not specific to this KEP, but would it be useful for etcd to provide a option to either avoid sending the total count? I suspect clients most often only care if a request that asks for a range with a limit reached the end of the range or not.
There was a problem hiding this comment.
This is an interesting angle. We wouldn't need to go through the CountTotal optimization section, and the API would be simpler. If it's just an option though, I'd imagine we still want to keep both options optimized and perform the counttotal optimization anyway.
RangeStream is intended to replace Range in high large Range requests anyway (and solves the count problem), would any consumers care about the option to skip the count?
There was a problem hiding this comment.
The ability to opt-out for existing Range op feels like a easy-ish win. Not sure if will matter for k8s given the plan to switch RangeStream.
For RangeStream, since the plan is to implement the limit opt, I can imagine clients wanting to know the total.. But for k8s, I think we'd opt out of total if we could?
| Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts | ||
| the existing `RangeRequest` and returns a stream of `RangeStreamResponse` | ||
| messages. The server handles pagination internally, pins to a single MVCC | ||
| revision for snapshot consistency, and uses adaptive chunk sizing to |
There was a problem hiding this comment.
This gets into how we expect the kube-apiserver to use RangeStream.
Do we expect kube-apiserver move entirely away from batching and fetch a list in a single request? Or are we still expecting the kube-apiserver to batch but possibly request fewer, larger batched chunks?
If we are making a single range request to etcd, we need to understand the impact of the long-running read txn and at what range size it becomes a problem.
Jefftree
left a comment
There was a problem hiding this comment.
I think all comments have been addressed.
- Add bbolt per-chunk transaction note
- Kubernetes API Server Integration section (watch cache path, client-side simplification)
| work when clients paginate (repeated Range calls with increasing keys | ||
| recompute the total count on every page by walking the full B-tree index). |
There was a problem hiding this comment.
This is an interesting angle. We wouldn't need to go through the CountTotal optimization section, and the API would be simpler. If it's just an option though, I'd imagine we still want to keep both options optimized and perform the counttotal optimization anyway.
RangeStream is intended to replace Range in high large Range requests anyway (and solves the count problem), would any consumers care about the option to skip the count?
| Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts | ||
| the existing `RangeRequest` and returns a stream of `RangeStreamResponse` | ||
| messages. The server handles pagination internally, pins to a single MVCC | ||
| revision for snapshot consistency, and uses adaptive chunk sizing to |
There was a problem hiding this comment.
Per discussion with @jpbetz, added a note to notes/caveats section that this rangestream implementation does not use read transactions and rather by pinning mvcc revision after the first chunk and reusing it for subsequent chunks under separate txns.
| | Message | Contents | | ||
| |----------------------|-------------------------------------------------------| | ||
| | Header | ClusterId, MemberId, RaftTerm (sent immediately from v3rpc layer) | | ||
| | First chunk | Revision, Count, Kvs | |
There was a problem hiding this comment.
Do we need Count in the first chunk? It would require another tree scan, while not giving any benefits.
There was a problem hiding this comment.
Moved count to last chunk so we don't need an extra tree scan.
| returns `InvalidArgument` for these requests and clients should use | ||
| the unary `Range` RPC instead. | ||
|
|
||
| ### Chunk Sizing |
There was a problem hiding this comment.
Is it new field in RangeStream or new option for etcd process?
cc @linxiulei :) - etcd-io/etcd#16300
There was a problem hiding this comment.
Current thinking is that it's derived from etcd's existing max-request-bytes server config option so no new fields. This is referring to an implementation detail between server and the underlying MVCC.
0199d5e to
f47cfe3
Compare
f47cfe3 to
da9f765
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Jefftree The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
LGTM overall. The etcd release cadence is slow. so my concern is that when we can get new release for v3.7 for this. |
| Add a server-streaming `RangeStream` RPC to the etcd KV service that accepts | ||
| the existing `RangeRequest` and returns a stream of `RangeStreamResponse` | ||
| messages. The server handles pagination internally, pins to a single MVCC | ||
| revision for snapshot consistency, and uses adaptive chunk sizing to |
There was a problem hiding this comment.
client's requests into multiple txns
Sounds good.
| ```go | ||
| ListStream(ctx context.Context, prefix string, opts ListStreamOptions, cb func(ListStreamResponse) error) error | ||
| ``` |
There was a problem hiding this comment.
Can we reuse the existing List interface method? If users want sorted responses, then use the range, otherwise use RangeStream. We encapsulate the details inside etcd's client SDK.
There was a problem hiding this comment.
I wanted to but I think it'll be difficult. Currently pagination is done on the client side so they call List with a limit and manage continuation tokens themselves. The logic lives outside of List.
RangeStream will allow callers to issue List without a limit and have chunking handled internally. However, when falling back to an etcd server that doesn't support RangeStream we need to fall back to the existing client-side pagination pattern that is controlled by the caller rather than encapsulated within the List. This makes it hard to transparently switch between the two behind a single List interface.
There was a problem hiding this comment.
Eventually I think we will need to consolidate them into one List. Can we add an option into ListOptions for now to differentiate the List and streamList instead of adding a new interface method ListStream?
Currently pagination is done on the client side so they call
Listwith a limit and manage continuation tokens themselves. The logic lives outside ofList.
Not sure whether it's feasible to move that code into etcd client SDK eventually. But it can be discussed separately.
There was a problem hiding this comment.
Yeah that seems reasonable. Updated to include an additional ListOptions parameter Stream.
Not sure whether it's feasible to move that code into etcd client SDK eventually. But it can be discussed separately.
Agree it may be feasible but we can keep it as a separate discussion.
|
Overall looks good with a comment #5967 (comment) |
This KEP proposes adding a
RangeStreamRPC to etcd's KV service. Instead of buffering the entire Range response in memory before sending, the server streams results back in chunks.The main problems this addresses:
The new RPC reuses the existing
RangeRequestand wraps responses in aRangeStreamResponse. Clients reassemble the stream withproto.Merge()and get identical results to a regularRange()call. The server handles chunking internally with adaptive sizing, pins to a single MVCC revision for consistency, and only computes the total count once on the first page.Requests with non-default sort orders fall back to the regular buffered path since sorting defeats the purpose of streaming. Older clients are completely unaffected, and downgrading to a version without
RangeStreamjust returnsUnimplemented.