Skip to content

operator: fix ingestor reconciliation and CRD handling#1051

Open
jessejlt wants to merge 16 commits intomainfrom
jesse/bug-op-ingestor-squash
Open

operator: fix ingestor reconciliation and CRD handling#1051
jessejlt wants to merge 16 commits intomainfrom
jesse/bug-op-ingestor-squash

Conversation

@jessejlt
Copy link
Copy Markdown
Member

@jessejlt jessejlt commented Jan 6, 2026

Summary

  • add TLS and applied provisioning state fixes for the Ingestor CRD
  • refactor ingestor resource rendering to programmatic builders, including pull secrets/node selector/tolerations propagation
  • recover from criteria-expression error states across operator controllers
  • update ingestor CRD docs

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the Ingestor reconciliation logic to use programmatic resource builders instead of YAML templates, adds TLS configuration support to the Ingestor CRD, and implements error recovery for criteria expression failures across all operator controllers.

Key changes:

  • Replaced template-based resource generation with type-safe Go builders (BuildServiceAccount, BuildService, BuildStatefulSet)
  • Added automatic discovery and propagation of ImagePullSecrets, NodeSelector, and Tolerations from the operator pod to managed workloads
  • Implemented criteria expression error recovery across Ingestor, Collector, Alerter, and ADXCluster reconcilers

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tools/crdgen/Dockerfile Updated to install specific Go version (1.25.3) and added tar dependency
pkg/logger/logger.go Enhanced controllerRuntimeLogSink to properly handle log levels, names, and values
operator/pullsecrets.go New discovery functions for ImagePullSecrets, NodeSelector, and Tolerations from operator pod
operator/pullsecrets_test.go Comprehensive test coverage for discovery functions
operator/manifests/ingestor_rbac.yaml Extracted RBAC resources from main ingestor manifest
operator/manifests/ingestor.yaml Removed (replaced by programmatic builders)
operator/manifests/crds/ingestors_crd.yaml Added TLS configuration fields (secretRef and hostPath)
operator/manifests/collector.yaml Removed namespace resource declaration
operator/manifests/alerter.yaml Removed namespace resource declaration
operator/ingestor_test.go Added comprehensive tests for reconciliation state machine, drift detection, and ownership handling
operator/ingestor_resources_test.go Tests for programmatic resource builders
operator/ingestor_resources.go New programmatic builders for Ingestor resources with type safety
operator/ingestor.go Major refactor: state machine fixes, ensureManifests for drift correction, deterministic ADXCluster ordering
operator/collector.go Added criteria expression error recovery
operator/builder_test.go Tests for common builder utilities
operator/builder.go Common label, toleration, and builder helper functions
operator/alerter.go Added criteria expression error recovery
operator/adx.go Added criteria expression error recovery
kustomize/bases/ingestors_crd.yaml Synced TLS configuration fields with operator manifests
docs/ingestor.md Added operator-managed deployment documentation
docs/designs/operator.md Updated Ingestor CRD documentation with status conditions and criteria expression
docs/crds.md Enhanced Ingestor documentation with status conditions table
cmd/operator/main.go Integrated discovery of ImagePullSecrets, NodeSelector, and Tolerations
api/v1/zz_generated.deepcopy.go Generated DeepCopy methods for TLSConfig and SecretReference
api/v1/ingestor_types.go Added TLSConfig struct and fixed recursive escaping in StoreAppliedProvisioningState
api/v1/alerter_types.go Fixed recursive escaping in StoreAppliedProvisioningState

@jessejlt jessejlt force-pushed the jesse/bug-op-ingestor-squash branch from 47c68ca to 2c5b0da Compare January 6, 2026 19:53
@jessejlt jessejlt marked this pull request as ready for review January 6, 2026 21:08
When updating status conditions for criteriaExpression errors,
return the error instead of just logging it. This ensures the
controller requeues and retries on transient API server failures,
so users will eventually see the correct status condition.
Replace argument sorting with order-independent set comparison in
statefulSetNeedsUpdate. This addresses the reviewer concern that
argument ordering may be semantically significant for process args.

- Add argsEqual() helper for multiset comparison (handles duplicates)
- Remove endpoint sorting in buildIngestorConfig
- Remove cluster label sorting in buildIngestorArgs
- Add comprehensive test coverage for args comparison

The set comparison prevents spurious StatefulSet rollouts caused by
non-deterministic map iteration or Kubernetes List API ordering,
while preserving the natural argument order produced by the builder.
Extract mapADXClusterToIngestors from inline closure to a named receiver
method on IngestorReconciler, enabling direct unit testing of the
ADXCluster-to-Ingestor watch mapping logic.

Add tests targeting state machine fragility and edge cases:

- CriteriaExpression tests: Cover CEL expression evaluation including
  syntax errors (CriteriaExpressionError), false evaluations
  (CriteriaExpressionFalse), and recovery transitions when expressions
  are fixed or removed. These catch issues where terminal vs transient
  error handling is confused.

- ADXCluster edge cases: Test mixed ready/not-ready clusters (verifies
  we wait for all clusters), federated cluster lexicographic selection
  (documents the deterministic but arbitrary selection behavior), and
  nil selector error handling.

- ensureManifests drift correction: Verify the function handles
  repeated calls without error and properly requeues when ADXCluster
  becomes not-ready after Ingestor is Ready.

- applyDefaults: Unit test the defaulting logic for Replicas, Image,
  and LogDestination to catch regressions in default values.

- Concurrent CRD installation: Test the mutex-protected ensureCRDsInstalled
  to verify concurrent goroutines don't cause duplicate installations or
  race conditions.

- mapADXClusterToIngestors: Test selector matching, deletion skipping,
  namespace scoping, and multiple matching ingestors to ensure the watch
  correctly enqueues reconcile requests.

- Terminal vs transient errors: Document and verify the error handling
  pattern where CEL errors return nil (terminal) vs API errors return
  error (transient) to prevent infinite retry loops.
@jessejlt
Copy link
Copy Markdown
Member Author

Addressed PR feedback

@jessejlt jessejlt enabled auto-merge (squash) January 12, 2026 16:21
@jessejlt jessejlt requested a review from mkeesey January 12, 2026 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants