Welcome to the Arcane Kubernetes-native data streaming platform! This guide will help you understand how to use Arcane to create and manage data streams in your Kubernetes cluster.
- Core Concepts
- Getting Started
- Creating Your First Stream
- Managing Streams
- Backfilling Data
- Advanced Configuration
- Monitoring and Troubleshooting
- Best Practices
A StreamClass is a cluster-scoped resource that tells the Arcane Operator which Custom Resource Definitions (CRDs)
to watch for stream definitions. When you install a streaming plugin, it typically creates a StreamClass that
registers the plugin with the operator.
If you need to have multiple instances of the same plugin with different configurations, you can create
additional StreamClass resources for each plugin.
Example StreamClass:
apiVersion: streaming.sneaksanddata.com/v1
kind: StreamClass
metadata:
name: sqlserver-change-tracking-stream
spec:
# The API group of the stream CRD to watch
apiGroupRef: streaming.sneaksanddata.com
# The API version
apiVersion: v1
# The kind of the stream CRD
kindRef: SqlServerChangeTrackingStream
# The plural name used in API paths
pluralName: sqlserverchangetrackingstreams
# Secret fields to extract and pass to jobs (optional)
# These fields will be looked up in the referenced secret in stream definitions
# Typically used for connection strings, API keys, etc.
# This field is usually defined by the plugin.
secretRefs:
- connectionString
- storageAccountKeyKey Fields:
apiGroupRef,apiVersion,kindRef,pluralName: Define which CRD to watchsecretRefs: List of secret fields that should be extracted and passed to streaming jobs. SecretRefs in the plugin should be defined in the format that can be deserialized into the EnvFromSource object.
A StreamingJobTemplate is a namespaced resource that defines the Kubernetes Job template used to run a stream.
It contains the container image, resource limits, environment variables, and other job specifications.
Example StreamingJobTemplate:
apiVersion: streaming.sneaksanddata.com/v1
kind: StreamingJobTemplate
metadata:
name: sqlserver-stream-template
namespace: data-streaming
spec:
# Standard Kubernetes Job spec
template:
spec:
containers:
- name: stream-processor
image: ghcr.io/sneaksanddata/arcane-stream-sqlserver-change-tracking:v1.0.12
resources:
limits:
memory: "2Gi"
cpu: "1000m"
requests:
memory: "1Gi"
cpu: "500m"Stream Definitions are custom resources defined by streaming plugins. Each plugin creates its own CRD with specific fields for that data source or sink. The operator watches these resources and creates/manages Kubernetes Jobs based on their specifications.
Example Stream Definition (conceptual):
apiVersion: streaming.sneaksanddata.com/v1
kind: SqlServerChangeTrackingStream
metadata:
name: my-database-stream
namespace: data-streaming
spec:
# Reference to the job template (ObjectReference)
jobTemplateRef:
name: sqlserver-stream-template
namespace: data-streaming # optional, defaults to stream's namespace
# Plugin-specific configuration
sourceDatabase: MyProductionDB
targetStorage: abfs://data@mystorageaccount.dfs.core.windows.net/tables
# Secrets reference
secretRef:
name: database-credentials
# Optional: suspend the stream
suspended: false
# Backfill job template reference (optional, ObjectReference)
backfillJobTemplateRef:
name: sqlserver-backfill-template
namespace: data-streaming # optional, defaults to stream's namespaceImportant
Each streaming plugin defines its own stream CRD with specific fields. Refer to the plugin documentation for details on available stream definitions and their configurations.
A BackfillRequest is used to trigger a one-time backfill job for a stream.
This is useful when you need to reprocess historical data or recover from a failure.
Example BackfillRequest:
apiVersion: streaming.sneaksanddata.com/v1
kind: BackfillRequest
metadata:
name: backfill-2024-01
namespace: data-streaming
spec:
# Name of the StreamClass
streamClass: sqlserver-change-tracking-stream
# ID/name of the specific stream to backfill
streamId: my-database-stream
# Whether the backfill is completed
completed: false- A Kubernetes cluster (v1.20 or later recommended)
kubectlconfigured to access your clusterhelm(v3.0 or later)- Sufficient permissions to create cluster-scoped resources (CRDs, ClusterRoles, etc.)
- Create a namespace for Arcane:
kubectl create namespace arcane- Install the Arcane Operator using Helm:
helm install arcane oci://ghcr.io/sneaksanddata/helm/arcane-operator \
--version v0.0.14 \
--namespace arcane- Verify the installation:
kubectl get pods -n arcaneYou should see the operator pod running:
NAME READY STATUS RESTARTS AGE
arcane-operator-55988bbfcb-ql7qr 1/1 Running 0 1m
- Check that the CRDs are installed:
kubectl get crd | grep streamingYou should see:
streamclasses.streaming.sneaksanddata.com
streamingjobtemplate.streaming.sneaksanddata.com
backfillrequests.streaming.sneaksanddata.com
After installing the operator, you need to install streaming plugins for your data sources. Each plugin is installed via Helm and includes:
- Plugin-specific CRDs
- A StreamClass registration
- Default StreamingJobTemplate(s)
Example: Installing SQL Server Change Tracking Plugin
helm install sqlserver-plugin oci://ghcr.io/sneaksanddata/helm/arcane-stream-sqlserver-change-tracking \
--version v1.0.0 \
--namespace data-streaming \
--create-namespaceAvailable Plugins:
- SQL Server Change Tracking:
arcane-stream-sqlserver-change-tracking-change-tracking - Microsoft Synapse Link:
arcane-stream-microsoft-synapse-link - Parquet Files:
arcane-stream-parquet - JSON Files:
arcane-stream-json - REST API:
arcane-stream-rest-api(Akka-based, legacy)
Let's create a stream step by step:
Create a Kubernetes Secret with connection credentials:
kubectl create secret generic my-database-creds \
--namespace data-streaming \
--from-literal=connectionString="Server=myserver;Database=mydb;..." \
--from-literal=storageAccountKey="your-storage-key"Create a YAML file for your stream (e.g., my-stream.yaml):
apiVersion: streaming.sneaksanddata.com/v1
kind: SqlServerChangeTrackingStream
metadata:
name: orders-stream
namespace: data-streaming
spec:
jobTemplateRef:
name: sqlserver-stream-template
namespace: data-streaming # optional, defaults to stream's namespace
sourceDatabase: OrdersDB
sourceTable: Orders
targetStorage: abfs://data@mystorageaccount.dfs.core.windows.net/orders
secretRef:
name: my-database-creds
pollInterval: 60skubectl apply -f my-stream.yamlImportant
When you create the stream, the operator will automatically create a Backfill request to backfill historical data if the plugin supports it. And once the backfill is complete, it will start the regular streaming job.
# Check the stream status
kubectl get sqlserverchangetrackingstreams -n data-streaming
# Watch for jobs being created
kubectl get jobs -n data-streaming -wYou should see:
- A backfill job created initially (if configured)
- A regular streaming job that runs continuously or on schedule
List all streams:
kubectl get <stream-kind> -n data-streamingGet detailed information:
kubectl describe <stream-kind> <stream-name> -n data-streamingCheck stream phase:
Stream resources typically have a status.phase field with values like:
Pending: Initial state, waiting for first jobRunning: Stream job is actively runningBackfilling: Backfill job is in progressFailed: Stream encountered an errorSuspended: Stream is paused
To temporarily stop a stream without deleting it:
kubectl patch <stream-kind> <stream-name> -n data-streaming \
--type merge \
-p '{"spec":{"suspended":true}}'Or edit the YAML:
spec:
suspended: trueThe operator will delete the running job but keep the stream definition.
To resume a suspended stream:
kubectl patch <stream-kind> <stream-name> -n data-streaming \
--type merge \
-p '{"spec":{"suspended":false}}'To permanently delete a stream:
kubectl delete <stream-kind> <stream-name> -n data-streamingThis will:
- Delete the stream definition
- Clean up associated jobs
- Remove any related resources
Backfilling allows you to reprocess historical data for a stream.
Create a YAML file (e.g., backfill.yaml):
apiVersion: streaming.sneaksanddata.com/v1
kind: BackfillRequest
metadata:
name: orders-backfill-jan-2026
namespace: data-streaming
spec:
streamClass: sqlserver-change-tracking-stream
streamId: orders-stream
completed: falseApply it:
kubectl apply -f backfill.yaml# Check backfill status
kubectl get backfillrequest -n data-streaming
# View backfill job
kubectl get jobs -l streamId=orders-stream -n data-streaming
# Check backfill logs
kubectl logs job/orders-stream-backfill -n data-streamingOnce the backfill job completes successfully, the operator will automatically mark the BackfillRequest as completed.
Control resource allocation by customizing StreamingJobTemplate:
spec:
template:
spec:
containers:
- name: stream-processor
resources:
limits:
memory: "4Gi"
cpu: "2000m"
requests:
memory: "2Gi"
cpu: "1000m"Best Practices:
- Start with conservative limits and increase based on monitoring
- Set requests lower than limits to allow efficient bin-packing
- Consider using Vertical Pod Autoscaler for automatic tuning
Inject environment variables:
spec:
template:
spec:
containers:
- name: stream-processor
env:
- name: LOG_LEVEL
value: "DEBUG"
- name: MAX_BATCH_SIZE
value: "1000"Use secrets:
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef:
name: database-credentials
key: passwordMount secret files:
volumeMounts:
- name: credentials
mountPath: /etc/credentials
readOnly: true
volumes:
- name: credentials
secret:
secretName: database-credentialsYou can create multiple job templates for different scenarios:
Production template (high resources):
apiVersion: streaming.sneaksanddata.com/v1
kind: StreamingJobTemplate
metadata:
name: production-template
namespace: data-streaming
spec:
template:
spec:
containers:
- name: stream-processor
resources:
limits:
memory: "8Gi"
cpu: "4000m"Development template (low resources):
apiVersion: streaming.sneaksanddata.com/v1
kind: StreamingJobTemplate
metadata:
name: dev-template
namespace: data-streaming
spec:
template:
spec:
containers:
- name: stream-processor
resources:
limits:
memory: "512Mi"
cpu: "250m"Reference the appropriate template in your stream definition:
spec:
jobTemplateRef:
name: production-template # or dev-templateView operator logs:
kubectl logs -n arcane -l app.kubernetes.io/name=arcane-operator -fCheck operator health endpoint:
The operator exposes health probes on port 8888:
kubectl port-forward -n arcane deploy/arcane-operator 8888:8888
curl http://localhost:8888/healthz
curl http://localhost:8888/readyzGet logs from the current job:
# List jobs
kubectl get jobs -n data-streaming
# View job logs
kubectl logs job/<stream-name> -n data-streaming -fView logs from previous job runs:
# Get pods for a job (including completed ones)
kubectl get pods -n data-streaming --show-all
# View logs from a completed pod
kubectl logs <pod-name> -n data-streaming- Create dedicated namespaces for different environments:
kubectl create namespace data-streaming-sql-server kubectl create namespace data-streaming-json kubectl create namespace data-streaming-parquet
- Always set resource requests and limits
- Monitor actual usage and adjust accordingly
- Use node selectors or affinities for workload placement:
spec: template: spec: nodeSelector: workload-type: data-streaming
Use consistent, descriptive names:
# Good
name: orders-sqlserver-to-datalake-prod
# Avoid
name: stream1- Delete unused streams and job templates
- Set up retention policies for completed jobs:
spec: ttlSecondsAfterFinished: 86400 # 24 hours
Note
Operator manages the lifecycle of jobs. It can recreate jobs if they are deleted manually. If a job is not needed (e.g., stream is suspended), operator will delete the job.
- Operator does not require HA and can run as a single spot instance.
- Each streaming job is independent; ensure your cluster has sufficient resources to handle peak loads.