Skip to main content
Version: Next

S3 Data Connector Deployment Guide

Production operating guide for the S3 data connector covering IAM authentication, credential chains, file-format tuning, metrics, and observability.

Authentication & Secrets

S3 authentication is selected via s3_auth:

ValueBehavior
(unset)Default AWS credential chain (IAM-based). Equivalent to iam_role with iam_role_source: auto.
iam_roleLoad credentials from the AWS credential chain; the source is further narrowed by iam_role_source.
keyUse the explicit s3_key / s3_secret pair. Required for S3-compatible stores that do not speak IAM (MinIO, Cloudflare R2 with keys, Backblaze B2, etc.).
publicUnauthenticated access for public buckets.

IAM Role Source

When s3_auth is unset or iam_role, the credential source is controlled by iam_role_source:

ValueBehavior
autoDefault AWS credential chain (env vars → shared credentials file → IMDS/ECS/IRSA).
metadataRestrict to instance/container metadata only: IMDS (EC2), ECS task role, EKS IRSA (pod role).
envRestrict to environment variables only (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_SESSION_TOKEN).

For production on EKS or ECS, prefer iam_role_source: metadata to guarantee the runtime only draws credentials from the workload identity, never from ambient environment variables.

Key Auth for S3-Compatible Stores

For MinIO, R2, B2, or on-prem S3 gateways:

params:
s3_auth: key
s3_key: ${secrets:s3_key}
s3_secret: ${secrets:s3_secret}
s3_endpoint: https://minio.internal:9000
s3_region: us-east-1

Keys must be sourced from a secret store in production. See Secret Stores.

Region Validation

s3_region is validated against AWS's known region set and must be lowercase. Invalid regions are rejected at startup. Custom S3-compatible endpoints still require a valid-looking AWS region code.

Resilience Controls

Retry Behavior

S3 I/O uses the AWS SDK's default retry strategy: standard adaptive backoff with retries on throttling (SlowDown, 503) and transient network errors. Per-operation retry parameters are not currently exposed at the Spice layer.

Permanent Failures

Authentication failures (401, 403) and missing buckets (404) surface immediately as query errors. Unlike the Databricks connector, the S3 connector does not permanently disable itself — subsequent queries re-attempt authentication, so transient IAM or network issues self-heal.

Capacity & Sizing

  • Object store throughput: S3 scales horizontally per prefix. For large Parquet workloads, partition data by date or tenant to maximize parallel reads.
  • Hive partitioning: Enable hive_partitioning_enabled: true when listing partitioned datasets so DataFusion can prune irrelevant partitions at plan time instead of listing and filtering at execution time.
  • Schema inference cost: On first registration, Spice samples files to infer schema. Provide an explicit schema in the dataset definition for large datasets to avoid repeated list/head operations.
  • DataFusion batch size: Object-store reads yield 8192-row record batches by default. Increase via runtime tuning for CPU-bound scans over compressed formats.

Metrics

S3 I/O metrics are collected via the shared runtime-object-store layer (request counts, retries, bytes read) and are exposed through Spice's runtime metrics. See Component Metrics for configuration.

The connector does not currently register S3-specific dataset-level instruments. Monitor S3 health via:

  • Standard AWS CloudWatch metrics on the bucket (AllRequests, 4xxErrors, 5xxErrors, TotalRequestLatency).
  • Spice's query-execution metrics (query_duration_ms, query_processed_rows) from runtime.metrics.

Task History

S3 object reads participate in Spice task history through DataFusion's object-store plan nodes. Individual object GETs are attributed to their enclosing sql_query or accelerated_table_refresh task via the DataFusion execution plan.

Known Limitations

  • Writes are not supported; the S3 connector is read-only.
  • S3 Express One Zone directory buckets are supported transparently via s3:// URIs when the region and endpoint match.
  • Server-side encryption with customer-provided keys (SSE-C) is not exposed; SSE-S3 and SSE-KMS work transparently when the role/user has KMS decrypt permission.
  • Requester-pays buckets are not currently supported.
  • Cross-region access incurs AWS data-transfer charges; place Spice in the same region as the bucket for best cost and latency.

Troubleshooting

SymptomLikely causeResolution
The request signature we calculated does not match the signature you providedClock skew or wrong s3_key/s3_secret.Verify secret values; check system clock (AWS tolerates only ~15 min drift).
Access DeniedIAM policy lacks s3:GetObject or s3:ListBucket.Attach a policy granting read on the bucket and prefix. Cross-account buckets also need bucket policy.
NoSuchBucketBucket does not exist in the configured region.Confirm bucket name and s3_region.
EnvCredentialsNotSet on EKSiam_role_source: env while running under IRSA.Set iam_role_source: metadata or auto.
InvalidSignatureException against MinIO/R2s3_endpoint not set or AWS SDK trying to sign for AWS S3.Set s3_endpoint and s3_region to match the S3-compatible provider.
Slow queries on large partitioned datasetsHive partitioning not enabled; every scan lists all files.Set hive_partitioning_enabled: true and encode partitions as key=value/ in the path.