Skip to main content
Version: Next

Arrow Data Accelerator Deployment Guide

Production operating guide for the Arrow in-memory data accelerator covering memory sizing, optional hash indexes, and observability.

Authentication & Secrets

The Arrow accelerator is an in-process, in-memory engine. There is no external storage and no authentication or secret management required.

Resilience & Durability

The Arrow accelerator is not durable. Data is held in RAM and is lost on process restart; every restart re-materializes the dataset from the source connector.

  • Crash recovery: None — on restart, the dataset is refreshed from scratch.
  • File modes: File-mode acceleration is rejected at startup; Arrow is memory-only. Use DuckDB, SQLite, PostgreSQL, or Cayenne when durability or spill is required.
  • Concurrency: Arrow reads are lock-free. Refresh cadence is controlled by the runtime refresh semaphore, not by the accelerator itself.

Capacity & Sizing

  • Memory: Plan for 1.0–1.5× the raw row-oriented size of the source data, plus overhead for string dictionaries. Use the source connector's schema and row count to estimate.
  • Hash index: Optional, disabled by default. When enabled via hash_index: enabled, a hash map is built over the primary-key columns. Build time scales linearly with rows; memory overhead is approximately 24–48 bytes per row plus the key size.
  • Startup cost: Full-dataset materialization happens on startup. For tables larger than ~1 GB, consider a durable accelerator to avoid repeated full refresh on every restart.

Metrics

Generic acceleration metrics are available with the dataset_acceleration_ prefix. Hash-index operations emit dedicated metrics when the index is enabled:

MetricTypeDescription
hash_index_buildsCounterTotal hash-index builds (one per refresh).
hash_index_build_duration_msHistogramTime to build the hash index.
hash_index_entriesGaugeNumber of entries in the index.
hash_index_memory_bytesGaugeApproximate memory footprint of the index.
hash_index_lookupsCounterTotal hash-index lookups performed by queries.
hash_index_lookup_rowsCounterTotal rows returned via hash-index lookups.

See Component Metrics for enabling and exporting metrics. Refresh metrics are described in Acceleration.

Task History

Arrow acceleration operations (refresh, query) participate in task history through the shared acceleration spans (accelerated_table_refresh, sql_query). No Arrow-specific spans are emitted — the accelerator is a thin wrapper over Arrow memory.

Known Limitations

  • No persistence: Every restart refreshes from the source.
  • No traditional indexes: Arrow does not support B-tree indexes. Hash index provides point-lookup acceleration but not range or sort-order optimization.
  • Only primary-key hash index: The hash index requires a primary_key constraint; unique constraints alone do not enable the index.
  • Memory pressure: If the dataset exceeds available RAM, the runtime will OOM; no spill-to-disk mechanism exists in the Arrow accelerator itself.
  • partition_by: Not applicable — Arrow accelerator holds a single in-memory representation.

Troubleshooting

SymptomLikely causeResolution
OOM on refreshSource dataset larger than RAM.Switch to a durable accelerator (DuckDB / SQLite / Cayenne) that supports spill to disk.
Long startup timeFull-dataset refresh runs on boot.Switch to a durable accelerator so refresh is incremental, not full, on restart.
hash_index ignoredNo primary-key constraint on the dataset.Add primary_key: to the dataset definition; hash index activates automatically.
Query slow for point lookupsHash index disabled or wrong key column.Enable hash_index: enabled; ensure the query filter matches the primary-key columns.
Accelerator refuses to start with file modeArrow rejects file-mode acceleration.Switch engine: to duckdb, sqlite, postgres, or cayenne.