Version: Next

Arrow Data Accelerator Deployment Guide

Production operating guide for the Arrow in-memory data accelerator covering memory sizing, optional hash indexes, and observability.

Authentication & Secrets

The Arrow accelerator is an in-process, in-memory engine. There is no external storage and no authentication or secret management required.

Resilience & Durability

The Arrow accelerator is not durable. Data is held in RAM and is lost on process restart; every restart re-materializes the dataset from the source connector.

Crash recovery: None — on restart, the dataset is refreshed from scratch.
File modes: File-mode acceleration is rejected at startup; Arrow is memory-only. Use DuckDB, SQLite, PostgreSQL, or Cayenne when durability or spill is required.
Concurrency: Arrow reads are lock-free. Refresh cadence is controlled by the runtime refresh semaphore, not by the accelerator itself.

Capacity & Sizing

Memory: Plan for 1.0–1.5× the raw row-oriented size of the source data, plus overhead for string dictionaries. Use the source connector's schema and row count to estimate.
Hash index: Optional, disabled by default. When enabled via hash_index: enabled, a hash map is built over the primary-key columns. Build time scales linearly with rows; memory overhead is approximately 24–48 bytes per row plus the key size.
Startup cost: Full-dataset materialization happens on startup. For tables larger than ~1 GB, consider a durable accelerator to avoid repeated full refresh on every restart.

Metrics

Generic acceleration metrics are available with the dataset_acceleration_ prefix. Hash-index operations emit dedicated metrics when the index is enabled:

Metric	Type	Description
`hash_index_builds`	Counter	Total hash-index builds (one per refresh).
`hash_index_build_duration_ms`	Histogram	Time to build the hash index.
`hash_index_entries`	Gauge	Number of entries in the index.
`hash_index_memory_bytes`	Gauge	Approximate memory footprint of the index.
`hash_index_lookups`	Counter	Total hash-index lookups performed by queries.
`hash_index_lookup_rows`	Counter	Total rows returned via hash-index lookups.

See Component Metrics for enabling and exporting metrics. Refresh metrics are described in Acceleration.

Task History

Arrow acceleration operations (refresh, query) participate in task history through the shared acceleration spans (accelerated_table_refresh, sql_query). No Arrow-specific spans are emitted — the accelerator is a thin wrapper over Arrow memory.

Known Limitations

No persistence: Every restart refreshes from the source.
No traditional indexes: Arrow does not support B-tree indexes. Hash index provides point-lookup acceleration but not range or sort-order optimization.
Only primary-key hash index: The hash index requires a primary_key constraint; unique constraints alone do not enable the index.
Memory pressure: If the dataset exceeds available RAM, the runtime will OOM; no spill-to-disk mechanism exists in the Arrow accelerator itself.
partition_by: Not applicable — Arrow accelerator holds a single in-memory representation.

Troubleshooting

Symptom	Likely cause	Resolution
OOM on refresh	Source dataset larger than RAM.	Switch to a durable accelerator (DuckDB / SQLite / Cayenne) that supports spill to disk.
Long startup time	Full-dataset refresh runs on boot.	Switch to a durable accelerator so refresh is incremental, not full, on restart.
`hash_index` ignored	No primary-key constraint on the dataset.	Add `primary_key:` to the dataset definition; hash index activates automatically.
Query slow for point lookups	Hash index disabled or wrong key column.	Enable `hash_index: enabled`; ensure the query filter matches the primary-key columns.
Accelerator refuses to start with file mode	Arrow rejects file-mode acceleration.	Switch `engine:` to `duckdb`, `sqlite`, `postgres`, or `cayenne`.

Authentication & Secrets​

Resilience & Durability​

Capacity & Sizing​

Metrics​

Task History​

Known Limitations​

Troubleshooting​