SLAF (Sparse Lazy Array Format)

SLAF is a high-performance format for single-cell transcriptomics data built on top of the Lance table format and Polars. For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.

pip install slafdb[ml]

GitHub | Documentation

Why SLAF?

Single-cell transcriptomics datasets have scaled 2,000-fold in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.

The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at 2000x the scale.
New AI-native workloads have arrived:

cell typing with nearest neighbor search on embeddings
transformer-based foundation model training with efficient tokenization
distribute workloads across nodes or GPUs by streaming random batches concurrently

For these, we need cloud-native, zero-copy, query-in-place storage --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.

Who is SLAF for?

Bioinformaticians — Struggling with OOM and data transfer on 10M+ cell datasets. SLAF eliminates the bottleneck with lazy evaluation.
Foundation Model Builders — SLAF enables cloud-native streaming and removes data duplication.
Tech Leaders & Architects — SLAF provides zero-copy, query-in-place storage instead of duplicated datasets per user.
Tool Builders — SLAF enables concurrent, cloud-scale access with high QPS for interactive experiences.
Atlas Builders — SLAF provides cloud-native, zero-copy storage for global distribution.
Data Integrators — SLAF’s SQL-native design enables complex data integration with pushdown optimization.

Quick examples

Query with SQL (no full download):

from slaf import SLAFArray

slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
results = slaf_array.query("""
    SELECT
        cytokine,
        cell_type,
        AVG(gene_count) as avg_gene_count
    FROM cells
    WHERE donor = 'Donor10'
      AND cytokine IN ('C5a', 'CD40L')
    GROUP BY cytokine, cell_type
    ORDER BY cytokine, avg_gene_count DESC
""")

Lazy Scanpy-style slicing:

from slaf.integrations import read_slaf

adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
subset = adata[
    (
        (adata.obs.cell_type == "CD8 Naive") &
        (adata.obs.cytokine == "C5a") &
        (adata.obs.donor == "Donor10")
    ), :
]
expression = subset[:10, :].X.compute()  # Only now is data loaded

Stream tokenized batches for training:

from slaf import SLAFArray
from slaf.ml.dataloaders import SLAFDataLoader

slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
dataloader = SLAFDataLoader(
    slaf_array=slaf_array,
    tokenizer_type="geneformer",
    batch_size=32,
    max_genes=2048,
    vocab_size=50000,
    prefetch_batch_size=1_000_000
)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    # Your training code here