SLAF (Sparse Lazy Array Format)

SLAF is a high-performance format for single-cell transcriptomics data built on top of the Lance table format and Polars. For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.

pip install slafdb[ml]

GitHub | Documentation

Why SLAF?

Single-cell transcriptomics datasets have scaled 2,000-fold in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.

  1. The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at 2000x the scale.

  2. New AI-native workloads have arrived:

For these, we need cloud-native, zero-copy, query-in-place storage --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.

Who is SLAF for?

Quick examples

Query with SQL (no full download):

from slaf import SLAFArray

slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
results = slaf_array.query("""
    SELECT
        cytokine,
        cell_type,
        AVG(gene_count) as avg_gene_count
    FROM cells
    WHERE donor = 'Donor10'
      AND cytokine IN ('C5a', 'CD40L')
    GROUP BY cytokine, cell_type
    ORDER BY cytokine, avg_gene_count DESC
""")

Lazy Scanpy-style slicing:

from slaf.integrations import read_slaf

adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
subset = adata[
    (
        (adata.obs.cell_type == "CD8 Naive") &
        (adata.obs.cytokine == "C5a") &
        (adata.obs.donor == "Donor10")
    ), :
]
expression = subset[:10, :].X.compute()  # Only now is data loaded

Stream tokenized batches for training:

from slaf import SLAFArray
from slaf.ml.dataloaders import SLAFDataLoader

slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
dataloader = SLAFDataLoader(
    slaf_array=slaf_array,
    tokenizer_type="geneformer",
    batch_size=32,
    max_genes=2048,
    vocab_size=50000,
    prefetch_batch_size=1_000_000
)
for batch in dataloader:
    input_ids = batch["input_ids"]
    attention_mask = batch["attention_mask"]
    # Your training code here