SLAF is a high-performance format for single-cell transcriptomics data built on top of the Lance table format and Polars. For users of scanpy or anndata, it should feel like you never left. SLAF provides and advanced dataloader that looks and feels like PyTorch, but runs its own multi-threaded async prefetcher under the hood. Bleeding-edge internals, familiar interfaces.
pip install slafdb[ml]
Single-cell transcriptomics datasets have scaled 2,000-fold in less than a decade. A typical study used to have 50k cells that could be copied to SSD and processed in memory. At the 100M-cell scale, network, storage, and memory become bottlenecks.
The traditional analytic workload is stuck in in-memory single-node operations. Today we need to do cell/gene filtering, normalization, PCA/UMAP, and differential expression at 2000x the scale.
New AI-native workloads have arrived:
For these, we need cloud-native, zero-copy, query-in-place storage --- without maintaining multiple copies per user, workload, application, or node --- while retaining the interfaces for numpy-like sparse matrix slicing and the scanpy pipelines we already use.
Query with SQL (no full download):
from slaf import SLAFArray
slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
results = slaf_array.query("""
SELECT
cytokine,
cell_type,
AVG(gene_count) as avg_gene_count
FROM cells
WHERE donor = 'Donor10'
AND cytokine IN ('C5a', 'CD40L')
GROUP BY cytokine, cell_type
ORDER BY cytokine, avg_gene_count DESC
""")
Lazy Scanpy-style slicing:
from slaf.integrations import read_slaf
adata = read_slaf("hf://datasets/slaf-project/Parse-10M")
subset = adata[
(
(adata.obs.cell_type == "CD8 Naive") &
(adata.obs.cytokine == "C5a") &
(adata.obs.donor == "Donor10")
), :
]
expression = subset[:10, :].X.compute() # Only now is data loaded
Stream tokenized batches for training:
from slaf import SLAFArray
from slaf.ml.dataloaders import SLAFDataLoader
slaf_array = SLAFArray("hf://datasets/slaf-project/Parse-10M")
dataloader = SLAFDataLoader(
slaf_array=slaf_array,
tokenizer_type="geneformer",
batch_size=32,
max_genes=2048,
vocab_size=50000,
prefetch_batch_size=1_000_000
)
for batch in dataloader:
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
# Your training code here