How BioVault Works

BioVault is an open-source platform for privacy-first biomedical collaboration through a peer-to-peer data visitation network. Instead of moving sensitive data to centralized systems, BioVault sends analysis code to the data. Raw data never leaves institutional boundaries, and only approved results are returned.

Available as both a desktop application and command-line interface, BioVault enables out-of-the-box use for participants in diverse resource settings. It supports clinical and research workflows across data modalities, with built-in permissioning, audit trails, and local governance controls.

For the full technical details, see our preprint on bioRxiv.

The Challenge

Biomedical datasets representing diverse populations are essential for advancing precision medicine, yet remain siloed due to regulatory, sovereignty, and privacy constraints. Existing data-sharing solutions each carry significant limitations:

Traditional data sharing

Raw data is copied across institutional and jurisdictional boundaries. Once transferred, data owners lose practical control over how data are accessed, reused, and governed.

Trusted Research Environments

Data is uploaded to centralized hosted platforms controlled by well-resourced institutions or commercial vendors. Data owners have limited oversight, and the cost creates barriers for under-resourced settings.

Secure computation frameworks

MPC and federated learning approaches preserve local control, but require deep expertise in cryptography, distributed algorithms, and substantial engineering effort, limiting accessibility.

These barriers disproportionately impact under-resourced institutions in LMICs and Indigenous communities, limiting equitable participation in global collaborations.

Data Visitation

BioVault's core principle is data visitation: analysis code travels to data rather than data being transferred to centralized systems. The data owner retains full governance and control throughout the process.

Data Scientist

Writes & submits code

Code travels to data

Only insights return

Datasite

Retains full governance

How it works step by step

Discover: Data owners publish synthetic mock datasets that mirror the structure of their private data, enabling researchers to develop and validate pipelines without accessing real data.
Develop: Researchers write analysis code (Jupyter notebooks or Nextflow workflows) and validate against mock data locally.
Submit: Validated pipelines are submitted as execution requests over the BioVault network.
Review: Data owners review requests with AI-assisted code summarization, inspect the code, or run it against mock data first.
Execute: Approved code runs locally within the data owner's environment on the private dataset.
Return: Only explicitly authorized results are returned to the researcher. All operations are logged with transparent audit trails.

Platform

BioVault is built on SyftBox, an open-source protocol for secure, decentralized coordination of remote computation across independently governed environments.

Open-source peer-to-peer network

No centralized infrastructure required. BioVault operates across a decentralized network of independently governed datasites spanning institutions and jurisdictions.

Desktop application & CLI

Compatible with Linux, macOS, and Windows. Supports Nextflow workflows and Jupyter Notebooks. Both interfaces expose agent-accessible endpoints for AI-driven research workflows.

End-to-end encrypted

All messages and results are end-to-end encrypted before transmission. Authenticated communication via the SyftBox protocol.

Data stays local

BioVault treats linked data as generic files or databases under local control, independent of data modality or analytical task. No uploads, no copies, no movement.

Governance & audit trails

Every execution request is logged, auditable, and subject to human approval before results are released. Data owners retain granular control over permitted analyses and shared outputs.

Mock data paradigm

Data owners publish synthetic datasets that mirror the structure of private data. Researchers develop and validate pipelines prior to execution, without exposure to real data.

Supported Workflows

BioVault supports diverse biomedical analyses and machine learning workflows. Across all workflows, data scientists submit requests that are executed remotely within data owners' environments. Raw data is never transferred.

Single-cell RNA-seq

Remote preprocessing and exploratory analysis of patient-derived scRNA-seq data. Supports iterative, interactive workflows including standard preprocessing, quality controls, and downstream outputs like UMAP embeddings, without exposing raw gene expression profiles.

Machine learning model training

Remote training of diagnostic and predictive models on private datasets. Model architectures are developed on synthetic data, then executed within the data-owning environment. Only trained parameters and performance metrics are returned.

Clinical medical imaging

Model-to-data inference on large-scale MRI and other imaging datasets. Pretrained models are shared with the data-owning environment for local inference, consistent with clinical data governance practices.

Rare disease genomics

Privacy-preserving analysis of whole-genome sequencing data. Only narrowly scoped, interpretation-ready results (e.g., region-restricted pileups) are returned. The full genome is never exposed.

Secure Computation

Federated computation across multiple datasites requires exchanging intermediate results. Sharing this data in plaintext raises privacy concerns, as even aggregate statistics can enable reconstruction or membership inference attacks.

BioVault integrates Syqure, a custom Rust-based wrapper around Sequre/Shechi, a high-performance framework that translates Python-syntax pipelines into optimized secure multiparty computation (MPC) and homomorphic encryption (HE) protocols. Syqure provides a zero-configuration WebRTC proxy that establishes peer-to-peer connections between datasites behind firewalls, removing the requirement for publicly addressable endpoints.

How it works

Each datasite locally computes values on private data (e.g., per-site allele counts)
Values are split into encrypted shares
Shares are securely aggregated across datasites via MPC
Only final pooled results are revealed. No party accesses another site's private data

Proof of concept: We implemented a secure federated protocol for calculating joint allele frequencies across Caribbean populations without sharing site-specific allele counts in plaintext. Results were exactly concordant with plaintext computation (Pearson r = 1.00). See our paper for full benchmarks.

Get Started

Whether you hold biomedical data or need access to it, we're looking for partners to shape what we build. BioVault is free and open-source.

Read the Paper Download