Spotlight Dataset

OMol25: A Large-Scale Electronic Structure Dataset for Accelerating Molecular Machine Learning

An open dataset of 500 TB comprising electronic densities, wavefunctions, and molecular orbital information from over 4 million high-accuracy density functional theory calculations

Daniel S. Levine1,∗Muhammed Shuaibi1,∗Evan Walter Clark Spotte-Smith2Michael G. Taylor3Muhammad R. Hasyim4Kyle Michel1Ilyes Batatia5Gábor Csányi5Misko Dzamba1Peter Eastman6Nathan C. Frey7Xiang Fu1Vahe Gharakhanyan1Aditi S. Krishnapriyan8,9Joshua A. Rackers7Sanjeev Raja8Ammar Rizvi1Andrew S. Rosen10Zachary Ulissi1Santiago Vargas9C. Lawrence Zitnick1,∗Samuel M. Blau9,∗Brandon M. Wood1,∗Rachana Ananthakrishnan12,13Rick Stevens11Mike Papka11Kyle Chard11,12Ian Foster11,12,13Ben Blaiszik11,12
1FAIR at Meta, 2Carnegie Mellon University, 3Los Alamos National Laboratory, 4New York University, 5University of Cambridge, 6Stanford University, 7Prescient Design, Genentech, 8University of California-Berkeley, 9Lawrence Berkeley National Laboratory, 10Princeton University, 11Argonne National Laboratory, 12University of Chicago, 13Globus
October, 7 2025
|CC-BY-4.0
~500 TB
Dataset Volume
>4M
DFT Calculations
Small to Large Complexes
Molecular Systems

Abstract

The development of accurate machine learning models for molecular property prediction and materials design requires extensive high-quality training data. While databases of molecular structures and basic properties exist, comprehensive electronic structure data—including electronic densities, wavefunctions, and molecular orbitals—remain scarce at scale.

We present the OMol25 Electronic Structures dataset, an unprecedented open dataset of quantum chemical calculations designed to enable the development of next-generation physics-informed machine learning models for molecular chemistry and materials science.

The dataset comprises raw density functional theory (DFT) outputs, electronic densities, wavefunctions, and molecular orbital information from over 4 million high-accuracy quantum chemical calculations performed on diverse molecular systems ranging from small organic molecules to large biomolecular complexes.

This dataset will enable researchers to develop improved partial charges, partial spins, and advanced electronic features for machine learning models, potentially accelerating discoveries in drug design, catalyst development, and energy materials.

OMol25 Electronic Structure Dataset key use cases and partners

OMol25 Electronic Structure Dataset key use cases and partners

The Vision

We envision a future where researchers can rapidly design molecules and peptides to treat diseases, discover catalysts to revolutionize synthesis and manufacturing, identify the next electrolyte to store and transport energy to protect the grid, and more. But these breakthrough discoveries require data.

Data to train next-generation AI models and interatomic potentials. Data to push the boundaries of what's computationally possible in molecular chemistry and lead the world in AI for science. Data that captures the full complexity of chemical systems, from small organic molecules to massive biomolecular complexes.

About the Dataset

With our partners at Meta and Argonne Leadership Computing Facility (ALCF), we announce the OMol25 Electronic Structures dataset that includes ~500 TB of open molecular data. These data include the raw DFT outputs, electronic densities, wavefunctions, and molecular orbital information for over 4M high-accuracy quantum chemical calculations. We see this as a transformative opportunity to develop higher quality partial charges, partial spins, and advanced electronic features to unlock the next generation of physics-informed ML models.

The Materials Data Facility is proud to make these data available via the Eagle cluster at ALCF through a high-performance Globus endpoint. Given the dataset's unprecedented scale, we're first releasing all output data for a 4M random OMol25 split, with the full multi-petabyte dataset following based on community engagement.

What You Can Build

Explore the possibilities enabled by this comprehensive dataset

Train Advanced ML Models

Develop next-generation interatomic potentials and physics-informed models with unprecedented electronic structure data.

Molecular Property Prediction

Build databases of calculated properties, partial charges, and descriptors for molecular screening.

Catalyst Discovery

Accelerate the discovery of novel catalysts for synthesis, manufacturing, and energy applications.

Drug Design

Leverage quantum chemical calculations to design molecules and peptides for treating diseases.

Join the Community

For this first release, the data are quite raw, and as-created by the Meta team. There's a significant opportunity for the community to build tools that simplify access to these data, allow data query and browsing, create databases of calculated properties and descriptors, and much more. We intend to work on these topics with all of you.

We can't wait to see what you can do with these data!

Access the Data

Download the dataset and explore additional resources

Download Dataset

Direct access to dataset files and transfer instructions

Access to this dataset requires a free Globus account and joining a permission group (link below). Due to the dataset's size (500TB), we recommend using Globus Connect Personal or Server for high-speed transfer.

New to Globus?

Create a free account to access high-performance data transfer for large datasets

Get Started with Globus

Partners & Contributors

Made possible through collaboration

Meta AI

Data Provider & Research Partner

Argonne Leadership Computing Facility (ALCF)

Infrastructure & Storage

Globus

Data Transfer Platform

University of Chicago

Research Collaboration

NIST

Funding & Support