# P2DFlow

P2DFlow is a protein ensemble generative model with SE(3) flow matching based on ESMFold, the ensembles generated by P2DFlow could aid in understanding protein functions across various scenarios.

Technical details and evaluation results are provided in our paper:
* [P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching](https://pubs.acs.org/doi/abs/10.1021/acs.jctc.4c01620) (JCTC)
* [P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching](https://arxiv.org/abs/2411.17196) (arxiv)


## Table of Contents
1. [Installation](#Installation)
2. [Prepare Dataset](#Prepare-Dataset)
3. [Model weights](#Model-weights)
4. [Training](#Training)
5. [Inference](#Inference)
6. [Evaluation](#Evaluation)
7. [License](#License)
8. [Citation](#Citation)


## Installation
In an environment with cuda 11.7, run:
```
conda env create -f environment.yml
```
To activate the environment, run:
```
conda activate P2DFlow
```

## Prepare Dataset
#### (tips: If you want to use the data we have preprocessed, please go directly to `3. Process selected dataset`; if you prefer to process the data from scratch or work with your own data, please start from the beginning)

#### 1. Download raw ATLAS dataset
(i) Download the `Analysis & MDs` dataset from [ATLAS](https://www.dsimb.inserm.fr/ATLAS/), or you can use `./dataset/download.py` by running:
```
python ./dataset/download.py
```
We will use `.pdb` and `.xtc` files for the following calculation.

#### 2. Calculate the 'approximate energy' and select representative structures
(i) Use `gaussian_kde` to calculate the 'approximate energy' (You need to put all files above in `./dataset`, just like `ATLAS_init_example` in [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing)):
```
python ./dataset/traj_analyse_select.py
```
And you will get selected representative structures in `select` dir and `traj_info_select.csv` for 'approximate energy'.


#### 3. Process selected dataset

(i) Download the selected dataset (or get it from the two steps above) from [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing) whose filename is `selected_dataset_v1.tar` or `selected_dataset_v2.tar` ('v1' selects ~10 structures from MD, 'v2' selects ~100 structures from MD), and decompress it using:
```
tar -xzvf select_dataset_v1.tar
```

(ii) Preprocess `.pdb` files to get `.pkl` files, compute node representation and pair representation using ESM-2, predict static structure using ESMFold, and get merged `.csv` file:
```
python ./data/process_pdb_files.py --pdb_dir ${pdb_dir} --write_dir ${write_dir} --traj_info_file ${traj_info_file} --valid_seq_file ${valid_seq_file} --merged_output_file ${merged_output_file}
```
And you will get `.pkl` files (large file size) and `metadata_merged.csv`. (if you are using your own data, you need to split dataset to get validation set as ${valid_seq_file} first, an example is `./inference/valid_seq.csv`).
Processed data will be similar to `ATLAS_processed_example.tar.gz` in [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing)


## Model weights
Download the pretrained checkpoint from [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing) whose filename is `pretrained.ckpt`, and put it into `./weights` folder. You can use the pretrained weight for inference.


## Training
To train P2DFlow, firstly make sure you have prepared the dataset according to `Prepare Dataset`, and put it in the right folder, then modify `./configs/base.yaml` (especially for `csv_path`). After this, you can run:
```
python experiments/train_se3_flows.py
```
And you will get the checkpoints in `./ckpt`.


## Inference
To infer for specified protein sequence, firstly modify `./configs/inference.yaml` (especially for `ckpt_path` and `validset_path`), then run:
```
python experiments/inference_se3_flows.py
```
And you will get the results in `./inference_outputs/weights/`.


## Evaluation
To evaluate metrics related to validity, fidelity and dynamics, run:
```
python ./analysis/eval_result.py --pred_org_dir ${pred_org_dir} --valid_csv_file ${valid_csv_file} --pred_merge_dir ${pred_merge_dir} --target_dir ${target_dir} --crystal_dir ${crystal_dir}
```
To evaluate PCA, run:
```
python ./analysis/pca_analyse.py --pred_pdb_dir ${pred_pdb_dir} --target_dir ${target_dir} --crystal_dir ${crystal_dir}
```
Evaluation results will be similar to `evaluation_example` in [Google Drive](https://drive.google.com/drive/folders/11mdVfMi2rpVn7nNG2mQAGA5sNXCKePZj?usp=sharing)

## License
This project is licensed under the terms of the GPL-3.0 license.


## Citation
```
@article{jin2025p2dflow,
  title={P2DFlow: A Protein Ensemble Generative Model with SE(3) Flow Matching},
  author={Yaowei Jin, Qi Huang, Ziyang Song, Mingyue Zheng, Dan Teng, Qian Shi},
  journal={Journal of Chemical Theory and Computation},
  year={2025}
}
```