Hello,
I want to create a Huggingface Dataset to host on the Hub, but I have a somewhat complex scenario and am looking for advice on whatâs the best approach to do this.
I have two different auto-regressive training tasks, letâs call them âAâ and âBâ. For each of these tasks I have 3 different datasets, and each of those datasets have 3 splits (train, val, test). Each dataset is described by two files: a FASTA file (for which I have a custom reading function), and a splts.json, that assigns rows of the FASTA file to a split. Visually, all the data looks like this:
âA
------- dataset 1
-------------- data.fasta
-------------- splits.json
------- dataset 2
-------------- data.fasta
-------------- splits.json
âB
------- dataset 1
-------------- data.fasta
-------------- splits.json
------- dataset 2
-------------- data.fasta
-------------- splits.json
Currently, Iâm reading the data with a simply Pytorch Dataset class. Is the best approach to 1) load this data structure to the hub, 2) create a dataset loading script where the _generate_examples uses the PyTorch Dataset class, and 3) leverage loading script configuration and splits ? I could set the configuration to A or B, and splits to train, val, test, but not sure how to define the dataset (i.e., dataset 1 or dataset 2)
Any advice would be greatly appreciated.
Thanks!!