To host and share your dataset, create a dataset repository on the Hugging Face Hub and upload your data files.
This guide will show you how to structure your dataset repository when you upload it.
A dataset with a supported structure and file format (
.zip) are loaded automatically with load_dataset(), and it’ll have a preview on its dataset page on the Hub.
For more flexibility over how to load and generate a dataset, you can also write a dataset loading script.
The simplest dataset structure has two files:
Your repository will also contain a
README.md file, the dataset card displayed on your dataset page.
my_dataset_repository/ ├── README.md ├── train.csv └── test.csv
🤗 Datasets searches for certain patterns in the dataset repository to automatically infer the dataset splits. There is an order to the patterns, beginning with the custom filename split format to treating all files as a single split if no pattern is found.
If your dataset splits have custom names that aren’t
validation, then you should name your data files like
Here is an example with three splits,
my_dataset_repository/ ├── README.md └── data/ ├── train-00000-of-00003.csv ├── train-00001-of-00003.csv ├── train-00002-of-00003.csv ├── test-00000-of-00001.csv ├── random-00000-of-00003.csv ├── random-00001-of-00003.csv └── random-00002-of-00003.csv
Your data files may also be placed into different directories named
validation where each directory contains the data files for that split:
my_dataset_repository/ ├── README.md └── data/ ├── train/ │ └── bees.csv ├── test/ │ └── more_bees.csv └── validation/ └── even_more_bees.csv
If you don’t have any non-traditional splits, then you can place the split name anywhere in the data file and it is automatically inferred. The only rule is that the split name must be delimited by non-word characters, like
test-file.csv for example instead of
testfile.csv. Supported delimiters include underscores, dashes, spaces, dots, and numbers.
For example, the following file names are all acceptable:
- train split:
- validation split:
- test split:
Here is an example where all the files are placed into a directory named
my_dataset_repository/ ├── README.md └── data/ ├── train.csv ├── test.csv └── validation.csv
When 🤗 Datasets can’t find any of the above patterns, then it’ll treat all the files as a single train split. If your dataset splits aren’t loading as expected, it may be due to an incorrect pattern.
There are several ways to name splits. Validation splits are sometimes called “dev”, and test splits may be referred to as “eval”. These other split names are also supported, and the following keywords are equivalent:
- train, training
- validation, valid, val, dev
- test, testing, eval, evaluation
The structure below is a valid repository:
my_dataset_repository/ ├── README.md └── data/ ├── training.csv ├── eval.csv └── valid.csv
If one of your splits comprises several files, 🤗 Datasets can still infer whether it is the train, validation, and test split from the file name. For example, if your train and test splits span several files:
my_dataset_repository/ ├── README.md ├── train_0.csv ├── train_1.csv ├── train_2.csv ├── train_3.csv ├── test_0.csv └── test_1.csv
Make sure all the files of your
train set have train in their names (same for test and validation).
Even if you add a prefix or suffix to
train in the file name (like
my_train_file_00001.csv for example),
🤗 Datasets can still infer the appropriate split.
For convenience, you can also place your data files into different directories. In this case, the split name is inferred from the directory name.
my_dataset_repository/ ├── README.md └── data/ ├── train/ │ ├── shard_0.csv │ ├── shard_1.csv │ ├── shard_2.csv │ └── shard_3.csv └── test/ ├── shard_0.csv └── shard_1.csv
Eventually, you’ll also be able to structure your repository to specify different dataset configurations. Stay tuned on this issue for the latest updates!