Short for ‘Data to Feed OLMo’s Appetite’, Dolma dataset contains 3 trillion tokens derived from web content, academic publications, code, books, and encyclopedic materials.
Heaptalk, Jakarta — Seattle-based non-profit research institute, The Allen Institute for AI (AI2), introduced a massive open dataset Dolma for training language models. The dataset is part of its open language model project called OLMo which has been started in March 2023.
Short for ‘Data to Feed OLMo’s Appetite’, Dolma contains 3 trillion tokens derived from web content, academic publications, code, books, and encyclopedic materials. As an open-source, this dataset is available for the AI researcher community on the Hugging Face. AI2 claimed that Dolma is the largest open dataset thus far.
“Dolma differentiates itself from other datasets on two key aspects. First, it is significantly larger than other open datasets. Second, it is released under AI2’s ImpACT license, which was designed to balance ease of access with mitigation of potential risk in distributing large datasets,” AI2 explained on its official blog.
Dolma is constructed by converting raw data from multiple sources into clean, plain-text documents. The process included two categories of data processing steps, namely source-specific and source-agnostic. These steps ensure the data collection adheres to a consistent structure and meets ethical and privacy standards.
Prohibited for military surveillance or generating disinformation
To produce this dataset, AI2 has established several criteria, spanning openness, representativeness, size, reproducibility, and risk mitigation. The research institute is optimistic that its approach to Dolma is the most appropriate for its first experiments in large-scale language modeling. However, that does not mean this approach is the best or the only way.
Further, under the AI2’s ImpACT license, researchers must comply with the following conditions, including providing their contact information and stating their intended use case for accessing Dolma, disclosing the creation of any derivative based on Dolma, and distributing derivatives under the same restrictions as the impact license. Additionally, researchers agree not to leverage Dolma in a range of prohibited uses such as military surveillance or generating disinformation.
“In fact, we are excited for future research into curating language modeling corpora, and we hope Dolma dataset and tools to be valuable starting points for future research,” added AI2.
Founded in 2014, AI2 carries a mission to contribute to humanity through high-impact AI research and engineering. The research institute has undertaken several projects to drive fundamental advances in science, medicine, and conservation through AI.