• Newsletter
  • Contact
  • Press Releases
Thursday, June 5, 2025
Stay Ahead with Heaptalk: Your Go-To Source for Business News
  • Login
  • Register
  • Whats on
  • News
  • Events
  • Technology
  • Industry
  • GovAct
  • Expert Talk
  • Insight
  • Sustainability
No Result
View All Result
Stay Ahead with Heaptalk: Your Go-To Source for Business News
  • Whats on
  • News
  • Events
  • Technology
  • Industry
  • GovAct
  • Expert Talk
  • Insight
  • Sustainability
No Result
View All Result
Stay Ahead with Heaptalk: Your Go-To Source for Business News
No Result
View All Result
Home News

AI2 introduces Dolma’s 3 trillion open dataset to train language models

Sinta by Sinta
October 9, 2023
in News, Technology
0
ai2 dolma

AI2 created Dolma dataset for OLMo language model. Image: AI2

Share on FacebookShare on Twitter

Short for ‘Data to Feed OLMo’s Appetite’, Dolma dataset contains 3 trillion tokens derived from web content, academic publications, code, books, and encyclopedic materials.

Heaptalk, Jakarta — Seattle-based non-profit research institute, The Allen Institute for AI (AI2), introduced a massive open dataset Dolma for training language models. The dataset is part of its open language model project called OLMo which has been started in March 2023.

Short for ‘Data to Feed OLMo’s Appetite’, Dolma contains 3 trillion tokens derived from web content, academic publications, code, books, and encyclopedic materials. As an open-source, this dataset is available for the AI researcher community on the Hugging Face. AI2 claimed that Dolma is the largest open dataset thus far.

World Ai Jakarta 2025
World Ai Jakarta 2025

“Dolma differentiates itself from other datasets on two key aspects. First, it is significantly larger than other open datasets. Second, it is released under AI2’s ImpACT license, which was designed to balance ease of access with mitigation of potential risk in distributing large datasets,” AI2 explained on its official blog.

Dolma is constructed by converting raw data from multiple sources into clean, plain-text documents. The process included two categories of data processing steps, namely source-specific and source-agnostic. These steps ensure the data collection adheres to a consistent structure and meets ethical and privacy standards.

Prohibited for military surveillance or generating disinformation

To produce this dataset, AI2 has established several criteria, spanning openness, representativeness, size, reproducibility, and risk mitigation. The research institute is optimistic that its approach to Dolma is the most appropriate for its first experiments in large-scale language modeling. However, that does not mean this approach is the best or the only way.

Further, under the AI2’s ImpACT license, researchers must comply with the following conditions, including providing their contact information and stating their intended use case for accessing Dolma, disclosing the creation of any derivative based on Dolma, and distributing derivatives under the same restrictions as the impact license. Additionally, researchers agree not to leverage Dolma in a range of prohibited uses such as military surveillance or generating disinformation.

“In fact, we are excited for future research into curating language modeling corpora, and we hope Dolma dataset and tools to be valuable starting points for future research,” added AI2.

Founded in 2014, AI2 carries a mission to contribute to humanity through high-impact AI research and engineering. The research institute has undertaken several projects to drive fundamental advances in science, medicine, and conservation through AI.

Tags: AI2allen institute for aidolmaolmo

Related Posts

Paving the Way for Expatriate Homeownership: Savyavasa and Permata Bank Launch Exclusive Foreign Mortgage Program

Paving the Way for Expatriate Homeownership: Savyavasa and Permata Bank Launch Exclusive Foreign Mortgage Program

June 5, 2025
KL1 Phase 2 has been completed. Credit: Equinix

Equinix completes KL1 Phase 2 data center in Kuala Lumpur

June 2, 2025
Commemorating Professor Soemitro's 108th anniversary at Soemitro Center (05/29). Credit: Haris

Soemitro Center: A platform for Indonesia’s young economists

May 31, 2025
Lifree breathable adult diapers can help improve skin health. Credit: Haris

Unicharm research: Lifree improves quality of life for elderly

May 31, 2025
Xiaomi's premiumization strategy has yielded positive results in Q1 2025. Credit: Sinta

Applying premiumization strategy, Xiaomi gains 64.5% net profit jump in Q1 2025

May 31, 2025
Kaspersky appointed Defi Nofitra as first country manager for Indonesia. Credit: Kaspersky

Kaspersky appoints Defi Nofitra as first country manager for Indonesia

May 31, 2025
  • 32321

    New tech layoff chapter, Microsoft lays off thousands of its cloud unit ‘Azure’

    1 shares
    Share 0 Tweet 0
  • Nokia rolls out 6600 5G Ultra

    0 shares
    Share 0 Tweet 0
  • Performing a second layoff round, Morgan Stanley to reduce 3,000 workforces in Q2 2023

    1 shares
    Share 0 Tweet 0
  • TikTok Shop to reach a US$15 billion in its GMV transactions

    1 shares
    Share 0 Tweet 0
  • International Women’s Day – Opportunity for Businesses to Support Women in the Workplace

    0 shares
    Share 0 Tweet 0
the 10th world battery & energy industry expo 2025World Ai Jakarta 2025
Heaptalk business news logo

We Build an Ecosystem by Sharing Business News, Headlines and Expert Talks in Professional Perspective and Positive Point of View. Latest business news media headlines platform today.

Recent Posts

  • Paving the Way for Expatriate Homeownership: Savyavasa and Permata Bank Launch Exclusive Foreign Mortgage Program
  • Equinix completes KL1 Phase 2 data center in Kuala Lumpur
  • Soemitro Center: A platform for Indonesia’s young economists
  • Unicharm research: Lifree improves quality of life for elderly
  • Applying premiumization strategy, Xiaomi gains 64.5% net profit jump in Q1 2025

Follow Us

Facebook
Twitter
LinkedIn Youtube Instagram RSS

Newsletter

  • About Us
  • Editorial
  • Newsletter
  • Contact
  • Privacy Policy
  • Cyber Media Guidelines
  • Disclaimer
  • SOP Perlindungan Wartawan

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms below to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Do not sell my personal information.
Cookie settingsACCEPT
Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Non-necessary
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
SAVE & ACCEPT
No Result
View All Result
  • Home
  • News
  • Technology
  • Industry
  • GovAct
  • Events
  • Whats on
  • Expert Talk
  • Insight
  • Sustainability
  • Newsletter
  • Press Releases
  • Login
  • Sign Up

© 2024 Heaptalk.com