Databricks revamps its open-source code with a new 15k dataset to train AI models for commercial use Heaptalk

Databricks collected 15,000 datasets of instruction response pairs from more than 5,000 employees during March and April 2023 to replace the previous training data.

Heaptalk, Jakarta — A startup providing open and unified platforms for data and AI, Databricks, released Dolly 2.0, the open-source instruction-following large language model (LLM) for commercial purposes (04/12).

The latest version of Dolly consists of 15,000 human-generated prompts for training AI models to perform interactivity similar to ChatGPT. According to the company’s official statement, the dataset contains natural and expressive instruction and response pairs, designed to represent a wide range of behaviors.

These instruction and response pairs are claimed to include brainstorming, content generation, information extraction, and summarization. Databricks collected this dataset from more than 5,000 employees in 40 countries by filling out questionnaires during March and April 2023.

This new dataset was created to address the constraints that occurred in Dolly 1.0. Released in late March 2023, this initial version was trained by the Stanford Alpaca team using a dataset generated from the OpenAI API.

Apparently, the dataset has terms of service to prevent the creation of a model similar to ChatGPT developed by OpenAI. This caused Dolly 1.0 could not to be used in commercial products. Therefore, Databricks decided to create its own dataset for commercial use.

Users can verify the training data themselves

“We are open-sourcing the entirety of Dolly 2.0, including the training code, the dataset, and the model weights, all suitable for commercial use. This means that any organization can create, own, and customize powerful LLMs that can talk to people, without paying for API access or sharing data with third parties,” stated Databricks on its official blog.

CEO of Databricks, Ali Ghodsi, delivered that the company unveils free training data to help other companies make their own AI systems, possibly by using Databricks, as quoted by Reuters.

Ali admits that the dataset is still not perfect since it comes only from Databricks employees, who are mostly male. However, users can verify the training data themselves, which they cannot do with other models such as OpenAI’s ChatGPT and Google’s Bard.

“We are not claiming that this is an unbiased dataset. We are just trying to push the community to go in this direction of more transparency, and more of everyone owning their own models instead of just a few that we have to trust,” concluded Ali.