Skip to Content

Article Summary: Open-Source Language AI Challenges Big Tech’s Models by Elizabeth Gibney

BLOOM aims to address the biases that machine-learning systems inherit from the texts they train on.

Recommendation

Currently in its final weeks of training, the BLOOM model for natural language processing is almost ready for full launch. With parameter sets rivaling those used by Google and OpenAI, the system’s originators seek to correct biases inherent in many systems that make them seem all too human – in the worst ways. Anyone who designs or uses AI should read this eye-opening report, and perhaps consider signing up for a test drive.

Take-Aways

  • Scientists designed the BLOOM natural language processing model to correct AI text biases.
  • BLOOM can tackle a variety of AI-based research projects.
  • Natural language processing programs only work well when based on quality data sets.
  • Researchers can freely download the latest BLOOM model.

Article Summary: Open-Source Language AI Challenges Big Tech’s Models by Elizabeth Gibney

Summary

Scientists designed the BLOOM natural language processing model to correct AI text biases.

Because machine learning systems tend to inherit errors from training material, researchers warn of possible harm caused by AI models that process and generate text.

A multinational team of about a thousand mostly academic volunteers tried to reduce such problems by breaking “big tech’s” grip on natural language processing models. Fueled by $7 million in allocations for computer time, the BLOOM [BigScience Language Open-science Open-access Multilingual] model rivals those conceived by OpenAI and Google – but offers multilingual and open-source access. BigScience collaborators introduced a preliminary BLOOM model in June 2022.

Such systems can display humanlike qualities, including societal and ethical flaws inherent in people.

“Big tech firms increasingly use models that recognize and generate language in applications ranging from chatbots to translators, and the models can sound so eerily human that a Google engineer this month claimed that the firm’s AI model was sentient (Google strongly denies that the AI possesses sentience).”

Until now, researchers have experienced difficulty in gaining access to privately-held models.

BLOOM can tackle a variety of AI-based research projects.

Biological classifications and historical document studies are just two possible uses for BLOOM. The company Hugging Face, which hosts another open-source AI platform, helped to organize the initiative.

“We think that access to the model is an essential step to do responsible machine learning.” (Thomas Wolf, co-founder of Hugging Face)

AI language models employ algorithms that compare statistical relationships between billions of phrases and words to generate summaries, answer questions, classify text, and perform language translations. Their “brains” consist of architectures called “neural networks.” The machines learn by continually updating individual values, or parameters, by comparing statistical predictions with reality. BLOOM’s 176 billion parameters are on par with the heralded GPT-3 model licensed by Microsoft and created by OpenAI.

These natural language processing systems can answer trivia questions or generate poetry, but possess no sense of language itself – so they also churn out gibberish. They sometimes copy misogynistic, antireligious, or racist attitudes buried in data sets, which can lead to abuse. Such models cost millions to train, and create a huge carbon footprint.

Hundreds of researchers worked to construct BLOOM, including philosophers, legal scholars and ethicists. Some Google and Facebook employees contributed as volunteers. BigScience enjoyed free use of the Jean Zay supercomputer facility near Paris, France.

Natural language processing programs only work well when based on quality data sets.

Selecting training texts for BLOOM presented a daunting challenge to project developers. BigScience researchers avoided the usual practice of using language ripped directly from websites like Reddit. They hand-selected two-thirds of the model’s data set from 500 multilingual sources, resulting in 341 billion words. One source included Semantic Scholar, an academic AI search engine that includes Nature articles. Sources were finalized through community workshops such as Masakhane (from Africa), Machine Learning Tokyo and LatinX in AI.

“We wanted to make sure people with proximity to the data, their country, the language they speak, had a hand in choosing what language came into the model’s training.” (Yacine Jernite, Hugging Group machine learning researcher)

Researchers redacted retrieved data for privacy and screened it for quality. They addressed the normal porn site proliferation (which can push sexist connotations) by reducing its overrepresentation without excluding words that promote a healthy discussion of sexuality. Researchers know that BLOOM will never be bias-free, but they aspire to improve on existing systems by providing open data sets and code.

“As well as comparing BLOOM with other models in its abilities to, for example, answer questions, researchers want to look at more diverse metrics, such as how strongly it makes certain stereotyped associations or how biased its abilities are towards a specific language.” (Ellie Pavlick, natural language learning researcher, Brown University)

Ellie Pavlick hopes BLOOM’s multilingual capacity will imbue it with deeper language awareness, to facilitate more complex, diverse tasks.

Researchers can freely download the latest BLOOM model.

Researchers who wish to utilize BLOOM will need sophisticated hardware. To avoid eliminating smaller teams, BigScience plans to publish smaller versions and allow laboratories to share BLOOM across a distributed network. A Hugging Face application allows individuals to query the system without formally downloading it.

BLOOM may not be limited to AI research. Linguist Francesco de Toni leads a BigScience work group at the University of Western Australia that aspires to scour large volumes of historical texts. For example, BLOOM might extract individual terms related to items mentioned in Renaissance merchant letters.

About the Author

Elizabeth Gibney is a senior physics reporter at Nature. She has written for Scientific American, the BBC and CERN.