Join us on November 9 to learn how to successfully innovate and achieve excellence by improving and scaling citizen developers at the Low-Code/No-Code Summit. Register here.
For a little more than a year, using large language models (LLMs) to generate software code is an innovative scientific experiment that has yet to prove its worth. But while code generation has become one of LLMs’ most successful applications, BigCode, recently launched by Hugging Face and ServiceNow, strives to address some of LLMs’ biggest pain points.
Today, many developers are using LLM-powered tools like GitHub Copilot to improve productivity, stay in flow and make their work more enjoyable. However, as LLM-powered coding grows, we are also beginning to discover the challenges it must overcome, including licensing, transparency, security and control.
Stack, a source code dataset recently released by the BigCode project, addresses some of these pain points. It also highlights some of the notable hurdles that remain to be resolved as artificial intelligence (AI)-powered code development continues to move into the mainstream.
LLM and code license
“The recent introduction of code LLMs has shown that they can make developers more productive and make software engineering accessible to people with a less technical background,” said Leandro von Werra, machine learning engineer at Hugging Face, on VentureBeat.
Event
Low-Code/No-Code Summit
Learn how to build, scale, and manage low-code programs in a straightforward way that creates success for everyone this November 9. Register for your free pass today.
Register here
These language models can serve different tasks. Programmers use tools like Copilot and Codex to write entire classes and functions from text descriptions. It can be very useful for automating mundane parts of programming, such as setting up web servers, retrieving information from databases or even writing Python code for a neural network and its training loop. According to von Werra, in the future, software engineers will be able to use LLMs to maintain legacy code written in an unfamiliar programming language.
However, the growing use of LLMs in coding has raised some concerns, including licensing issues. Models like Copilot generate code based on patterns they learn from their training examples, some of which may be subject to restrictive licenses.
“Questions are raised as to whether these AI models respect existing open-source licenses—both for training and model generation—and what social impact this technology has on the open-source community of software,” says von Werra.
Although open-source licenses legally allow the use of a code repository, these licenses were developed before modern deep learning and the collection of large datasets for training models. Therefore, developers may not intend to use their code to train language models.
“The issues of consent and purpose around using people’s code to train deep neural networks is not addressed in current open-source licenses; the community still needs to develop standards on how to responsibly develop this technology, respecting the wishes of developers for the modern use of their content,” said von Werra.
Hugging Face and ServiceNow have released a collaborative project
The BigCode project is a collaboration between Hugging Face and ServiceNow, which was announced in September. The Stack, released on October 27, consists of 3 TB of “permissively licensed source code” obtained from GitHub, developed for training large language models for code generation.
Permissive licenses are those with the least restrictions on copying, modifying and redistributing the code, which include the MIT and Apache 2.0 licenses. This does not include “copyleft” licenses such as the GPL, which require that the same rights be maintained in code originating from the original repository. There are currently controversies and disagreements over whether models trained under copyleft licenses are considered derivative work.
Limiting the dataset to permissively licensed code will ensure that it can be used for a variety of applications.
“The goal of The Stack is to enable researchers from academia and industries to collaborate on the research and development of large-scale language models for code applications by releasing a dataset that can be shared, investigated and used to pre-train new systems,” von Werra said.
BigCode is also taking steps to give developers more control over their code. Developers can explicitly opt out of having their repository included in The Stack and used to train LLMs, regardless of the license they initially chose.
“To honor these opt-out requests, developers who wish to opt-out can submit a request and, once validated, have their code removed from future versions of The Stack,” said von Werra.
One of the challenges facing researchers working with LLM codes is the lack of openness and transparency around the development of these systems. Models such as AlphaCode, CodeParrot and CodeGen only described the high-level data collection process but did not release the training data.
“It is difficult for other researchers to fully replicate these models and understand what kind of pretraining data leads to high-performance code LLMs,” von Werra said. “By releasing an open large-scale code dataset, we hope to make the practice of code LLMs more replicable.”
In addition to providing an unprecedented 3 TB of curated source code, the BigCode team provided a detailed breakdown of how the code was sourced and filtered. The dataset was gathered over several months. The team downloaded 137.36 million publicly available GitHub repositories. It then filters the dataset to exclude repositories that do not have permitted licenses. Finally, it goes through a deduplication process to remove files that are exact or near duplicates of others.
“An open dataset benefits from external review, where BigCode provides a way for other researchers and developers to report issues directly to the team managing the dataset,” von Werra said. .
Hugging Face and ServiceNow tackle the remaining challenges
Licensing is not the only challenge facing LLM codes. Models engineers and datasets administrators must also address other problems such as removing sensitive information, including usernames, passwords and security tokens.
Another concern is insecure code. Because LLMs are trained on source code curated from public sources, there is concern that the training set may include insecure code. Alternatively, malicious actors can poison the training data by deliberately spreading insecure code in open repositories. LLM will learn insecure coding patterns and simulate them in response to developer prompts.
The open-source format of The Stack will allow security researchers to analyze the dataset for insecure code. In addition, the BigCode team has implemented update mechanisms that take advantage of new information, such as the disclosure of vulnerabilities, and evolving best practices to limit the spread of malicious code on The Stack. The team is also working on ways to filter out personally identifiable information (PII).
Moreover, the team is working on a special license dedicated to LLM codes named OpenRAIL (Responsible AI License).
“The OpenRAIL license is an open-source license similar to Apache 2.0, but also includes provisions to prohibit certain use cases that may, for example, not include the development of malware,” said von Werra. “In addition, we are also developing a tool to search generated code within The Stack for proper license attribution.”
The future of code LLMs
LLMs can expand the skills of professional software engineers and enable non-technical people to develop new software. But that will only happen if the community can establish a new set of sustainable rules and best practices around licensing and attribution, von Werra warns. He also believes that automation does not mean that human skills will become less relevant to coding.
“There needs to be more internal governance in place in organizations that use technology,” von Werra said. “The role of the human-in-the-loop in the AI value chain will become more important to ensure that code generated is fit for purpose and complies with corporate policy and broader AI regulation.”
VentureBeat’s mission will be a digital town square for technical decision makers to gain knowledge about transformative enterprise technology and transactions. Discover our Briefings.