Project CodeNet was announced at IBM’s Think conference this week and claims to be the largest open-source dataset for code (approximately 10 times the size of the closest.)
CodeNet features 500 million lines of code, 14 million examples, and spans 55 programming languages including Python, C++, Java, Go, COBOL, Pascal, and more.
Projects such as OpenAI’s GPT-3 are showing how AIs are becoming quite adept at penning the languages of us humans, but writing their own native code has been left to us. CodeNet aims to change that.
For at least the foreseeable future, projects like GPT-3 will be a tool for humans that can increase their productivity by providing a basic standard that will still require some editing to iron out errors and compensate for areas where humans still have an edge such as creativity, emotion, and compassion.
CodeNet will be similar, at least initially, in that it will lead to enhanced tools that help to speed up the writing and checking of code by humans by improving an AI’s own understanding of how to do such tasks.
“Given its wealth of programs written in a multitude of languages, we believe Project CodeNet can serve as a benchmark dataset for source-to-source translation and do for AI and code what the ImageNet dataset did years ago for computer vision,” says IBM.
US entrepreneur Marc Andreesen famously, and correctly, wrote in 2011 that “Software is eating the world”. Fast-forward to today and even cars now feature over 100 million lines of code (and growing rapidly, with the advent of autonomous vehicles.)
IBM says one of its large automotive clients recently approached the company to help update a $200 million asset consisting of 3,500, multi-generation Java files. These files contained over one million lines of code.
By applying its AI for Code stack, IBM reduced the client’s year-long ongoing code migration process down to just four weeks.
That example is sure to be the first of many in the years to come which have been greatly sped up, and improved, thanks to Project CodeNet.
You can find the full Project CodeNet dataset on GitHub here.
Interested in hearing industry leaders discuss subjects like this? Attend the co-located 5G Expo, IoT Tech Expo, Blockchain Expo, AI & Big Data Expo, and Cyber Security & Cloud Expo World Series with upcoming events in Silicon Valley, London, and Amsterdam.