You want to start an AI project – but what processes do you need to bear in mind, how do you manage the data, and what do you need to look at when it comes to team composition and testing? In this extract from Embracing the Power of AI, Javier Minhondo, Juan José López Murphy, Haldo Spontón, Martín Migoya, and Guibert Englebienne outline how to get through these crucial initial stages.
Getting a grasp of the business needs is crucial, as is understanding the data sources. Are we thinking the solution centers on user needs or are we putting technology first? Going back to the key aspects discussed in Chapter 6 (see page 73), successful AI products must be driven and designed to help people access and process information, and facilitate decision-making. That very first issue for the team is problem formulation, during which they will define the business needs to be addressed in a way that is usable, identify the expected outcome, and even more so, how they will use that outcome so as to ensure it’s not spurious. Next comes understanding the data sources—how healthy they are, how to integrate them, and how the team can use them to fulfill the business goal. Given that, the first thing to do is check that we can actually process the data of the client, without regard for speed or complexity, and validate that we can get a suitable business answer. It’s not relevant if it’s not the best one, since we should make sure that we are doing something that is worth doing. If all these checks go through, we can plan to scale up.
To summarize, the questions that have to be addressed are:
- What data sources are available?
- What is the meaning of the data?
- Do we have enough data?
- How is the data generated and collected?
And there are some other questions that might not be the point in terms of feasibility, but that should be up for discussion since they will empower or undermine the quality of the whole endeavor:
- Is the data that we are using to train the model representative enough?
- Are there any biases we are not factoring out?
- Can the data be used with disclosure, without alienating the end user?
There is another step on handling the data before the algorithm itself, which is called feature engineering and consists of how the data can be combined and modified to make it more significant to the algorithms that are going to consume it. This includes, for instance, squaring variables or multiplying them to allow linear models to model nonlinearities, or more generally applying kernels, performing a Fourier transform to handle the domain frequency to process audio, or passing an image through the SIFT algorithm in order to locate features in images.
While this step is critical in traditional machine learning, it has been shifting into architecture engineering when talking about deep learning, since one of the capabilities of these deep neural networks is learning features by themselves. In that situation, the focus becomes defining an architecture that enables the algorithm to learn the best possible features. Once we have a clear business goal and a clear understanding of the data, we can then start with the modeling of algorithms. This will include training the algorithms, testing the algorithms, visualizing the outcome, and evaluating it.
Training an algorithm can be likened to training a dog, but millions of times. As trivial as that sounds, the process is as simple as this: You have input data of which you know the expected outcome; you feed the data to the algorithm, which will give an answer back for that data; and you give feedback to the AI, pointing out which answer is right and which one is wrong (more likely: this answer is about “this much” wrong). The process may seem simple, but executing it effectively and correctly is far from trivial, requiring a lot of focus and effort. We need to define a strategy, otherwise the artificial intelligence can learn unwanted and unforeseen negative behaviors.
One heavily touted example happened to Google when image recognition software was fed a photo of an African-American family, and it tagged it as a shrewdness of apes (group of apes). Had that outcome anything to do with racism? We will discuss that a bit more in Chapter 13, but the behavior of the AI emerges from the dataset and process used during the training and testing phase. This stresses how the training part is very important, and therein lies the success of the behavior that you design for it. This is where the data science and the service and experience design work together to curate and educate the AI to make the experience of interacting with it as effective as possible. One of the design decisions at this stage has to do with carefully selecting from all the available data how much, and which parts, are going to be used to train the algorithm, and how much is going to be kept secret (untouched and unseen, a critical point) from the algorithm to test it later. The testing data needs to be as representative as the training data. Also, the amount of test data needs to be carefully selected. In the past, the rule of thumb has been to leave 20 percent of the whole dataset unseen to the algorithm, while using 80 percent as training data. Some modern deep learning approaches can use up to 99 percent of the data for training.
Now we can elaborate on the team composition and the role of testing. Training the AI and ML components is part of the data-science role that the team must engage in. They need to check the results of the algorithm when fed testing data that it has never seen before. One of the common pitfalls a team wants to avoid is using the same dataset to train and then test the algorithm. This will lead to the algorithm responding with great levels of accuracy for data it has seen during the training phase. A team that has fallen into this pitfall will be tempted to say that the ML component has reached a level of confidence that makes it deployable. The problem comes when faced with unknown data, making the output totally uncertain. There are many ways in which the test data can leak into the training set, which could be as simple as reckless repeated testing.
Finally, the team has to perform an evaluation of the algorithm in terms of sensitivity and cost. An error-free model does not exist, and not all errors are born alike. A false-positive (saying something is what it is not) is not the same as a false-negative (failing to say what something is). Think of it as misdiagnosing an illness (false-positive) against failing to tell you that you have the flu (false-negative). Some cases are critical (serious illnesses), so a higher rate of false-positive might be more acceptable. This can be taken to the extreme, however, such as flagging everyone as terrorists at an airport just to avoid missing one. On another, more retail-oriented example, overforecasting might give us some working capital costs, while under-forecasting might cause lost sales to a competitor.
That trade-off is what is called a precision-recall trade-off, which needs to be tuned very specifically for each problem. But how much training is enough? Is the AI just memorizing the data or extracting generalities from it? Is it just general enough to handle new data, or is it so general that it does not tell me anything nontrivial? The equilibrium between the overfitting (memorization) and generalization is another difficult alchemy to perform. A basic schematic of this concept for a simple classifier can be seen in figure 7.1 (above). Throughout all this process, a core concern remains lurking below the surface. How am I measuring success, cost, preference? Is it aligned with business requirements? Is it technically helpful? How can I combine them? All metrics have biases and blind spots, and not all of them are mathematically useful for training! Once the team is confident with the solution, it can work on converting the outcomes of the algorithms into insights, action items, predictions, or simply put: Data Products. The overall process can be seen, simplified, in figure 7.2 (right).
The process is not quite finished at this point. Making sure that the trained algorithm, app, or system can actually work in production, scale, and be kept in control requires some heavy engineering skills. Through the monitoring of its behavior and use, feedback can be incorporated back into the problem definition and all the other stages. In fact, we can see in figure 7.3 (left) a more truthful representation of the actual workflow, considering all the ways we can loop back to an earlier stage before moving forward.
About the contributors: Javier Minhondo is a devoted and passionate software engineer with years of experience. As the vice president of technology of the Artificial Intelligence Studio at Globant, he engages with companies to enhance digital journeys by leveraging cognitive and emotional aspects with the ever-increasing capacity of machines to learn and understand complex patterns. He utilizes state-of-the-art techniques, including deep learning, neural networks, or traditional machine learning approaches, coupled with hacking and engineering abilities. Juan Jose Lopez Murphy’s primary area of interest is working on the intersection of innovation, strategy, technology, and data. He is passionate about data, whether from a data science, data engineering, or BI data visualization mind-set, always looking for the ways in which technology enables and is the driver of business model innovation—true disruption is the result of both. Haldo Sponton is part of the team that leads the practice of Data Science and the study of artificial intelligence within Globant. He focuses on machine learning algorithms and deep learning, with particular focus and taste for the analysis of unstructured data for later decision-making (audio processing, images, video, etc). In 2003, Martín Migoya, with only a small start-up capital, cofounded Globant with three friends. Twelve years after its launch, Martín drove the company as CEO from a small start-up to a publicly listed organization with more than 6,700+ professionals, pushing the edge of innovative digital journeys in offices across the globe. Martín, whose passion is to inspire future entrepreneurs, frequently gives lectures to foster the entrepreneurship gene, and has won numerous prestigious industry awards, including the E&Y Entrepreneur of the Year award. Guibert Englebienne has had a lifelong passion for cutting-edge technology and exploring how it fits into business and culture. He is one of Globant’s cofounders and now serves as CTO leading the execution of thousands of consumer-facing technology projects. He is a frequent speaker and thinker on how to drive a culture of innovation at scale in organizations. Guibert is widely recognized as one of the industry’s most influential leaders.