Skip to Content

How do RAG, agents, and prompt engineering work together to build better AI apps (according to Chip Huyen)?

What is AI Engineering (and how does it differ from traditional ML for building apps with foundation models)?

AI Engineering by Chip Huyen explains how to build reliable apps using foundation models—covering prompt engineering, RAG (Retrieval-Augmented Generation), AI agents, evaluation metrics, and dataset engineering for non-ML experts. Continue reading to start your AI engineering roadmap: define your app’s success metrics, test a simple RAG pipeline versus a fine-tuned model, and use the evaluation checklist to measure if your prompt engineering is actually working.

Recommendation

Various readily available AI foundation models can be applied to myriad use cases. These models allow even those with relatively little technical background or experience with AI to build AI products. AI Engineering will help people understand the distinction between AI engineering and classical machine learning models and how to develop an AI model and navigate challenges that might arise during that process. Learn how to adapt a model to your specific needs and, ultimately, how to choose a model that works for you. Datasets are important, and so are ways to evaluate where you are on your journey.

Take-Aways

  • AI engineering differs from traditional machine learning engineering.
  • Foundation AI models are trained on enormous amounts of data.
  • The more intelligent an AI model is, the more difficult it is to evaluate.
  • An application is only useful if it fulfills its stated purposes.
  • “Prompt engineering” is the art of crafting instructions in order to get the output you seek from your AI model.
  • AI models require both instructions and information.
  • Models can be adapted for specific tasks.
  • A model is only as good as its training data.

Summary

AI engineering differs from traditional machine learning engineering.

What distinguishes AI today from earlier incarnations is its sheer scale. Applications like ChatGPT and Google’s Gemini and Midjourney swallow up significant amounts of electricity and are trained on massive amounts of data. The amount of publicly available data to train them on may actually run out. These powerful AI models can run myriad applications, increasing their economic value — and, ultimately, improving people’s lives.

“The demand for AI applications has increased while the barrier to entry for building AI applications has decreased. This has turned AI engineering — the process of building applications on top of readily available models — into one of the fastest growing engineering disciplines.”

Training large language model (LLM) AIs requires huge amounts of data and computational power. Few companies can meet this demand. The difference between today’s LLMs and earlier language models is “self-supervision”: Older language model algorithms needed specifically labeled data — which can take a great deal of time and resources to gather. Supervision involves tagging data with behavior and other features you want a model to learn about and then use to inform its output. Once training is complete, the model can apply what it learned from the tagged datasets to analyze data in general. Labeling becomes more difficult as the complexity of the data grows: Almost anyone can label common objects; but labeling CT scans based on whether they show signs of cancer requires special expertise. Self-supervision speeds up the process by allowing the model to infer how something ought to be labeled based on input data. In the case of an LLM, for example, the model predicts the word that ought to come next in a sequence based on the text sequences that appear in its training data, such as blog posts, news articles, and e-books.

Foundation AI models are trained on enormous amounts of data.

Large language models are, by nature, limited. Specifically, they are limited to text. But if AI is going to function in the actual world and mimic human experience, it will have to process information from all the senses: sight, hearing, taste, smell and touch. Thus, more advanced AI models are beginning to incorporate other data modalities, such as video and images. While many people still refer to these newer “multimodal” models as LLMs, a better term is “foundation models.”

“Foundation models, thanks to their scale and the way they are trained, are capable of a wide range of tasks.”

Foundation models are general purpose — meaning people can apply, adapt, and build upon them to handle a broad range of tasks. You might, for instance, want to use AI to generate product descriptions for your website. Or you might want to create detailed product descriptions, with links to a database of customer reviews. You can even task the model with refining descriptions based on a database of such reviews.

AI engineering involves developing applications on top of existing foundation models. Foundation models are powerful in part because they are so versatile — which saves money. AI foundation models also attract billions in investment; some $200 billion expected in 2025. Over time, building sophisticated AI applications has grown easier and easier.

The more intelligent an AI model is, the more difficult it is to evaluate.

Training foundation models is a complicated and pricey endeavor. The data used in training models and the way it’s organized shapes the way the model works. A model’s training is divided into “pre-training” and “post-training.” The purpose of post-training data is to articulate human interests and preferences.

“An AI model is only as good as the data it was trained on.”

If certain kinds of data — say, images of plants or animals — aren’t included in training data, a model’s responses to queries related to these subjects won’t be accurate. Problematic data quality — for example, data that includes misinformation and conspiracy theories — will lead to questionable and untrustworthy outputs. Training data tends to be limited in language terms. English language data accounts for nearly half of the available training data — over 45%. The next most common language is Russian, which accounts for nearly 6%. Many languages aren’t even included in the data. Thus, some models are more likely to have performance problems when operating in non-English languages. For instance, in 2023, NewsGuard prompted ChatGPT to produce misinformation-infused articles. The model declined to write the articles in English, but produced the asked-for outputs in Chinese.

An application is only useful if it fulfills its stated purposes.

Countless foundation models are available. To choose the right model, you need to evaluate the various options in light of how well each can support a given application. But, first, you must evaluate your applications and determine how to measure their success. Say, for instance, you build an application to detect fraud. You can measure such an application’s success by determining how much money it has saved by preventing fraudulent activity.

“AI applications with questionable returns on investment are, unfortunately, quite common. This happens not only because the application is hard to evaluate but also because application developers don’t have visibility into how their applications are being used.”

In general, it pays to consider an application’s effectiveness in four primary areas: “domain-specific capability, generation capability, instruction-following capability, and cost and latency.” If you make an app for summarizing legal contracts, for instance, you’ll want to measure its ability to understand the contract, the accuracy level of the summary it produces, and how well the summary matches whatever instructions are provided — such as how to format the summary. You will also want to look at the cost of producing each summary and how long you have to wait for your finished product.

Evaluating the moral or ethical status of an application is also important. For instance, outputs should be assessed for obscene language, illegal or otherwise problematic suggestions, hate speech, and the promotion of violent behavior. Finally, when assessing models, it’s important to distinguish between what you want and what you actually need.

“Prompt engineering” is the art of crafting instructions in order to get the output you seek from your AI model.

Prompt engineering entails giving instructions to an AI to elicit a desired output. Well-refined prompts are the easiest and most efficient way to optimize an AI’s capabilities. However important and straightforward prompt engineering may be, it’s not sufficient in and of itself. Prompt engineering should be enhanced by statistics and practices such as dataset curation.

“You can think of prompt engineering as human-to-AI communication: You communicate with AI models to get them to do what you want. Anyone can communicate, but not every one can communicate effectively.”

A good prompt has at least one of three features — and most likely, all three. First, a prompt should contain a broad description of what you want the AI to do and the kind of output you seek. You might, for instance, want to find all the instances of a certain kind of language in a text. Next, you’ll want to provide relevant examples of the kinds of language you want the AI to find. Then you’ll need to tell the model its task: to examine a specific text and extract all the instances of that kind of language in that text. Needless to say, for a prompt to work properly, the model needs to be able to follow the prompt’s instructions. With this in mind, you might also want to review and evaluate your model’s instruction-following capabilities and craft your prompt accordingly. The amount of prompt engineering you will need will depend upon the quality and robustness of your AI model. You may have to experiment with your prompt’s phrasing just to see the outputs it elicits.

Context and context length are both important. In ChatGPT’s first three generations, the space available for context was relatively limited, circumscribing its output. The earlier models didn’t have adequate context space to generate, say, a university-level or scholarly essay. Among AI model providers, context length has gone up dramatically. Over the course of just five years, the space available for context length in AI models like ChatGPT-2 has gone up as much as 2,000 times. Still, more space for context doesn’t replace the importance of good prompt engineering practices. If you’re seeking a complex output, such as comments on an article or report, you’ll need to provide the model with the appropriate context. Otherwise, AI will fall back on its own resources, which can be unreliable.

AI models require both instructions and information.

If you want an AI model to complete a task, you need to provide it with instructions and adequate and appropriate contextual information. If a model lacks these elements, it’s more likely to make errors — or even “hallucinate,” meaning invent things out of thin air just to fill in the gaps. The two principal ways to build context are through “retrieval-augmented generation” (RAG) and “agents.” RAG facilitates retrieval of information from independent data sources. Agents enable internet searches for relevant information. RAG is primarily useful for constructing contexts; autonomous agents are more versatile.

“Both RAG and agentic patterns are exciting because of the capabilities they bring to already powerful models. In a short amount of time, they’ve managed to capture the collective imagination, leading to incredible demos and products that convince many people that they are the future.”

The RAG approach accesses relevant information from a variety of sources. They could be an independent database, your email exchanges over the past few years, or, more generally, the internet. RAG is useful for tasks that require a lot of knowledge that can’t be input directly. RAG only responds to information directly relevant to the query with found, retrieved, and input data. RAG allows for detailed and informed query responses — and reduces hallucinations.

Agents or “intelligent agents,” on the other hand, are, in a way, AI’s ultimate aim, and new developments in foundation models have made them viable. An agent is anything that can perceive and interact with its environment. An agent’s environment will be shaped by the way it’s used. An agent in a game will engage with its specific game environment; other agents will interact with datasets in specific domains. The range of actions an agent can perform will depend on the tools it’s provided with — such as the ability to search the internet. Tools and environments are dependent on one another. Some tools can only be used in specific environments: If a robot’s only tool is its ability to swim, you can’t use it on dry land.

Models can be adapted for specific tasks.

RAG and agent systems require prompts. They also demand vast amounts of information. Sometimes they overwhelm a system’s memory capacity, and they may require introducing a memory system to manage the information. Nonetheless, RAG and agents don’t actually change the model itself. But models can be modified or fine-tuned for specific tasks or industries — like medicine — via additional training. Adapting a model can be thought of as a way to make use of knowledge your foundation model already has but is difficult to access. This process can be undertaken by either model or application developers.

“Fine-tuning can enhance various aspects of a model. It can improve the model’s domain-specific capabilities, such as coding or medical question answering, and can also strengthen its safety.”

Compared with prompt-based approaches like RAG, customized foundation models often demand more up-front investment. This is in part because adaptations often require more memory than can be handled by a single computer. Nonetheless, memory can be optimized and made more efficient. The most popular way to render a model’s memory more efficient is with “parameter-efficient fine-tuning” or PEFT: Putting extra parameters into your model in key places to reduce the number of trainable parameters and samples needed for good performance.

An important concept in adapting foundation models in ways that are memory-efficient is “transfer learning.” Transfer learning ideally allows you to avoid purchasing additional training data. If you train your foundation model on a lot of different datasets, you can transfer that information to the model’s customized adaptation. Transfer learning allows your model to learn and be customized with fewer examples. Training a model from scratch can demand countless examples, but with transfer learning, you can leverage a good base model.

A model is only as good as its training data.

A model performs brilliantly when it has good data. But no matter the quality of your staff, and no matter how astronomical your computing capacity, you’re not going to get very far without quality data. The purpose of “dataset engineering” is to establish a dataset that can train the best — and most customized — model possible, within budget constraints. Models are becoming more customized for specific and complex tasks, which means investment in data and the skilled people required to work on the model is increasing. Indeed, AI is becoming less focused on models and more intensively focused on the data used.

“Data-centric AI tries to improve AI performance by enhancing the data. This involves developing new data processing techniques and creating high-quality datasets that allow better models to be trained with fewer resources.”

Data isn’t everything, but quality data can make your model work better, faster, and in more contexts. Low-quality data increases errors and biases. In order to properly select your data, you need to know how your model works. The people who develop your dataset should work closely with the people responsible for the model and applications. Those training the model should also be the people responsible for selecting the dataset. The dataset will, of course, be specific to the tasks you want your model to perform. The data you train your model on should reflect the behaviors and characteristics of what you want your model to learn. Keep in mind that modest amounts of high-quality data will serve you better than massive amounts of low-quality data.

About the Author

Chip Huyen is a writer and computer scientist who works at the intersection of AI, data, and storytelling. She has worked with Snorkel AI and NVIDIA, founded an AI infrastructure start-up, and taught machine learning systems design at Stanford University.