LLMs and Tool Use

Last March, just two weeks after GPT-4 was released, researchers at Microsoft quietly announced a plan to compile millions of APIs—tools that can do everything from ordering a pizza to solving physics equations to controlling the TV in your living room—into a compendium that would be made accessible to large language models (LLMs). This was just one milestone in the race across industry and academia to find the best ways to teach LLMs how to manipulate tools, which would supercharge the potential of AI more than any of the impressive advancements we’ve seen to date.

The Microsoft project aims to teach AI how to use any and all digital tools in one fell swoop, a clever and efficient approach. Today, LLMs can do a pretty good job of recommending pizza toppings to you if you describe your dietary preferences and can draft dialog that you could use when you call the restaurant. But most AI tools can’t place the order, not even online. In contrast, Google’s seven-year-old Assistant tool can synthesize a voice on the telephone and fill out an online order form, but it can’t pick a restaurant or guess your order. By combining these capabilities, though, a tool-using AI could do it all. An LLM with access to your past conversations and tools like calorie calculators, a restaurant menu database, and your digital payment wallet could feasibly judge that you are trying to lose weight and want a low-calorie option, find the nearest restaurant with toppings you like, and place the delivery order. If it has access to your payment history, it could even guess at how generously you usually tip. If it has access to the sensors on your smartwatch or fitness tracker, it might be able to sense when your blood sugar is low and order the pie before you even realize you’re hungry.

Perhaps the most compelling potential applications of tool use are those that give AIs the ability to improve themselves. Suppose, for example, you asked a chatbot for help interpreting some facet of ancient Roman law that no one had thought to include examples of in the model’s original training. An LLM empowered to search academic databases and trigger its own training process could fine-tune its understanding of Roman law before answering. Access to specialized tools could even help a model like this better explain itself. While LLMs like GPT-4 already do a fairly good job of explaining their reasoning when asked, these explanations emerge from a “black box” and are vulnerable to errors and hallucinations. But a tool-using LLM could dissect its own internals, offering empirical assessments of its own reasoning and deterministic explanations of why it produced the answer it did.

If given access to tools for soliciting human feedback, a tool-using LLM could even generate specialized knowledge that isn’t yet captured on the web. It could post a question to Reddit or Quora or delegate a task to a human on Amazon’s Mechanical Turk. It could even seek out data about human preferences by doing survey research, either to provide an answer directly to you or to fine-tune its own training to be able to better answer questions in the future. Over time, tool-using AIs might start to look a lot like tool-using humans. An LLM can generate code much faster than any human programmer, so it can manipulate the systems and services of your computer with ease. It could also use your computer’s keyboard and cursor the way a person would, allowing it to use any program you do. And it could improve its own capabilities, using tools to ask questions, conduct research, and write code to incorporate into itself.

It’s easy to see how this kind of tool use comes with tremendous risks. Imagine an LLM being able to find someone’s phone number, call them and surreptitiously record their voice, guess what bank they use based on the largest providers in their area, impersonate them on a phone call with customer service to reset their password, and liquidate their account to make a donation to a political party. Each of these tasks invokes a simple tool—an Internet search, a voice synthesizer, a bank app—and the LLM scripts the sequence of actions using the tools.

We don’t yet know how successful any of these attempts will be. As remarkably fluent as LLMs are, they weren’t built specifically for the purpose of operating tools, and it remains to be seen how their early successes in tool use will translate to future use cases like the ones described here. As such, giving the current generative AI sudden access to millions of APIs—as Microsoft plans to—could be a little like letting a toddler loose in a weapons depot.

Companies like Microsoft should be particularly careful about granting AIs access to certain combinations of tools. Access to tools to look up information, make specialized calculations, and examine real-world sensors all carry a modicum of risk. The ability to transmit messages beyond the immediate user of the tool or to use APIs that manipulate physical objects like locks or machines carries much larger risks. Combining these categories of tools amplifies the risks of each.

The operators of the most advanced LLMs, such as OpenAI, should continue to proceed cautiously as they begin enabling tool use and should restrict uses of their products in sensitive domains such as politics, health care, banking, and defense. But it seems clear that these industry leaders have already largely lost their moat around LLM technology—open source is catching up. Recognizing this trend, Meta has taken an “If you can’t beat ’em, join ’em” approach and partially embraced the role of providing open source LLM platforms.

On the policy front, national—and regional—AI prescriptions seem futile. Europe is the only significant jurisdiction that has made meaningful progress on regulating the responsible use of AI, but it’s not entirely clear how regulators will enforce it. And the US is playing catch-up and seems destined to be much more permissive in allowing even risks deemed “unacceptable” by the EU. Meanwhile, no government has invested in a “public option” AI model that would offer an alternative to Big Tech that is more responsive and accountable to its citizens.

Regulators should consider what AIs are allowed to do autonomously, like whether they can be assigned property ownership or register a business. Perhaps more sensitive transactions should require a verified human in the loop, even at the cost of some added friction. Our legal system may be imperfect, but we largely know how to hold humans accountable for misdeeds; the trick is not to let them shunt their responsibilities to artificial third parties. We should continue pursuing AI-specific regulatory solutions while also recognizing that they are not sufficient on their own.

We must also prepare for the benign ways that tool-using AI might impact society. In the best-case scenario, such an LLM may rapidly accelerate a field like drug discovery, and the patent office and FDA should prepare for a dramatic increase in the number of legitimate drug candidates. We should reshape how we interact with our governments to take advantage of AI tools that give us all dramatically more potential to have our voices heard. And we should make sure that the economic benefits of superintelligent, labor-saving AI are equitably distributed.

We can debate whether LLMs are truly intelligent or conscious, or have agency, but AIs will become increasingly capable tool users either way. Some things are greater than the sum of their parts. An AI with the ability to manipulate and interact with even simple tools will become vastly more powerful than the tools themselves. Let’s be sure we’re ready for them.

This essay was written with Nathan Sanders, and previously appeared on Wired.com.

Tags: artificial intelligence, essays, LLM