Resources Article API-Bank: Benchmarking Language Models’ Tool Use

API-Bank: Benchmarking Language Models’ Tool Use

Brad Nikkel

Published on 08/28/23Updated on 10/11/23

Table of Contents

LLM Tool-Augmentation Approaches Is a Tool Even Necessary?Digging into the Toolbox Evaluating Tools Calibrating Tools API-Bank’s Implementation API-Bank’s Three-Tier Evaluation To Call or Not to Call: Level-1 Evaluation Finding the Right Tool: Level-2 Evaluation Employing the Entire Toolbox: Level-3 Evaluation Results API-Bank’s Contributions

Share this guide

Foundational Large Language Models (LLMs) are very much “jack-of-all-trades, master-of-none” entities; they can do a lot decently but often stumble on tasks requiring specialized or obscure knowledge.

Fine-tuning foundational LLMs can, of course, help improve their ability to handle esoteric tasks, but fine-tuning foundational LLMs for every narrow task under the sun would be costly, and we’d still need some sort of “brain” logic to decide which fine-tuned model to use in which circumstance. To become more useful to us (and more efficient), LLMs need to use tools. Often taking the form of APIs, tools allow LLMs to do things like call other models, retrieve information, interpret code, act on the physical world, and more.

Shortly after ChatGPT brought LLMs’ potential (and shortcomings) mainstream, researchers and the open source community began investigating how to extend the capabilities of LLMs with external tools. A few such efforts are:

OpenAI’s addition of plugins to ChatGPT
The LangChain ecosystem’s many options for augmenting LLMs with external tools
LLM agent frameworks designed for LLMs to semi-autonomously employ tools to accomplish human-defined tasks
ToolFormer found that LLMs could teach themselves to use external APIs
Berkley and Microsoft Research’s Gorilla, an LLM trained specifically to call APIs
ToolkenGPT turned tools themselves into tokens (tool + token == “toolken”), embedding tools similarly to how we tokenize words and subwords in LLMs

Needless to say, excitement about improving LLMs’ abilities via tool augmentation is brimming. With all this experimentation, it’d be nice, though, to test how well (or poorly) LLMs actually employ tools, which is why, in April 2023, Li et al. created API-Bank, the first benchmark testing tool-augmented LLMs. Specifically, API-Bank tests LLMs’ abilities to find APIs relevant to a user-defined goal and then plan and execute API calls to accomplish that goal.

LLM Tool-Augmentation Approaches

There are two main approaches toward augmenting LLMs with tools:

In-context learning (i.e., showing a pre-trained model examples of how to use tools)
Fine-tuning (i.e., feeding a pre-trained LLM annotated data relating to tools (e.g., API documentation))

A shortcoming of in-context learning approaches is that examples must remain within an LLM’s context window, which may be too short to provide sufficient examples. Aware of this weakness, Li et al. designed API-Bank as an in-context learning tool-augmentation approach that overcomes context length limitations via the following three key components:

An “API Pool” of APIs for various tasks
A keyword-based search engine, “ToolSearch”, that retrieves relevant APIs from API Pool
Prompts explaining to an LLM a user-defined task and how to employ ToolSearch

Is a Tool Even Necessary?

Here’s how these components interact: Given a user request (e.g., book me a trip somewhere sunny), an LLM’s first step is determining if it can fulfill that request with its own internal knowledge or if calling APIs would be a better approach, meaning the LLM needs some sense of its own “known unknowns.”

At this stage, the LLM has three options:

It can respond using its internal knowledge
It can ask the user for additional clarification
It can go ahead with an API call (other LLM-tool forms exist, like programming a function from scratch, but API-Bank, as its name suggests, only tests LLMs’ API utilization abilities).

Digging into the Toolbox

An LLM’s next step is finding the right tool for the right job (i.e., the most suitable API for the user’s request). To do this, before making any API calls, the LLM summarizes a user request into a handful of keywords and then inputs these keywords into ToolSearch (the API search engine), which queries the API Pool (the collection of available APIs) to find the most relevant API. After that, the LLM receives the candidate API’s documentation (the API’s function description and its input and output parameters).

Evaluating Tools

With an API’s documentation, the LLM then decides if that API looks worth a try or if it’s back to the drawing board (i.e., tweak the keywords a bit, search for a new API, check its documentation, and repeat the cycle). The LLM has the option of throwing in the towel here, giving the user an LLM version of the blue screen of death (e.g., "As a large language model, I cannot…”), but if the LLM finds an API suited for the user’s task, it calls that API, and the decision tree branches out a few more times.

Calibrating Tools

An API might return results relevant to the user’s request. In this case, the LLM would pass those on to the user. But an API might also return an exception (i.e., an error message). When encountering an exception, an LLM ideally attempts to use that exception message to modify the API call and try again. Should that fail, the LLM can inform the user that it can’t solve the task with the available APIs. Below is a diagram of all API-Bank’s components and decision flow: