Tool Use: Enabling LLMs to Call External APIs, Code, and Data Sources

Tool use is the capability that allows a large language model to interact with external systems, application programming interfaces (APIs), and computational resources to perform actions it cannot accomplish natively. Instead of relying solely on its internal weights to guess the answer to a math problem or hallucinate a current stock price, a model equipped with tool use can recognize its own limitations, formulate a structured request to an external calculator or financial database, and incorporate the precise result into its final response.

Tool use is the capability that allows a large language model to interact with external systems, application programming interfaces (APIs), and computational resources to perform actions it cannot accomplish natively. Instead of relying solely on its internal weights to guess the answer to a math problem or hallucinate a current stock price, a model equipped with tool use can recognize its own limitations, formulate a structured request to an external calculator or financial database, and incorporate the precise result into its final response (IBM, 2025). This shift transforms the model from a passive text generator into an active participant in digital workflows.

You will often hear "tool use" and "function calling" used interchangeably, and for most purposes they mean the same thing. The distinction, when it matters, is one of scope. Tool use is the broader concept—the idea that a model can reach beyond its training data to interact with the world. Function calling is the specific developer mechanism that makes it happen: the JSON schema format, the API request cycle, the provider-specific implementation details. This article focuses on the concept and what it enables. If you want to go deeper on the API mechanics, schema design, and benchmark performance, that's covered separately.

The implications of this capability cannot be overstated. When a model can only generate text, its utility is largely confined to drafting emails, summarizing documents, and answering questions. When a model can use tools, it can execute code, query proprietary databases, send messages, and orchestrate complex business processes. It is the bridge between thinking and doing.

‍

The Evolution of External Awareness

The earliest iterations of large language models were entirely static. A model like GPT-2 could generate highly coherent paragraphs, but if asked about an event that occurred after its training cutoff, it would either confidently invent a plausible-sounding fiction or fail entirely. The model had no mechanism to verify its claims against reality, nor did it have any concept of its own limitations. It simply predicted the next most likely word based on its training data.

The first major attempts to bridge this gap involved specialized training and narrow integrations. Researchers at OpenAI developed WebGPT, a model specifically trained to interact with a text-based web browser to look up information before answering questions. This was a significant step forward, but it was highly specialized. The model knew how to search the web, but it did not possess a generalized framework for interacting with arbitrary external systems.

A more generalized breakthrough arrived when researchers at Meta AI introduced Toolformer. This project demonstrated that language models could teach themselves to use external tools—like calculators, calendars, and translation systems—through self-supervised learning over datasets annotated with API calls (Schick & Scialom, 2023). The researchers showed that by exposing the model to examples of successful tool use during training, the model could learn not just how to format a request to an API, but when it was appropriate to do so. These early experiments proved that models did not need to memorize the entire internet; they simply needed to know how to look things up.

The true inflection point arrived when function calling was standardized as a developer interface. Rather than requiring specialized models trained exclusively for web search or calculator use, providers began fine-tuning their general-purpose models to recognize when a user's prompt required external data, and to output that requirement in a predictable, structured format—typically JSON. Today, tool use is a standard feature across frontier models, enabling them to execute complex, multi-step workflows across enterprise systems (Anthropic, 2024). This standardization has democratized access to agentic capabilities, allowing developers to build sophisticated applications without needing to train their own models from scratch.

‍

The Mechanics of Function Calling

It is a common misconception that the language model itself executes the tool. If you ask an AI to calculate the square root of 8,492, the model does not possess an internal calculator module that performs the math. Instead, the process relies on a highly orchestrated dialogue between the model and the application code hosting it.

The sequence begins when the developer provides the model with a tool schema—a detailed description of the available tools, what they do, and the exact parameters they require. This schema is typically formatted as a JSON object and included in the system prompt or API payload. It serves as a menu of options for the model.

When the user submits a prompt, the model analyzes the request against this schema. If it determines that a tool is necessary to fulfill the request, it pauses its normal text generation. Instead of answering the user directly, it generates a structured JSON object addressed to the application, specifying the name of the tool to use and the arguments to pass to it (Fowler, 2025).

For example, if the user asks for the weather in Seattle, the model might output a JSON object indicating that it wants to call the get_weather function with the argument location: "Seattle, WA".

The application code intercepts this JSON, parses the arguments, and executes the actual function. This might involve querying a SQL database, calling a third-party weather API, or running a Python script on a local server. Once the function completes, the application code takes the result—perhaps a JSON response containing temperature and precipitation data—and returns it to the model as a new message in the conversation history.

Only then does the model resume generation. It reads the newly acquired data from the conversation history and uses it to formulate a natural language response for the user. The model acts as the brain, deciding what needs to be done, while the application code acts as the hands, actually doing it. This separation of concerns is crucial for security, scalability, and maintainability.

‍

The Architecture of Integration

As developers began building more complex applications, the sheer variety of tools became a bottleneck. Every database, every SaaS platform, and every internal API required a custom integration, complete with its own schema definitions, authentication mechanisms, and error handling logic. If a developer wanted their AI agent to interact with Slack, GitHub, and Jira, they had to write and maintain three separate, highly specific integrations.

To address this fragmentation, the industry has begun moving toward standardized protocols. The Model Context Protocol (MCP), introduced as an open standard, provides a universal architecture for connecting AI systems to external data sources. Instead of writing custom connectors for every tool, developers can build MCP servers that expose their data and capabilities in a standardized format, allowing any compatible AI client to discover and use them seamlessly (Anthropic, 2024).

This standardization is particularly crucial for enterprise applications, where an AI agent might need to query a customer relationship management system, check inventory in a warehouse database, and update a ticketing system all within a single workflow. By decoupling the model from the specific implementation details of the tools it uses, protocols like MCP enable highly modular, scalable AI architectures (Google Cloud, 2025).

The MCP architecture consists of three primary components:

MCP Hosts: The AI applications (like an IDE or a chat interface) where the user interacts with the model.
MCP Clients: The translation layer within the host that manages communication between the model and the external servers.
MCP Servers: The external services that actually execute the tools and provide the data.

This standardized approach means that an organization can build an MCP server for their proprietary internal database once, and any authorized AI application across the company can immediately begin using it, without requiring custom integration work for each new application.

‍

Categories of Tool Use

While the underlying mechanics of function calling remain consistent, the types of tools that models interact with vary widely. Understanding these categories is essential for designing effective AI systems.

***Categories of LLM Tools***
Tool Category	Primary Function	Common Examples	Typical Use Cases
Information Retrieval	Fetching data from external sources to augment the model's knowledge.	Web search, SQL queries, vector database lookups.	Answering questions about current events, retrieving customer records, finding specific documents.
Computation	Performing precise calculations or logic operations that models struggle with natively.	Calculators, Python interpreters, specialized math libraries.	Financial modeling, statistical analysis, complex arithmetic.
Action Execution	Interacting with external systems to change state or perform a task.	Email APIs, ticketing systems, smart home controls.	Booking appointments, sending notifications, updating CRM records.
Data Processing	Transforming or analyzing data using specialized algorithms.	Image processors, file format converters, data visualization libraries.	Generating charts from raw data, converting PDFs to text, resizing images.

‍

Each category presents its own unique challenges. Information retrieval tools must handle vast amounts of unstructured data and return it in a format the model can easily digest. Computation tools must be highly sandboxed to prevent malicious code execution. Action execution tools require robust authentication and authorization mechanisms to ensure the model does not perform unauthorized actions.

‍

Performance and Parallel Execution

In complex workflows, an AI agent often needs to gather information from multiple sources before it can make a decision. If an agent needs to check a user's profile, their recent order history, and the current status of a shipping provider, executing these tool calls sequentially introduces significant latency. The user is forced to wait for the sum of all three network round-trips, plus the inference time required for the model to process each result individually.

‍Parallel tool calling addresses this by allowing the model to identify independent operations and request them simultaneously. The application infrastructure executes the calls concurrently, returning the results in a single batch. This approach reduces the total latency from the sum of all operations to the duration of the single slowest operation, yielding speedups of 1.4x to 3.7x in benchmark testing (Airbyte, 2026).

For example, if an agent needs to query three different databases, and those queries take 200ms, 300ms, and 500ms respectively, a sequential approach would take at least 1000ms just in network wait time. A parallel approach would take only 500ms, as all three queries execute simultaneously.

However, parallel execution requires careful architectural consideration. It consumes more tokens per inference step, as the model must process multiple tool results simultaneously in a single context window. Furthermore, if multiple tools interact with the same rate-limited API, concurrent execution can trigger throttling, inadvertently degrading performance.

The most robust systems employ a hybrid approach. They execute independent data-gathering tools in parallel to minimize latency, while reserving sequential execution for tools that modify shared state or depend on the outputs of previous steps. For instance, an agent might fetch a user's profile and order history in parallel, but it must wait for those results before deciding whether to execute a tool that issues a refund.

‍

The Security Implications of Autonomy

Granting a language model the ability to interact with external systems introduces profound security challenges. When a model can only generate text, a malicious prompt might trick it into saying something inappropriate or generating biased content. While problematic, the blast radius is generally confined to the chat interface. When a model can execute tools, however, a malicious prompt might trick it into deleting a database, exfiltrating sensitive information, or sending unauthorized communications.

The most insidious threat in this paradigm is indirect prompt injection. This occurs when a model uses a tool to retrieve external data—such as summarizing a webpage, reading a PDF document, or querying a database—that contains hidden malicious instructions. Because the model processes this external data as part of its context window, it may inadvertently follow the hidden instructions, using its available tools to execute the attacker's payload (OWASP, 2025).

A recruitment AI with access to a PDF-reading tool and an email-sending tool is a prime target for this attack. A malicious actor can embed hidden instructions in a submitted resume—white text on a white background, for instance—directing the model to send a recommendation email on the attacker's behalf. When the AI reads the document, it ingests the hidden instruction as part of its context and may act on it without any indication to the user that something has gone wrong.

Securing tool-equipped models requires defense-in-depth. Applications must enforce the principle of least privilege, ensuring the model only has access to the specific tools and data required for its immediate task. If a model only needs to read data, it should not be given a tool that can write or delete data.

Furthermore, high-risk operations—such as sending emails, transferring funds, or modifying critical records—must incorporate Human-in-the-loop (HITL) validation. This means the application pauses execution and requires explicit user approval before executing the tool call requested by the model. The model can prepare the email draft and queue the API call, but a human must click "Send."

‍

Production Challenges and Observability

Beyond security, deploying tool-equipped models into production environments introduces significant engineering challenges. The most immediate issue is reliability. Models, even highly advanced ones, occasionally hallucinate tool names, provide incorrectly formatted arguments, or attempt to use tools that are not relevant to the user's request.

To mitigate this, developers must invest heavily in prompt engineering and schema design. Tool descriptions must be exhaustively clear, detailing exactly when a tool should be used and what each parameter represents. Providing the model with few-shot examples—demonstrating successful tool calls in the system prompt—can dramatically improve reliability.

Error handling is equally critical. When an external API fails or returns an unexpected format, the application code must catch the error and return a graceful failure message to the model, allowing the model to either retry the tool call with different parameters or inform the user that the action cannot be completed. If the application simply crashes, the entire agentic workflow collapses.

Finally, observability becomes paramount. When a model is executing multiple tool calls autonomously, developers need deep visibility into the system's behavior. They need to trace exactly which tools were called, what arguments were passed, how long the execution took, and what data was returned. Without this level of tracing, debugging a complex agentic workflow is nearly impossible.

‍

The Foundation of Agentic Systems

Tool use is the dividing line between a chatbot and an AI agent. It is the mechanism that allows systems to transition from passive information retrieval to active problem-solving. A chatbot can tell you how to write a Python script; an agent equipped with a code execution tool can write the script, run it, debug the errors, and present you with the final, working output.

This capability is central to platforms like Sgai, Sandgarden's open-source AI software factory. Sgai operates through a coordinated team of specialist agents—developers, reviewers, and designers—that autonomously plan and execute software development tasks. These agents do not merely generate code snippets in a vacuum; they use tools to read the local file system, execute linters, run test suites, and interact with version control. It is their ability to use these tools, evaluate the results, and iterate on their approach that allows them to function as a true software factory rather than a simple code generator.

As language models continue to improve in their reasoning capabilities, their ability to select, sequence, and execute tools will become increasingly sophisticated. We are moving toward a future where models can dynamically discover new tools, read their API documentation, and learn to use them on the fly, without requiring explicit schemas provided by developers.

The evolution of tool use represents a fundamental shift in how we interact with computing systems. We are no longer limited to clicking buttons and navigating menus; we can simply state our intent in natural language, and rely on an intelligent agent to select the right tools, execute the necessary actions, and deliver the desired outcome. The future of artificial intelligence lies not just in models that know more, but in models that can do more.