How to call tools with multimodal data
This guide assumes familiarity with the following concepts:
Here we demonstrate how to call tools with multimodal data, such as images.
Some multimodal models, such as those that can reason over images or audio, support tool calling features as well.
To call tools using such models, simply bind tools to them in the usual way, and invoke the model using content blocks of the desired type (e.g., containing image data).
Below, we demonstrate examples using OpenAI and Anthropic. We will use the same image and tool in all cases. Let’s first select an image, and build a placeholder tool that expects as input the string “sunny”, “cloudy”, or “rainy”. We will ask the models to describe the weather in the image.
The tool
function available in @langchain/core
version 0.2.7 and
above.
DynamicStructuredTool
should be used in older versions.
import { tool } from "@langchain/core/tools";
import { z } from "zod";
const imageUrl =
"https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg";
const weatherTool = tool(
async ({ weather }) => {
console.log(weather);
return weather;
},
{
name: "multiply",
description: "Describe the weather",
schema: z.object({
weather: z.enum(["sunny", "cloudy", "rainy"]),
}),
}
);
OpenAI
For OpenAI, we can feed the image URL directly in a content block of type “image_url”:
import { HumanMessage } from "@langchain/core/messages";
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
model: "gpt-4o",
}).bindTools([weatherTool]);
const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: imageUrl,
},
},
],
});
const response = await model.invoke([message]);
console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "call_ZaBYUggmrTSuDjcuZpMVKpMR"
}
]
Note that we recover tool calls with parsed arguments in LangChain’s standard format in the model response.
Anthropic
For Anthropic, we can format a base64-encoded image into a content block of type “image”, as below:
import * as fs from "node:fs/promises";
import { ChatAnthropic } from "@langchain/anthropic";
import { HumanMessage } from "@langchain/core/messages";
const imageData = await fs.readFile("../../data/sunny_day.jpeg");
const model = new ChatAnthropic({
model: "claude-3-sonnet-20240229",
}).bindTools([weatherTool]);
const message = new HumanMessage({
content: [
{
type: "text",
text: "describe the weather in this image",
},
{
type: "image_url",
image_url: {
url: `data:image/jpeg;base64,${imageData.toString("base64")}`,
},
},
],
});
const response = await model.invoke([message]);
console.log(response.tool_calls);
[
{
name: "multiply",
args: { weather: "sunny" },
id: "toolu_01HLY1KmXZkKMn7Ar4ZtFuAM"
}
]
Google Generative AI
For Google GenAI, we can format a base64-encoded image into a content block of type “image”, as below:
import { ChatGoogleGenerativeAI } from "@langchain/google-genai";
import axios from "axios";
import {
ChatPromptTemplate,
MessagesPlaceholder,
} from "@langchain/core/prompts";
import { HumanMessage } from "@langchain/core/messages";
const axiosRes = await axios.get(imageUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);
const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([weatherTool]);
const prompt = ChatPromptTemplate.fromMessages([
["system", "describe the weather in this image"],
new MessagesPlaceholder("message"),
]);
const response = await prompt.pipe(model).invoke({
message: new HumanMessage({
content: [
{
type: "media",
mimeType: "image/jpeg",
data: base64,
},
],
}),
});
console.log(response.tool_calls);
[ { name: 'multiply', args: { weather: 'sunny' } } ]
Audio input
Google’s Gemini also supports audio inputs. In this next example we’ll see how we can pass an audio file to the model, and get back a summary in structured format.
import { SystemMessage } from "@langchain/core/messages";
import { StructuredTool } from "@langchain/core/tools";
class SummaryTool extends StructuredTool {
schema = z.object({
summary: z.string().describe("The summary of the content to log"),
});
description = "Log the summary of the content";
name = "summary_tool";
async _call(input: z.infer<typeof this.schema>) {
return input.summary;
}
}
const summaryTool = new SummaryTool();
const audioUrl =
"https://www.pacdv.com/sounds/people_sound_effects/applause-1.wav";
const axiosRes = await axios.get(audioUrl, { responseType: "arraybuffer" });
const base64 = btoa(
new Uint8Array(axiosRes.data).reduce(
(data, byte) => data + String.fromCharCode(byte),
""
)
);
const model = new ChatGoogleGenerativeAI({
model: "gemini-1.5-pro-latest",
}).bindTools([summaryTool]);
const response = await model.invoke([
new SystemMessage(
"Summarize this content. always use the summary_tool in your response"
),
new HumanMessage({
content: [
{
type: "media",
mimeType: "audio/wav",
data: base64,
},
],
}),
]);
console.log(response.tool_calls);
[
{
name: 'summary_tool',
args: { summary: 'The video shows a person clapping their hands.' }
}
]