Tool calling¶
📚 Overview¶
Tool calling is a task of using the right functions (picked from a list) and using them with the right arguments - understanding the user request.
We use a uniform custom prompt and structured JSON output generation to obtain predictions from a given model.
When evaluating generative models, we allow the model to generate 500 tokens on this task.
📊 Metrics¶
We use simple accuracy where a prediction is positive only if the list of functions and required arguments exactly matches one of the options given in the ground truth. For example, given the ground truth:
[
{
"latest_exchange_rate": {
"source_currency": [
"USD",
"US Dollars",
"US Dollar"
],
"target_currency": [
"EUR",
"Euro"
],
"amount": [
1000
]
}
},
{
"safeway.order": {
"location": [
"Palo Alto, CA",
"Palo Alto",
"CA"
],
"items": [
[
"water",
"apples",
"bread"
]
],
"quantity": [
[
2,
3,
1
]
]
}
},
]
and function descriptions:
[
{
"name": "safeway.order",
"description": "Order specified items from a Safeway location.",
"parameters": {
"type": "dict",
"properties": {
"location": {
"type": "string",
"description": "The location of the Safeway store, e.g. Palo Alto, CA."
},
"items": {
"type": "array",
"items": {
"type": "string"
},
"description": "List of items to order."
},
"quantity": {
"type": "array",
"items": {
"type": "integer"
},
"description": "Quantity of each item in the order list."
}
},
"required": [
"location",
"items",
"quantity"
]
}
},
{
"name": "latest_exchange_rate",
"description": "Retrieve the latest exchange rate between two specified currencies.",
"parameters": {
"type": "dict",
"properties": {
"source_currency": {
"type": "string",
"description": "The currency you are converting from."
},
"target_currency": {
"type": "string",
"description": "The currency you are converting to."
},
"amount": {
"type": "integer",
"description": "The amount to be converted. If omitted, default to xchange rate of 1 unit source currency."
}
},
"required": [
"source_currency",
"target_currency"
]
}
}
]
One valid prediction option is
{
"tool_calls": [
{
"function": "last_exchange_rate",
"arguments": {
"source_currency": "US Dollar", # or 'USD' or 'US Dollars'
"target_currency": "EUR", # or 'Euro'
"amount": 1000 # this can even be left out
}
},
{
"function": "safeway.order",
"arguments": {
"location": "Palo Alto",
"items": ["water", "apples", "bread"],
"quantity": [2, 3, 1]
}
}
]
}
🛠️ How to run¶
In the command line interface of the EuroEval Python package, you can benchmark your favorite model on the tool calling task like so:
euroeval --model <model-id> --task tool-calling