Spider Web Scraper

Hi,

i use SpiderTool for some agents with gpt-4o and I get this errors nearly everytime the agents searches the web:


I encountered an error while trying to use the tool. This was the error: 1 validation error for SpiderToolSchema
params
Field required [type=missing, input_value={‘url’: ‘https://www.theg…itar’, ‘mode’: ‘scrape’}, input_type=dict]
For further information visit https ://errors.pydantic.dev/2.9/v/missing.
Tool Spider scrape & crawl tool accepts these inputs: Tool Name: Spider scrape & crawl tool
Tool Arguments: {‘url’: {‘description’: ‘Website URL’, ‘type’: ‘str’}, ‘params’: {‘description’: ‘Set additional params. Options include:\n- limit: Optional[int] - The maximum number of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.\n- depth: Optional[int] - The crawl limit for maximum depth. If 0, no limit will be applied.\n- metadata: Optional[bool] - Boolean to include metadata or not. Defaults to False unless set to True. If the user wants metadata, include params.metadata = True.\n- query_selector: Optional[str] - The CSS query selector to use when extracting content from the markup.\n’, ‘type’: ‘Union[dict[str, Any], NoneType]’}, ‘mode’: {‘description’: ‘Mode, the only two allowed modes are scrape or crawl. Use scrape to scrape a single page and crawl to crawl the entire website following subpages. These modes are the only allowed values even when ANY params is set.’, ‘type’: ‘Literal[scrape, crawl]’}}
Tool Description: Scrape & Crawl any url and return LLM-ready data.

I encountered an error while trying to use the tool. This was the error: 2 validation errors for SpiderToolSchema
params
Input should be a valid dictionary [type=dict_type, input_value=‘{“metadata”: false’, input_type=str]
For further information visit Redirecting...
mode
Input should be ‘scrape’ or ‘crawl’ [type=literal_error, input_value=‘“scrape”’, input_type=str]
For further information visit https ://errors.pydantic.dev/2.9/v/literal_error.
Tool Spider scrape & crawl tool accepts these inputs: Tool Name: Spider scrape & crawl tool
Tool Arguments: {‘url’: {‘description’: ‘Website URL’, ‘type’: ‘str’}, ‘params’: {‘description’: ‘Set additional params. Options include:\n- limit: Optional[int] - The maximum number of pages allowed to crawl per website. Remove the value or set it to 0 to crawl all pages.\n- depth: Optional[int] - The crawl limit for maximum depth. If 0, no limit will be applied.\n- metadata: Optional[bool] - Boolean to include metadata or not. Defaults to False unless set to True. If the user wants metadata, include params.metadata = True.\n- query_selector: Optional[str] - The CSS query selector to use when extracting content from the markup.\n’, ‘type’: ‘Union[dict[str, Any], NoneType]’}, ‘mode’: {‘description’: ‘Mode, the only two allowed modes are scrape or crawl. Use scrape to scrape a single page and crawl to crawl the entire website following subpages. These modes are the only allowed values even when ANY params is set.’, ‘type’: ‘Literal[scrape, crawl]’}}
Tool Description: Scrape & Crawl any url and return LLM-ready data.

I have already tried to specify and define the task from the agent on how the tool is used. However, that did not work.

Does somebody has an idea how to solve this?

Best regards and thanks for any help!
Milan

1 Like

Can you please show the code? How do you set the LLM?

Yes for sure.

I use a yaml file to define an agent. The defintion is in German but I translated it for you,
topic_researcher:
role: >

goal: |

Suche im Internet:
Nutz fuer die Suche im Internet das Tool “SerperDevTool”. Waehle danach die passendes Suchergebnisse aus und Crawle deren URLs mit dem Tool “SpiderTool”.

Translation:
Search on the Internet:
Use the tool "SerperDevTool" for searching on the Internet. Then select the appropriate search results and crawl their URLs with the tool "SpiderTool".

backstory: |

llm: openai/gpt-4o


I use a yaml file to define the task:

research_task:
description: |

1. **Initiale Informationssuche:**
   - **Tool:** Verwende das Tool **SerperDevTool**, um im Internet nach relevanten Informationen zum Thema "{topic}" zu suchen.
   - **Suchbegriffe:** Nutze kurze und klar verständliche Suchbegriffe, die der typischen Suchweise von Menschen im Internet entsprechen.
   - **Auswahl der Ergebnisse:** Wähle die am besten passenden und relevantesten Suchergebnisse aus und notiere deren URLs.

2. **Inhaltsanalyse:**
   - **Tool:** Verwende das Tool **Spider scrape & crawl tool**, um die ausgewählten URLs zu scrapen und deren Inhalte vollständig zu extrahieren. WICHTIG: Halte dich EXAKT an die Anleitung des Tools und die Formatvorgaben!
   -        - **Datenextraktion:** Sammle alle relevanten Informationen aus den gescrapten Webseiten.
 

Translation:
1. **Initial information search:**
   - **Tool:** Use the tool **SerperDevTool** to search the internet for relevant information on the topic "{topic}".
   - **Keywords:** Use short and easily understandable keywords that correspond to the typical search behavior of people on the internet.
   - **Selection of results:** Choose the most suitable and relevant search results and note their URLs.

2. **Content analysis:**
   - **Tool:** Use the tool **Spider scrape & crawl tool** to scrape the selected URLs and extract their content completely. IMPORTANT: Follow the instructions of the tool and the formatting guidelines EXACTLY!
   - **Data extraction:** Collect all relevant information from the scraped websites.

expected_output: |

agent: topic_researcher


In the crew file ill give the tools to the agent:

@agent
def topic_researcher(self) → Agent:
return Agent(
config=self.agents_config[‘topic_researcher’],
# tools=[MyCustomTool()], # Example of custom tool, loaded on the beginning of file
verbose=True,
tools=[SerperDevTool(), SpiderTool()]
)


I hope this is what you were asking for.

Thank you very much for your help.

Best
Milan

@Milan Make sure to set the gpt-4o to all agents! By default, the gpt-4o-mini is used, which is a less capable LLM and may cause errors. Try this and let me know if it fixes the issue.

Hi and thank you very much for taking the time to help me!

I had already placed all research agents on gpt-4o. However, the error still occurred. I just did a test with the research agents on gpt-4o-mini. There were no errors in one run. However, they responded to me in English instead of German :wink:

Best
Milan

Wait, what? It worked with the mini version?

yes. it was only one run, but it worked without the errors.

I tried using the Spidertool and am getting the same error with Llama3.1, 4o and 4o-mini. It worked one time out of 5 (with 4o-mini) but it’s not working the rest of the times.

Yes, i guess it is the same for me. It works sometimes and sometimes not.