Spider Web Scraper

Milan · November 25, 2024, 4:24pm

Hi,

i use SpiderTool for some agents with gpt-4o and I get this errors nearly everytime the agents searches the web:

I encountered an error while trying to use the tool. This was the error: 1 validation error for SpiderToolSchema
params
Field required [type=missing, input_value={‘url’: ‘https://www.theg…itar’, ‘mode’: ‘scrape’}, input_type=dict]
For further information visit https ://errors.pydantic.dev/2.9/v/missing.
Tool Spider scrape & crawl tool accepts these inputs: Tool Name: Spider scrape & crawl tool
Tool Arguments: {‘url’: {‘description’: ‘Website URL’, ‘type’: ‘str’}, ‘params’: {‘description’: ‘Set additional params. Options include:\n- `limit`: Optional[int] - The maximum number of pages allowed to crawl per website. Remove the value or set it to `0` to crawl all pages.\n- `depth`: Optional[int] - The crawl limit for maximum depth. If `0`, no limit will be applied.\n- `metadata`: Optional[bool] - Boolean to include metadata or not. Defaults to `False` unless set to `True`. If the user wants metadata, include params.metadata = True.\n- `query_selector`: Optional[str] - The CSS query selector to use when extracting content from the markup.\n’, ‘type’: ‘Union[dict[str, Any], NoneType]’}, ‘mode’: {‘description’: ‘Mode, the only two allowed modes are `scrape` or `crawl`. Use `scrape` to scrape a single page and `crawl` to crawl the entire website following subpages. These modes are the only allowed values even when ANY params is set.’, ‘type’: ‘Literal[scrape, crawl]’}}
Tool Description: Scrape & Crawl any url and return LLM-ready data.

I encountered an error while trying to use the tool. This was the error: 2 validation errors for SpiderToolSchema
params
Input should be a valid dictionary [type=dict_type, input_value=‘{“metadata”: false’, input_type=str]
For further information visit Redirecting...
mode
Input should be ‘scrape’ or ‘crawl’ [type=literal_error, input_value=‘“scrape”’, input_type=str]
For further information visit https ://errors.pydantic.dev/2.9/v/literal_error.
Tool Spider scrape & crawl tool accepts these inputs: Tool Name: Spider scrape & crawl tool
Tool Arguments: {‘url’: {‘description’: ‘Website URL’, ‘type’: ‘str’}, ‘params’: {‘description’: ‘Set additional params. Options include:\n- `limit`: Optional[int] - The maximum number of pages allowed to crawl per website. Remove the value or set it to `0` to crawl all pages.\n- `depth`: Optional[int] - The crawl limit for maximum depth. If `0`, no limit will be applied.\n- `metadata`: Optional[bool] - Boolean to include metadata or not. Defaults to `False` unless set to `True`. If the user wants metadata, include params.metadata = True.\n- `query_selector`: Optional[str] - The CSS query selector to use when extracting content from the markup.\n’, ‘type’: ‘Union[dict[str, Any], NoneType]’}, ‘mode’: {‘description’: ‘Mode, the only two allowed modes are `scrape` or `crawl`. Use `scrape` to scrape a single page and `crawl` to crawl the entire website following subpages. These modes are the only allowed values even when ANY params is set.’, ‘type’: ‘Literal[scrape, crawl]’}}
Tool Description: Scrape & Crawl any url and return LLM-ready data.

I have already tried to specify and define the task from the agent on how the tool is used. However, that did not work.

Does somebody has an idea how to solve this?

Best regards and thanks for any help!
Milan

rokbenko · November 25, 2024, 5:23pm

Can you please show the code? How do you set the LLM?

Milan · November 25, 2024, 5:43pm

Yes for sure.

I use a yaml file to define an agent. The defintion is in German but I translated it for you,
topic_researcher:
role: >
…
goal: |
…
Suche im Internet:
Nutz fuer die Suche im Internet das Tool “SerperDevTool”. Waehle danach die passendes Suchergebnisse aus und Crawle deren URLs mit dem Tool “SpiderTool”.

Translation:
Search on the Internet:
Use the tool "SerperDevTool" for searching on the Internet. Then select the appropriate search results and crawl their URLs with the tool "SpiderTool".

backstory: |
…
llm: openai/gpt-4o

I use a yaml file to define the task:

research_task:
description: |
…

1. **Initiale Informationssuche:**
   - **Tool:** Verwende das Tool **SerperDevTool**, um im Internet nach relevanten Informationen zum Thema "{topic}" zu suchen.
   - **Suchbegriffe:** Nutze kurze und klar verständliche Suchbegriffe, die der typischen Suchweise von Menschen im Internet entsprechen.
   - **Auswahl der Ergebnisse:** Wähle die am besten passenden und relevantesten Suchergebnisse aus und notiere deren URLs.

2. **Inhaltsanalyse:**
   - **Tool:** Verwende das Tool **Spider scrape & crawl tool**, um die ausgewählten URLs zu scrapen und deren Inhalte vollständig zu extrahieren. WICHTIG: Halte dich EXAKT an die Anleitung des Tools und die Formatvorgaben!
   -        - **Datenextraktion:** Sammle alle relevanten Informationen aus den gescrapten Webseiten.
 

Translation:
1. **Initial information search:**
   - **Tool:** Use the tool **SerperDevTool** to search the internet for relevant information on the topic "{topic}".
   - **Keywords:** Use short and easily understandable keywords that correspond to the typical search behavior of people on the internet.
   - **Selection of results:** Choose the most suitable and relevant search results and note their URLs.

2. **Content analysis:**
   - **Tool:** Use the tool **Spider scrape & crawl tool** to scrape the selected URLs and extract their content completely. IMPORTANT: Follow the instructions of the tool and the formatting guidelines EXACTLY!
   - **Data extraction:** Collect all relevant information from the scraped websites.

expected_output: |
…
agent: topic_researcher

In the crew file ill give the tools to the agent:

@agent
def topic_researcher(self) → Agent:
return Agent(
config=self.agents_config[‘topic_researcher’],
# tools=[MyCustomTool()], # Example of custom tool, loaded on the beginning of file
verbose=True,
tools=[SerperDevTool(), SpiderTool()]
)

I hope this is what you were asking for.

Thank you very much for your help.

Best
Milan

rokbenko · November 25, 2024, 6:47pm

@Milan Make sure to set the gpt-4o to all agents! By default, the gpt-4o-mini is used, which is a less capable LLM and may cause errors. Try this and let me know if it fixes the issue.

Milan · November 26, 2024, 6:18am

Hi and thank you very much for taking the time to help me!

I had already placed all research agents on gpt-4o. However, the error still occurred. I just did a test with the research agents on gpt-4o-mini. There were no errors in one run. However, they responded to me in English instead of German

Best
Milan

rokbenko · November 26, 2024, 9:24am

Wait, what? It worked with the mini version?

Milan · November 26, 2024, 4:53pm

yes. it was only one run, but it worked without the errors.

Bhavik_Shah · December 3, 2024, 12:40am

I tried using the Spidertool and am getting the same error with Llama3.1, 4o and 4o-mini. It worked one time out of 5 (with 4o-mini) but it’s not working the rest of the times.

Milan · December 4, 2024, 6:16am

Yes, i guess it is the same for me. It works sometimes and sometimes not.

gbertb · December 19, 2024, 4:19pm

Hi! There’s a PR submitted by me to help resolve this in CrewAI: Fix(SpiderTool): Improve Tool Reliability and Performance by gbertb · Pull Request #150 · crewAIInc/crewAI-tools · GitHub

Topic		Replies	Views
Spider Scraper Output Parsing Error CrewAI Community Support tools_issues	5	41	November 26, 2024
Can Agent input argument for tool(s)? CrewAI Community Support tools_issues , agent	4	249	November 29, 2024
How ScrapeWebsiteTool works with 2 Agents CrewAI Community Support tools_issues , agent	3	153	February 11, 2025
PydanticUserError: FirecrawlCrawlWebsiteTool is not fully defined; you should define FirecrawlApp General tools_issues	4	284	December 3, 2024
Why is WebsiteSearchTool not parsing the given URL? CrewAI Community Support tools_issues	3	302	December 13, 2024

Spider Web Scraper

Related topics