How to search and scrape websites efficiently

I want a crew that can conduct multiple Google searches, choose which articles are most relevant, go into them and summarise the articles and then conduct a summary of all results. What is the most efficient to do this?

Currently, I have an agent that first decides what to search and then delegates the search and summarising to another agent. However, the issue is that sometimes the agent that searches and summarises gets lost. I think this is because of the verbose Google search results. After the Google search results, sometimes I will see the agent prompt being printed out as well. It also sometimes does not summarise and just spits back the full article which drastically increases the context length.

I think what makes sense is the following process:
Agent A: overall planner and then report generator
Agent B: Conducts Google Search and chooses most relevant article URL
Agent C: Takes in URL and summarises article.

Agent A first comes up with some possible search queries. It then delegates the first search query to Agent B who picks the most relevant URL. Then Agent C will take in the URL and summarise the article. This summary gets passed back to Agent A for the first search query, Agent A assess this and decides what to search next. This loop goes on until Agent A is satisfied. It then summarises every information and outputs a report.

The goal of this setup is to minimise the context used in Agent B and Agent C so that they do not lose sight of their goal. How do I implement this in crewAI?

Agent A : crawls the web using serperdev tool and get the most relevant websites
Agent B: scrapes the websites collected by Agent A and get the insights of the content scraped for summarisation using firecrawl(crewai tool) or custom tool implementation of crawl4ai
Agent C : upon getting the insights delivers the report in the format you need

How do I do this while minimising context used for Agent B? Lets say Agent A outputs 5 websites. Agent B will then be tasked to go into each website, scrape it and then summarise it. This means that the context of Agent B will be ever increasing for each website it is tasked with scraping right? For examples when acting on the 3rd website, its context will be (scrape of website 1 + summary of website 1 + scrape of website 2 + summary of website 2).

Or is it possible to call Agent B for each website individually?

As per the tool implementation, each url will be scraped individually and when i tracked with mlflow

agent 1 : gets the 10 websites links
agent 2 : takes input 10 websites links and does the tool calling for all 10 websites
(irrespective of context window size i think (not sure)) → this is the thing i did not understand
agent 3 : it is not getting whole input from the agent 2 (scraped content)

How did you set this up? Can you send me the code? Thank you.

Agent A:

Web_crawler:
role: >
{topic} Real-Time Info Agent
goal: >
Find the latest and most useful Indian and international news from {week} - {date} related to {topic}.

If the {topic} pertains to prices, consider:

  1. Wholesale Price Index (WPI) and Consumer Price Index (CPI) trends.
  2. Latest trade figures and merchandise trade.
  3. Stock market indices.
  4. Commodity prices on exchanges, including:
     - National Commodity & Derivatives Exchange (NCDEX): https://www.ncdex.com/
     - Multi Commodity Exchange of India (MCX): https://www.mcxindia.com/
     - Indian Commodity Exchange (ICEX): https://www.icexindia.com/
  5. Factors affecting fluctuations in commodity prices.

Ensure that:
  - Websites with repetitive information are excluded.
  - Only freely accessible websites are considered for scraping.

If the {topic} is not about prices, conduct a comprehensive search on the {topic}.

backstory: >
You’re a news article agent renowned for uncovering the latest news and articles in {topic}.
Your expertise lies in identifying the most relevant source URLs containing information about {topic}.

Agent B:

Webscraper:
role: >
specialist in Scraping the important information from different websites about the {topic}
goal: >
Find the latest news and commodity prices from different URLs you get from Web_crawler and extracting the important information about Prices and the news due to which prices are fluctuating.
backstory: >
You’re a Expert in understading {topic} commodity prices and how it effects the global market.
Known for your ability to extract the important news and information in detailed and concise manner.

Agent C:

ReportAnalyst:
role: >
Specialist in creating a Report from the provided context related to the {topic}
goal: >
Create detailed reports based on {topic} with the help of data from Webscraper.
backstory: >
You’re a meticulous analyst with a keen eye for detail. You’re known for
your ability to turn complex data into clear and concise reports, making
it easy for others to understand and act on the information you provide.

Format the output in such a way that it will easy to read and it contains most important information under different sections
and atlast mention all the URLs in References Section and dont describe URLs anywhere expect in References Section
seperator *** at the end of each url ,so that it will be easy for me to extract them for other purposes.

TASKS-------------------

TASK A:

Crawler:
description: >
Crawl websites for the latest Indian and international news and articles related to {topic} from {week} to {date}.

If the {topic} pertains to prices, consider:

  1. Wholesale Price Index (WPI) and Consumer Price Index (CPI) trends.
  2. Latest trade figures and merchandise trade.
  3. Stock market indices.
  4. Commodity prices on exchanges, including:
     - National Commodity & Derivatives Exchange (NCDEX): https://www.ncdex.com/
     - Multi Commodity Exchange of India (MCX): https://www.mcxindia.com/
     - Indian Commodity Exchange (ICEX): https://www.icexindia.com/

Ensure that:
  - Websites with repetitive information are excluded.
  - Only freely accessible websites are considered for scraping.

If the {topic} is not about prices, conduct a comprehensive search on the {topic}.

Make sure to find any interesting and relevant information up to the current date, covering at most the past week.

expected_output: >
Return only the URLs containing the most relevant information about {topic}.

agent: Web_crawler

TASK B:

Scraper:
description: >
Scrape the websites for the Latest news and articles related to the {topic}.
Make sure to find deatiled interesting and relevant information upto till date.

If any website gives the error of NOT FOUND , then replace the website with another website and dont include the NOT FOUND response website in the 1 URLS

expected_output: >
Return extarcted content about {topic}
agent: Webscraper

TASK C:

Reporter:
description : >
Review the context you got about {topic} and expand it into a full section for a report.
Make sure the report is detailed and contains any and all relevant information.
expected_output : >
A fully fledge reports with the Prices and the events,news effecting the prices, each with a full section of information.
Formatted as plain text without any special characters in the output and any other special characters and headings in bold.
Provide the URL’s from which the information is retrieved under references in the output.
agent : ReportAnalyst

AGENTS IN CREW:

scrape_tool = ScrapeWebsiteTool()
# fire_search_tool = FirecrawlCrawlWebsiteTool() **(not good)
fire_scrape_tool = FirecrawlScrapeWebsiteTool(api_key=os.getenv(“FIRECRAWL_API_KEY”))

@agent
def Web_crawler(self) -> Agent:
	return Agent(
		config=self.agents_config['Web_crawler'],
		verbose=True,
		llm=self.llm,
		tools=[self.search_tool],
		memory=False,
		max_retry_limit=2,
	)

@agent
def Webscraper(self) -> Agent:
	return Agent(
		config = self.agents_config['Webscraper'],
		verbose=True,
		llm = self.llm,
		tools = [self.fire_scrape_tool],
		memory=False,
		max_retry_limit=2,
	)

@agent
def ReportAnalyst(self) -> Agent:
	return Agent(
		config = self.agents_config["ReportAnalyst"],
		llm = self.llm,
		verbose =True,
		memory=False,
		max_retry_limit=2,
	)

@task
def Crawler(self) -> Task:
	return Task(
		config=self.tasks_config['Crawler']
	)

@task
def Scraper(self) -> Task:
	return Task(
		config=self.tasks_config['Scraper']
	)

@task
def Reporter(self) -> Task:
	return Task(
		config = self.tasks_config['Reporter']
	)
1 Like

import mlflow

Turn on auto tracing by calling mlflow.crewai.autolog()

mlflow.crewai.autolog()

Optional: Set a tracking URI and an experiment

mlflow.set_tracking_uri(“http://localhost:5000”)

mlflow.set_experiment(“CrewAI”)

use this in crew.py and monitor the crew,

Follow the docs → MLflow Integration - CrewAI