`ScrapegraphScrapeTool`

描述

ScrapegraphScrapeTool 旨在利用 Scrapegraph AI 的 SmartScraper API 智能地从网站提取内容。该工具提供先进的网络抓取功能，并结合了由 AI 驱动的内容提取技术，使其成为定向数据收集和内容分析任务的理想选择。与传统的网络抓取工具不同，它能够理解网页的上下文和结构，根据自然语言提示提取最相关的信息。

安装

要使用此工具，您需要安装 Scrapegraph Python 客户端

uv add scrapegraph-py

您还需要将您的 Scrapegraph API 密钥设置为环境变量

export SCRAPEGRAPH_API_KEY="your_api_key"

您可以从 Scrapegraph AI 获取 API 密钥。

开始步骤

要有效使用 ScrapegraphScrapeTool，请遵循以下步骤

安装依赖项：使用上面的命令安装所需的包。
设置 API 密钥：将您的 Scrapegraph API 密钥设置为环境变量或在初始化时提供它。
初始化工具：使用必要的参数创建工具的实例。
定义提取提示：创建自然语言提示以指导特定内容的提取。

示例

以下示例演示了如何使用 ScrapegraphScrapeTool 从网站提取内容

代码

from crewai import Agent, Task, Crew
from crewai_tools import ScrapegraphScrapeTool

# Initialize the tool
scrape_tool = ScrapegraphScrapeTool(api_key="your_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract specific information from websites",
    backstory="An expert in web scraping who can extract targeted content from web pages.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract product information from an e-commerce site
scrape_task = Task(
    description="Extract product names, prices, and descriptions from the featured products section of example.com.",
    expected_output="A structured list of product information including names, prices, and descriptions.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

您也可以使用预定义的参数来初始化该工具。

代码

# Initialize the tool with predefined parameters
scrape_tool = ScrapegraphScrapeTool(
    website_url="https://www.example.com",
    user_prompt="Extract all product prices and descriptions",
    api_key="your_api_key"
)

参数

ScrapegraphScrapeTool 在初始化时接受以下参数

api_key：可选。您的 Scrapegraph API 密钥。如果未提供，它将查找 SCRAPEGRAPH_API_KEY 环境变量。
website_url: 可选。要抓取的网站 URL。如果在初始化时提供，代理在使用该工具时将无需指定它。
user_prompt：可选。用于内容提取的自定义指令。如果在初始化时提供，代理在使用该工具时将无需指定它。
enable_logging：可选。是否为 Scrapegraph 客户端启用日志记录。默认为 False。

用法

当代理使用 ScrapegraphScrapeTool 时，代理将需要提供以下参数（除非在初始化时已指定）

website_url：要抓取的网站的 URL。
user_prompt：可选。用于内容提取的自定义指令。默认为“提取网页的主要内容”。

该工具将根据提供的提示返回提取的内容。

代码

# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract specific information from websites",
    backstory="An expert in web scraping who can extract targeted content from web pages.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent to extract specific content
extract_task = Task(
    description="Extract the main heading and summary from example.com",
    expected_output="The main heading and summary from the website",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[extract_task])
result = crew.kickoff()

错误处理

ScrapegraphScrapeTool 可能会引发以下异常

ValueError：当 API 密钥缺失或 URL 格式无效时。
RateLimitError：当超出 API 速率限制时。
RuntimeError：当抓取操作失败时（网络问题、API 错误）。

建议指导代理优雅地处理潜在错误

代码

# Create a task that includes error handling instructions
robust_extract_task = Task(
    description="""
    Extract the main heading from example.com.
    Be aware that you might encounter errors such as:
    - Invalid URL format
    - Missing API key
    - Rate limit exceeded
    - Network or API errors
    
    If you encounter any errors, provide a clear explanation of what went wrong
    and suggest possible solutions.
    """,
    expected_output="Either the extracted heading or a clear error explanation",
    agent=web_scraper_agent,
)

速率限制

Scrapegraph API 的速率限制因您的订阅计划而异。请考虑以下最佳实践

在处理多个 URL 时，在请求之间实施适当的延迟。
在您的应用程序中优雅地处理速率限制错误。
在 Scrapegraph 仪表板上检查您的 API 计划限制。

实现细节

ScrapegraphScrapeTool 使用 Scrapegraph Python 客户端与 SmartScraper API 进行交互

代码

class ScrapegraphScrapeTool(BaseTool):
    """
    A tool that uses Scrapegraph AI to intelligently scrape website content.
    """
    
    # Implementation details...
    
    def _run(self, **kwargs: Any) -> Any:
        website_url = kwargs.get("website_url", self.website_url)
        user_prompt = (
            kwargs.get("user_prompt", self.user_prompt)
            or "Extract the main content of the webpage"
        )

        if not website_url:
            raise ValueError("website_url is required")

        # Validate URL format
        self._validate_url(website_url)

        try:
            # Make the SmartScraper request
            response = self._client.smartscraper(
                website_url=website_url,
                user_prompt=user_prompt,
            )

            return response
        # Error handling...

结论

ScrapegraphScrapeTool 提供了一种强大的方法，通过 AI 驱动的对网页结构的理解来提取网站内容。通过使代理能够使用自然语言提示来定位特定信息，它使网络抓取任务更加高效和专注。该工具特别适用于需要从网页中提取特定信息的数据提取、内容监控和研究任务。

开始使用

指南

核心概念

MCP 集成

工具

可观测性

学习

遥测

Scrapegraph 抓取工具

`ScrapegraphScrapeTool`

描述

安装

开始步骤

示例

参数

用法

错误处理

速率限制

实现细节

结论

开始使用

指南

核心概念

MCP 集成

工具

可观测性

学习

遥测

​ScrapegraphScrapeTool

​描述

​安装

​开始步骤

​示例

​参数

​用法

​错误处理

​速率限制

​实现细节

​结论

`ScrapegraphScrapeTool`

描述

安装

开始步骤

示例

参数

用法

错误处理

速率限制

实现细节

结论