`ScrapegraphScrapeTool`

描述

ScrapegraphScrapeTool 旨在利用 Scrapegraph AI 的 SmartScraper API 智能地从网站提取内容。该工具提供先进的网页抓取能力和 AI 驱动的内容提取，非常适合目标数据收集和内容分析任务。与传统的网页抓取器不同，它可以理解网页的上下文和结构，根据自然语言提示提取最相关的信息。

安装

要使用此工具，您需要安装 Scrapegraph Python 客户端

uv add scrapegraph-py

您还需要将您的 Scrapegraph API 密钥设置为环境变量

export SCRAPEGRAPH_API_KEY="your_api_key"

您可以从 Scrapegraph AI 获取 API 密钥。

入门步骤

要有效地使用 ScrapegraphScrapeTool，请遵循以下步骤

安装依赖项：使用上面的命令安装所需的包。
设置 API 密钥：将您的 Scrapegraph API 密钥设置为环境变量或在初始化时提供。
初始化工具：使用必要的参数创建工具实例。
定义提取提示：创建自然语言提示以指导特定内容的提取。

示例

以下示例演示了如何使用 ScrapegraphScrapeTool 从网站提取内容

代码
from crewai import Agent, Task, Crew
from crewai_tools import ScrapegraphScrapeTool

# Initialize the tool
scrape_tool = ScrapegraphScrapeTool(api_key="your_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract specific information from websites",
    backstory="An expert in web scraping who can extract targeted content from web pages.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract product information from an e-commerce site
scrape_task = Task(
    description="Extract product names, prices, and descriptions from the featured products section of example.com.",
    expected_output="A structured list of product information including names, prices, and descriptions.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

您也可以使用预定义参数初始化工具

代码
# Initialize the tool with predefined parameters
scrape_tool = ScrapegraphScrapeTool(
    website_url="https://www.example.com",
    user_prompt="Extract all product prices and descriptions",
    api_key="your_api_key"
)

参数

ScrapegraphScrapeTool 在初始化期间接受以下参数

api_key：可选。您的 Scrapegraph API 密钥。如果未提供，将查找 SCRAPEGRAPH_API_KEY 环境变量。
website_url：可选。要抓取的网站 URL。如果在初始化期间提供，智能体在使用工具时无需指定。
user_prompt：可选。用于内容提取的自定义说明。如果在初始化期间提供，智能体在使用工具时无需指定。
enable_logging：可选。是否启用 Scrapegraph 客户端的日志记录。默认为 False。

用法

当智能体使用 ScrapegraphScrapeTool 时，智能体需要提供以下参数（除非在初始化期间已指定）

website_url：要抓取的网站 URL。
user_prompt：可选。用于内容提取的自定义说明。默认为“提取网页的主要内容”。

工具将根据提供的提示返回提取的内容。

代码
# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract specific information from websites",
    backstory="An expert in web scraping who can extract targeted content from web pages.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent to extract specific content
extract_task = Task(
    description="Extract the main heading and summary from example.com",
    expected_output="The main heading and summary from the website",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[extract_task])
result = crew.kickoff()

错误处理

ScrapegraphScrapeTool 可能会引发以下异常

ValueError：API 密钥缺失或 URL 格式无效时。
RateLimitError：API 速率限制超出时。
RuntimeError：抓取操作失败时（网络问题、API 错误）。

建议指导智能体优雅地处理潜在错误

代码
# Create a task that includes error handling instructions
robust_extract_task = Task(
    description="""
    Extract the main heading from example.com.
    Be aware that you might encounter errors such as:
    - Invalid URL format
    - Missing API key
    - Rate limit exceeded
    - Network or API errors
    
    If you encounter any errors, provide a clear explanation of what went wrong
    and suggest possible solutions.
    """,
    expected_output="Either the extracted heading or a clear error explanation",
    agent=web_scraper_agent,
)

速率限制

Scrapegraph API 的速率限制因您的订阅计划而异。考虑以下最佳实践

处理多个 URL 时，在请求之间实施适当的延迟。
在您的应用程序中优雅地处理速率限制错误。
在 Scrapegraph 控制面板上检查您的 API 计划限制。

实现细节

ScrapegraphScrapeTool 使用 Scrapegraph Python 客户端与 SmartScraper API 进行交互

代码
class ScrapegraphScrapeTool(BaseTool):
    """
    A tool that uses Scrapegraph AI to intelligently scrape website content.
    """
    
    # Implementation details...
    
    def _run(self, **kwargs: Any) -> Any:
        website_url = kwargs.get("website_url", self.website_url)
        user_prompt = (
            kwargs.get("user_prompt", self.user_prompt)
            or "Extract the main content of the webpage"
        )

        if not website_url:
            raise ValueError("website_url is required")

        # Validate URL format
        self._validate_url(website_url)

        try:
            # Make the SmartScraper request
            response = self._client.smartscraper(
                website_url=website_url,
                user_prompt=user_prompt,
            )

            return response
        # Error handling...

结论

ScrapegraphScrapeTool 提供了一种强大的方法，利用 AI 对网页结构的理解来从网站提取内容。通过使智能体能够使用自然语言提示定位特定信息，它使网页抓取任务更加高效和聚焦。该工具特别适用于数据提取、内容监控和研究任务，在这些任务中需要从网页中提取特定信息。

S3 写入工具从网站抓取元素工具

本页内容

ScrapegraphScrapeTool
描述
安装
入门步骤
示例
参数
用法
错误处理
速率限制
实现细节
结论

开始

指南

核心概念

工具

智能体监控与可观测性

学习

遥测

Scrapegraph 抓取工具

`ScrapegraphScrapeTool`

描述

安装

入门步骤

示例

参数

用法

错误处理

速率限制

实现细节

结论

开始

指南

核心概念

工具

智能体监控与可观测性

学习

遥测

​ScrapegraphScrapeTool

​描述

​安装

​入门步骤

​示例

​参数

​用法

​错误处理

​速率限制

​实现细节

​结论

`ScrapegraphScrapeTool`

描述

安装

入门步骤

示例

参数

用法

错误处理

速率限制

实现细节

结论