`ScrapflyScrapeWebsiteTool`

描述

ScrapflyScrapeWebsiteTool 旨在利用 Scrapfly 的网络抓取 API 从网站提取内容。该工具提供先进的网络抓取功能，支持无头浏览器、代理和反机器人绕过功能。它允许以多种格式提取网页数据，包括原始 HTML、markdown 和纯文本，非常适合各种网络抓取任务。

安装

要使用此工具，您需要安装 Scrapfly SDK

uv add scrapfly-sdk

您还需要在 scrapfly.io/register 注册以获取 Scrapfly API 密钥。

开始步骤

要有效使用 ScrapflyScrapeWebsiteTool，请遵循以下步骤

安装依赖项：使用上述命令安装 Scrapfly SDK。
获取 API 密钥：在 Scrapfly 注册以获取您的 API 密钥。
初始化工具：使用您的 API 密钥创建工具实例。
配置抓取参数：根据您的需求自定义抓取参数。

示例

以下示例演示了如何使用 ScrapflyScrapeWebsiteTool 从网站提取内容

代码

from crewai import Agent, Task, Crew
from crewai_tools import ScrapflyScrapeWebsiteTool

# Initialize the tool
scrape_tool = ScrapflyScrapeWebsiteTool(api_key="your_scrapfly_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract content from a website
scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products and summarize the available products.",
    expected_output="A summary of the products available on the website.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

您还可以自定义抓取参数

代码

# Example with custom scraping parameters
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites with custom parameters",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# The agent will use the tool with parameters like:
# url="https://web-scraping.dev/products"
# scrape_format="markdown"
# ignore_scrape_failures=True
# scrape_config={
#     "asp": True,  # Bypass scraping blocking solutions, like Cloudflare
#     "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
#     "proxy_pool": "public_residential_pool",  # Select a proxy pool
#     "country": "us",  # Select a proxy location
#     "auto_scroll": True,  # Auto scroll the page
# }

scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products using advanced scraping options including JavaScript rendering and proxy settings.",
    expected_output="A detailed summary of the products with all available information.",
    agent=web_scraper_agent,
)

参数

ScrapflyScrapeWebsiteTool 接受以下参数

初始化参数

api_key: 必需。您的 Scrapfly API 密钥。

运行参数

url: 必需。要抓取的网站 URL。
scrape_format: 可选。提取网页内容的格式。选项有 “raw”（HTML）、“markdown” 或 “text”。默认为 “markdown”。
scrape_config: 可选。一个包含额外 Scrapfly 抓取配置选项的字典。
ignore_scrape_failures: 可选。是否在抓取过程中忽略失败。如果设置为 True，当抓取失败时，该工具将返回 None 而不是引发异常。

Scrapfly 配置选项

scrape_config 参数允许您使用以下选项自定义抓取行为

asp: 启用反抓取保护绕过。
render_js: 使用云端无头浏览器启用 JavaScript 渲染。
proxy_pool: 选择代理池（例如，“public_residential_pool”，“datacenter”）。
country: 选择代理位置（例如，“us”，“uk”）。
auto_scroll: 自动滚动页面以加载延迟加载的内容。
js: 由无头浏览器执行自定义 JavaScript 代码。

有关配置选项的完整列表，请参阅 Scrapfly API 文档。

用法

当使用 ScrapflyScrapeWebsiteTool 和代理时，代理需要提供要抓取的网站 URL，并且可以选择性地指定格式和其他配置选项

代码

# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent
scrape_task = Task(
    description="Extract the main content from example.com in markdown format.",
    expected_output="The main content of example.com in markdown format.",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

有关使用自定义配置的更高级用法

代码

# Create a task with more specific instructions
advanced_scrape_task = Task(
    description="""
    Extract content from example.com with the following requirements:
    - Convert the content to plain text format
    - Enable JavaScript rendering
    - Use a US-based proxy
    - Handle any scraping failures gracefully
    """,
    expected_output="The extracted content from example.com",
    agent=web_scraper_agent,
)

错误处理

默认情况下，如果抓取失败，ScrapflyScrapeWebsiteTool 将引发异常。可以指示代理通过指定 ignore_scrape_failures 参数来优雅地处理失败

代码

# Create a task that instructs the agent to handle errors
error_handling_task = Task(
    description="""
    Extract content from a potentially problematic website and make sure to handle any 
    scraping failures gracefully by setting ignore_scrape_failures to True.
    """,
    expected_output="Either the extracted content or a graceful error message",
    agent=web_scraper_agent,
)

实现细节

ScrapflyScrapeWebsiteTool 使用 Scrapfly SDK 与 Scrapfly API 进行交互

代码

class ScrapflyScrapeWebsiteTool(BaseTool):
    name: str = "Scrapfly web scraping API tool"
    description: str = (
        "Scrape a webpage url using Scrapfly and return its content as markdown or text"
    )
    
    # Implementation details...
    
    def _run(
        self,
        url: str,
        scrape_format: str = "markdown",
        scrape_config: Optional[Dict[str, Any]] = None,
        ignore_scrape_failures: Optional[bool] = None,
    ):
        from scrapfly import ScrapeApiResponse, ScrapeConfig

        scrape_config = scrape_config if scrape_config is not None else {}
        try:
            response: ScrapeApiResponse = self.scrapfly.scrape(
                ScrapeConfig(url, format=scrape_format, **scrape_config)
            )
            return response.scrape_result["content"]
        except Exception as e:
            if ignore_scrape_failures:
                logger.error(f"Error fetching data from {url}, exception: {e}")
                return None
            else:
                raise e

结论

ScrapflyScrapeWebsiteTool 提供了一种强大的方式，利用 Scrapfly 的先进网络抓取功能从网站提取内容。凭借无头浏览器支持、代理和反机器人绕过等功能，它可以处理复杂的网站并以多种格式提取内容。该工具对于需要可靠网络抓取的数据提取、内容监控和研究任务特别有用。

开始使用

指南

核心概念

MCP 集成

工具

可观测性

学习

遥测

Scrapfly 网站抓取工具

`ScrapflyScrapeWebsiteTool`

描述

安装

开始步骤

示例

参数

初始化参数

运行参数

Scrapfly 配置选项

用法

错误处理

实现细节

结论

开始使用

指南

核心概念

MCP 集成

工具

可观测性

学习

遥测

​ScrapflyScrapeWebsiteTool

​描述

​安装

​开始步骤

​示例

​参数

​初始化参数

​运行参数

​Scrapfly 配置选项

​用法

​错误处理

​实现细节

​结论

`ScrapflyScrapeWebsiteTool`

描述

安装

开始步骤

示例

参数

初始化参数

运行参数

Scrapfly 配置选项

用法

错误处理

实现细节

结论