`ScrapflyScrapeWebsiteTool`

描述

ScrapflyScrapeWebsiteTool 旨在利用 Scrapfly 的网页抓取 API 从网站提取内容。此工具提供高级网页抓取功能，支持无头浏览器、代理和反机器人绕过功能。它允许以各种格式（包括原始 HTML、Markdown 和纯文本）提取网页数据，非常适合广泛的网页抓取任务。

安装

要使用此工具，您需要安装 Scrapfly SDK

uv add scrapfly-sdk

您还需要在 scrapfly.io/register 注册以获取 Scrapfly API 密钥。

入门步骤

要有效使用 ScrapflyScrapeWebsiteTool，请遵循以下步骤

安装依赖项：使用上述命令安装 Scrapfly SDK。
获取 API 密钥：在 Scrapfly 注册以获取您的 API 密钥。
初始化工具：使用您的 API 密钥创建工具实例。
配置抓取参数：根据您的需求自定义抓取参数。

示例

以下示例演示了如何使用 ScrapflyScrapeWebsiteTool 从网站提取内容

代码
from crewai import Agent, Task, Crew
from crewai_tools import ScrapflyScrapeWebsiteTool

# Initialize the tool
scrape_tool = ScrapflyScrapeWebsiteTool(api_key="your_scrapfly_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract content from a website
scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products and summarize the available products.",
    expected_output="A summary of the products available on the website.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

您还可以自定义抓取参数

代码
# Example with custom scraping parameters
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites with custom parameters",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# The agent will use the tool with parameters like:
# url="https://web-scraping.dev/products"
# scrape_format="markdown"
# ignore_scrape_failures=True
# scrape_config={
#     "asp": True,  # Bypass scraping blocking solutions, like Cloudflare
#     "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
#     "proxy_pool": "public_residential_pool",  # Select a proxy pool
#     "country": "us",  # Select a proxy location
#     "auto_scroll": True,  # Auto scroll the page
# }

scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products using advanced scraping options including JavaScript rendering and proxy settings.",
    expected_output="A detailed summary of the products with all available information.",
    agent=web_scraper_agent,
)

参数

ScrapflyScrapeWebsiteTool 接受以下参数

初始化参数

api_key：必需。您的 Scrapfly API 密钥。

运行参数

url：必需。要抓取的网站 URL。
scrape_format：可选。提取网页内容的格式。选项包括“raw”（HTML）、“markdown”或“text”。默认为“markdown”。
scrape_config：可选。包含额外 Scrapfly 抓取配置选项的字典。
ignore_scrape_failures：可选。是否忽略抓取期间的失败。如果设置为 True，当抓取失败时，工具将返回 None 而不是引发异常。

Scrapfly 配置选项

scrape_config 参数允许您使用以下选项自定义抓取行为

asp：启用反抓取保护绕过。
render_js：使用云无头浏览器启用 JavaScript 渲染。
proxy_pool：选择代理池（例如，“public_residential_pool”、“datacenter”）。
country：选择代理位置（例如，“us”、“uk”）。
auto_scroll：自动滚动页面加载延迟加载的内容。
js：由无头浏览器执行自定义 JavaScript 代码。

有关完整的配置选项列表，请参阅 Scrapfly API 文档。

用法

将 ScrapflyScrapeWebsiteTool 与智能体一起使用时，智能体需要提供要抓取的网站 URL，并可选择指定格式和额外的配置选项

代码
# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent
scrape_task = Task(
    description="Extract the main content from example.com in markdown format.",
    expected_output="The main content of example.com in markdown format.",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

更多带有自定义配置的高级用法

代码
# Create a task with more specific instructions
advanced_scrape_task = Task(
    description="""
    Extract content from example.com with the following requirements:
    - Convert the content to plain text format
    - Enable JavaScript rendering
    - Use a US-based proxy
    - Handle any scraping failures gracefully
    """,
    expected_output="The extracted content from example.com",
    agent=web_scraper_agent,
)

错误处理

默认情况下，如果抓取失败，ScrapflyScrapeWebsiteTool 将引发异常。可以通过指定 ignore_scrape_failures 参数来指示智能体优雅地处理失败

代码
# Create a task that instructs the agent to handle errors
error_handling_task = Task(
    description="""
    Extract content from a potentially problematic website and make sure to handle any 
    scraping failures gracefully by setting ignore_scrape_failures to True.
    """,
    expected_output="Either the extracted content or a graceful error message",
    agent=web_scraper_agent,
)

实现细节

ScrapflyScrapeWebsiteTool 使用 Scrapfly SDK 与 Scrapfly API 交互

代码
class ScrapflyScrapeWebsiteTool(BaseTool):
    name: str = "Scrapfly web scraping API tool"
    description: str = (
        "Scrape a webpage url using Scrapfly and return its content as markdown or text"
    )
    
    # Implementation details...
    
    def _run(
        self,
        url: str,
        scrape_format: str = "markdown",
        scrape_config: Optional[Dict[str, Any]] = None,
        ignore_scrape_failures: Optional[bool] = None,
    ):
        from scrapfly import ScrapeApiResponse, ScrapeConfig

        scrape_config = scrape_config if scrape_config is not None else {}
        try:
            response: ScrapeApiResponse = self.scrapfly.scrape(
                ScrapeConfig(url, format=scrape_format, **scrape_config)
            )
            return response.scrape_result["content"]
        except Exception as e:
            if ignore_scrape_failures:
                logger.error(f"Error fetching data from {url}, exception: {e}")
                return None
            else:
                raise e

结论

ScrapflyScrapeWebsiteTool 通过 Scrapfly 的高级网页抓取功能，提供了从网站提取内容的强大方式。凭借无头浏览器支持、代理和反机器人绕过等功能，它可以处理复杂的网站并以各种格式提取内容。此工具特别适用于需要可靠网页抓取的数据提取、内容监控和研究任务。

抓取网站 Selenium 抓取器

在此页面上

ScrapflyScrapeWebsiteTool
描述
安装
入门步骤
示例
参数
初始化参数
运行参数
Scrapfly 配置选项
用法
错误处理
实现细节
结论

入门

指南

核心概念

工具

智能体 (Agent) 监控与可观测性

学习

遥测

Scrapfly 抓取网站工具

`ScrapflyScrapeWebsiteTool`

描述

安装

入门步骤

示例

参数

初始化参数

运行参数

Scrapfly 配置选项

用法

错误处理

实现细节

结论

入门

指南

核心概念

工具

智能体 (Agent) 监控与可观测性

学习

遥测

​ScrapflyScrapeWebsiteTool

​描述

​安装

​入门步骤

​示例

​参数

​初始化参数

​运行参数

​Scrapfly 配置选项

​用法

​错误处理

​实现细节

​结论

`ScrapflyScrapeWebsiteTool`

描述

安装

入门步骤

示例

参数

初始化参数

运行参数

Scrapfly 配置选项

用法

错误处理

实现细节

结论