ScrapflyScrapeWebsiteTool

描述

ScrapflyScrapeWebsiteTool 旨在利用 Scrapfly 的网页抓取 API 从网站提取内容。此工具提供高级网页抓取功能,支持无头浏览器、代理和反机器人绕过功能。它允许以各种格式(包括原始 HTML、Markdown 和纯文本)提取网页数据,非常适合广泛的网页抓取任务。

安装

要使用此工具,您需要安装 Scrapfly SDK

uv add scrapfly-sdk

您还需要在 scrapfly.io/register 注册以获取 Scrapfly API 密钥。

入门步骤

要有效使用 ScrapflyScrapeWebsiteTool,请遵循以下步骤

  1. 安装依赖项:使用上述命令安装 Scrapfly SDK。
  2. 获取 API 密钥:在 Scrapfly 注册以获取您的 API 密钥。
  3. 初始化工具:使用您的 API 密钥创建工具实例。
  4. 配置抓取参数:根据您的需求自定义抓取参数。

示例

以下示例演示了如何使用 ScrapflyScrapeWebsiteTool 从网站提取内容

代码
from crewai import Agent, Task, Crew
from crewai_tools import ScrapflyScrapeWebsiteTool

# Initialize the tool
scrape_tool = ScrapflyScrapeWebsiteTool(api_key="your_scrapfly_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract content from a website
scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products and summarize the available products.",
    expected_output="A summary of the products available on the website.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

您还可以自定义抓取参数

代码
# Example with custom scraping parameters
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites with custom parameters",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# The agent will use the tool with parameters like:
# url="https://web-scraping.dev/products"
# scrape_format="markdown"
# ignore_scrape_failures=True
# scrape_config={
#     "asp": True,  # Bypass scraping blocking solutions, like Cloudflare
#     "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
#     "proxy_pool": "public_residential_pool",  # Select a proxy pool
#     "country": "us",  # Select a proxy location
#     "auto_scroll": True,  # Auto scroll the page
# }

scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products using advanced scraping options including JavaScript rendering and proxy settings.",
    expected_output="A detailed summary of the products with all available information.",
    agent=web_scraper_agent,
)

参数

ScrapflyScrapeWebsiteTool 接受以下参数

初始化参数

  • api_key:必需。您的 Scrapfly API 密钥。

运行参数

  • url:必需。要抓取的网站 URL。
  • scrape_format:可选。提取网页内容的格式。选项包括“raw”(HTML)、“markdown”或“text”。默认为“markdown”。
  • scrape_config:可选。包含额外 Scrapfly 抓取配置选项的字典。
  • ignore_scrape_failures:可选。是否忽略抓取期间的失败。如果设置为 True,当抓取失败时,工具将返回 None 而不是引发异常。

Scrapfly 配置选项

scrape_config 参数允许您使用以下选项自定义抓取行为

  • asp:启用反抓取保护绕过。
  • render_js:使用云无头浏览器启用 JavaScript 渲染。
  • proxy_pool:选择代理池(例如,“public_residential_pool”、“datacenter”)。
  • country:选择代理位置(例如,“us”、“uk”)。
  • auto_scroll:自动滚动页面加载延迟加载的内容。
  • js:由无头浏览器执行自定义 JavaScript 代码。

有关完整的配置选项列表,请参阅 Scrapfly API 文档

用法

ScrapflyScrapeWebsiteTool 与智能体一起使用时,智能体需要提供要抓取的网站 URL,并可选择指定格式和额外的配置选项

代码
# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent
scrape_task = Task(
    description="Extract the main content from example.com in markdown format.",
    expected_output="The main content of example.com in markdown format.",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

更多带有自定义配置的高级用法

代码
# Create a task with more specific instructions
advanced_scrape_task = Task(
    description="""
    Extract content from example.com with the following requirements:
    - Convert the content to plain text format
    - Enable JavaScript rendering
    - Use a US-based proxy
    - Handle any scraping failures gracefully
    """,
    expected_output="The extracted content from example.com",
    agent=web_scraper_agent,
)

错误处理

默认情况下,如果抓取失败,ScrapflyScrapeWebsiteTool 将引发异常。可以通过指定 ignore_scrape_failures 参数来指示智能体优雅地处理失败

代码
# Create a task that instructs the agent to handle errors
error_handling_task = Task(
    description="""
    Extract content from a potentially problematic website and make sure to handle any 
    scraping failures gracefully by setting ignore_scrape_failures to True.
    """,
    expected_output="Either the extracted content or a graceful error message",
    agent=web_scraper_agent,
)

实现细节

ScrapflyScrapeWebsiteTool 使用 Scrapfly SDK 与 Scrapfly API 交互

代码
class ScrapflyScrapeWebsiteTool(BaseTool):
    name: str = "Scrapfly web scraping API tool"
    description: str = (
        "Scrape a webpage url using Scrapfly and return its content as markdown or text"
    )
    
    # Implementation details...
    
    def _run(
        self,
        url: str,
        scrape_format: str = "markdown",
        scrape_config: Optional[Dict[str, Any]] = None,
        ignore_scrape_failures: Optional[bool] = None,
    ):
        from scrapfly import ScrapeApiResponse, ScrapeConfig

        scrape_config = scrape_config if scrape_config is not None else {}
        try:
            response: ScrapeApiResponse = self.scrapfly.scrape(
                ScrapeConfig(url, format=scrape_format, **scrape_config)
            )
            return response.scrape_result["content"]
        except Exception as e:
            if ignore_scrape_failures:
                logger.error(f"Error fetching data from {url}, exception: {e}")
                return None
            else:
                raise e

结论

ScrapflyScrapeWebsiteTool 通过 Scrapfly 的高级网页抓取功能,提供了从网站提取内容的强大方式。凭借无头浏览器支持、代理和反机器人绕过等功能,它可以处理复杂的网站并以各种格式提取内容。此工具特别适用于需要可靠网页抓取的数据提取、内容监控和研究任务。