跳转到主要内容

ScrapflyScrapeWebsiteTool

描述

ScrapflyScrapeWebsiteTool 旨在利用 Scrapfly 的网络抓取 API 从网站提取内容。该工具提供先进的网络抓取功能,支持无头浏览器、代理和反机器人绕过功能。它允许以多种格式提取网页数据,包括原始 HTML、markdown 和纯文本,非常适合各种网络抓取任务。

安装

要使用此工具,您需要安装 Scrapfly SDK
uv add scrapfly-sdk
您还需要在 scrapfly.io/register 注册以获取 Scrapfly API 密钥。

开始步骤

要有效使用 ScrapflyScrapeWebsiteTool,请遵循以下步骤
  1. 安装依赖项:使用上述命令安装 Scrapfly SDK。
  2. 获取 API 密钥:在 Scrapfly 注册以获取您的 API 密钥。
  3. 初始化工具:使用您的 API 密钥创建工具实例。
  4. 配置抓取参数:根据您的需求自定义抓取参数。

示例

以下示例演示了如何使用 ScrapflyScrapeWebsiteTool 从网站提取内容
代码
from crewai import Agent, Task, Crew
from crewai_tools import ScrapflyScrapeWebsiteTool

# Initialize the tool
scrape_tool = ScrapflyScrapeWebsiteTool(api_key="your_scrapfly_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract content from a website
scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products and summarize the available products.",
    expected_output="A summary of the products available on the website.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()
您还可以自定义抓取参数
代码
# Example with custom scraping parameters
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites with custom parameters",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# The agent will use the tool with parameters like:
# url="https://web-scraping.dev/products"
# scrape_format="markdown"
# ignore_scrape_failures=True
# scrape_config={
#     "asp": True,  # Bypass scraping blocking solutions, like Cloudflare
#     "render_js": True,  # Enable JavaScript rendering with a cloud headless browser
#     "proxy_pool": "public_residential_pool",  # Select a proxy pool
#     "country": "us",  # Select a proxy location
#     "auto_scroll": True,  # Auto scroll the page
# }

scrape_task = Task(
    description="Extract the main content from the product page at https://web-scraping.dev/products using advanced scraping options including JavaScript rendering and proxy settings.",
    expected_output="A detailed summary of the products with all available information.",
    agent=web_scraper_agent,
)

参数

ScrapflyScrapeWebsiteTool 接受以下参数

初始化参数

  • api_key: 必需。您的 Scrapfly API 密钥。

运行参数

  • url: 必需。要抓取的网站 URL。
  • scrape_format: 可选。提取网页内容的格式。选项有 “raw”(HTML)、“markdown” 或 “text”。默认为 “markdown”。
  • scrape_config: 可选。一个包含额外 Scrapfly 抓取配置选项的字典。
  • ignore_scrape_failures: 可选。是否在抓取过程中忽略失败。如果设置为 True,当抓取失败时,该工具将返回 None 而不是引发异常。

Scrapfly 配置选项

scrape_config 参数允许您使用以下选项自定义抓取行为
  • asp: 启用反抓取保护绕过。
  • render_js: 使用云端无头浏览器启用 JavaScript 渲染。
  • proxy_pool: 选择代理池(例如,“public_residential_pool”,“datacenter”)。
  • country: 选择代理位置(例如,“us”,“uk”)。
  • auto_scroll: 自动滚动页面以加载延迟加载的内容。
  • js: 由无头浏览器执行自定义 JavaScript 代码。
有关配置选项的完整列表,请参阅 Scrapfly API 文档

用法

当使用 ScrapflyScrapeWebsiteTool 和代理时,代理需要提供要抓取的网站 URL,并且可以选择性地指定格式和其他配置选项
代码
# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract information from websites",
    backstory="An expert in web scraping who can extract content from any website.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent
scrape_task = Task(
    description="Extract the main content from example.com in markdown format.",
    expected_output="The main content of example.com in markdown format.",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()
有关使用自定义配置的更高级用法
代码
# Create a task with more specific instructions
advanced_scrape_task = Task(
    description="""
    Extract content from example.com with the following requirements:
    - Convert the content to plain text format
    - Enable JavaScript rendering
    - Use a US-based proxy
    - Handle any scraping failures gracefully
    """,
    expected_output="The extracted content from example.com",
    agent=web_scraper_agent,
)

错误处理

默认情况下,如果抓取失败,ScrapflyScrapeWebsiteTool 将引发异常。可以指示代理通过指定 ignore_scrape_failures 参数来优雅地处理失败
代码
# Create a task that instructs the agent to handle errors
error_handling_task = Task(
    description="""
    Extract content from a potentially problematic website and make sure to handle any 
    scraping failures gracefully by setting ignore_scrape_failures to True.
    """,
    expected_output="Either the extracted content or a graceful error message",
    agent=web_scraper_agent,
)

实现细节

ScrapflyScrapeWebsiteTool 使用 Scrapfly SDK 与 Scrapfly API 进行交互
代码
class ScrapflyScrapeWebsiteTool(BaseTool):
    name: str = "Scrapfly web scraping API tool"
    description: str = (
        "Scrape a webpage url using Scrapfly and return its content as markdown or text"
    )
    
    # Implementation details...
    
    def _run(
        self,
        url: str,
        scrape_format: str = "markdown",
        scrape_config: Optional[Dict[str, Any]] = None,
        ignore_scrape_failures: Optional[bool] = None,
    ):
        from scrapfly import ScrapeApiResponse, ScrapeConfig

        scrape_config = scrape_config if scrape_config is not None else {}
        try:
            response: ScrapeApiResponse = self.scrapfly.scrape(
                ScrapeConfig(url, format=scrape_format, **scrape_config)
            )
            return response.scrape_result["content"]
        except Exception as e:
            if ignore_scrape_failures:
                logger.error(f"Error fetching data from {url}, exception: {e}")
                return None
            else:
                raise e

结论

ScrapflyScrapeWebsiteTool 提供了一种强大的方式,利用 Scrapfly 的先进网络抓取功能从网站提取内容。凭借无头浏览器支持、代理和反机器人绕过等功能,它可以处理复杂的网站并以多种格式提取内容。该工具对于需要可靠网络抓取的数据提取、内容监控和研究任务特别有用。