跳转到主要内容

RagTool

描述

RagTool 旨在通过 CrewAI 的原生 RAG 系统利用检索增强生成(RAG)的强大功能来回答问题。它提供了一个动态知识库,可以查询该知识库以从各种数据源检索相关信息。此工具对于需要访问大量信息并需要提供上下文相关答案的应用程序特别有用。

示例

以下示例演示如何初始化该工具并将其与不同的数据源一起使用
代码
from crewai_tools import RagTool

# Create a RAG tool with default settings
rag_tool = RagTool()

# Add content from a file
rag_tool.add(data_type="file", path="path/to/your/document.pdf")

# Add content from a web page
rag_tool.add(data_type="web_page", url="https://example.com")

# Define an agent with the RagTool
@agent
def knowledge_expert(self) -> Agent:
    '''
    This agent uses the RagTool to answer questions about the knowledge base.
    '''
    return Agent(
        config=self.agents_config["knowledge_expert"],
        allow_delegation=False,
        tools=[rag_tool]
    )

支持的数据源

RagTool 可以与各种数据源一起使用,包括:
  • 📰 PDF 文件
  • 📊 CSV 文件
  • 📃 JSON 文件
  • 📝 文本
  • 📁 目录/文件夹
  • 🌐 HTML 网页
  • 📽️ YouTube 频道
  • 📺 YouTube 视频
  • 📚 文档网站
  • 📝 MDX 文件
  • 📄 DOCX 文件
  • 🧾 XML 文件
  • 📬 Gmail
  • 📝 GitHub 存储库
  • 🐘 PostgreSQL 数据库
  • 🐬 MySQL 数据库
  • 🤖 Slack 对话
  • 💬 Discord 消息
  • 🗨️ Discourse 论坛
  • 📝 Substack 时事通讯
  • 🐝 Beehiiv 内容
  • 💾 Dropbox 文件
  • 🖼️ 图像
  • ⚙️ 自定义数据源

参数

RagTool 接受以下参数:
  • summarize:可选。是否对检索到的内容进行总结。默认为 False
  • adapter:可选。知识库的自定义适配器。如果未提供,将使用 CrewAIRagAdapter。
  • config:可选。底层 CrewAI RAG 系统的配置。接受一个 RagToolConfig TypedDict,其中包含可选的 embedding_model (ProviderSpec) 和 vectordb (VectorDbConfig) 键。所有以编程方式提供的配置值优先于环境变量。

添加内容

您可以使用 add 方法向知识库添加内容
代码
# Add a PDF file
rag_tool.add(data_type="file", path="path/to/your/document.pdf")

# Add a web page
rag_tool.add(data_type="web_page", url="https://example.com")

# Add a YouTube video
rag_tool.add(data_type="youtube_video", url="https://www.youtube.com/watch?v=VIDEO_ID")

# Add a directory of files
rag_tool.add(data_type="directory", path="path/to/your/directory")

代理集成示例

以下是如何将 RagTool 与 CrewAI 代理集成:
代码
from crewai import Agent
from crewai.project import agent
from crewai_tools import RagTool

# Initialize the tool and add content
rag_tool = RagTool()
rag_tool.add(data_type="web_page", url="https://docs.crewai.org.cn")
rag_tool.add(data_type="file", path="company_data.pdf")

# Define an agent with the RagTool
@agent
def knowledge_expert(self) -> Agent:
    return Agent(
        config=self.agents_config["knowledge_expert"],
        allow_delegation=False,
        tools=[rag_tool]
    )

高级配置

您可以通过提供配置字典来自定义 RagTool 的行为
代码
from crewai_tools import RagTool
from crewai_tools.tools.rag import RagToolConfig, VectorDbConfig, ProviderSpec

# Create a RAG tool with custom configuration

vectordb: VectorDbConfig = {
    "provider": "qdrant",
    "config": {
        "collection_name": "my-collection"
    }
}

embedding_model: ProviderSpec = {
    "provider": "openai",
    "config": {
        "model_name": "text-embedding-3-small"
    }
}

config: RagToolConfig = {
    "vectordb": vectordb,
    "embedding_model": embedding_model
}

rag_tool = RagTool(config=config, summarize=True)

嵌入模型配置

embedding_model 参数接受一个结构为 crewai.rag.embeddings.types.ProviderSpec 的字典
{
    "provider": "provider-name",  # Required
    "config": {                    # Optional
        # Provider-specific configuration
    }
}

支持的提供商

main.py
from crewai.rag.embeddings.providers.openai.types import OpenAIProviderSpec

embedding_model: OpenAIProviderSpec = {
    "provider": "openai",
    "config": {
        "api_key": "your-api-key",
        "model_name": "text-embedding-ada-002",
        "dimensions": 1536,
        "organization_id": "your-org-id",
        "api_base": "https://api.openai.com/v1",
        "api_version": "v1",
        "default_headers": {"Custom-Header": "value"}
    }
}
配置选项
  • api_key (str):OpenAI API 密钥
  • model_name (str):要使用的模型。默认值:text-embedding-ada-002。选项:text-embedding-3-smalltext-embedding-3-largetext-embedding-ada-002
  • dimensions (int):嵌入的维度数
  • organization_id (str):OpenAI 组织 ID
  • api_base (str):自定义 API 基本 URL
  • api_version (str):API 版本
  • default_headers (dict):API 请求的自定义标头
环境变量
  • OPENAI_API_KEYEMBEDDINGS_OPENAI_API_KEYapi_key
  • OPENAI_ORGANIZATION_IDEMBEDDINGS_OPENAI_ORGANIZATION_IDorganization_id
  • OPENAI_MODEL_NAMEEMBEDDINGS_OPENAI_MODEL_NAMEmodel_name
  • OPENAI_API_BASEEMBEDDINGS_OPENAI_API_BASEapi_base
  • OPENAI_API_VERSIONEMBEDDINGS_OPENAI_API_VERSIONapi_version
  • OPENAI_DIMENSIONSEMBEDDINGS_OPENAI_DIMENSIONSdimensions
main.py
from crewai.rag.embeddings.providers.cohere.types import CohereProviderSpec

embedding_model: CohereProviderSpec = {
    "provider": "cohere",
    "config": {
        "api_key": "your-api-key",
        "model_name": "embed-english-v3.0"
    }
}
配置选项
  • api_key (str):Cohere API 密钥
  • model_name (str):要使用的模型。默认值:large。选项:embed-english-v3.0embed-multilingual-v3.0largesmall
环境变量
  • COHERE_API_KEYEMBEDDINGS_COHERE_API_KEYapi_key
  • EMBEDDINGS_COHERE_MODEL_NAMEmodel_name
main.py
from crewai.rag.embeddings.providers.voyageai.types import VoyageAIProviderSpec

embedding_model: VoyageAIProviderSpec = {
    "provider": "voyageai",
    "config": {
        "api_key": "your-api-key",
        "model": "voyage-3",
        "input_type": "document",
        "truncation": True,
        "output_dtype": "float32",
        "output_dimension": 1024,
        "max_retries": 3,
        "timeout": 60.0
    }
}
配置选项
  • api_key (str):VoyageAI API 密钥
  • model (str):要使用的模型。默认值:voyage-2。选项:voyage-3voyage-3-litevoyage-code-3voyage-large-2
  • input_type (str):输入类型。选项:document(用于存储)、query(用于搜索)
  • truncation (bool):是否截断超过最大长度的输入。默认值:True
  • output_dtype (str):输出数据类型
  • output_dimension (int):输出嵌入的维度
  • max_retries (int):最大重试次数。默认值:0
  • timeout (float):请求超时(秒)
环境变量
  • VOYAGEAI_API_KEYEMBEDDINGS_VOYAGEAI_API_KEYapi_key
  • VOYAGEAI_MODELEMBEDDINGS_VOYAGEAI_MODELmodel
  • VOYAGEAI_INPUT_TYPEEMBEDDINGS_VOYAGEAI_INPUT_TYPEinput_type
  • VOYAGEAI_TRUNCATIONEMBEDDINGS_VOYAGEAI_TRUNCATIONtruncation
  • VOYAGEAI_OUTPUT_DTYPEEMBEDDINGS_VOYAGEAI_OUTPUT_DTYPEoutput_dtype
  • VOYAGEAI_OUTPUT_DIMENSIONEMBEDDINGS_VOYAGEAI_OUTPUT_DIMENSIONoutput_dimension
  • VOYAGEAI_MAX_RETRIESEMBEDDINGS_VOYAGEAI_MAX_RETRIESmax_retries
  • VOYAGEAI_TIMEOUTEMBEDDINGS_VOYAGEAI_TIMEOUTtimeout
main.py
from crewai.rag.embeddings.providers.ollama.types import OllamaProviderSpec

embedding_model: OllamaProviderSpec = {
    "provider": "ollama",
    "config": {
        "model_name": "llama2",
        "url": "https://:11434/api/embeddings"
    }
}
配置选项
  • model_name (str):Ollama 模型名称(例如,llama2mistralnomic-embed-text
  • url (str):Ollama API 端点 URL。默认值:https://:11434/api/embeddings
环境变量
  • OLLAMA_MODELEMBEDDINGS_OLLAMA_MODELmodel_name
  • OLLAMA_URLEMBEDDINGS_OLLAMA_URLurl
main.py
from crewai.rag.embeddings.providers.aws.types import BedrockProviderSpec

embedding_model: BedrockProviderSpec = {
    "provider": "amazon-bedrock",
    "config": {
        "model_name": "amazon.titan-embed-text-v2:0",
        "session": boto3_session
    }
}
配置选项
  • model_name (str):Bedrock 模型 ID。默认值:amazon.titan-embed-text-v1。选项:amazon.titan-embed-text-v1amazon.titan-embed-text-v2:0cohere.embed-english-v3cohere.embed-multilingual-v3
  • session (Any):用于 AWS 身份验证的 Boto3 会话对象
环境变量
  • AWS_ACCESS_KEY_ID:AWS 访问密钥
  • AWS_SECRET_ACCESS_KEY:AWS 秘密密钥
  • AWS_REGION:AWS 区域(例如,us-east-1
main.py
from crewai.rag.embeddings.providers.microsoft.types import AzureProviderSpec

embedding_model: AzureProviderSpec = {
    "provider": "azure",
    "config": {
        "deployment_id": "your-deployment-id",
        "api_key": "your-api-key",
        "api_base": "https://your-resource.openai.azure.com",
        "api_version": "2024-02-01",
        "model_name": "text-embedding-ada-002",
        "api_type": "azure"
    }
}
配置选项
  • deployment_id (str):必需 - Azure OpenAI 部署 ID
  • api_key (str):Azure OpenAI API 密钥
  • api_base (str):Azure OpenAI 资源端点
  • api_version (str):API 版本。示例:2024-02-01
  • model_name (str):模型名称。默认值:text-embedding-ada-002
  • api_type (str):API 类型。默认值:azure
  • dimensions (int):输出维度
  • default_headers (dict):自定义标头
环境变量
  • AZURE_OPENAI_API_KEYEMBEDDINGS_AZURE_API_KEYapi_key
  • AZURE_OPENAI_ENDPOINTEMBEDDINGS_AZURE_API_BASEapi_base
  • EMBEDDINGS_AZURE_DEPLOYMENT_IDdeployment_id
  • EMBEDDINGS_AZURE_API_VERSIONapi_version
  • EMBEDDINGS_AZURE_MODEL_NAMEmodel_name
  • EMBEDDINGS_AZURE_API_TYPEapi_type
  • EMBEDDINGS_AZURE_DIMENSIONSdimensions
main.py
from crewai.rag.embeddings.providers.google.types import GenerativeAiProviderSpec

embedding_model: GenerativeAiProviderSpec = {
    "provider": "google-generativeai",
    "config": {
        "api_key": "your-api-key",
        "model_name": "gemini-embedding-001",
        "task_type": "RETRIEVAL_DOCUMENT"
    }
}
配置选项
  • api_key (str):Google AI API 密钥
  • model_name (str):模型名称。默认值:gemini-embedding-001。选项:gemini-embedding-001text-embedding-005text-multilingual-embedding-002
  • task_type (str):嵌入的任务类型。默认值:RETRIEVAL_DOCUMENT。选项:RETRIEVAL_DOCUMENTRETRIEVAL_QUERY
环境变量
  • GOOGLE_API_KEYGEMINI_API_KEYEMBEDDINGS_GOOGLE_API_KEYapi_key
  • EMBEDDINGS_GOOGLE_GENERATIVE_AI_MODEL_NAMEmodel_name
  • EMBEDDINGS_GOOGLE_GENERATIVE_AI_TASK_TYPEtask_type
main.py
from crewai.rag.embeddings.providers.google.types import VertexAIProviderSpec

embedding_model: VertexAIProviderSpec = {
    "provider": "google-vertex",
    "config": {
        "model_name": "text-embedding-004",
        "project_id": "your-project-id",
        "region": "us-central1",
        "api_key": "your-api-key"
    }
}
配置选项
  • model_name (str):模型名称。默认值:textembedding-gecko。选项:text-embedding-004textembedding-geckotextembedding-gecko-multilingual
  • project_id (str):Google Cloud 项目 ID。默认值:cloud-large-language-models
  • region (str):Google Cloud 区域。默认值:us-central1
  • api_key (str):用于身份验证的 API 密钥
环境变量
  • GOOGLE_APPLICATION_CREDENTIALS:服务帐户 JSON 文件的路径
  • GOOGLE_CLOUD_PROJECTEMBEDDINGS_GOOGLE_VERTEX_PROJECT_IDproject_id
  • EMBEDDINGS_GOOGLE_VERTEX_MODEL_NAMEmodel_name
  • EMBEDDINGS_GOOGLE_VERTEX_REGIONregion
  • EMBEDDINGS_GOOGLE_VERTEX_API_KEYapi_key
main.py
from crewai.rag.embeddings.providers.jina.types import JinaProviderSpec

embedding_model: JinaProviderSpec = {
    "provider": "jina",
    "config": {
        "api_key": "your-api-key",
        "model_name": "jina-embeddings-v3"
    }
}
配置选项
  • api_key (str):Jina AI API 密钥
  • model_name (str):模型名称。默认值:jina-embeddings-v2-base-en。选项:jina-embeddings-v3jina-embeddings-v2-base-enjina-embeddings-v2-small-en
环境变量
  • JINA_API_KEYEMBEDDINGS_JINA_API_KEYapi_key
  • EMBEDDINGS_JINA_MODEL_NAMEmodel_name
main.py
from crewai.rag.embeddings.providers.huggingface.types import HuggingFaceProviderSpec

embedding_model: HuggingFaceProviderSpec = {
    "provider": "huggingface",
    "config": {
        "url": "https://api-inference.huggingface.co/models/sentence-transformers/all-MiniLM-L6-v2"
    }
}
配置选项
  • url (str):HuggingFace 推理 API 端点的完整 URL
环境变量
  • HUGGINGFACE_URLEMBEDDINGS_HUGGINGFACE_URLurl
main.py
from crewai.rag.embeddings.providers.instructor.types import InstructorProviderSpec

embedding_model: InstructorProviderSpec = {
    "provider": "instructor",
    "config": {
        "model_name": "hkunlp/instructor-xl",
        "device": "cuda",
        "instruction": "Represent the document"
    }
}
配置选项
  • model_name (str):HuggingFace 模型 ID。默认值:hkunlp/instructor-base。选项:hkunlp/instructor-xlhkunlp/instructor-largehkunlp/instructor-base
  • device (str):运行设备。默认值:cpu。选项:cpucudamps
  • instruction (str):嵌入的指令前缀
环境变量
  • EMBEDDINGS_INSTRUCTOR_MODEL_NAMEmodel_name
  • EMBEDDINGS_INSTRUCTOR_DEVICEdevice
  • EMBEDDINGS_INSTRUCTOR_INSTRUCTIONinstruction
main.py
from crewai.rag.embeddings.providers.sentence_transformer.types import SentenceTransformerProviderSpec

embedding_model: SentenceTransformerProviderSpec = {
    "provider": "sentence-transformer",
    "config": {
        "model_name": "all-mpnet-base-v2",
        "device": "cuda",
        "normalize_embeddings": True
    }
}
配置选项
  • model_name (str):Sentence Transformers 模型名称。默认值:all-MiniLM-L6-v2。选项:all-mpnet-base-v2all-MiniLM-L6-v2paraphrase-multilingual-MiniLM-L12-v2
  • device (str):运行设备。默认值:cpu。选项:cpucudamps
  • normalize_embeddings (bool):是否规范化嵌入。默认值:False
环境变量
  • EMBEDDINGS_SENTENCE_TRANSFORMER_MODEL_NAMEmodel_name
  • EMBEDDINGS_SENTENCE_TRANSFORMER_DEVICEdevice
  • EMBEDDINGS_SENTENCE_TRANSFORMER_NORMALIZE_EMBEDDINGSnormalize_embeddings
main.py
from crewai.rag.embeddings.providers.onnx.types import ONNXProviderSpec

embedding_model: ONNXProviderSpec = {
    "provider": "onnx",
    "config": {
        "preferred_providers": ["CUDAExecutionProvider", "CPUExecutionProvider"]
    }
}
配置选项
  • preferred_providers (list[str]):ONNX 执行提供程序的列表,按优先级排序
环境变量
  • EMBEDDINGS_ONNX_PREFERRED_PROVIDERSpreferred_providers(逗号分隔列表)
main.py
from crewai.rag.embeddings.providers.openclip.types import OpenCLIPProviderSpec

embedding_model: OpenCLIPProviderSpec = {
    "provider": "openclip",
    "config": {
        "model_name": "ViT-B-32",
        "checkpoint": "laion2b_s34b_b79k",
        "device": "cuda"
    }
}
配置选项
  • model_name (str):OpenCLIP 模型架构。默认值:ViT-B-32。选项:ViT-B-32ViT-B-16ViT-L-14
  • checkpoint (str):预训练检查点名称。默认值:laion2b_s34b_b79k。选项:laion2b_s34b_b79klaion400m_e32openai
  • device (str):运行设备。默认值:cpu。选项:cpucuda
环境变量
  • EMBEDDINGS_OPENCLIP_MODEL_NAMEmodel_name
  • EMBEDDINGS_OPENCLIP_CHECKPOINTcheckpoint
  • EMBEDDINGS_OPENCLIP_DEVICEdevice
main.py
from crewai.rag.embeddings.providers.text2vec.types import Text2VecProviderSpec

embedding_model: Text2VecProviderSpec = {
    "provider": "text2vec",
    "config": {
        "model_name": "shibing624/text2vec-base-multilingual"
    }
}
配置选项
  • model_name (str):HuggingFace 中的 Text2Vec 模型名称。默认值:shibing624/text2vec-base-chinese。选项:shibing624/text2vec-base-multilingualshibing624/text2vec-base-chinese
环境变量
  • EMBEDDINGS_TEXT2VEC_MODEL_NAMEmodel_name
main.py
from crewai.rag.embeddings.providers.roboflow.types import RoboflowProviderSpec

embedding_model: RoboflowProviderSpec = {
    "provider": "roboflow",
    "config": {
        "api_key": "your-api-key",
        "api_url": "https://infer.roboflow.com"
    }
}
配置选项
  • api_key (str):Roboflow API 密钥。默认值:""(空字符串)
  • api_url (str):Roboflow 推理 API URL。默认值:https://infer.roboflow.com
环境变量
  • ROBOFLOW_API_KEYEMBEDDINGS_ROBOFLOW_API_KEYapi_key
  • ROBOFLOW_API_URLEMBEDDINGS_ROBOFLOW_API_URLapi_url
main.py
from crewai.rag.embeddings.providers.ibm.types import WatsonXProviderSpec

embedding_model: WatsonXProviderSpec = {
    "provider": "watsonx",
    "config": {
        "model_id": "ibm/slate-125m-english-rtrvr",
        "url": "https://us-south.ml.cloud.ibm.com",
        "api_key": "your-api-key",
        "project_id": "your-project-id",
        "batch_size": 100,
        "concurrency_limit": 10,
        "persistent_connection": True
    }
}
配置选项
  • model_id (str):WatsonX 模型标识符
  • url (str):WatsonX API 端点
  • api_key (str):IBM Cloud API 密钥
  • project_id (str):WatsonX 项目 ID
  • space_id (str):WatsonX 空间 ID(project_id 的替代方案)
  • batch_size (int):嵌入的批处理大小。默认值:100
  • concurrency_limit (int):最大并发请求数。默认值:10
  • persistent_connection (bool):使用持久连接。默认值:True
  • 以及 20 多个额外的身份验证和配置选项
环境变量
  • WATSONX_API_KEYEMBEDDINGS_WATSONX_API_KEYapi_key
  • WATSONX_URLEMBEDDINGS_WATSONX_URLurl
  • WATSONX_PROJECT_IDEMBEDDINGS_WATSONX_PROJECT_IDproject_id
  • EMBEDDINGS_WATSONX_MODEL_IDmodel_id
  • EMBEDDINGS_WATSONX_SPACE_IDspace_id
  • EMBEDDINGS_WATSONX_BATCH_SIZEbatch_size
  • EMBEDDINGS_WATSONX_CONCURRENCY_LIMITconcurrency_limit
  • EMBEDDINGS_WATSONX_PERSISTENT_CONNECTIONpersistent_connection
main.py
from crewai.rag.core.base_embeddings_callable import EmbeddingFunction
from crewai.rag.embeddings.providers.custom.types import CustomProviderSpec

class MyEmbeddingFunction(EmbeddingFunction):
    def __call__(self, input):
        # Your custom embedding logic
        return embeddings

embedding_model: CustomProviderSpec = {
    "provider": "custom",
    "config": {
        "embedding_callable": MyEmbeddingFunction
    }
}
配置选项
  • embedding_callable (type[EmbeddingFunction]):自定义嵌入函数类
注意:自定义嵌入函数必须实现 crewai.rag.core.base_embeddings_callable 中定义的 EmbeddingFunction 协议。__call__ 方法应接受输入数据并返回作为 numpy 数组列表(或兼容的将规范化的格式)的嵌入。返回的嵌入会自动规范化和验证。

注意

  • 所有配置字段都是可选的,除非标记为必需
  • API 密钥通常可以通过环境变量而不是配置提供
  • 适用时显示默认值

结论

RagTool 提供了一种强大的方式,可以从各种数据源创建和查询知识库。通过利用检索增强生成,它使代理能够有效地访问和检索相关信息,从而增强其提供准确和上下文适当响应的能力。