ragoon.chunks

ragoon.chunks#

Classes

`ChunkMetadata`(uuid, chunk_uuid, chunk_number)	Metadata for a text chunk within a dataset.
`DatasetChunker`(dataset, max_tokens, ...[, ...])	A class to chunk text data within a dataset for processing with embeddings models.
`SemanticTextSplitter`([model, token, ...])	A class for splitting text into semantically coherent sections using a language model.

class ragoon.chunks.SemanticTextSplitter(model: str | None = 'meta-llama/Meta-Llama-3.1-70B-Instruct', token: str | None = None, split_token: str = '<|split|>', system_prompt: str | None = None, max_tokens: int = 4096, stream: bool = True)[source]#

Bases: object

A class for splitting text into semantically coherent sections using a language model.

This class leverages the Hugging Face Inference API to generate splits in the input text, and then processes the result to return a list of split text sections. It is designed to work with various language models available through the Hugging Face platform.

modelstr, optional
The name or path of the Hugging Face model to use for text splitting. This should be a model capable of text generation tasks, such as GPT-based models. Default is ‘meta-llama/Meta-Llama-3.1-70B-Instruct’.

tokenstr, optional
The Hugging Face API token for authentication. If not provided, the class will attempt to use the token stored in the Hugging Face CLI configuration.

split_tokenstr, optional
The token used to split the text (default is ‘<|split|>’). This token will be inserted by the model to indicate where the text should be split.

system_promptstr, optional
The system prompt to use for the model. If not provided, a default prompt will be used, which instructs the model on how to split the text.

max_tokensint, optional
The maximum number of tokens to generate in the model’s response (default is 4096). This limit applies to the entire response, including the input prompt.

streambool, optional
Whether to stream the model’s output (default is True). When True, the output will be printed as it’s generated. When False, the output will be returned all at once.

clientInferenceClient
The Hugging Face Inference API client used to communicate with the model.

split_tokenstr
The token used to split the text.

system_promptstr
The system prompt used to instruct the model on how to split the text.

max_tokensint
The maximum number of tokens to generate in the model’s response.

streambool
Whether to stream the model’s output.

completion(text: str) -> str
Calls the language model to process the input text.

split(text: str) -> List[str]
Splits the input text into semantically coherent sections.

ValueError
If the model name is not provided during initialization.

RuntimeError
If there’s an error calling the Hugging Face Inference API.
>>> # Ensure you have set up your Hugging Face token using `huggingface-cli login`
>>> splitter = SemanticTextSplitter(
...     model="meta-llama/Llama-2-70b-chat-hf",
...     token=api.token  # This will use your stored Hugging Face token
... )
>>> text = '''
... The Python programming language, created by Guido van Rossum,
... has become one of the most popular languages in the world.
... Its simplicity and readability make it an excellent choice for beginners.
... Meanwhile, data science has emerged as a crucial field in the modern world.
... Python's extensive libraries, such as NumPy and Pandas, have made it
... a favorite among data scientists and analysts.
... '''
>>> result = splitter.split(text)
>>> for section in result:
...     print(f"Section: {section}

“)

Section: The Python programming language, created by Guido van Rossum, has become one of the most popular languages in the world. Its simplicity and readability make it an excellent choice for beginners.

Section: Meanwhile, data science has emerged as a crucial field in the modern world. Python’s extensive libraries, such as NumPy and Pandas, have made it a favorite among data scientists and analysts.

The quality of the text splitting depends on the capabilities of the chosen language model.
The system prompt plays a crucial role in guiding the model’s behavior. Customizing it can lead to different splitting results.
When using streamed output, the results are printed to the console in real-time, which can be useful for monitoring long-running splits.
The split token (‘<split>’ by default) should be chosen carefully to avoid conflicts with the content of the text being split.

huggingface_hub.InferenceClient : The client used to interact with Hugging Face models.

Initialize the SemanticTextSplitter.

Parameters:

model (str, optional) – The name or path of the Hugging Face model to use for text splitting. Default is ‘meta-llama/Meta-Llama-3.1-70B-Instruct’.
token (str, optional) – The Hugging Face API token for authentication.
split_token (str, optional) – The token used to split the text (default is ‘<split>’).
system_prompt (str, optional) – The system prompt to use for the model. If None, a default prompt is used.
max_tokens (int, optional) – The maximum number of tokens to generate (default is 4096).
stream (bool, optional) – Whether to stream the model’s output (default is True).

Raises:

ValueError – If the model name is not provided.

completion(text: str) → str[source]#

Call the language model to process the input text.

This method sends the input text to the language model via the Hugging Face Inference API and returns the model’s output.

Parameters:: text (str) – The input text to be processed by the model.
Returns:: The processed text returned by the model, potentially including split tokens.
Return type:: str
Raises:: RuntimeError – If there’s an error calling the Hugging Face Inference API.

Notes

If streaming is enabled, the method will print the output in real-time and return the complete output as a string.
If streaming is disabled, the method will return the complete output after the model finishes processing.

split(text: str) → List[str][source]#

Split the input text into semantically coherent sections.

This method sends the input text to the language model for processing, then splits the returned text based on the specified split token.

textstr: The input text to be split.

List[str]: A list of strings, each representing a semantically coherent section of the input text.

>>> splitter = SemanticTextSplitter(
...     model="meta-llama/Llama-2-70b-chat-hf",
...     token="your_hf_token_here"
... )
>>> text = '''
... Machine learning is a subset of artificial intelligence
... that focuses on the development of algorithms and statistical models.
... It enables computer systems to improve their performance on a specific task
... through experience, without being explicitly programmed.
... On the other hand, deep learning is a subset of machine learning
... that uses artificial neural networks with multiple layers
... to progressively extract higher-level features from raw input.
... '''
>>> result = splitter.split(text)
>>> for idx, section in enumerate(result, 1):
...     print(f"Section {idx}:

{section} “)

Section 1: Machine learning is a subset of artificial intelligence that focuses on the development of algorithms and statistical models. It enables computer systems to improve their performance on a specific task through experience, without being explicitly programmed.

Section 2: On the other hand, deep learning is a subset of machine learning that uses artificial neural networks with multiple layers to progressively extract higher-level features from raw input.

The quality of the splitting depends on the language model’s understanding of the text and its ability to identify semantic boundaries.

The method uses the completion method internally to process the text, so any streaming behavior will occur during this step.

Empty sections (after stripping whitespace) are automatically removed from the final output.

class ragoon.chunks.ChunkMetadata(uuid: str, chunk_uuid: str, chunk_number: str)[source]#

Bases: object

Metadata for a text chunk within a dataset.

uuid#

The UUID of the original text.

Type:: str

chunk_uuid#

The UUID of the chunked text.

Type:: str

chunk_number#

The identifier of the chunk indicating its order and total number of chunks.

Type:: str

uuid: str#

chunk_uuid: str#

chunk_number: str#

__init__(uuid: str, chunk_uuid: str, chunk_number: str) → None#

class ragoon.chunks.DatasetChunker(dataset: Dataset | DatasetDict, max_tokens: int, overlap_percentage: float, column: str, model_name: str = 'bert-base-uncased', uuid_column: str | None = None, separators: List[str] = ['.', '\n'], space_after_splitters: List[str] | None = None)[source]#

Bases: object

A class to chunk text data within a dataset for processing with embeddings models.

This class splits large texts into smaller chunks based on a specified maximum token limit, while maintaining an overlap between chunks to preserve context.

datasetUnion[datasets.Dataset, datasets.DatasetDict]
The dataset to be chunked. It can be either a Dataset or a DatasetDict.

max_tokensint
The maximum number of tokens allowed in each chunk.

overlap_percentagefloat
The percentage of tokens to overlap between consecutive chunks.

columnstr
The name of the column containing the text to be chunked.

model_namestr, optional
The name of the tokenizer model to use (default is “bert-base-uncased”).

uuid_columnOptional[str], optional
The name of the column containing UUIDs for the texts. If not provided, new UUIDs will be generated.

separatorsList[str], optional
List of separators used to split the text.

space_after_splittersOptional[List[str]], optional
List of separators that require a space after splitting (default is None).
>>> from datasets import load_dataset
>>> dataset = load_dataset("louisbrulenaudet/dac6-instruct")
>>> chunker = DatasetChunker(
...     dataset['train'],
...     max_tokens=512,
...     overlap_percentage=0.5,
...     column="document",
...     model_name="intfloat/multilingual-e5-large",
...     separators=["

“, “.”, “!”, “?”]: … ) >>> dataset_chunked = chunker.chunk_dataset() >>> dataset_chunked.to_list()[:3] [{‘text’: ‘This is a chunked text.’}, {‘text’: ‘This is another chunked text.’}, …]

__init__(dataset: Dataset | DatasetDict, max_tokens: int, overlap_percentage: float, column: str, model_name: str = 'bert-base-uncased', uuid_column: str | None = None, separators: List[str] = ['.', '\n'], space_after_splitters: List[str] | None = None) → None[source]#

split_text(text: str) → List[str][source]#

Splits a text into segments based on the specified separators.

Parameters:: text (str) – The text to be split.
Returns:: A list of text segments.
Return type:: List[str]

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> chunker.split_text("This is a sentence. This is another one.")
['This is a sentence', '.', ' This is another one', '.']

create_chunks(text: str) → List[str][source]#

Creates text chunks from a given text based on the maximum tokens limit.

Parameters:: text (str) – The text to be chunked.
Returns:: A list of text chunks.
Return type:: List[str]
Raises:: ValueError – If the text cannot be chunked properly.

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> text = "This is a very long text that needs to be chunked."
>>> chunks = chunker.create_chunks(text)
>>> len(chunks)
2

finalize_chunk(chunk_text: str, is_last: bool) → str[source]#

Finalizes the chunk text by adjusting leading/trailing separators.

Parameters:

chunk_text (str) – The chunk text to be finalized.
is_last (bool) – Indicates whether this is the last chunk.

Returns:

The finalized chunk text.

Return type:

str

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> chunk = " This is a chunk."
>>> chunker.finalize_chunk(chunk, is_last=True)
'This is a chunk.'

chunk_dataset() → Dataset | DatasetDict[source]#

Chunks the entire dataset into smaller segments.

Returns:: The chunked dataset, with each entry split into smaller chunks.
Return type:: Union[Dataset, DatasetDict]

Examples

>>> chunker = DatasetChunker(dataset, 512, 0.1, 'text')
>>> chunked_dataset = chunker.chunk_dataset()
>>> len(chunked_dataset)
1000

ragoon.chunks

Contents

ragoon.chunks#