lancedb_haystack
Subpackages
Submodules
Classes
Stores data in LanceDB, and leverages its inbuilt search features. |
|
A component for retrieving documents from an LanceDBDocumentStore using embeddings and vector similarity. |
|
A component for retrieving documents from an LanceDBDocumentStore using the FTS. |
Package Contents
- class lancedb_haystack.LanceDBDocumentStore(database: str, table_name: str, metadata_schema: pyarrow.StructType | None = None, embedding_dims: int | None = None)
Bases:
haystack.document_stores.types.DocumentStoreStores data in LanceDB, and leverages its inbuilt search features.
- _database
- _table_name
- _metadata_schema
- _embedding_dims
- db
- table_exists() bool
Check if the table this DocumentStore relies on already exists.
- Returns:
True if the table already exists in the LanceDB backing this DocumentStore
- count_documents() int
Returns how many documents are present in the document store.
- Returns:
the number of documents in the document store, or 0 if the table hasn’t been created yet.
- filter_documents(filters: Dict[str, Any] | None = None) List[haystack.Document]
Returns the documents that match the filters provided.
Filters are defined as nested dictionaries that can be of two types:
Comparison
Logic
Comparison dictionaries must contain the keys:
field
operator
value
Logic dictionaries must contain the keys:
operator
conditions
The conditions key must be a list of dictionaries, either of type Comparison or Logic.
The operator value in Comparison dictionaries must be one of:
==
!=
>
>=
<
<=
in
not in
The operator values in Logic dictionaries must be one of:
NOT
OR
AND
A simple filter:
`python filters = {"field": "meta.type", "operator": "==", "value": "article"} `A more complex filter:
“operator”: “AND”, “conditions”: [
{“field”: “meta.type”, “operator”: “==”, “value”: “article”}, {“field”: “meta.date”, “operator”: “>=”, “value”: 1420066800}, {“field”: “meta.date”, “operator”: “<”, “value”: 1609455600}, {“field”: “meta.rating”, “operator”: “>=”, “value”: 3}, {
“operator”: “OR”, “conditions”: [
{“field”: “meta.genre”, “operator”: “in”, “value”: [“economy”, “politics”]}, {“field”: “meta.publisher”, “operator”: “==”, “value”: “nytimes”},
],
},
],
}```
- Parameters:
filters – the filters to apply to the document list.
- Returns:
a list of Documents that match the given filters.
- perform_query(query: str | List[float] | None = None, filters: Dict[str, Any] | None = None, top_k: int | None = None) List[haystack.Document]
Performs a query againts the LanceDB backing this DocumentStore
- Parameters:
query – Either a query string for FTS, a vector for vector search, or empty to just use filters.
filters – Filters to apply to the search. See: https://docs.haystack.deepset.ai/docs/metadata-filtering
top_k – limit the results to the top_k most relevant documents. Default: no limit
- Returns:
a list of Haystack Documents which match the search and filters.
- Raises:
ValueError – if an invalid top_k is given (ie: negative)
- write_documents(documents: List[haystack.Document], policy: haystack.document_stores.types.DuplicatePolicy = DuplicatePolicy.NONE) int
Writes (or overwrites) documents into the store.
- Parameters:
documents – a list of documents.
policy – documents with the same ID count as duplicates. When duplicates are met, the store can: - skip: keep the existing document and ignore the new one. - overwrite: remove the old document and write the new one. - fail: an error is raised
- Raises:
DuplicateDocumentError – Exception trigger on duplicate document if policy=DuplicatePolicy.FAIL
ValueError – if no documents are provided.
- Returns:
the number of documents created or updated.
- delete_documents(object_ids: List[str]) None
Deletes all documents with a matching document_ids from the document store. Fails with MissingDocumentError if no document with this id is present in the store.
- Parameters:
object_ids – the object_ids to delete
- to_dict() Dict[str, Any]
Serializes this store to a dictionary.
- classmethod from_dict(data: Dict[str, Any]) LanceDBDocumentStore
Deserializes the store from a dictionary.
- class lancedb_haystack.LanceDBEmbeddingRetriever(document_store: lancedb_haystack.document_store.LanceDBDocumentStore, filters: Dict[str, Any] | None = None, top_k: int | None = 10)
A component for retrieving documents from an LanceDBDocumentStore using embeddings and vector similarity.
- NAME = 'lancedb_haystack.embedding_retriever.LanceDBEmbeddingRetriever'
- _document_store
- _filters
- _top_k
- run(query_embedding: List[float], filters: Dict[str, Any] | None = None, top_k: int | None = None)
Run the LanceDBEmbeddingRetriever on the given input data.
- Parameters:
query_embedding – Embedding of the query.
filters – A dictionary with filters to narrow down the search space.
top_k – The maximum number of documents to return.
- Returns:
The retrieved documents.
- to_dict() Dict[str, Any]
Serialize this component to a dictionary.
- classmethod from_dict(data: Dict[str, Any]) LanceDBEmbeddingRetriever
Deserialize this component from a dictionary.
- class lancedb_haystack.LanceDBFTSRetriever(document_store: lancedb_haystack.document_store.LanceDBDocumentStore, filters: Dict[str, Any] | None = None, top_k: int | None = 10)
A component for retrieving documents from an LanceDBDocumentStore using the FTS.
- NAME = 'lancedb_haystack.fts_retriever.LanceDBFTSRetriever'
- _document_store
- _filters
- _top_k
- run(query: str, filters: Dict[str, Any] | None = None, top_k: int | None = None)
Run the LanceDBFTSRetriever on the given input data.
- Parameters:
query – The query string for the Retriever.
filters – A dictionary with filters to narrow down the search space.
top_k – The maximum number of documents to return.
- Returns:
The retrieved documents.
- Raises:
ValueError – If the specified DocumentStore is not found or is not a LanceDBFTSRetriever instance.
- to_dict() Dict[str, Any]
Serialize this component to a dictionary.
- classmethod from_dict(data: Dict[str, Any]) LanceDBFTSRetriever
Deserialize this component from a dictionary.