lancedb_haystack

Subpackages

Submodules

Classes

LanceDBDocumentStore

Stores data in LanceDB, and leverages its inbuilt search features.

LanceDBEmbeddingRetriever

A component for retrieving documents from an LanceDBDocumentStore using embeddings and vector similarity.

LanceDBFTSRetriever

A component for retrieving documents from an LanceDBDocumentStore using the FTS.

Package Contents

class lancedb_haystack.LanceDBDocumentStore(database: str, table_name: str, metadata_schema: pyarrow.StructType | None = None, embedding_dims: int | None = None)

Bases: haystack.document_stores.types.DocumentStore

Stores data in LanceDB, and leverages its inbuilt search features.

_database
_table_name
_metadata_schema
_embedding_dims
db
table_exists() bool

Check if the table this DocumentStore relies on already exists.

Returns:

True if the table already exists in the LanceDB backing this DocumentStore

count_documents() int

Returns how many documents are present in the document store.

Returns:

the number of documents in the document store, or 0 if the table hasn’t been created yet.

filter_documents(filters: Dict[str, Any] | None = None) List[haystack.Document]

Returns the documents that match the filters provided.

Filters are defined as nested dictionaries that can be of two types:

  • Comparison

  • Logic

Comparison dictionaries must contain the keys:

  • field

  • operator

  • value

Logic dictionaries must contain the keys:

  • operator

  • conditions

The conditions key must be a list of dictionaries, either of type Comparison or Logic.

The operator value in Comparison dictionaries must be one of:

  • ==

  • !=

  • >

  • >=

  • <

  • <=

  • in

  • not in

The operator values in Logic dictionaries must be one of:

  • NOT

  • OR

  • AND

A simple filter:

`python filters = {"field": "meta.type", "operator": "==", "value": "article"} `

A more complex filter:

```python filters = {

“operator”: “AND”, “conditions”: [

{“field”: “meta.type”, “operator”: “==”, “value”: “article”}, {“field”: “meta.date”, “operator”: “>=”, “value”: 1420066800}, {“field”: “meta.date”, “operator”: “<”, “value”: 1609455600}, {“field”: “meta.rating”, “operator”: “>=”, “value”: 3}, {

“operator”: “OR”, “conditions”: [

{“field”: “meta.genre”, “operator”: “in”, “value”: [“economy”, “politics”]}, {“field”: “meta.publisher”, “operator”: “==”, “value”: “nytimes”},

],

},

],

}```

Parameters:

filters – the filters to apply to the document list.

Returns:

a list of Documents that match the given filters.

perform_query(query: str | List[float] | None = None, filters: Dict[str, Any] | None = None, top_k: int | None = None) List[haystack.Document]

Performs a query againts the LanceDB backing this DocumentStore

Parameters:
  • query – Either a query string for FTS, a vector for vector search, or empty to just use filters.

  • filters – Filters to apply to the search. See: https://docs.haystack.deepset.ai/docs/metadata-filtering

  • top_k – limit the results to the top_k most relevant documents. Default: no limit

Returns:

a list of Haystack Documents which match the search and filters.

Raises:

ValueError – if an invalid top_k is given (ie: negative)

write_documents(documents: List[haystack.Document], policy: haystack.document_stores.types.DuplicatePolicy = DuplicatePolicy.NONE) int

Writes (or overwrites) documents into the store.

Parameters:
  • documents – a list of documents.

  • policy – documents with the same ID count as duplicates. When duplicates are met, the store can: - skip: keep the existing document and ignore the new one. - overwrite: remove the old document and write the new one. - fail: an error is raised

Raises:
  • DuplicateDocumentError – Exception trigger on duplicate document if policy=DuplicatePolicy.FAIL

  • ValueError – if no documents are provided.

Returns:

the number of documents created or updated.

delete_documents(object_ids: List[str]) None

Deletes all documents with a matching document_ids from the document store. Fails with MissingDocumentError if no document with this id is present in the store.

Parameters:

object_ids – the object_ids to delete

to_dict() Dict[str, Any]

Serializes this store to a dictionary.

classmethod from_dict(data: Dict[str, Any]) LanceDBDocumentStore

Deserializes the store from a dictionary.

class lancedb_haystack.LanceDBEmbeddingRetriever(document_store: lancedb_haystack.document_store.LanceDBDocumentStore, filters: Dict[str, Any] | None = None, top_k: int | None = 10)

A component for retrieving documents from an LanceDBDocumentStore using embeddings and vector similarity.

NAME = 'lancedb_haystack.embedding_retriever.LanceDBEmbeddingRetriever'
_document_store
_filters
_top_k
run(query_embedding: List[float], filters: Dict[str, Any] | None = None, top_k: int | None = None)

Run the LanceDBEmbeddingRetriever on the given input data.

Parameters:
  • query_embedding – Embedding of the query.

  • filters – A dictionary with filters to narrow down the search space.

  • top_k – The maximum number of documents to return.

Returns:

The retrieved documents.

to_dict() Dict[str, Any]

Serialize this component to a dictionary.

classmethod from_dict(data: Dict[str, Any]) LanceDBEmbeddingRetriever

Deserialize this component from a dictionary.

class lancedb_haystack.LanceDBFTSRetriever(document_store: lancedb_haystack.document_store.LanceDBDocumentStore, filters: Dict[str, Any] | None = None, top_k: int | None = 10)

A component for retrieving documents from an LanceDBDocumentStore using the FTS.

NAME = 'lancedb_haystack.fts_retriever.LanceDBFTSRetriever'
_document_store
_filters
_top_k
run(query: str, filters: Dict[str, Any] | None = None, top_k: int | None = None)

Run the LanceDBFTSRetriever on the given input data.

Parameters:
  • query – The query string for the Retriever.

  • filters – A dictionary with filters to narrow down the search space.

  • top_k – The maximum number of documents to return.

Returns:

The retrieved documents.

Raises:

ValueError – If the specified DocumentStore is not found or is not a LanceDBFTSRetriever instance.

to_dict() Dict[str, Any]

Serialize this component to a dictionary.

classmethod from_dict(data: Dict[str, Any]) LanceDBFTSRetriever

Deserialize this component from a dictionary.