lancedb_haystack.document_store

Attributes

logger

Classes

LanceDBDocumentStore

Stores data in LanceDB, and leverages its inbuilt search features.

Functions

`_create_schema`(→ pyarrow.Schema)	Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims.
`_create_isempty_section`(→ pyarrow.StructType)	Creates the _isempty struct for the given list of fields.
`_prepare_metadata_schema`(→ pyarrow.StructType)	Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB.

Module Contents

lancedb_haystack.document_store.logger

class lancedb_haystack.document_store.LanceDBDocumentStore(database: str, table_name: str, metadata_schema: pyarrow.StructType | None = None, embedding_dims: int | None = None)

Bases: haystack.document_stores.types.DocumentStore

Stores data in LanceDB, and leverages its inbuilt search features.

_database

_table_name

_metadata_schema

_embedding_dims

db

table_exists() → bool

Check if the table this DocumentStore relies on already exists.

Returns:: True if the table already exists in the LanceDB backing this DocumentStore

count_documents() → int

Returns how many documents are present in the document store.

Returns:: the number of documents in the document store, or 0 if the table hasn’t been created yet.

filter_documents(filters: Dict[str, Any] | None = None) → List[haystack.Document]

Returns the documents that match the filters provided.

Filters are defined as nested dictionaries that can be of two types:

Comparison
Logic

Comparison dictionaries must contain the keys:

field
operator
value

Logic dictionaries must contain the keys:

operator
conditions

The conditions key must be a list of dictionaries, either of type Comparison or Logic.

The operator value in Comparison dictionaries must be one of:

==
!=
>
>=
<
<=
in
not in

The operator values in Logic dictionaries must be one of:

NOT
OR
AND

A simple filter:

`python filters = {"field": "meta.type", "operator": "==", "value": "article"} `

A more complex filter:

```python filters = {

“operator”: “AND”, “conditions”: [

{“field”: “meta.type”, “operator”: “==”, “value”: “article”}, {“field”: “meta.date”, “operator”: “>=”, “value”: 1420066800}, {“field”: “meta.date”, “operator”: “<”, “value”: 1609455600}, {“field”: “meta.rating”, “operator”: “>=”, “value”: 3}, {

“operator”: “OR”, “conditions”: [

{“field”: “meta.genre”, “operator”: “in”, “value”: [“economy”, “politics”]}, {“field”: “meta.publisher”, “operator”: “==”, “value”: “nytimes”},

],

},

],

}```

Parameters:: filters – the filters to apply to the document list.
Returns:: a list of Documents that match the given filters.

perform_query(query: str | List[float] | None = None, filters: Dict[str, Any] | None = None, top_k: int | None = None) → List[haystack.Document]

Performs a query againts the LanceDB backing this DocumentStore

Parameters:

query – Either a query string for FTS, a vector for vector search, or empty to just use filters.
filters – Filters to apply to the search. See: https://docs.haystack.deepset.ai/docs/metadata-filtering
top_k – limit the results to the top_k most relevant documents. Default: no limit

Returns:

a list of Haystack Documents which match the search and filters.

Raises:

ValueError – if an invalid top_k is given (ie: negative)

write_documents(documents: List[haystack.Document], policy: haystack.document_stores.types.DuplicatePolicy = DuplicatePolicy.NONE) → int

Writes (or overwrites) documents into the store.

Parameters:

documents – a list of documents.
policy – documents with the same ID count as duplicates. When duplicates are met, the store can: - skip: keep the existing document and ignore the new one. - overwrite: remove the old document and write the new one. - fail: an error is raised

Raises:

DuplicateDocumentError – Exception trigger on duplicate document if policy=DuplicatePolicy.FAIL
ValueError – if no documents are provided.

Returns:

the number of documents created or updated.

delete_documents(object_ids: List[str]) → None

Deletes all documents with a matching document_ids from the document store. Fails with MissingDocumentError if no document with this id is present in the store.

Parameters:: object_ids – the object_ids to delete

to_dict() → Dict[str, Any]: Serializes this store to a dictionary.

classmethod from_dict(data: Dict[str, Any]) → LanceDBDocumentStore: Deserializes the store from a dictionary.

lancedb_haystack.document_store._create_schema(metadata_schema: pyarrow.StructType, embedding_dims: int | None) → pyarrow.Schema

Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims.

Parameters:

metadata_schema – a pyarrow StructType defining the schema for the metadata field.
embedding_dims – the number of dimensions used in the embedding.

Returns:

a pyarrow schema used to initialise the table in LanceDB

lancedb_haystack.document_store._create_isempty_section(field_names: List[str]) → pyarrow.StructType

Creates the _isempty struct for the given list of fields.

Haystack expects it’s DocumentStores to return Documents which have only the fields they had when written. Unfortunately, LanceDB expects all fields to exist in all records, and not all types have easy ‘None’ analogues. To solve this we have a struct of boolean flags to indicate if a given field should be considered to be emtpy.

Parameters:: field_names – a list of fieldnames to create entries for in the _isempty struct
Returns:: a pyarrow StructType

lancedb_haystack.document_store._prepare_metadata_schema(struct: pyarrow.StructType) → pyarrow.StructType

Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB.

This covers a couple of steps to address limitations: 1. sorting the fields into alphabetical order. If we don’t do this, then LanceDB tends to complain when we give it

a python dict, as those fields tend to be iterated in alphabetical order.

Add the _isempty section to each StructType in the specification. This lets us know if the field is meant to be empty in a given instance.

Parameters:: struct – a pyarrow Struct
Returns:: a copy of the struct with suitable _isempty sections added.