lancedb_haystack.document_store

Attributes

logger

Classes

LanceDBDocumentStore

Stores data in LanceDB, and leverages its inbuilt search features.

Functions

_create_schema(→ pyarrow.Schema)

Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims.

_create_isempty_section(→ pyarrow.StructType)

Creates the _isempty struct for the given list of fields.

_prepare_metadata_schema(→ pyarrow.StructType)

Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB.

Module Contents

lancedb_haystack.document_store.logger
class lancedb_haystack.document_store.LanceDBDocumentStore(database: str, table_name: str, metadata_schema: pyarrow.StructType | None = None, embedding_dims: int | None = None)

Bases: haystack.document_stores.types.DocumentStore

Stores data in LanceDB, and leverages its inbuilt search features.

_database
_table_name
_metadata_schema
_embedding_dims
db
table_exists() bool

Check if the table this DocumentStore relies on already exists.

Returns:

True if the table already exists in the LanceDB backing this DocumentStore

count_documents() int

Returns how many documents are present in the document store.

Returns:

the number of documents in the document store, or 0 if the table hasn’t been created yet.

filter_documents(filters: Dict[str, Any] | None = None) List[haystack.Document]

Returns the documents that match the filters provided.

Filters are defined as nested dictionaries that can be of two types:

  • Comparison

  • Logic

Comparison dictionaries must contain the keys:

  • field

  • operator

  • value

Logic dictionaries must contain the keys:

  • operator

  • conditions

The conditions key must be a list of dictionaries, either of type Comparison or Logic.

The operator value in Comparison dictionaries must be one of:

  • ==

  • !=

  • >

  • >=

  • <

  • <=

  • in

  • not in

The operator values in Logic dictionaries must be one of:

  • NOT

  • OR

  • AND

A simple filter:

`python filters = {"field": "meta.type", "operator": "==", "value": "article"} `

A more complex filter:

```python filters = {

“operator”: “AND”, “conditions”: [

{“field”: “meta.type”, “operator”: “==”, “value”: “article”}, {“field”: “meta.date”, “operator”: “>=”, “value”: 1420066800}, {“field”: “meta.date”, “operator”: “<”, “value”: 1609455600}, {“field”: “meta.rating”, “operator”: “>=”, “value”: 3}, {

“operator”: “OR”, “conditions”: [

{“field”: “meta.genre”, “operator”: “in”, “value”: [“economy”, “politics”]}, {“field”: “meta.publisher”, “operator”: “==”, “value”: “nytimes”},

],

},

],

}```

Parameters:

filters – the filters to apply to the document list.

Returns:

a list of Documents that match the given filters.

perform_query(query: str | List[float] | None = None, filters: Dict[str, Any] | None = None, top_k: int | None = None) List[haystack.Document]

Performs a query againts the LanceDB backing this DocumentStore

Parameters:
  • query – Either a query string for FTS, a vector for vector search, or empty to just use filters.

  • filters – Filters to apply to the search. See: https://docs.haystack.deepset.ai/docs/metadata-filtering

  • top_k – limit the results to the top_k most relevant documents. Default: no limit

Returns:

a list of Haystack Documents which match the search and filters.

Raises:

ValueError – if an invalid top_k is given (ie: negative)

write_documents(documents: List[haystack.Document], policy: haystack.document_stores.types.DuplicatePolicy = DuplicatePolicy.NONE) int

Writes (or overwrites) documents into the store.

Parameters:
  • documents – a list of documents.

  • policy – documents with the same ID count as duplicates. When duplicates are met, the store can: - skip: keep the existing document and ignore the new one. - overwrite: remove the old document and write the new one. - fail: an error is raised

Raises:
  • DuplicateDocumentError – Exception trigger on duplicate document if policy=DuplicatePolicy.FAIL

  • ValueError – if no documents are provided.

Returns:

the number of documents created or updated.

delete_documents(object_ids: List[str]) None

Deletes all documents with a matching document_ids from the document store. Fails with MissingDocumentError if no document with this id is present in the store.

Parameters:

object_ids – the object_ids to delete

to_dict() Dict[str, Any]

Serializes this store to a dictionary.

classmethod from_dict(data: Dict[str, Any]) LanceDBDocumentStore

Deserializes the store from a dictionary.

lancedb_haystack.document_store._create_schema(metadata_schema: pyarrow.StructType, embedding_dims: int | None) pyarrow.Schema

Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims.

Parameters:
  • metadata_schema – a pyarrow StructType defining the schema for the metadata field.

  • embedding_dims – the number of dimensions used in the embedding.

Returns:

a pyarrow schema used to initialise the table in LanceDB

lancedb_haystack.document_store._create_isempty_section(field_names: List[str]) pyarrow.StructType

Creates the _isempty struct for the given list of fields.

Haystack expects it’s DocumentStores to return Documents which have only the fields they had when written. Unfortunately, LanceDB expects all fields to exist in all records, and not all types have easy ‘None’ analogues. To solve this we have a struct of boolean flags to indicate if a given field should be considered to be emtpy.

Parameters:

field_names – a list of fieldnames to create entries for in the _isempty struct

Returns:

a pyarrow StructType

lancedb_haystack.document_store._prepare_metadata_schema(struct: pyarrow.StructType) pyarrow.StructType

Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB.

This covers a couple of steps to address limitations: 1. sorting the fields into alphabetical order. If we don’t do this, then LanceDB tends to complain when we give it

a python dict, as those fields tend to be iterated in alphabetical order.

  1. Add the _isempty section to each StructType in the specification. This lets us know if the field is meant to be empty in a given instance.

Parameters:

struct – a pyarrow Struct

Returns:

a copy of the struct with suitable _isempty sections added.