lancedb_haystack.document_store
Attributes
Classes
Stores data in LanceDB, and leverages its inbuilt search features. |
Functions
|
Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims. |
|
Creates the _isempty struct for the given list of fields. |
|
Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB. |
Module Contents
- lancedb_haystack.document_store.logger
- class lancedb_haystack.document_store.LanceDBDocumentStore(database: str, table_name: str, metadata_schema: pyarrow.StructType | None = None, embedding_dims: int | None = None)
Bases:
haystack.document_stores.types.DocumentStoreStores data in LanceDB, and leverages its inbuilt search features.
- _database
- _table_name
- _metadata_schema
- _embedding_dims
- db
- table_exists() bool
Check if the table this DocumentStore relies on already exists.
- Returns:
True if the table already exists in the LanceDB backing this DocumentStore
- count_documents() int
Returns how many documents are present in the document store.
- Returns:
the number of documents in the document store, or 0 if the table hasn’t been created yet.
- filter_documents(filters: Dict[str, Any] | None = None) List[haystack.Document]
Returns the documents that match the filters provided.
Filters are defined as nested dictionaries that can be of two types:
Comparison
Logic
Comparison dictionaries must contain the keys:
field
operator
value
Logic dictionaries must contain the keys:
operator
conditions
The conditions key must be a list of dictionaries, either of type Comparison or Logic.
The operator value in Comparison dictionaries must be one of:
==
!=
>
>=
<
<=
in
not in
The operator values in Logic dictionaries must be one of:
NOT
OR
AND
A simple filter:
`python filters = {"field": "meta.type", "operator": "==", "value": "article"} `A more complex filter:
“operator”: “AND”, “conditions”: [
{“field”: “meta.type”, “operator”: “==”, “value”: “article”}, {“field”: “meta.date”, “operator”: “>=”, “value”: 1420066800}, {“field”: “meta.date”, “operator”: “<”, “value”: 1609455600}, {“field”: “meta.rating”, “operator”: “>=”, “value”: 3}, {
“operator”: “OR”, “conditions”: [
{“field”: “meta.genre”, “operator”: “in”, “value”: [“economy”, “politics”]}, {“field”: “meta.publisher”, “operator”: “==”, “value”: “nytimes”},
],
},
],
}```
- Parameters:
filters – the filters to apply to the document list.
- Returns:
a list of Documents that match the given filters.
- perform_query(query: str | List[float] | None = None, filters: Dict[str, Any] | None = None, top_k: int | None = None) List[haystack.Document]
Performs a query againts the LanceDB backing this DocumentStore
- Parameters:
query – Either a query string for FTS, a vector for vector search, or empty to just use filters.
filters – Filters to apply to the search. See: https://docs.haystack.deepset.ai/docs/metadata-filtering
top_k – limit the results to the top_k most relevant documents. Default: no limit
- Returns:
a list of Haystack Documents which match the search and filters.
- Raises:
ValueError – if an invalid top_k is given (ie: negative)
- write_documents(documents: List[haystack.Document], policy: haystack.document_stores.types.DuplicatePolicy = DuplicatePolicy.NONE) int
Writes (or overwrites) documents into the store.
- Parameters:
documents – a list of documents.
policy – documents with the same ID count as duplicates. When duplicates are met, the store can: - skip: keep the existing document and ignore the new one. - overwrite: remove the old document and write the new one. - fail: an error is raised
- Raises:
DuplicateDocumentError – Exception trigger on duplicate document if policy=DuplicatePolicy.FAIL
ValueError – if no documents are provided.
- Returns:
the number of documents created or updated.
- delete_documents(object_ids: List[str]) None
Deletes all documents with a matching document_ids from the document store. Fails with MissingDocumentError if no document with this id is present in the store.
- Parameters:
object_ids – the object_ids to delete
- to_dict() Dict[str, Any]
Serializes this store to a dictionary.
- classmethod from_dict(data: Dict[str, Any]) LanceDBDocumentStore
Deserializes the store from a dictionary.
- lancedb_haystack.document_store._create_schema(metadata_schema: pyarrow.StructType, embedding_dims: int | None) pyarrow.Schema
Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims.
- Parameters:
metadata_schema – a pyarrow StructType defining the schema for the metadata field.
embedding_dims – the number of dimensions used in the embedding.
- Returns:
a pyarrow schema used to initialise the table in LanceDB
- lancedb_haystack.document_store._create_isempty_section(field_names: List[str]) pyarrow.StructType
Creates the _isempty struct for the given list of fields.
Haystack expects it’s DocumentStores to return Documents which have only the fields they had when written. Unfortunately, LanceDB expects all fields to exist in all records, and not all types have easy ‘None’ analogues. To solve this we have a struct of boolean flags to indicate if a given field should be considered to be emtpy.
- Parameters:
field_names – a list of fieldnames to create entries for in the _isempty struct
- Returns:
a pyarrow StructType
- lancedb_haystack.document_store._prepare_metadata_schema(struct: pyarrow.StructType) pyarrow.StructType
Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB.
This covers a couple of steps to address limitations: 1. sorting the fields into alphabetical order. If we don’t do this, then LanceDB tends to complain when we give it
a python dict, as those fields tend to be iterated in alphabetical order.
Add the _isempty section to each StructType in the specification. This lets us know if the field is meant to be empty in a given instance.
- Parameters:
struct – a pyarrow Struct
- Returns:
a copy of the struct with suitable _isempty sections added.