lancedb_haystack.document_store =============================== .. py:module:: lancedb_haystack.document_store Attributes ---------- .. autoapisummary:: lancedb_haystack.document_store.logger Classes ------- .. autoapisummary:: lancedb_haystack.document_store.LanceDBDocumentStore Functions --------- .. autoapisummary:: lancedb_haystack.document_store._create_schema lancedb_haystack.document_store._create_isempty_section lancedb_haystack.document_store._prepare_metadata_schema Module Contents --------------- .. py:data:: logger .. py:class:: LanceDBDocumentStore(database: str, table_name: str, metadata_schema: Optional[pyarrow.StructType] = None, embedding_dims: Optional[int] = None) Bases: :py:obj:`haystack.document_stores.types.DocumentStore` Stores data in LanceDB, and leverages its inbuilt search features. .. py:attribute:: _database .. py:attribute:: _table_name .. py:attribute:: _metadata_schema .. py:attribute:: _embedding_dims .. py:attribute:: db .. py:method:: table_exists() -> bool Check if the table this DocumentStore relies on already exists. :return: True if the table already exists in the LanceDB backing this DocumentStore .. py:method:: count_documents() -> int Returns how many documents are present in the document store. :return: the number of documents in the document store, or 0 if the table hasn't been created yet. .. py:method:: filter_documents(filters: Optional[Dict[str, Any]] = None) -> List[haystack.Document] Returns the documents that match the filters provided. Filters are defined as nested dictionaries that can be of two types: - Comparison - Logic Comparison dictionaries must contain the keys: - `field` - `operator` - `value` Logic dictionaries must contain the keys: - `operator` - `conditions` The `conditions` key must be a list of dictionaries, either of type Comparison or Logic. The `operator` value in Comparison dictionaries must be one of: - `==` - `!=` - `>` - `>=` - `<` - `<=` - `in` - `not in` The `operator` values in Logic dictionaries must be one of: - `NOT` - `OR` - `AND` A simple filter: ```python filters = {"field": "meta.type", "operator": "==", "value": "article"} ``` A more complex filter: ```python filters = { "operator": "AND", "conditions": [ {"field": "meta.type", "operator": "==", "value": "article"}, {"field": "meta.date", "operator": ">=", "value": 1420066800}, {"field": "meta.date", "operator": "<", "value": 1609455600}, {"field": "meta.rating", "operator": ">=", "value": 3}, { "operator": "OR", "conditions": [ {"field": "meta.genre", "operator": "in", "value": ["economy", "politics"]}, {"field": "meta.publisher", "operator": "==", "value": "nytimes"}, ], }, ], }``` :param filters: the filters to apply to the document list. :return: a list of Documents that match the given filters. .. py:method:: perform_query(query: Optional[Union[str, List[float]]] = None, filters: Optional[Dict[str, Any]] = None, top_k: Optional[int] = None) -> List[haystack.Document] Performs a query againts the LanceDB backing this DocumentStore :param query: Either a query string for FTS, a vector for vector search, or empty to just use filters. :param filters: Filters to apply to the search. See: https://docs.haystack.deepset.ai/docs/metadata-filtering :param top_k: limit the results to the top_k most relevant documents. Default: no limit :return: a list of Haystack Documents which match the search and filters. :raises ValueError: if an invalid top_k is given (ie: negative) .. py:method:: write_documents(documents: List[haystack.Document], policy: haystack.document_stores.types.DuplicatePolicy = DuplicatePolicy.NONE) -> int Writes (or overwrites) documents into the store. :param documents: a list of documents. :param policy: documents with the same ID count as duplicates. When duplicates are met, the store can: - skip: keep the existing document and ignore the new one. - overwrite: remove the old document and write the new one. - fail: an error is raised :raises DuplicateDocumentError: Exception trigger on duplicate document if `policy=DuplicatePolicy.FAIL` :return: the number of documents created or updated. :raises ValueError: if no documents are provided. .. py:method:: delete_documents(object_ids: List[str]) -> None Deletes all documents with a matching document_ids from the document store. Fails with `MissingDocumentError` if no document with this id is present in the store. :param object_ids: the object_ids to delete .. py:method:: to_dict() -> Dict[str, Any] Serializes this store to a dictionary. .. py:method:: from_dict(data: Dict[str, Any]) -> LanceDBDocumentStore :classmethod: Deserializes the store from a dictionary. .. py:function:: _create_schema(metadata_schema: pyarrow.StructType, embedding_dims: Optional[int]) -> pyarrow.Schema Creates the LanceDB schema for the DocumentStore using the given metadata field schema and num embedding_dims. :param metadata_schema: a pyarrow StructType defining the schema for the metadata field. :param embedding_dims: the number of dimensions used in the embedding. :return: a pyarrow schema used to initialise the table in LanceDB .. py:function:: _create_isempty_section(field_names: List[str]) -> pyarrow.StructType Creates the _isempty struct for the given list of fields. Haystack expects it's DocumentStores to return Documents which have only the fields they had when written. Unfortunately, LanceDB expects all fields to exist in all records, and not all types have easy 'None' analogues. To solve this we have a struct of boolean flags to indicate if a given field should be considered to be emtpy. :param field_names: a list of fieldnames to create entries for in the _isempty struct :return: a pyarrow StructType .. py:function:: _prepare_metadata_schema(struct: pyarrow.StructType) -> pyarrow.StructType Take a pyarrow.StructType describing the metadata section and prepare it for use with LanceDB. This covers a couple of steps to address limitations: 1. sorting the fields into alphabetical order. If we don't do this, then LanceDB tends to complain when we give it a python dict, as those fields tend to be iterated in alphabetical order. 2. Add the _isempty section to each StructType in the specification. This lets us know if the field is meant to be empty in a given instance. :param struct: a pyarrow Struct :return: a copy of the struct with suitable _isempty sections added.