I have created an Azure Cognitive Search Service index. All the fields that I want to be able to search, retrieve, filter, sort, facet are included within the single table that the index is built from. Some of the data fields in that table are coded, but I have a separate table that serves as a dictionary that defines those codes more literally in plain english. I would like to be able to search on the defined/literal values in my search results from the index without having to add the contents of the large dictionary table to the search index.
Is it possible to configure the index to use a referential table in this way? How?
Or is my only option to denormalize the entire contents of the dictionary into the index table?
Thanks!
Azure Cognitive Search Services will only return data that it has stored in its search indexes. If you want to "join" that data with some external table, that's something you may need to do client-side.
If you want to search against the terms in your external table, that data needs to be included in the search index.
Related
I wonder how can I find a specific value from DB without going through the entire DB table.
by example:
There is a DB of students and we are looking for all the students with a certain name, how do you do that without going through the whole DB table.
Use INDEXES
Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed. ... Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.
SQL Server has four options for improving performance for this type of query:
A regular index (either clustered or non-clustered).
A full text index.
Partitioning.
Hash index (for memory optimized tables).
A regular index, created using create index, is the "canonical" answer to this question. It is like an alphabetical list of all names with a pointer to the record. The implementation uses something called B-trees, so the analogy is not perfect. These indexes can be used for equality (eg. =, is null) and inequality comparisons (eg. in, <, >)
A full text index indexes all words in a text column (for some definition of "word"). This can be used for a range of full text search options -- and available through contains.
Partitioning is used when you have lots and lots of data and only a handful of categories. That is highly unlikely with a name in a student database. But it physically splits the data into separate files for each name or range of names.
Hash-based indexing is only available on memory-optimized tables. These are only useful for comparisons using = and in (and <> and not in).
I would like to query the Schema definition of a Index in MarkLogic.
How can I query that?
What would be the query to do that?
I am talking about the Schema such as Elasticsearch Schema, with Field Types, Analyses, etc.
Please think of my question, as if I am asking how to see the column types, and column names in Oracle. How to do the same in MarkLogic? Any examples?
MarkLogic has a universal index, so there is no requirement to define a schema up front to search on specific elements or properties.
To do datatyped queries on element or properties, you can use TDE in MarkLogic 9 to define how to project datatyped values from documents in a collection into the indexes as a view over the documents. To find out the list of columns with data types for a view, you can either query the system columns view or retrieve the TDE template from the schemas database.
In MarkLogic 8 and before, you would define range indexes on elements, properties, fields, or paths. On the enode, the Admin API can get the list of range indexes for any database. On the middle tier, the Management REST API can express the equivalent REST request.
Hoping that clarifies,
I want to populate an index but make it searchable only after I'm done. Is there a standard way of doing that with elastic search? I think I can set "index.blocks.read": true but I'd like a way to be able to ask elastic for a list of the searchable indices and I don't know how to do that with that setting. Also closing/opening an index feels a bit cumbersome.
A solution I found is to add a document to each index defining that index's status. Though querying for the list of indices is a bit annoying. Specifically since querying and paginating a long list of 2,000 index status documents is problematic. Scroll-scan is a solution because it gives me all the results in one go (because every shard has at most 1 index status document). Though that feels like I'm using the wrong tool for the job (i.e. a scroll-scan op that always does exactly one scroll).
I don't want one document that references all the indices because then I'd have to garbage collect it manually alongside garbage collecting indices. But maybe that's the best tradeoff...
Is there a standard practice that I'm not aware of?
How about using aliases? Instead of querying an index directly, your application could query an alias (e.g. live) instead. As long as your index is not ready (i.e. still being populated), you don't assign the live alias to it and hence the index won't be searchable.
Basically, the process goes like this:
Create the index with its settings and mappings
Populate it
When done, assign the live alias to it and send your queries against it
Later when you need to index new data, you create another index
You populate that new index
When done, you switch the aliases, i.e. remove the live alias from the previous searchable index and assign the live alias to the new searchable index
Here is a simple example that demonstrates this.
I have a database with two primary tables:
The components table (50M rows),
The assemblies table (100K rows),
..and a many-to-many relationship between them (about 100K components per assembly), thus in total 10G relationships.
What's the best way to index the components such that I could query the index for a given assembly? Given the amount of relationships, I don't want to import them into the Lucene index, but am looking instead for a way to "join" with my external table on-the-fly.
Solr supports multi-valued fields. Not positive if Lucene supports them natively or not. It's been a while for me. If only one of the entities is searchable, which you mentioned is components, I would index all components with a field called "assemblies" or "assemblyIds" or something similar and include whatever metadata you need to identify the assemblies.
Then you can search components with
assemblyIds:(1, 2, 3)
To find components in assembly 1, 2 or 3.
To be brief, you've got to process the data and Index it before you can search. Therefore, there exists no way to just "plug-in" Lucene to some data or database, instead you've to plug-in (process, parse, analyze, index, and query) the data it self to the Lucene.
rustyx: "My data is mostly static. I can even live with a read-only index."
In that case, you might use Lucene it self. You can iterate the datasource to add all the many-to-many relations to the Lucene index. How did you come up with that "100GB" size? People index millions and millions of documents using Lucene, I don't think it'd be a problem for you to index.
You can add multiple field instances to the index with different values ("components") in a document having an "assembly" field as well.
rustyx: "I'm looking instead into a way to "join" the Lucene search with my external data source on the fly"
If you need something seamless, you might try out the following framework which acts like a bridge between relational database and Lucene Index.
Hibernate Search : In that tutorial, you might search for "#ManyToMany" keyword to find the exact section in the tutorial to get some idea.
Is this possible? I cannot access the database directly--only through data objects.
Would I be able to search the index if the items are returned in something like ArrayList?
If this is not possible, is there some way I can use Lucene (or some other tool) to do fuzzy matching against an object using java?
For example, I have a Person object that has a FirstName and LastName. I want to do a fuzzy match on the name.
So, say I have an array of x amount of Person objects, would there be an efficient way of looping through each Person object and comparing the names?
Take those data objects and build a separate Lucene index over them, storing the fields you need. Using your Person example, every Lucene document would be [Id, FirstName, LastName]. A search on this index would return the Id required to query your database for the complete data object.
The actual indexing is easy, you just need to retrieve a list of data objects, iterate them, generate Lucene documents, and store them using an IndexWriter. You could work against either a filesystem directory for persistent storage, or a in-memory storage.
Those are the possible solutions I came up with--
however, I cannot store my index on FSDirectory (project specs do not allow this) and for RAMDirectory, there are going to be thousands of Person objects we'll need to search through so I don't know if in-memory storage is ideal for this situation.
Is there any other sort of fuzzy match algorithm I can use that will be efficient for large sets of data?