What index would speed up my XQuery in X-Hive / Documentum xDB? - indexing

I have approx 2500 documents in my test database and searching the xpath /path/to/#attribute takes approximately 2.4 seconds. Doing distinct-values(/path/to/#attribute) takes 3.0 seconds.
I've been able to speed up queries on /path/to[#attribute='value'] to hundreds or tens of milliseconds by adding a Path value index on /path/to[#attribute<STRING>] but no index I can think of gets picked up for the more general query.
Anybody know what indexes I should be using?

The index you propose is the correct one (/path/to[#attribute]), but unfortunately the xDB optimizer currently doesn't recognize this specific case since the 'target node' stored in the index is always an element and not an attribute. If /path/to/#attribute has few results then you can optimize this by slightly modifying your query to this: distinct-values(/path/to[#attribute]/#attribute). With this query the optimizer recognizes that there is an index it can use to get to the 'to' element, but then it still has the access the target document to retrieve the attribute for the #attribute step. This is precisely why it will only benefit cases where there are few hits: each hit will likely access a different data page.
What you also can do is access the keys in the index directly through the API: XhiveIndexIf.getKeys(). This will be very fast, but clearly this is not very user friendly (and should be done by the optimizer instead).
Clearly the optimizer could handle this. I will add it to the bug tracker.

Related

Element Range Index & availability for search

MarkLogic 9.0.8.2
We have around 20M records in our database in XML format.
To work with facets, we have created element-rage-index on the given element.
It is working fine, so no issue there.
Real problem is that, we now want to deploy same code on different environments like System Test(ST), UAT, Production.
Before deploying code, we have to make sure that given index exist. So we execute it in 1/2 days in advance.
We noticed that until full indexing is completed, we can't deploy our code else it will start showing up errors like this.
<error:code>XDMP-ELEMRIDXNOTFOUND</error:code>
<error:name/>
<error:xquery-version>1.0-ml</error:xquery-version>
<error:message>No element range index</error:message>
<error:format-string>XDMP-ELEMRIDXNOTFOUND: cts:element-reference(fn:QName("","tc"), ("type=string", "collation=http://marklogic.com/collation/")) -- No string element range index for tc collation=http://marklogic.com/collation/ </error:format-string>
And once index is finished, same code will run as expected.
Specially in ST/UAT, we are fine if we get partial data with unfinished indexing.
Is there any way we can achieve this? else we are loosing too much time just to wait for index to finish.
This happens every time when we come up with new feature which has dependency on new index
You can only use a range index if it exists and is available. It is not available until all matching records have been indexed.
You should create your indexes earlier and allow enough time to finish reindexing before deploying code that uses them. Maybe make your code deployment depend upon the reindexing status and not allow for it to be deployed until it has completed.
If the new versions of your applications can function without the indexes (value query instead of range-query), or you are fine with queries returning inaccurate results, then you could enable/disable the section of code utilizing them with feature flags, or wrap with try/catch, but you really should just create the indexes earlier in your deployment cycles.
Otherwise, if you are performing tests without a complete and functioning environment, what are you really testing?

Solr Query, Syntax trouble with drupal apache solr module

I'm working with apache solr, and the module for drupal 7 apachesolr.
Some of our queries are very custom.
I have been looking in the solr documentation and at explanations on stackoverflow.
I have come up with the query:
/select?q=&start=0&rows=20&fq=bundle:(message)&fq=sm_hashtags:(hashtags)&fq=(is_uid:(1 OR 2 OR 37 OR 38 OR 50 OR 166 OR 174 OR 198 OR 431 OR 499 OR 640 OR 642) AND is_privacy:(0)) AND -is_uid:(177 OR 189) AND is_status:(1)&fq=entity_id:{* TO 2666}&fl=tus_message_object,sm_hashtags,content,ts_search,is_privacy,is_status,is_uid&sort=entity_id+desc&wt=json&wt=json
but this is returning NULL, i have tried a few different things like:
/select?q=&start=0&rows=20&fq=bundle:(message)&fq=sm_hashtags:(hashtags)&fq=((is_uid:(1+OR+2+OR+37+OR+38+OR+50+OR+166+OR+174+OR+198+OR+431+OR+499+OR+640+OR+642)+is_privacy:(0))-is_uid:(177+OR+189)+is_status:(1))&fq=entity_id:{*+TO+2666}&fl=tus_message_object,sm_hashtags,content,ts_search,is_privacy,is_status,is_uid&sort=entity_id+desc&wt=json&wt=json
But I am not sure this is correct.
I need a filter that allows the users with id (is_uid) and all the ones with privacy is 0 but not the users in blocked id list -is_uid and where the status is 1.
You're using a lot of clauses so it is very hard to determine here what could be the cause. I can give you the following hints:
a) Debug mode
Copy the whole query string, go to the Solr Admin console and execute that query with and additional debug=true or debugQuery=true. In the outcoming response, Solr will append an additional section where it explains how it "sees" the query you entered
b) Investigate each fq one by one
If there's a problem, it is for sure in the filter queries, so I suggest you to gradually try them one by one. But before of that, see the c) point.
c) fq design
Filter queries are great for their caching capabilities. Here, although there are some fq that I think won't be reused so much (the is_uid filter) I suggest you to split those queries in order to have bettere reusable (i.e. cached) results; something like:
fq=bundle:message // You already have this
fq=sm_hashtags:hashtags // You already have this
fq=is_privacy:0 // this will cache (for further reuse) all documents with no privacy
fq=is_status:1 // this will cache (for further reuse) all documents with status 1
fq=is_uid(1 OR 921 OR 9...) // Most probably these query results won't benefit so much of caching. If so, I think this query could be also part of the main query
fq=-is_uid(8 OR 99) // Most probably these query results won't benefit so much of caching. If so, I think this query could be also part of the main query
You are querying nothing on no fields q=. Try q=*:*, or anything on any field. Best way to debug solr is by constructing your query manually in the solr query builder, usually http://localhost:8983/solr and seeing what it comes back with.

Elasticsearch much slower when returning all results

I have about 20,000 documents stored in elastic search, at about 200kb each.
I have a search which has 733 hits total, I'm running that takes about 50ms to complete when returning 10 results.
If I set the size to 1000 so that it returns all results, the search takes 3-5 seconds to return.
Normally I would see that this is because it has to continue searching until it finds all of them, which takes extra time. However when returning 10 results only, the search still says 733 hits in total, so it already knows which documents are to be returned!
Note that I am not returning the _source field here, all I want it the list of _ids back, so I can't imagine that it would have to read any more data from the disk, as all the _ids are surely stored in the indices anyway.
Am I missing something in the way this works?
(My _ids are guids that we use internally).
EDIT: Since posting I've re-indexed with two changes to the mapping:
Set _source to false, so now the actual documents aren't stored.
Changed the index for the field that I was searching on to be not_analyzed.
This solves the problem, now I'm getting all 733 _ids back in ~50ms. Not sure which change solved it though. I'll take one of them back out and re-index.
It will take that Time. Because it need to fetch all data from ES and calculate score for your query.
Try
1)set fields to not analyzed which you Don search in.
2)change the store type of ES from simplfs to mmaps.. ( mention "index.store.type:mmaps" in elasticsearch.yml..)
3)configure less shard as much possible.. Shard more must be equal to move on nodes you gonna use..

What is a best way to organise the complex couchdb view (sql-like query)?

In my application I need a SQL-like query of the documents. The big picture is that there is a page with a paginated table showing the couchdb documents of a certain "type". I have about 15 searchable columns like timestamp, customer name, the us state, different numeric fields, etc. All of these columns are orderable, also there is a filter form allowing the user to filter by each of the fields.
For a more concrete below is a typical query which is a result by a customer setting some of the filter options and following to the second page. Its written in a pseodo-sql code, just to explain the problem:
timestamp > last_weeks_monday_epoch AND timestamp < this_weeks_monday_epoch AND marked_as_test = False AND dataspace="production" AND fico > 650
SORT BY timestamp DESC
LIMIT 15
SKIP 15
This would be a trivial problem if I were using any sql-like database, but couchdb is way more fun ;) To solve this I've created a view with the following structure of the emitted rows:
key: [field, value], id: doc._id, value: null
Now, to resolve the example query above I need to perform a bunch of queries:
{startkey: ["timestamp", last_weeks_monday_epoch], endkey: ["timestamp", this_weeks_monday_epoch]}, the *_epoch here are integers epoch timestamps,
{key: ["marked_as_test", False]},
{key: ["dataspace", "production"]},
{startkey: ["fico", 650], endkey: ["fico", {}]}
Once I have the results of the queries above I calculate intersection of the sets of document IDs and apply the sorting using the result of timestamp query. Than finally I can apply the slice resolving the document IDs of the rows 15-30 and download their content using bulk get operation.
Needless to say, its not the fastest operation. Currently the dataset I'm working with is roughly 10K documents big. I can already see that the part when I'm calculating the intersection of the sets can take like 4 seconds, obviously I need to optimize it further. I'm afraid to think, how slow its going to get in a few months when my dataset doubles, triples, etc.
Ok, so having explained the situation I'm at, let me ask the actual questions.
Is there a better, more natural way to reach my goal without loosing the flexibility of the tool?
Is the view structure I've used optimal ? At some point I was considering using a separate map() function generating the value of each field. This would result in a smaller b-trees but more work of the view server to generate the index. Can I benefit this way ?
The part of algorithm where I have to calculate intersections of the big sets just to later get the slice of the result bothers me. Its not a scalable approach. Does anyone know a better algorithm for this ?
Having map function:
function(doc){
if(doc.marked_as_test) return;
emit([doc.dataspace, doc.timestamp, doc.fico], null):
}
You can made similar request:
http://localhost:5984/db/_design/ddoc/_view/view?startkey=["production", :this_weeks_monday_epoch]&endkey=["production", :last_weeks_monday_epoch, 650]&descending=true&limit=15&skip=15
However, you should pass :this_weeks_monday_epoch and :last_weeks_monday_epoch values from the client side (I believe they are some calculable variables on database side, right?)
If you don't care about dataspace field (e.g. it's always constant), you may move it into the map function code instead of having it in query parameters.
I don't think CouchDB is a good fit for the general solution to your problem. However, there are two basic ways you can mitigate the ways CouchDB fits the problem.
Write/generate a bunch of map() functions that use each separate column as the key (for even better read/query performance, you can even do combinatoric approaches). That way you can do smart filtering and sorting, making use of a bunch of different indices over the data. On the other hand, this will cost extra disk space and index caching performance.
Try to find out which of the filters/sort orders your users actually use, and optimize for those. It seems unlikely that each combination of filters/sort orders is used equally, so you should be able to find some of the most-used patterns and write view functions that are optimal for those patterns.
I like the second option better, but it really depends on your use case. This is one of those things SQL engines have been pretty good at traditionally.

Need for long and dynamic select query/view sqlite

I have a need to generate a long select query of potentially thousands of where conditions like (table1.a = ? OR table1.a = ? OR ...) AND (table2.b = ? OR table2.b = ? ...) AND....
I initially started building a class to make this more bearable, but have since stopped to wonder if this will work well. This query is going to be hammering a table of potentially 10s of millions of rows joined with 2 more tables with thousands of rows.
A number of concerns are stemming from this:
1.) I wanted to use these statements to generate a temp view so I could easily transfer over existing code base, the point here is I want to filter data that I have down for analysis based on selected parameters in a GUI, so how poorly will a view do in this scenario?
2.) Can sqlite even parse a query with thousands of binds?
3.) Isn't there a framework that can make generating this query easier other than with string concatenation?
4.) Is the better solution to dump all of the WHERE variables into hash sets in memory and then just write a wrapper for my DB query object that gets next() until a query is encountered this satisfies all my conditions? My concern here is, the application generates graphs procedurally on scrolls, so waiting to draw while calling query.next() x 100,000 might cause an annoying delay? Ideally I don't want to have to wait on the next row that satisfies everything for more than 30ms at a time.
edit:
New issue, it came to my attention that sqlite3 is limited to 999 bind values(host parameters) at compile time.
So it seems as if the only way to accomplish what I had originally intended is to
1.) Generate the entire query via string concatenations(my biggest concern being, I don't know how slow parsing all the data inside sqlite3 will be)
or
2.) Do the blanket query method(select * from * where index > ? limit ?) and call next() until I hit what valid data in my compiled code(including updating index variable and re-querying repeatedly)
I did end up writing a wrapper around the QSqlQuery object that will walk a table using index > variable and limit to allow "walking" the table.
Consider dumping the joined results without filters (denormalized) into a flat file and index it with Fastbit, a bitmap index engine.