Element Range Index & availability for search - indexing

MarkLogic 9.0.8.2
We have around 20M records in our database in XML format.
To work with facets, we have created element-rage-index on the given element.
It is working fine, so no issue there.
Real problem is that, we now want to deploy same code on different environments like System Test(ST), UAT, Production.
Before deploying code, we have to make sure that given index exist. So we execute it in 1/2 days in advance.
We noticed that until full indexing is completed, we can't deploy our code else it will start showing up errors like this.
<error:code>XDMP-ELEMRIDXNOTFOUND</error:code>
<error:name/>
<error:xquery-version>1.0-ml</error:xquery-version>
<error:message>No element range index</error:message>
<error:format-string>XDMP-ELEMRIDXNOTFOUND: cts:element-reference(fn:QName("","tc"), ("type=string", "collation=http://marklogic.com/collation/")) -- No string element range index for tc collation=http://marklogic.com/collation/ </error:format-string>
And once index is finished, same code will run as expected.
Specially in ST/UAT, we are fine if we get partial data with unfinished indexing.
Is there any way we can achieve this? else we are loosing too much time just to wait for index to finish.
This happens every time when we come up with new feature which has dependency on new index

You can only use a range index if it exists and is available. It is not available until all matching records have been indexed.
You should create your indexes earlier and allow enough time to finish reindexing before deploying code that uses them. Maybe make your code deployment depend upon the reindexing status and not allow for it to be deployed until it has completed.
If the new versions of your applications can function without the indexes (value query instead of range-query), or you are fine with queries returning inaccurate results, then you could enable/disable the section of code utilizing them with feature flags, or wrap with try/catch, but you really should just create the indexes earlier in your deployment cycles.
Otherwise, if you are performing tests without a complete and functioning environment, what are you really testing?

Related

Azure Data Factory Limits

I have created a simple pipeline that operates as such:
Generates an access token via an Azure Function. No problem.
Uses a Lookup activity to create a table to iterate through the rows (4 columns by 0.5M rows). No problem.
For Each activity (sequential off, batch-size = 10):
(within For Each): Set some variables for checking important values.
(within For Each): Pass values through web activity to return a json.
(within For Each): Copy Data activity mapping parts of the json to the sink-dataset (postgres).
Problem: The pipeline slows to a crawl after approximately 1000 entries/inserts.
I was looking at this documentation regarding the limits of ADF.
ForEach items: 100,000
ForEach parallelism: 20
I would expect that this falls within in those limits unless I'm misunderstanding it.
I also cloned the pipeline and tried it by offsetting the query in one, and it tops out at 2018 entries.
Anyone with more experience be able to give me some idea of what is going on here?
As a suggestion, whenever I have to fiddle with variables inside a foreach, I made a new pipeline for the foreach process, and call it from within the foreach. That way I make sure that the variables get their own context for each iteration of the foreach.
Have you already checked that the bottleneck is not at the source or sink? If the database or web service is under some stress, then going sequential may help if your scenario allows that.
Hope this helped!

Is there a way to check if a query hit a cached result or not in BigQuery?

We are performance tuning, both in terms of cost and speed, some of our queries and the results we get are a little bit weird. First, we had one query that did an overwrite on an existing table, we stopped that one after 4 hours. Running the same query to an entirely new table and it only took 5 minutes. I wonder if the 5 minute query maybe used a cached result from the first run, is that possible to check? Is it possible to force BigQuery to not use cache?
If you run query in UI - expand Options and make sure Use Cached Result properly set
Also, in UI, you can check Job Details to see if cached result was used
If you run your query programmatically - you should use respective attributes - configuration.query.useQueryCache and statistics.query.cacheHit

What state is saved between rerunning queries in Linqpad?

What state is saved between rerunning queries in Linqpad? I presumed none, so if you run a script twice it will have the same results both time.
However run the C# Program below twice in the same Linqpad tab. You'll find the first it prints an empty list, the second time a list with the message 'hey'. What's going on?
System.ComponentModel.TypeDescriptor.GetAttributes(typeof(String)).OfType<ObsoleteAttribute>().Dump();
System.ComponentModel.TypeDescriptor.AddAttributes(typeof(String),new ObsoleteAttribute("hey"));
LINQPad caches the application domain between queries, unless you request otherwise in Edit | Preferences (or press Ctrl+Shift+F5 to clear the app domain). This means that anything stored in static variables will be preserved between queries, assuming the types are numerically identical. This is why you're seeing the additional type description attribute in your code, and also explains why you often see a performance advantage on subsequent query runs (since many things are cached one way or another in static variables).
You can take advantage of this explicitly with LINQPad's Cache extension method:
var query = <someLongRunningQuery>.Cache();
query.Select (x => x.Name).Dump();
Cache() is a transparent extension method that returns exactly what it was fed if the input was not already seen in a previous query. Otherwise, it returns the enumerated result from the previous query.
Hence if you change the second line and re-execute the query, the query will execute quickly since will be supplied from a cache instead of having to re-execute.

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

What index would speed up my XQuery in X-Hive / Documentum xDB?

I have approx 2500 documents in my test database and searching the xpath /path/to/#attribute takes approximately 2.4 seconds. Doing distinct-values(/path/to/#attribute) takes 3.0 seconds.
I've been able to speed up queries on /path/to[#attribute='value'] to hundreds or tens of milliseconds by adding a Path value index on /path/to[#attribute<STRING>] but no index I can think of gets picked up for the more general query.
Anybody know what indexes I should be using?
The index you propose is the correct one (/path/to[#attribute]), but unfortunately the xDB optimizer currently doesn't recognize this specific case since the 'target node' stored in the index is always an element and not an attribute. If /path/to/#attribute has few results then you can optimize this by slightly modifying your query to this: distinct-values(/path/to[#attribute]/#attribute). With this query the optimizer recognizes that there is an index it can use to get to the 'to' element, but then it still has the access the target document to retrieve the attribute for the #attribute step. This is precisely why it will only benefit cases where there are few hits: each hit will likely access a different data page.
What you also can do is access the keys in the index directly through the API: XhiveIndexIf.getKeys(). This will be very fast, but clearly this is not very user friendly (and should be done by the optimizer instead).
Clearly the optimizer could handle this. I will add it to the bug tracker.