Apache Solr - Wrong facet count - apache

Are there any known issues in Solr 3.6, where when grouping, filter query (fq) is applied with faceting, the facet counts are coming back wrong.
If I have a query:
..indent=on&wt=json&start=0&rows=500&q=*:*&group=true&group.truncate=true&group.ngroups=true&group.field=Id&facet=true&facet.mincount=1&facet.field={!ex=filed1}field1&
facet.field={!ex=filed2}field2
if user filters on field1, then I have following query:
...indent=on&wt=json&start=0&rows=500&q=*:*&fq={!tag=dt}field1:value&group=true&group.truncate=true&group.ngroups=true&group.field=Id&facet=true&facet.mincount=1&facet.field={!ex=dt}field1&facet.field={!ex=dt}field2
I'm noticing that the facet counts for are different in the results coming back from each query.
thanks,

There seem to be two problems here:
You are misspelling field1 as filed1 in your !ex queries.
You are using !ex local parameter but without corresponding !tag parameter.

Related

Why does the number of returned samples where name='keyword' does not match the number of observed samples with 'keyword' in table?

I have a Postgres table whose header is [id(uuid), name(str), arg_name(str), measurements(list), run_id(uuid), parent_id(uuid)] with a total of 237K entries.
When I want to filter for specific measurements I can use 'name', but for the majority of entries in the table 'name' == 'arg_name' and thus map to the same sample.
In my peculiar case I am interested in retrieving samples whose 'name'='TimeM12nS' and whose 'arg_name'='Time'. These two attributes point to the same samples when visually inspecting the table through PgAdmin. That is to say all entries which have arg_name='Time' also have the name='TimeM12nS' and vice-versa.
Its obvious there's a problem because of the quantity of returned samples is not the same. I first noticed the problem using django orm, but the problem is also present when I query the DB using PgAdmin.
SELECT *
FROM TableA
WHERE name='TimeM12nS'
returns 301 entries (name='TimeM12nS' and arg_name='Time' in all cases)
BUT the query:
SELECT *
FROM TableA
WHERE arg_name='Time'
returns 3945 (name='TimeM12nS' and arg_name='Time' in all cases)
I am completely stumped, anyone think they can shed some light into what's happening here?
EDIT:
I should add that the query by 'arg_name' returns the 301 entries that are returned when querying by 'name'
First let me say thank you to everyone who pitched in ideas to solve this conundrum and especially to JGH for the solution (found in the comments of the original post).
Indeed the problem was a indexing issue. After re-indexing the queries return the same number of entries '3945' as expected.
In Postgress re-indexing a table can be achieved through pgAdmin by navigating to Databases > 'database_name' > Schemas > Tables then right-clicking on the table_name selecting Maintenance and pressing the REINDEX button.
or more simply by running the following command
REINDEX TABLE table_name
Postgress Re-Indexing Docs
Without access to the database, it's not possibly to give a definitive answer. All I can provide is the next query that I would use in this case.
SELECT COUNT(*), LENGTH(name), name, arg_name
FROM TableA
WHERE arg_name='Time'
GROUP BY name, arg_name;
This should show you any differences in the name column that you aren't able to see. The length of that string could also be informative.

How can I use data from more than one measurement in a single Grafana panel?

I am attempting to create a gauge panel in Grafana (Version 6.6.2 - presume that upgrading is a last resort, but possible if necessary, for the purposes of this problem) that can represent the percentage of total available memory used by the Java Virtual Machine running a process of mine. the problem that I am running into is the following:
I have used Springboot actuator's metrics and imported them into an Influx database with Micrometer, but in the process, it has stored the two values that I would like to use in my calculation into two different measurements. jvm_memory_used and jvm_memory_max
My initial Idea was to simply call a SELECT on both of the measurements to get the value that I want, and then divide the "used" / "max" and multiply that value by 100 to get the percentage to display. Unfortunately I run into syntax errors when I try to do this manually, and I am unsure if I can do this using Grafana's query builder.
I know that the syntax is incorrect, but I am not familiar enough with InfluxQL to know how to properly structure this query. Here is what I had tried:
(SELECT last("value")
FROM "jvm_memory_used"
WHERE ("area" = 'heap')
AND $timeFilter
GROUP BY time($__interval) fill(null)
) /
(SELECT last("value")
FROM "jvm_memory_max"
WHERE ("area" = 'heap')
AND $timeFilter
GROUP BY time($__interval) fill(null)
)
(The AND and GROUP BY are present as a result of the default values from Grafana's query builder, I am not sure whether they are necessary or not)
I'm assuming that my parenthesis and division process is illegal, but I am not sure how to resolve it.
How can I divide these two values from separate tables?
EDIT: I have gotten slightly further but it seems that there is a new issue. I now have the following query that I am sending in:
SELECT 100 * (last("used") / sum("max")) AS "percentUsed"
FROM(
SELECT last("value") AS "used"
FROM "jvm_memory_used"
WHERE ("area" = 'heap')
AND $timeFilter
),(
SELECT last("value") AS "max"
FROM "jvm_memory_max"
WHERE ("area" = 'heap')
AND $timeFilter
)
GROUP BY time($__interval) fill(null)
and the result I get is this:
How can I now get this query to return only one gauge with data, instead of two with nulls?
I've accepted an answer that works for versions of Grafana after 7. If there are any other answers that arise that do not involve updating the version of Grafana, please provide them as well!
I am not particulary experienced with Influx, but since your question is how to use/combine two measurements (query results) for a Grafana panel, I can tell you about one approach:
You can use a transformation. By that, you can keep two separate queries. With the transformation mode binary operation you can simply divide one of your values by the other one.
In your specific case, to display the result as percentage, you can then use Percent (0.0-1.0) as unit and you should have accomplished your goal.

Passing reduced ES query results to SQL

This is a follow-up question to How to pass ElasticSearch query to hadoop.
Basically, I want to do a full-text-search in ElasticSearch and then pass the result set to SQL to run an aggregation query. Here's an example:
Let's say we search "Terminator" in a financials database that has 10B records. It has the following matches:
"Terminator" (1M results)
"Terminator 2" (10M results)
"XJ4-227" (1 result ==> Here "Terminator" is in the synopsis of the title)
Instead of passing back the 10+M ids, we'd pass back the following 'reduced query' --
...WHERE name in ('Terminator', 'Terminator 2', 'XJ4-227')
How could we write such an algorithm to reduce the ES result set to a smallest possible filter query that we could send back to SQL? Does ES have any sort of match-metadata that would help us in this?
If you know that which "not analyzed" (keyword at 5.x) field would be suitable for your use case you could get their distinct values and number of matches by terms aggregation. sum_other_doc_count even tells you if your search resulted in too many distinct values, as only top N are returned.
Naturally you could run terms aggregation on multiple fields and use the one in SQL which had fewest distinct values. And actually it could be more efficient to first run cardinality aggregation to know to which field you should run terms aggregation.
If your search is a pure filter then its result should be cached but please benchmark both solutions as your ES cluster has quite a lot of data.

Using OrientDB 2.0.2, a LUCENE query does not seem to respect the "LIMIT n" operator

Using LUCENE inside of OrientDB seems to work fine, but there are very many LUCENE-specific query parameters that I would ordinarily pass directly to LUCENE (normally through Solr). The first one I need to pass is the result limiter such as SELECT * FROM V WHERE field LUCENE "Value" LIMIT 10.
If I use a value that only returns a few rows, I get the performance I expect, but if it has a lot of values, I need the limiter to get the result to return quickly. Otherwise I get an message in the console stating that The query would return more than 50000 records. Please consider using an index.
How do I pass additional LUCENE query paramters?
There's a known issue with the query parser which is in the process of being fixed, until then the following workaround should help:
SELECT FROM (
SELECT * FROM V WHERE Field LUCENE 'Value'
) LIMIT 10
Alternatively, depending on which client libraries you're using you may be able to set a limit using the out-of-band query settings.

Alfresco: Lucene query by ID returns 2 rows

I'm using Alfresco 3.4d and imported some nodes as well as created a few with NodeService. Today I noticed that a Lucene query by ID does sometimes return two rows instead of just one. Not all nodes show this kind of behavior.
For example, when I execute the following Lucene query in the Alfresco Node Browser, I get the result shown below: ID:"workspace://SpacesStore/96c0cc27-cb8c-49cf-977d-a966e5c5e9ca"
How is it even possible that a query by ID can return more than one row? I tried rebuilding the Lucene index, but it didn't help. When I delete the node, the query returns 0 rows. What can I do to remove those "ghost" nodes from the query result?
I also ran across this problem and asked the Alfresco support for advice. They told me that it is perfectly normal to have duplicate entries in the lucene ID field and that this is related to whether there is an ANCESTOR present or not. They recommended using the sys:node-uuid field when doing a lucene search for the node's ID, e.g.:
#sys\:node-uuid:f13a21dd-b020-4c70-aa21-1a0e5c89d42b
I've seen this problem since Alfresco 3.2r, but maybe it is even older! I used the Lucene index Viewer "Luke" (http://www.getopt.org/luke/) to check the index directly and I saw that the corrupt index entry contains almost no information. As workaround we combined our search to some basic information like node type or aspect. I will ask a colleague if he has more information about this.
I don't know directly how this is possible but in your 'code' where you retrieve the nodes you could always do: if node.isDocument or node.isContainer to get true result or type is cm:content or cm:folder.
You could also try to re-index, but I doubt that will be of any help