Filter on Count Aggregation - data-visualization

I have been looking for the solution on the internet from quite a while and I'm still not sure that if it is possible on Kibana or not.
Suppose I apply filter on term and it gives me count of the respective terms but I want the results to show only those terms where count equals a specific value.
Being more specific,
I want to find out the number of tills which are the most busy (most number of transactions). Currently when I apply a filter on term and count it shows me the all the tills with their respective transaction count. What I want is that to show only those tills where the count is equal to let's say 10.
In other words a similar functionality like HAVING clause in relational dbms.
I have found a lot of work arounds of the same usecase but I'm looking for a solution.

I hope I understand what you're asking. I think You can search the field in question with proper parameters. For example, for the field 'field_name' with more than 10 hits, try following Lucene query:
field_name:(*) AND count:[10 TO *]
For the exact result of field_name with count=10, query:
field_name:(*) AND count:[10]
Let me know if this was what you were looking for!

Related

Passing reduced ES query results to SQL

This is a follow-up question to How to pass ElasticSearch query to hadoop.
Basically, I want to do a full-text-search in ElasticSearch and then pass the result set to SQL to run an aggregation query. Here's an example:
Let's say we search "Terminator" in a financials database that has 10B records. It has the following matches:
"Terminator" (1M results)
"Terminator 2" (10M results)
"XJ4-227" (1 result ==> Here "Terminator" is in the synopsis of the title)
Instead of passing back the 10+M ids, we'd pass back the following 'reduced query' --
...WHERE name in ('Terminator', 'Terminator 2', 'XJ4-227')
How could we write such an algorithm to reduce the ES result set to a smallest possible filter query that we could send back to SQL? Does ES have any sort of match-metadata that would help us in this?
If you know that which "not analyzed" (keyword at 5.x) field would be suitable for your use case you could get their distinct values and number of matches by terms aggregation. sum_other_doc_count even tells you if your search resulted in too many distinct values, as only top N are returned.
Naturally you could run terms aggregation on multiple fields and use the one in SQL which had fewest distinct values. And actually it could be more efficient to first run cardinality aggregation to know to which field you should run terms aggregation.
If your search is a pure filter then its result should be cached but please benchmark both solutions as your ES cluster has quite a lot of data.

Query to Find Adjacent Date Records

There exists in my database a page_history table; the idea is that whenever a record in the page table is changed, that record's old values are stored in the history table.
My job now is to find occasions in which a record was changed, and retrieve the pre- and post-conditions of that change. Specifically, I want to know when a page changed groups, and what groups were involved in the change. The query I have below can find these instances, but with the use of the min function, I can only get back the values that match between the two records:
select page_id,
original_group,
min(created2) change_date
from (select h.page_id,
h.group_id original_group,
i.group_id new_group,
h.created_dttm created1,
i.created_dttm created2
from page_history h,
page_history i
where h.page_id = i.page_id
and h.created_dttm < i.created_dttm
and h.group_id != i.group_id)
group by page_id, original_group, created1
order by page_id
When I try to get, say, any details of the second record, like new_group, I'm hit with a ORA-00979: not a GROUP BY expression error. I don't want to group by new_group, though, because that's going to destroy the logic (I think it would find records displaying times a page changed from a group to another group, regardless of any changes to other groups in between).
My question, then, is how can I modify this query, or go about writing a new one, that achieves a similar end, but with the added availability of columns that do not match between the two records? In essence, how can I find that min record without sacrificing all the other columns I'm not trying to compare? I don't exactly need a complete answer, any suggestions that point me in the right direction would be appreciated.
I use PL/SQL Developer, and it looks like version 11.2.0.2.0 of Oracle.
EDIT: I have found a solution. It's not pretty, and I'd still like to see some alternatives, but if helping me out would threaten to explode your brain, I would advise relocating to an easier question.
Without seeing your table structure it's hard to re-write the query but when you have a min function used like that it invariably seems better to put it into a separate sub select to get what you want and then compare the result of that.

First time MapReduce: I need to combine a distinct and count, please help

I have a collection and need to get a distinct count from the data set in MongoDB
db['2011-05-29'].distinct("plugins.HTTPServer.string");
returns all the distinct names for the key
How would I go about getting a count for every occurrence of a particular string?
Example:
Apache 29172
IIS 3932
I've looked at some MapReduce examples but can't seem to get it to work right. As my counts add up to more than the actual items in the collection.
db['2011-04-13-1pm-scan'].distinct("plugins.HTTPServer.string").length;
returns the number of distinct items in that key.
I however want the Key Value and Count for each, as above.
Your question is 100% exactly what the wordcount demo application does.
It's part of the standard set of examples shipped with Hadoop and it's also explained here in great detail in these pages
http://wiki.apache.org/hadoop/WordCount
http://developer.yahoo.com/hadoop/tutorial/module4.html#wordcount
HTH

Solr: How can I get all documents ordered by score with a list of keywords?

I have a Solr 3.1 database containing Emails with two fields:
datetime
text
For the query I have two parameters:
date of today
keyword array("important thing", "important too", "not so important, but more than average")
Is it possible to create a query to
get ALL documents of this day AND
sort them by relevancy by ordering them so that the email with contains most of my keywords(important things) scores best?
The part with the date is not very complicated:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
I know that you can boost the keywords this way:
q=text:"first keyword"^5 OR text:"second one"^2 OR text:"minus scoring"^0.5 OR text:"*"
But how do I only use the keywords to sort this list and get ALL entries instead of doing a realy query and get only a few entries back?
Thanks for help!
You need to specify your terms in the main query and then change your date query to be a filter query on these results by adding the following.
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
So you should have something like this:
q=<terms go here>&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
Edit: A little more about filter queries (as suggested by rfreak).
From Solr Wiki - FilterQuery Guidance - "Now, what is a filter query? It is simply a part of a query that is factored out for special treatment. This is achieved in Solr by specifying it using the fq (filter query) parameter instead of the q (main query) parameter. The same result could be achieved leaving that query part in the main query. The difference will be in query efficiency. That's because the result of a filter query is cached and then used to filter a primary query result using set intersection."
These should be sorted by relevancy score already, that is just the default behavior of Solr. You can see the score by adding that field.
fl=*,score
If you use the Full Interface for Make A Query on the Admin Interface on your Solr installation at http://<yourserver:port#>/<instancename>/admin/form.jspyou will see where you can specify the filter query, fields, and other options. You can check out the Solr Wiki for more details on the options and how they are used.
I hope that this helps you.
You could do a first query for:
fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
which gives all documents that match the range. Then, use CachingWrapperFilter for the second query to find documents in the DocSet from first query which have at least one keyword. They will be relevance ranked per tf-idf. You may want to use ConstantScoringQuery for the first to get the list of matching docids in the fastest possible way.
Sorting by relevance is default behavior on solr/lucene.
If your results are unsatisfied, try to put the keywords in quotes
//Edit: Folowing the answer from Paige Cook, use somethink like that
q="important thing"&fq=datetime[YY-MM-DDT00:00:00.000Z TO YY-MM-DDT23:59:59.999Z]
//2. nd update. By thinking about this answer: quotes are not an good idea, because in this case you will only receive "important thing" mails, but no "important too"
The Point is: what keywords you are using. Because: searching for -- important thing -- results in the highest scores for "important thing" mails. But lucene does not know, how to score "important too" or "not so important, but more than average" in relation to your keywords.
An other idea would be searching only for "important". But the field-values "importand thing" and "importand too" gives nearly the same score values,because 50% of the searched keywords (in this key: "imported") are part of the field-value.
So probably you have to change your keywords. It could work after changeing "importend to" into "also an important mail", to get the beast ratio of search-word "important" and field-value in order to score the shortest Mail-discripton to the highest value.

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.