How to use multiple count distinct in the same query with other columns in Druid SQL? - sql

I'm trying to use three projections in same query like below in a Druid environment:
select
__time,
count(distinct col1),
count(distinct case when (condition1 and condition2 then (concat(col2,TIME_FORMAT(__time))) else 0 end )
from table
where condition3
GROUP BY __time
But instead I get an error saying - Unknown exception / Cannot build plan for query
It seems to work perfectly fine when I put just one count(distinct) in the query.
How can this be resolved?

As a workaround, you can do multiple subqueries and join them. Something like:
SELECT x.__time, x.delete_cnt, y.added_cnt
FROM
(
SELECT FLOOR(__time to HOUR) __time, count(distinct deleted) delete_cnt
FROM wikipedia
GROUP BY 1
)x
JOIN
(
SELECT FLOOR(__time to HOUR) __time, count( distinct added) added_cnt
FROM wikipedia
GROUP BY 1
)y ON x.__time = y.__time

As the Druid documentation points out:
COUNT(DISTINCT expr) Counts distinct values of expr, which can be string, numeric, or hyperUnique. By default this is approximate, using a variant of HyperLogLog. To get exact counts set "useApproximateCountDistinct" to "false". If you do this, expr must be string or numeric, since exact counts are not possible using hyperUnique columns. See also APPROX_COUNT_DISTINCT(expr). In exact mode, only one distinct count per query is permitted.
So this is a Druid limitation: you either need to disable exact mode, or else limit yourself to one distinct count per query.
On a side note, other databases typically do not have this limitation. Apache Druid is designed for high performance real-time analytics, and as a result, its implementation of SQL has some restrictions. Internally, Druid uses a JSON-based query language. The SQL interface is powered by a parser and planner based on Apache Calcitea, which translates SQL into native Druid queries.

Related

Query working in MySQL but trowing error in Oracle can someone please explain. and tell me how to rewrite this same query in oracle to avoid error [duplicate]

I have a query
SELECT COUNT(*) AS "CNT",
imei
FROM devices
which executes just fine. I want to further restrict the query with a WHERE statement. The (humanly) logical next step is to modify the query followingly:
SELECT COUNT(*) AS "CNT",
imei
FROM devices
WHERE CNT > 1
However, this results in a error message ORA-00904: "CNT": invalid identifier. For some reason, wrapping the query in another query produces the desired result:
SELECT *
FROM (SELECT COUNT(*) AS "CNT",
imei
FROM devices
GROUP BY imei)
WHERE CNT > 1
Why does Oracle not recognize the alias "CNT" in the second query?
Because the documentation says it won't:
Specify an alias for the column
expression. Oracle Database will use
this alias in the column heading of
the result set. The AS keyword is
optional. The alias effectively
renames the select list item for the
duration of the query. The alias can
be used in the order_by_clause but not
other clauses in the query.
However, when you have an inner select, that is like creating an inline view where the column aliases take effect, so you are able to use that in the outer level.
The simple answer is that the AS clause defines what the column will be called in the result, which is a different scope than the query itself.
In your example, using the HAVING clause would work best:
SELECT COUNT(*) AS "CNT",
imei
FROM devices
GROUP BY imei
HAVING COUNT(*) > 1
To summarize, this little gem explains:
10 Easy Steps to a Complete Understanding of SQL
A common source of confusion is the simple fact that SQL syntax
elements are not ordered in the way they are executed. The lexical
ordering is:
SELECT [ DISTINCT ]
FROM
WHERE
GROUP BY
HAVING
UNION
ORDER BY
For simplicity, not all SQL clauses are listed. This lexical ordering
differs fundamentally from the logical order, i.e. from the order of
execution:
FROM
WHERE
GROUP BY
HAVING
SELECT
DISTINCT
UNION
ORDER BY
As a consequence, anything that you label using "AS" will only be available once the WHERE, HAVING and GROUP BY have already been performed.
I would imagine because the alias is not assigned to the result column until after the WHERE clause has been processed and the data generated. Is Oracle different from other DBMSs in this behaviour?

HAVING clause in SQL

I'm trying to understand why some DBMS systems allow the below while the most don't. Assume table X has attributes name, id, data
SELECT id, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data
In most databases, it's illegal to use non-grouping or non-aggregate field in HAVING clause conditional statement. Some systems seem to allow the same. Would you be able to explain why they would have allowed the HAVING condition to use an attribute which may not have a unique value throughout the group?
Referred to database documentation of DB2, PostgreSQL, MySQL
SELECT id, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data
The first issue with this query:
SELECT name, count(*) as count
FROM TABLE X
GROUP BY id
HAVING count > X.data;
is that you have name in the SELECT but id in GROUP BY. Because you are grouping by id, I assume that there are multiple values in X. Hence, this is incorrect syntax.
There are some cases where this is allowed by the standard -- and even in some databases. However, that requires that id be unique in the table (the technical jargon is that the columns in the SELECT are functionally dependent on the columns in the GROUP BY).
The next issue is the use of count in the HAVING clause. This is fine conceptually. However, not all databases may support it.
Finally, you have x.data in the HAVING clause. If that is functionally dependent on a subset of the GROUP BY keys, then the usage conforms with the standard. However, that is unlikely in this case.
The standard is quite explicit that x.data is out-of-scope after the aggregation. So, this should result in a syntax error -- and it does in almost all databases.
There are a dwindling number of databases that support this construct -- happily MySQL no longer supports it by default. In such databases, they take an arbitrary and indetermine value of data from a row in each group and use that for the comparison.

Refer to aggregate result in Amazon Redshift query?

In other postgresql DBMSes (e.g., Netezza) I can do something like this without errors:
select store_id
,sum(sales) as total_sales
,count(distinct(txn_id)) as d_txns
,total_sales/d_txns as avg_basket
from my_tlog
group by 1
I.e., I can use aggregate values within the same SQL query that defined them.
However, when I go to do the same sort of thing on Amazon Redshift, I get the error "Column total_sales does not exist..." Which it doesn't, that's correct; it's not really a column. But is there a way to preserve this idiom, rather than restructuring the query? I ask because there would be a lot of code to change.
Thanks.
You simply need to repeat the expressions (or use a subquery or CTE):
select store_id,
sum(sales) as total_sales,
count(distinct txn_id) as d_txns,
sum(sales)/count(distinct txn_id) as avg_basket
from my_tlog
group by store_id;
Most databases do not support the re-use of column aliases in the select. The reason is twofold (at least):
The designers of the database engine do not want to specify the order of processing of expressions in the select.
There is ambiguity when a column alias is also a valid column in a table in the from clause.
Personally I loove the construct in netezza. This is compact and the syntax is not ambiguous: any 'dublicate' column names will default to (new) alias in the current query, and if you need to reference the column of the underlying tables, simply put the tablename in front of the column. The above example would become:
select store_id
,sum(sales) as sales ---- dublicate name
,count(distinct(txn_id)) as d_txns
,my_tlog.sales/d_txns as avg_basket --- this illustrates but may not make sense
from my_tlog
group by 1
I recently moved away from sql server, and on that database I used a construct like this to avoid repeating the expressions:
Select *, total_sales/d_txns as avg_basket
From (
select store_id
,sum(sales) as total_sales
,count(distinct(txn_id)) as d_txns
from my_tlog
group by 1
)x
Most (if not all) databases will support this construct, and have done so for 10 years or more

counting rows in select clause with DB2

I would like to query a DB2 table and get all the results of a query in addition to all of the rows returned by the select statement in a separate column.
E.g., if the table contains columns 'id' and 'user_id', assuming 100 rows, the result of the query would appear in this format: (id) | (user_id) | 100.
I do not wish to use a 'group by' clause in the query. (Just in case you are confused about what i am asking) Also, I could not find an example here: http://mysite.verizon.net/Graeme_Birchall/cookbook/DB2V97CK.PDF.
Also, if there is a more efficient way of getting both these results (values + count), I would welcome any ideas. My environment uses zend framework 1.x, which does not have an ODBC adapter for DB2. (See issue http://framework.zend.com/issues/browse/ZF-905.)
If I understand what you are asking for, then the answer should be
select t.*, g.tally
from mytable t,
(select count(*) as tally
from mytable
) as g;
If this is not what you want, then please give an actual example of desired output, supposing there are 3 to 5 records, so that we can see exactly what you want.
You would use window/analytic functions for this:
select t.*, count(*) over() as NumRows
from table t;
This will work for whatever kind of query you have.

HQL count from multiple tables

I would like to query my database using a HQL query to retrieve the total number of rows having a MY_DATE greater than SOME_DATE.
So far, I have come up with a native Oracle query to get that result, but I am stuck when writing in HQL:
SELECT
(
SELECT COUNT(MY_DATE)
FROM Table1
WHERE MY_DATE >= TO_DATE('2011-09-07','yyyy-MM-dd')
)
+
(
SELECT COUNT(MY_DATE)
FROM Table2
WHERE MY_DATE >= TO_DATE('2011-09-07','yyyy-MM-dd')
)
AS total
I actually have more than 2 tables but I keep having an IllegalArgumentException (unexpected end of subtree).
The working native Oracle basically ends with FROM dual.
What HQL query should I use to get the total number of rows I want?
First of, if you have a working SQL query, why not just use that instead of trying to translate it to HQL? Since you're returning a single scalar in the first place, it's not like you need anything HQL provides (e.g. dependent entities, etc...)
Secondly, do you have 'dual' mapped in Hibernate? :-) If not, how exactly are you planning on translating that?
That said, "unexpected end of subtree" error is usually caused by idiosyncrasies of Hibernate's AST parser. A commonly used workaround is to prefix the expression with '0 +':
select 0 + (
... nested select #1 ...
) + (
... nested select #2 ...
) as total
from <from what exactly?>