druid sql query - count distinctly for a multi value field across records - sql

Is there a way to do a distinct count across different rows for a multi-value field in druid SQL for a particular value in which value is only counted once across an array? eg suppose I have below records :
shippingSpeed
[standard, standard, standard, ground]
[standard,ground]
[ground,ground]
Expected Result:
standard 2
ground 3
I tried below query but it is aggregating the field count inside an array and then giving the total count across all records:
SELECT
"shippingSpeed", count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC
Result:
standard 4
ground 4

This is because the Group By on multi-valued columns will UNNEST the array into multiple rows. It is counting each item as an instance correctly.
If you want to remove duplicates, define "shippingSpeed" at ingestion time with the property:
"multiValueHandling": "SORTED_SET"
You can find more details here: https://druid.apache.org/docs/latest/querying/multi-value-dimensions.html#overview

Okay there are some undocumented function's that you can use.
SELECT
array_set_add(MV_TO_ARRAY("shippingSpeed",null) , count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC
which might work.
MV_TO_ARRAY -> converts the multi value col to an array
array_set_add -> creates a set out of the arrays. Since we donot have 2 arrays, second argument is null.
but what #sergio said might be the easiest option.

Related

Trying to UNNEST timestamp array field, but need to GROUP BY

I have a repeated field of type TIMESTAMP in a BigQuery table. I am attempting to UNNEST this field. However, I must group or aggregate the field in order. I am not knowledgable with SQL, so I could use some help. The code snippet is part of a larger query that works when substituting subscription.future_renewal_dates with GENERATE_TIMESTAMP_ARRAY
subscription.future_renewal_dates is ARRAY<TIMESTAMP>
The TIMESTAMP array is unique (recurring subscriptions) and cannot be generated using GENERATE_TIMESTAMP_ARRAY, so I have to generate the dates before uploading to BigQuery. UDF is too much.
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name
Cannot figure out how to successfully unnest subscription.future_renewal_dates without error 'UNNEST expression references subscription.future_renewal_dates which is neither grouped nor aggregated'
When you do GROUP BY - all expressions, columns in the SELECT (except those in GROUP BY list) should be used with some aggregation function - which you clearly do not have. So you need to decide what it is that you actually trying to achieve here with that grouping
Below is the option I think you had in mind - though it can be different - but at least you have an idea on how to fix it
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY_CONCAT_AGG( ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
)) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name

SELECT MIN from a subset of data obtained through GROUP BY

There is a database in place with hourly timeseries data, where every row in the DB represents one hour. Example:
TIMESERIES TABLE
id date_and_time entry_category
1 2017/01/20 12:00 type_1
2 2017/01/20 13:00 type_1
3 2017/01/20 12:00 type_2
4 2017/01/20 12:00 type_3
First I used the GROUP BY statement to find the latest date and time for each type of entry category:
SELECT MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category;
However now, I want to find which is the date and time which is the LEAST RECENT among the datetime's I obtained with the query listed above. I will need to use somehow SELECT MIN(date_and_time), but how do I let SQL know I want to treat the output of my previous query as a "new table" to apply a new SELECT query on? The output of my total query should be a single value—in case of the sample displayed above, date_and_time = 2017/01/20 12:00.
I've tried using aliases, but don't seem to be able to do the trick, they only rename existing columns or tables (or I'm misusing them..).There are many questions out there that try to list the MAX or MIN for a particular group (e.g. https://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/ or Select max value of each group) which is what I have already achieved, but I want to do work now on this list of obtained datetime's. My database structure is very simple, but I lack the knowledge to string these queries together.
Thanks, cheers!
You can use your first query as a sub-query, it is similar to what you are describing as using the first query's output as the input for the second query. Here you will get the one row out put of the min date as required.
SELECT MIN(date_and_time)
FROM (SELECT MAX(date_and_time) as date_and_time, entry_category
FROM timeseries_table
GROUP BY entry_category)a;
Is this what you want?
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC;
This returns ties. If you do not want ties, then include an additional sort key:
SELECT TOP 1 MAX(date_and_time), entry_category
FROM timeseries_table
GROUP BY entry_category
ORDER BY MAX(date_and_time) ASC, entry_category;

SQL SELECT that excludes rows with any of a list of values?

I have found many Questions and Answers about a SELECT excluding rows with a value "NOT IN" a sub-query (such as this). But how to exclude a list of values rather than a sub-query?
I want to search for rows whose timestamp is within a range but exclude some specific date-times. In English, that would be:
Select all the ORDER rows recorded between noon and 2 PM today except for the ones of these times: Today 12:34, Today 12:55, and Today 13:05.
SQL might be something like:
SELECT *
FROM order_
WHERE recorded_ >= ?
AND recorded_ < ?
AND recorded_ NOT IN ( list of date-times… )
;
So two parts to this Question:
How to write the SQL to exclude rows having any of a list of values?
How to set an arbitrary number of arguments to a PreparedStatement in JDBC?(the arbitrary number being the count of the list of values to be excluded)
Pass array
A fast and NULL-safe alternative would be a LEFT JOIN to an unnested array:
SELECT o.*
FROM order_ o
LEFT JOIN unnest(?::timestamp[]) x(recorded_) USING (recorded_)
WHERE o.recorded_ >= ?
AND o.recorded_ < ?
AND x.recorded_ IS NULL;
This way you can prepare a single statement and pass any number of timestamps as array.
The explicit cast ::timestamp[] is only necessary if you cannot type your parameters (like you can in prepared statements). The array is passed as single text (or timestamp[]) literal:
'{2015-07-09 12:34, 2015-07-09 12:55, 2015-07-09 13:05}', ...
Or put CURRENT_DATE into the query and pass times to add like outlined by #drake . More about adding a time / interval to a date:
How to get the end of a day?
Pass individual values
You could also use a VALUES expression - or any other method to create an ad-hoc table of values.
SELECT o.*
FROM order_ o
LEFT JOIN (VALUES (?::timestamp), (?), (?) ) x(recorded_)
USING (recorded_)
WHERE o.recorded_ >= ?
AND o.recorded_ < ?
AND x.recorded_ IS NULL;
And pass:
'2015-07-09 12:34', '2015-07-09 12:55', '2015-07-09 13:05', ...
This way you can only pass a predetermined number of timestamps.
Asides
For up to 100 parameters (or your setting of max_function_args), you could use a server-side function with a VARIADIC parameter:
Return rows matching elements of input array in plpgsql function
I know that you are aware of timestamp characteristics, but for the general public: equality matches can be tricky for timestamps, since those can have up to 6 fractional digits for seconds and you need to match exactly.
Related
Select rows which are not present in other table
Optimizing a Postgres query with a large IN
SELECT *
FROM order_
WHERE recorded_ BETWEEN (CURRENT_DATE + time '12:00' AND CURRENT_DATE + time '14:00')
AND recorded_ NOT IN (CURRENT_DATE + time '12:34',
CURRENT_DATE + time '12:55',
CURRENT_DATE + time '13:05')
;

SQL query to identify 0 AFTER a 1

Let's say I have two columns: Date and Indicator
Usually the indicator goes from 0 to 1 (when the data is sorted by date) and I want to be able to identify if it goes from 1 to 0 instead. Is there an easy way to do this with SQL?
I am already aggregating other fields in the same table. If I can add this to as another aggregation (e.g. without using a separate "where" statement or passing over the data a second time) it would be pretty awesome.
This is the phenomena I want to catch:
Date Indicator
1/5/01 0
1/4/01 0
1/3/01 1
1/2/01 1
1/1/01 0
This isn't a teradata-specific answer, but this can be done in normal SQL.
Assuming that the sequence is already 'complete' and xn+1 can be derived from xn, such as when the dates are sequential and all present:
SELECT date -- the 1 on the day following the 0
FROM r curr
JOIN r prev
-- join each day with the previous day
ON curr.date = dateadd(d, 1, prev.date)
WHERE curr.indicator = 1
AND prev.indicator = 0
YMMV on the ability of such a query to use indexes efficiently.
If the sequence is not complete the same can be applied after making a delegate sequence which is well ordered and similarly 'complete'.
This can also be done using correlated subqueries, each selecting the indicator of the 'previous max', but.. uhg.
Joining the table against it self it quite generic, but most SQL Dialects now support Analytical Functions. Ideally you could use LAG() but TeraData seems to try to support the absolute minimum of these, and so so they point you to use SUM() combined with rows preceding.
In any regard, this method avoids a potentially costly join and effectively deals with gaps in the data, whilst making maximum use of indexes.
SELECT
*
FROM
yourTable t
QUALIFY
t.indicator
<
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
QUALIFY is a bit TeraData specific, but slightly tidier than the alternative...
SELECT
*
FROM
(
SELECT
*,
SUM(t.indicator) OVER (PARTITION BY t.somecolumn /* optional */
ORDER BY t.Date
ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING
)
AS previous_indicator
FROM
yourTable t
)
lagged
WHERE
lagged.indicator < lagged.previous_indicator
Supposing you mean that you want to determine whether any row having 1 as its indicator value has an earlier Date than a row in its group having 0 as its indicator value, you can identify groups with that characteristic by including the appropriate extreme dates in your aggregate results:
SELECT
...
MAX(CASE indicator WHEN 0 THEN Date END) AS last_ind_0,
MIN(CASE indicator WHEN 1 THEN Date END) AS first_ind_1,
...
You then test whether first_ind_1 is less than last_ind_0, either in code or as another selection item.

SQL: need only 1 row per particular timestamp

i have some SQL code that is inserting values from another (non sql-based) system. one of the values i get is a timestamp.
i can get multiple inserts that have the same timestamp (albeit different values for other fields).
my problem is that i am trying to get the first insert happening every day (based upon timestamp) since a particular day (i.e. give me the first insert of each day since January 28, 2007...)
my code to get the first timestamp of every day is as follows:
SELECT MIN(my_timestamp) AS first_timestamp
FROM my_schema.my_table
WHERE my_col1 = 'WHATEVER'
AND my_timestamp > timestamp '2010-Jul-27 07:45:24' - INTERVAL '365 DAY'
GROUP BY DATE (my_timestamp);
This delivers me the list of times available. But when I join against these times, I can get several rows, as there are lots of rows that mach these times. So for 365 days, I may get 5,000 rows (I could be inserting 100 rows at 00:00:00 every day).
Assuming, in the example above, my_table has columns my_col1 and my_col2, how can I get exactly 365 rows that contain my_col1 & my_col2? it doesn't matter which row i get back if there are multiple rows for a date; any row will suffice.
it's an odd question. the overall problem is: given a timestamp, how can one get 1-row-per-timestamp even if there are multiple rows that have said timestamp (assuming there is no other priority)?
thanks for the help in advance.
EDIT:
So, let's say for example, this table has the following columns: my_col1, my_col2, and my_timestamp.
Here are example values (in order of my_col1 - my_col2 - my_timestamp):
'my_val1' - 10 - '2010-07-01 01:01:01'
'my_val2' - 11 - '2010-07-01 01:01:01'
'my_val3' - 12 - '2010-07-01 01:01:01'
'my_val4' - 13 - '2010-07-01 01:01:02'
'my_val5' - 14 - '2010-07-02 01:01:01'
'my_val6' - 15 - '2010-07-02 01:01:01'
'my_val7' - 16 - '2010-07-03 01:01:01'
in the end, i would want only 3 rows, 1 with a timestamp with '2010-07-01 01:01:01', one with '2010-07-02 01:01:01', and one with '2010-07-03 01:01:01'. the third one is easy, since there is only 1 row with that last timestamp. but the first two are the tricky ones. the sql i posted above will ignore the row with 'my_val4'.
i need a query that will return me all of the columns, not just the dates.
how would i get sql to give me either the first or last of the values that would match that timestamp (it doesn't matter either way. i just need to get 1-per first-day's timestamp matching)?
select distinct on (date(my_timestamp)) *
from my_table
order by date(my_timestamp), my_timestamp
This selects all columns, exactly one row per date(my_timestamp). The single row per day is the first row for the group, as determined by order by (so that's the row with minimal my_timestamp).
Of course you can add whatever joins, wheres etc. you need. But this is the stub you're looking for.
The solution is to use the SQL's DISTINCT statement (http://www.sql-tutorial.com/sql-distinct-sql-tutorial/):
SELECT DISTINCT MIN(my_timestamp) AS first_timestamp FROM my_schema.my_table WHERE my_col1 = 'WHATEVER' AND my_timestamp > timestamp '2010-Jul-27 07:45:24' - INTERVAL '365 DAY' GROUP BY DATE (my_timestamp);
I know you already have an answer, but I still don't understand why you have mentioned a join in your question. Why not just include the rest of the columns in your query, like this:
SELECT MIN(my_timestamp) AS first_timestamp, my_col1, my_col2
FROM my_table
GROUP BY DATE(my_timestamp);
This works in MySQL. Does it not return the expected result in PostgreSQL?