How to get max() function working without grouping? Databricks SQL - sql

I have a table which requires filtering based on the dates.
| Group | Account || Values | Date_ingested |
| -------- | -------- || -------- | -------- |
| X | 3000 || 0 | 2023-01-07 |
| Y | 3000 || null | 2021-02-22 |
The goal is to select the latest date when there is multiple data points like in the example above.
The account 3000 in the dataframe occurs under two Groups but the up-to-date and correct result should only reflect the group X because it was ingested into Databricks very recently.
Now, if I try to use the code below with grouping the code gets executed but the max function is ignored and in the results I get two results for account 3000 with group X and then Y.
Select Group, Account, Values, max(Date_ingested) from datatableX
If I choose to use the code without grouping, I get the following error
Error in SQL statement: AnalysisException: grouping expressions sequence is empty, and 'datatableX.Account' is not an aggregate function. Wrap '(max(spark_catalog.datatableX.Date_ingested) AS`max(Date_ingested))' in windowing function(s) or wrap 'spark_catalog.datatableX.Account' in first() (or first_value) if you don't care which value you get.
I can't, however, figure out a way to do the above. Tried reading about the aggreate functions but I can't grasp the concept.
Select Group, Account, Values, max(Date_ingested) from datatableX
or
Select Group, Account, Values, max(Date_ingested) from datatableX
group by Group, Account, Values

You want the entire latest record per account, which suggests filtering rather than aggregation.
A typical approach uses rank() to enumerate records having the same account by descending date of ingestion, then filters on the top-record per group in the outer query:
select *
from (
select d.*,
row_number() over(partition by account order by date_ingested desc) rn
from datatableX
) d
where rn = 1

Related

Postgres: aggregation function that returns a column

In postgres, when I call a function on some data, like so:
select f(col_nums) from tbl_name
where col_str = '12345'
then function f will be applied on each row where col_str = '12345'.
On the other hand, if I call an aggregation function on some data, like so:
select g_agg(col_nums) from tbl_name
where col_str = '12345'
then the function g_agg will be called on the the entire column but will result in a single value.
Q: How can I make a function that will be applied on the entire column and return a column of the same size while at the same time being aware of all the values in the the subset?
For example, can I create a function to calculate cumulative sum?
select *, sum_accum(col_nums) as cs from tbl_name
where col_str = '12345'
such that the result of the above query would look like this:
col_str | more_cols | col_numbers | cs
---------+-----------+-------------+----
12345 | 567 | 1 | 1
12345 | 568 | 2 | 3
12345 | 569 | 3 | 6
12345 | 570 | 4 | 10
Is there no choice but to pass a sub-query result to a function and then join with the original table?
Use window functions
A window function performs a calculation across a set of table rows
that are somehow related to the current row. This is comparable to the
type of calculation that can be done with an aggregate function. But
unlike regular aggregate functions, use of a window function does not
cause rows to become grouped into a single output row — the rows
retain their separate identities. Behind the scenes, the window
function is able to access more than just the current row of the query
result.
e.g.
select *, sum(col_nums) OVER(PARTITION BY T.COLX, T.COLY) as cs
from tbl_name T
where col_str = '12345'
Note that it is the addition on a over clause that changes an aggregate from its traditional use to a window function:
the OVER clause causes it to be treated as a window function and
computed across an appropriate set of rows
In the over clause has a partition by (analogous to group by) which controls the window that the calculations are performed in; and it also allows an order by which is valid for some functions but not all.
select *
-- running sum using an order by
, sum(col_nums) OVER(PARTITION BY T.COLX ORDER BY T.COLY) as cs
-- but count does not permit ordering
, count(*) OVER(PARTITION BY T.COLX) as cs_count
from tbl_name T
where col_str = '12345'
The function that you want is a cumulative sum. This is handled by window functions:
select t.*, sum(col_nums) over (order by more_cols) as cs
from tbl_name t
where col_str = '12345';
I am guessing that the order by sequence is defined by the second column. It can be any column including col_nums.
You can do this for all values of col_str at the same time, using the partition by clause:
select t.*, sum(col_nums) over (partition by col_str order by more_cols) as cs
from tbl_name t

Adding a percent column to MS Access Query

I'm trying to add a column which calculates percentages of different products in MS Access Query. Basically, this is the structure of the query that I'm trying to reach:
Product |
Total |
Percentage
Prod1 |
15 |
21.13%
Prod2 |
23 |
32.39%
Prod3 |
33 |
46.48%
Product |
71 |
100%
The formula for finding the percent I use is: ([Total Q of a Product]/[Totals of all Products])*100, but when I try to use the expression builder (since my SQL skills are basic) in MS Access to calculate it..
= [CountOfProcuts] / Sum([CountOfProducts])
..I receive an error message "Cannot have aggregate function in GROUP BY clause.. (and the expression goes here)". I also tried the option with two queries: one that calculates only the totals and another that use the first one to calculate the percentages, but the result was the same.
I'll be grateful if someone can help me with this.
You can get all but the last row of your desired output with this query.
SELECT
y.Product,
y.Total,
Format((y.Total/sub.SumOfTotal),'#.##%') AS Percentage
FROM
YourTable AS y,
(
SELECT Sum(Total) AS SumOfTotal
FROM YourTable
) AS sub;
Since that query does not include a JOIN or WHERE condition, it returns a cross join between the table and the single row of the subquery.
If you need the last row from your question example, you can UNION the query with another which returns the fabricated row you want. In this example, I used a custom Dual table which is designed to always contain one and only one row. But you could substitute another table or query which returns a single row.
SELECT
y.Product,
y.Total,
Format((y.Total/sub.SumOfTotal),'#.##%') AS Percentage
FROM
YourTable AS y,
(
SELECT Sum(Total) AS SumOfTotal
FROM YourTable
) AS sub
UNION ALL
SELECT
'Product',
DSum('Total', 'YourTable'),
'100%'
FROM Dual;

Error in Hive : Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected

I am trying to translate some PL/SQL script in hive, and i faced an error with one HiveQL script.
The error is this one :
FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.exec.UDFArgumentTypeException: One or more arguments are expected.
I think that the error is coming from this part of script :
SELECT
mag.co_magasin,
dem.id_produit as id_produit_orig,
pnvente.dt_debut_commercial as dt_debut_commercial,
COALESCE(pnvente.id_produit,dem.id_produit) as id_produit,
min(
CASE WHEN dem.co_validation IS NULL THEN 0 ELSE 1 END
) as flg_demarque_valide,
sum(CASE WHEN dem.co_validation IS NULL THEN 0 ELSE cast(dem.mt_revient_ope AS INT) END)
as me_dem_con_prx_cs,
0 as me_dem_inc_prx_cs,
0 as me_dem_prov_stk_cs,
sum(CASE WHEN dem.co_validation IS NULL THEN 0 ELSE cast(dem.qt_demarque AS INT) END)
as qt_dem_con,
0 as qt_dem_inc,
0 as qt_dem_prov_stk,
RANK() OVER (PARTITION BY mag.co_magasin, dem.id_produit ORDER BY pnvente.dt_debut_commercial DESC, COALESCE(pnvente.id_produit,dem.id_produit) DESC) as rang
from default.calendrier cal
INNER JOIN default.demarque_mag_jour dem
ON CASE WHEN dem.co_societe = 1 THEN 1 ELSE 2 END = '${hiveconf:in_co_societe}'
AND dem.dt_jour = cal.dt_jour
LEFT OUTER JOIN default.produit_norm pn
ON pn.co_societe = dem.co_societe
AND pn.id_produit = dem.id_produit
LEFT OUTER JOIN default.produit_norm pnvente
ON pnvente.co_societe = pn.co_societe
AND pnvente.co_produit_rfu = pn.co_produit_lip
AND pnvente.co_type_motif='05'
INNER JOIN default.kpi_magasin mag
ON mag.co_societe = '${hiveconf:in_co_societe}'
AND mag.id_magasin = dem.id_magasin
WHERE cal.dt_jour = '${hiveconf:in_dt_jour}'
AND NOT (dem.co_validation IS NULL AND cal.dt_jour > from_unixtime(unix_timestamp()-3*60*60*24, 'ddmmyyyy'))
-- JYP 4.4
AND dem.co_operation_magasin IN ('13','14','32')
GROUP BY
mag.co_magasin,
dem.id_produit,
pnvente.dt_debut_commercial,
COALESCE(pnvente.id_produit,dem.id_produit)
But i can't find any solution on the web.
Thanks for your help :-)
I have run into the same error. rank() is case sensitive in hive and the error message give nothing away. Try changing RANK() to rank().
My guess is that it has to do with the coalesce inside your rank. Analytic functions work but are more limited in HiveQL. I would try all your joins and sums in an inner query and then do the rank in an outer query. Often times this is required as HiveQL does not always follow the same order of operations you would expect from a typical SQL language. Consider a table based on stock information:
select count(*) as COUNT
from NYSE_STOCKS
where date in ('2001-12-20','2001-12-21','2001-12-24') and exchange = 'NYSE';
Now consider the following query:
select
exchange
, date
, count(*) over (partition by exchange)
from NYSE_STOCKS
where date in ('2001-12-20','2001-12-21','2001-12-24')
group by exchange, date;
You would expect the following results:
EXCHANGE | DATE | COUNT
NYSE | 2001-12-20 | 5199
NYSE | 2001-12-21 | 5199
NYSE | 2001-12-24 | 5199
But you would actually get this in HiveQL:
EXCHANGE | DATE | COUNT
NYSE | 2001-12-20 | 3
NYSE | 2001-12-21 | 3
NYSE | 2001-12-24 | 3
To get the correct results you have to do the group by in an inner query and the analytic function in the outer query:
select
exchange
, date
, count
from (
select
exchange
, date
, count(*) over (partition by exchange) as count
from NYSE_STOCKS
where date in ('2001-12-20','2001-12-21','2001-12-24')
) A
group by exchange, date, count
;
So in summary its always good to think about order of operations when using analytic functions and get the data you are working with to its simplest form before you use the analytic function.
Funny enough, I actually hit this same error today. The problem for me was that one of the columns I was using in my analytic function was not a valid column. W/O knowing what columns your tables provide its impossible for me to prove this is your problem, but you may want to make sure all the columns in your RANK are valid.
Does not look like a valid "Hive" query to me. Remember hive's query language is pretty limited compared to SQL. For example "IN" is not supported. Another exmaple RANK() OVER (...) - that's not supported either. In other words attempting to use RDBMS SQL directly in Hive mostly not work.

how to select one tuple in rows based on variable field value

I'm quite new into SQL and I'd like to make a SELECT statement to retrieve only the first row of a set base on a column value. I'll try to make it clearer with a table example.
Here is my table data :
chip_id | sample_id
-------------------
1 | 45
1 | 55
1 | 5986
2 | 453
2 | 12
3 | 4567
3 | 9
I'd like to have a SELECT statement that fetch the first line with chip_id=1,2,3
Like this :
chip_id | sample_id
-------------------
1 | 45 or 55 or whatever
2 | 12 or 453 ...
3 | 9 or ...
How can I do this?
Thanks
i'd probably:
set a variable =0
order your table by chip_id
read the table in row by row
if table[row]>variable, store the table[row] in a result array,increment variable
loop till done
return your result array
though depending on your DB,query and versions you'll probably get unpredictable/unreliable returns.
You can get one value using row_number():
select chip_id, sample_id
from (select chip_id, sample_id,
row_number() over (partition by chip_id order by rand()) as seqnum
) t
where seqnum = 1
This returns a random value. In SQL, tables are inherently unordered, so there is no concept of "first". You need an auto incrementing id or creation date or some way of defining "first" to get the "first".
If you have such a column, then replace rand() with the column.
Provided I understood your output, if you are using PostGreSQL 9, you can use this:
SELECT chip_id ,
string_agg(sample_id, ' or ')
FROM your_table
GROUP BY chip_id
You need to group your data with a GROUP BY query.
When you group, generally you want the max, the min, or some other values to represent your group. You can do sums, count, all kind of group operations.
For your example, you don't seem to want a specific group operation, so the query could be as simple as this one :
SELECT chip_id, MAX(sample_id)
FROM table
GROUP BY chip_id
This way you are retrieving the maximum sample_id for each of the chip_id.

Select distinct values for a particular column choosing arbitrarily from duplicates

I have health data relating to deaths. Individual should die once maximum. In the database they sometimes don't; probably because causes of death were changed but the original entry was not deleted. I don't really understand how this was allowed to happen, but it has. So, as a made up example, I have:
Row_number | Individual_ID | Cause_of_death | Date_of_death
------------+---------------+-----------------------+---------------
1 | 1 | Stroke | 3 march 2008
2 | 2 | Myocardial infarction | 1 jan 2009
3 | 2 | Pulmonary Embolus | 1 jan 2009
I want each individual to have only one cause of death.
In the example, I want a query that returns row 1 and either row 2 or row 3 (not both). I have to make an arbitrary choice between rows 2 and 3 because there is no timestamp in any of the fields that can be used to determine which is the revision; it's not ideal but is unavoidable.
I can't make the SQL work to do this. I've tried inner joining distinct Individual_ID to the other fields, but this still gives all the rows. I've tried adding a 'having count(Individual_ID) = 1' clause with it. This leaves out people with more than one cause of death completely. Suggestions on the internet seem to be based on using a timestamped field to choose the most recent, but I don't have that.
IBM DB2. Windows XP. Any thoughts gratefully received.
Have you tried using MIN (or MAX) against the cause of death. (and the date of death, if they died on two different dates)
SELECT IndividualID, MIN(Cause_Of_Death), MIN (Date_Of_Death)
from deaths
GROUP BY IndividualID
I don't know DB2 so I'll answer in general. There are two main approaches:
select *
from T
join (
select keys, min(ID) as MinID
from T
group by keys
) on T.ID = MinID
And
select *, row_number() over (partition by keys) as r
from T
where r = 1
Both return all rows, no matter if duplicate or not. But they returns only one duplicate per "key".
Notice, that both statements are pseudo-SQL.
The row_number() approach is probably preferable from a performance standpoint. Here is usr's example, in DB2 syntax:
select * from (
select T.*, row_number() over (partition by Individual_ID) as r
from T
)
where r=1;