array join partitioned by column in spark sql

array join partitioned by column in spark sql - sql

I am using spark sql. Let's say I have a table like this
ID,Grade
1,A
2,B
1,A
2,C
I want to make arrays that contain all the grades for each id. But i don't want to collapse the table with a group by. I am trying to maitain all the IDs. My desired output is the following:
ID,Grade
1,[A, A]
1,[A,A]
2,[B,C]
2,[B,C]
My query is the following
SELECT array_join(collect_list(GRADE), ",") AS GRADES
OVER (PARTITION BY ID)
FROM table
However i get an error like this:
AnalysisException: "grouping expressions sequence is empty, and 'ID' is not an aggregate function.
Any idea how to fix my query? Thank you

In your query, collect_list is the aggregate function, so you if you want to use a window you need to apply it directly on collect_list:
SELECT id,
array_join(collect_list(GRADE) OVER (PARTITION BY ID) , ",") AS GRADES
FROM table

Related

Filter SQL by Aggregate Not in SELECT Statement

Can you filter a SQL table based on an aggregated value, but still show column values that weren't in the aggregate statement?
My table has only 3 columns: "Composer_Tune", "_Year", and "_Rank".
I want to use SQL to find which "Composer_Tune" values are repeated in each annual list, as well as which ranks the duplicated items had.
Since I am grouping by "Composer_Tune" & "Year", I can't list "_Rank" with my current code.
The image shows the results of my original "find the duplicates" query vs what I want:
Current vs Desired Results
I tried applying the concepts in this Aggregate Subquery StackOverflow post but am still getting "_Rank is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause" from this code:
WITH DUPE_DB AS (SELECT * FROM DB.dbo.[NAME] GROUP BY Composer_Tune, _Year HAVING COUNT(*)>1)
SELECT Composer_Tune, _Year, _Rank
FROM DUPE_DB

You need to explicitly declare the columns used in the Group By expression in the select columns.
You can use the following documentation if you are using transact sql for the proper use of Group By.

Simply join the aggregated resultset to original unit level table:
WITH DUPE_DB AS (
SELECT Composer_Tune, _Year
FROM DB.dbo.[NAME]
GROUP BY Composer_Tune, _Year
HAVING COUNT(*) > 1
)
SELECT n.Composer_Tune, n._Year, n._Rank
FROM DB.dbo.[NAME] n
INNER JOIN DUPE_DB
ON n.Compuser_Tune = DUPE_DB.Composer_Tune
AND n._Year = DUPE_DB._Year
ORDER n.Composer_Tune, n._Year

Transform rows to columns SQL Query

If I have a query like this
SELECT * FROM table1
I get a result something like this:
How can I write a query from the same table that returns me something like this:
value_name has to turn into columns and the value column has to turn into its values.
Also notice that the ids are repeated and its description is always the same one.
I'm working with PostgresQL

If you know the values in advance, you can use conditional aggregation:
select id, description,
max(value) filter (where value_name = 'FE') as fe,
max(value) filter (where value_name = 'H2O') as h2o,
max(value) filter (where value_name = 'N') as n
from t
group by id, description;
If you don't know the names, then you cannot accomplish this with a single SQL query. You need to use dynamic SQL or use an alternate data representation such as JSON.

Any easier way to group by individual columns in Hive/Impala?

I need to output report of users by their age, gender, education, income, etc from our database. However, there are about 40 variables. It seems just silly to group by each variable one bye one but I'm not aware of other ways and I don't know how to write UDF to solve it yet. I'd appreciate your help.
It's not that complicated but it does come up a lot in daily work. My work environment is Hive/Impala.

We cannot implement 'Group By' task on input rows in UDF , UDAF or UDTF.
UDF takes in a single input row and output a single output row.
UDAF just does Aggregations on one column, but not by Grouping rows.
UDTF transforms a single input row to multiple output rows.
Only possible solution is to write multiple Queries and Combine them using UNION ALL and display/insert into table
Sample Query:
SELECT *
FROM
(
SELECT COUNT(column1),column1 FROM table GROUP BY column1
UNION ALL
SELECT COUNT(column2),column2 FROM table GROUP BY column2
UNION ALL
SELECT COUNT(column3),column3 FROM table GROUP BY column3
) s

GROUP BY in Informix (11.5)

Following Example Table Structure:
NR1 | NR2 | PRENAME | LASTNAME
If i query all 4 fields of this table and grouping it´s first 2 fields (NR1,NR2) in mysql,
i can do something like this:
SELECT NR1,NR2,PRENAME,LASTNAME FROM tbl GROUP BY NR1,NR1
But this won´t work in informix.
INFORMIX ERROR: the column (PRENAME) must be in the group by list
After reading some Topics at google, it is an "Informix feature" that all Selected Columns has to be in the Grouping List.
But if i will do that, the result is not that result, that i wish to have.
If i use
DISTINCT
instead GROUP BY the result is similar false, because i can not put the DISTINCTfunction only to column 1 and 2.
So: How can i make a "MYSQL GROUP BY" function ?

Your original syntax is suitable in one database -- MySQL. And, that database says that the results of the non-aggregated columns come from indeterminate rows. So, an equivalent query is just to use MIN() or MAX():
SELECT NR1, NR2, MIN(PRENAME), MIN(LASTNAME)
FROM tbl
GROUP BY NR1, NR1;
My guess is that you want an arbitrary value from just one row. I'd be inclined to concatenate them:
SELECT NR1, NR2, MIN(PRENAME || ' ' || LASTNAME)
FROM tbl
GROUP BY NR1, NR1;

counting rows in select clause with DB2

I would like to query a DB2 table and get all the results of a query in addition to all of the rows returned by the select statement in a separate column.
E.g., if the table contains columns 'id' and 'user_id', assuming 100 rows, the result of the query would appear in this format: (id) | (user_id) | 100.
I do not wish to use a 'group by' clause in the query. (Just in case you are confused about what i am asking) Also, I could not find an example here: http://mysite.verizon.net/Graeme_Birchall/cookbook/DB2V97CK.PDF.
Also, if there is a more efficient way of getting both these results (values + count), I would welcome any ideas. My environment uses zend framework 1.x, which does not have an ODBC adapter for DB2. (See issue http://framework.zend.com/issues/browse/ZF-905.)

If I understand what you are asking for, then the answer should be
select t.*, g.tally
from mytable t,
(select count(*) as tally
from mytable
) as g;
If this is not what you want, then please give an actual example of desired output, supposing there are 3 to 5 records, so that we can see exactly what you want.

You would use window/analytic functions for this:
select t.*, count(*) over() as NumRows
from table t;
This will work for whatever kind of query you have.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

array join partitioned by column in spark sql - sql

In your query, collect_list is the aggregate function, so you if you want to use a window you need to apply it directly on collect_list: SELECT id, array_join(collect_list(GRADE) OVER (PARTITION BY ID) , ",") AS GRADES FROM table

Related

Filter SQL by Aggregate Not in SELECT Statement

Transform rows to columns SQL Query

Any easier way to group by individual columns in Hive/Impala?

GROUP BY in Informix (11.5)

counting rows in select clause with DB2

Categories

Resources