Spark SQL: Aggregate column values within a Group

Spark SQL: Aggregate column values within a Group - sql

I need to aggregate the values of a column articleId to an array. This needs to be done within a group which i create per groupBy beforehand.
My table looks the following:
| customerId | articleId | articleText | ...
| 1 | 1 | ... | ...
| 1 | 2 | ... | ...
| 2 | 1 | ... | ...
| 2 | 2 | ... | ...
| 2 | 3 | ... | ...
And I want to build something like
| customerId | articleIds |
| 1 | [1, 2] |
| 2 | [1, 2, 3] |
My code so far:
DataFrame test = dfFiltered.groupBy("CUSTOMERID").agg(dfFiltered.col("ARTICLEID"));
But here I get an AnalysisException:
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression 'ARTICLEID' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
Can someone help to build a correct statement?

For SQL syntax, when you want to group by something, you must to include this "something" in select statement. Maybe in your sparkSQL code, it's not indicated this point.
You have a similar question so I think it's the solution for your problem SPARK SQL replacement for mysql GROUP_CONCAT aggregate function

This can be achieved using collect_list function, but it's available only if you're using HiveContext:
import org.apache.spark.sql.functions._
df.groupBy("customerId").agg(collect_list("articleId"))

Related

Find sequence of choice in a column

There is a table where user_id is for each test taker, and choice is the answer for all the three questions. I would like to get all the different sequence of choices that test taker made and count the sequence. Is there a way to write sql query to achieve this? Thanks
----------------------------------
| user_id | Choice |
----------------------------------
| 1 | a |
----------------------------------
| 1 | b |
----------------------------------
| 1 | c |
----------------------------------
| 2 | b |
----------------------------------
| 2 | c |
----------------------------------
| 2 | a |
----------------------------------
Desire answer:
----------------------------------
| choice | count |
----------------------------------
| a,b,c | 1 |
----------------------------------
| b,c,a | 1 |
-----------------------------------

In BigQuery, you can use aggregation functions:
select choices, count(*)
from (select string_agg(choice order by ?) as choices, user_id
from t
group by user_id
) t
group by choices;
The ? is for the column that specifies the ordering of the table. Remember: tables represent unordered sets, so without such a column the choices can be in any order.
You can do something similar in SQL Server 2017+ using string_agg(). In earlier versions, you have to use an XML method, which is rather unpleasant.

How to explode map datatype in Hive OR how to give multiple aliases in Hive

Suppose I query :
select explode(map_column_name) as exploded from table_name
I get this error:
The number of aliases in the AS clause does not match the number of
columns output by the UDTF, expected 2 aliases but got 1
and I googled the error and got to know that to give more than one alias , we use stack function ..
How to use stack function along with explode function so that I eventually explode map datatype and also give 2 aliases at a time?
Kindly bear with me as I am a beginner and learning Hive.

With default columns names
select explode(map) from table_name
With aliases
select explode(map) as (mykey,myval) from table_name
Demo
With default columns names
select explode (map('A',1,'B',2,'C',3))
;
+-----+-------+
| key | value |
+-----+-------+
| A | 1 |
| B | 2 |
| C | 3 |
+-----+-------+
With aliases
select explode (map('A',1,'B',2,'C',3)) as (mykey,myvalue)
;
+-------+---------+
| mykey | myvalue |
+-------+---------+
| A | 1 |
| B | 2 |
| C | 3 |
+-------+---------+

Selecting only distinct rows based on a column in Knex

I'm using Knex, a pretty nice SQL builder.
I've got a table called Foo which has 3 columns
+--------------+-----------------+
| id | PK |
+--------------+-----------------+
| idFoo | FK (not unique) |
+--------------+-----------------+
| serialNumber | Number |
+--------------+-----------------+
I'd like to select all rows with idFoo IN (1, 2, 3).
However I'd like to avoid duplicate records based on the same idFoo.
Since that column is not unique there could be many rows with the same idFoo.
A possible solution
My query above will of course return all with idFoo IN (1, 2, 3), even duplicates.
db.select(
"id",
"idFoo",
"age"
)
.from("foo")
.whereIn("idFoo", [1, 2, 3])
However this will return results with duplicated idFoo's like so:
+----+-------+--------------+
| id | idFoo | serialNumber |
+----+-------+--------------+
| 1 | 2 | 56454 |
+----+-------+--------------+
| 2 | 3 | 75757 |
+----+-------+--------------+
| 3 | 3 | 00909 |
+----+-------+--------------+
| 4 | 1 | 64421 |
+----+-------+--------------+
What I need is this:
+----+-------+--------------+
| id | idFoo | serialNumber |
+----+-------+--------------+
| 1 | 2 | 56454 |
+----+-------+--------------+
| 3 | 3 | 00909 |
+----+-------+--------------+
| 4 | 1 | 64421 |
+----+-------+--------------+
I can take the result and use Javascript to filter out the duplicates. I'd specifically like to avoid that and write this in Knex.
The question is how can I do this with Knex code?
I know it can be done with plain SQL (perhaps something using GROUP BY) but I'd specifically like to achieve this in "pure" knex without using raw SQL.

Knex.js supports groupBy natively. You can write:
knex('foo').whereIn('id',
knex('foo').max('id').groupBy('idFoo')
)
Which is rewritten to the following SQL:
SELECT * FROM foo
WHERE id IN (
SELECT max(id) FROM foo
GROUP BY idFoo
)
Note that you need to use the subselect to make sure you won't mix values from diffrent rows within the same group.

In normal sql you do it like this.
You perform a self join and try to find a row with same idFoo but bigger id, if you dont find it you have NULL. And will know you are the bigger one.
SELECT t1.id, t1.idFoo, t1.serialNumber
FROM foo as t1
LEFT JOIN foo as t2
ON t1.id < t2.id
AND t1.idFoo = t2.idFoo
WHERE t2.idFoo IS NULL
So check for left join on knex.js
EDIT:
Just check documentation build this (not tested):
knex.select('t1.*')
.from('foo as t1')
.leftJoin('foo as t2', function() {
this.on('t1.id', '<', 't2.id')
.andOn('t1.idFoo ', '=', 't2.idFoo')
})
.whereNull("t2.idFoo")

How can I get a pivot table with concatenated values?

I have the following data:
| ID | TYPE | USER_ID |
|----------|----------|----------|
| 1 | A | 7 |
| 1 | A | 8 |
| 1 | B | 6 |
| 2 | A | 9 |
| 2 | B | 5 |
I'm trying to create a query to return
| ID | RESULT |
|----------|----------|
| 1 | 7, 8, 6 |
| 2 | 9, 5 |
The USER_ID values must be ordered by the TYPE attribute.
Since I'm using MS ACCESS, I'm trying to pivot. What I've tried:
TRANSFORM first(user_id)
SELECT id, type
FROM mytable
GROUP BY id, type
ORDER BY type
PIVOT user_id
Error:
Too many crosstab column headers (4547).
I'm missing something in the syntax. However, it seems to be wrong since the first() aggregate needs to be changed to something else to concatenate the results.
PS: I'm using MS-ACCESS 2007. If you know a solution for SQL-Server or Oracle using only SQL (without vendor functions or stored procedures), I'll probably accept your answer since it will help me to find a solution for this problem.

You don't want to use PIVOT. Pivot will create a column named after each of your user IDs (1 - 7). Your TYPE field doesn't seem to do anything either.
Unfortunately, doing this in SQL Server requires the use of a function (FOR XML Path) that's not available in Access.
Here's a link with a similar Access function to do something similar.

Oracle view grouping elements [duplicate]

This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Oracle: Combine multiple results in a subquery into a single comma-separated value
Hi there,
this is my problem...
I have a table:
+------+------+------+
| CODE | NAME | TYPE |
+------+------+------+
| 1 | AAA | x |
+------+------+------+
| 2 | BBB | x |
+------+------+------+
| 3 | CCC | y |
+------+------+------+
| 4 | DDD | y |
+------+------+------+
I wanna make a view in ORACLE .... I wanna that the result is:
+---------+------+
| NAME | TYPE |
+---------+------+
| AAA;BBB | x |
+---------+------+
| CCC;DDD | y |
+---------+------+
Can I grouping AAA and BBB because they have same TYPE in a VIEW that in a NAME will be "AAA;BBB" ... so grouping various names divided with ;
Can anyone help me?
Regards,
Tommaso

Tim Hall has a page that covers the various string aggregation techniques available in Oracle depending on the Oracle version, what packages are installed in the database, and whether you can create new procedures to support this or whether you want it done in pure SQL.
If you are using 11.2, the simplest option would be to use the built-in LISTAGG analytic funciton
SELECT listagg(name, ';') within group (order by code), type
FROM your_table
GROUP BY type
If you are using an earlier version, my preference would be to use the custom aggregate function (Tim's string_agg).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Spark SQL: Aggregate column values within a Group - sql

This can be achieved using collect_list function, but it's available only if you're using HiveContext: import org.apache.spark.sql.functions._ df.groupBy("customerId").agg(collect_list("articleId"))

Related

Find sequence of choice in a column

How to explode map datatype in Hive OR how to give multiple aliases in Hive

Selecting only distinct rows based on a column in Knex

How can I get a pivot table with concatenated values?

Oracle view grouping elements [duplicate]

Categories

Resources