how to estimate basic statistics group of data with spark sql?

how to estimate basic statistics group of data with spark sql? - apache-spark-sql

I have a table which is similar with the sample below:
id value group
a 2 0
b 3 0
c 4 0
d 6 0
e 4 0
f 3 1
g 2 1
h 1 1
i 0 1
j 3 1
The group column is a sign for each data group. From 0 to 0 is a sign for the first group, and so also with from 1 to 1 is also a sign for the second group. I want to create a new table which is the result of basic statistic (mean, median, st.deviation, variance, etc) estimation of each group.
One thing that I realize is I need to aggregate them first in order to estimate the basic statistic.
The desired table would be like this:
group max min mean median stdt var
0 6 2 3,8
1 3 0
2
How should I write my spark sql to get the desired result?
Thank you in advance.

import spark.implicits._
import org.apache.spark.sql.functions._
val sourceDF = Seq(
("a", 2, 0),
("b", 3, 0),
("c", 4, 0),
("d", 6, 0),
("e", 4, 0),
("f", 3, 1),
("g", 2, 1),
("h", 1, 1),
("i", 0, 1),
("j", 3, 1)
).toDF("id","value","group")
val resDF = sourceDF
.groupBy("group")
.agg(max("value")
, min("value")
, mean("value")
)
resDF.show(false)
// +-----+----------+----------+----------+
// |group|max(value)|min(value)|avg(value)|
// +-----+----------+----------+----------+
// |1 |3 |0 |1.8 |
// |0 |6 |2 |3.8 |
// +-----+----------+----------+----------+

Related

SQL Presto Aggregate Table column values with another column values

Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Count the number of unique targets for each result for each user. For example, for user 1, this user has 2 targets (b and c) who have result A. And it has one target for each result B (target c) and C (target d).
user | result
-------------------
1 | {A: 2, B:1, C:1}
2 | {A: 1, C: 1}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
** Or Even better, can we make a one table that has both columns?
user | result 1 | result 2
--------------------------------------------------
1 | {A:[b,c], B:[c], C:[d]} | {A: 2, B:1, C:1}
2 | {A:[d], C:[a]} | {A: 1, C: 1}
Can anyone help me with it? I would really appreciate it.
I'm pretty new to SQL so I didn't even know how to start it.`

This can be achieved with map aggregate functions. Assuming that result originally is a map you can flatten it with unnest and then group by user and use multimap_agg and histogram functions:
-- sample data
WITH dataset(user, target, result) AS (
VALUES (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[2, 3])),
(2, 'd', map(array['A'], array[1])),
(1, 'd', map(array['C'], array[4]))
)
-- query
select user, multimap_agg(k, target), histogram(k)
from dataset,
unnest(result) as t(k, v)
group by user;
Output:
user
_col1
_col2
2
{A=[d], C=[a]}
{A=1, C=1}
1
{A=[b, c], B=[c], C=[d]}
{A=2, B=1, C=1}

Aggregate columns containing dictionary in SQL presto

Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Aggregate the values of the {key:value} dictionary based on the user and regardless of target
user | result
-------------------
1 | {A:3, B:3, C:4}
2 | {A:1, C:2}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
Can anyone help me with it? I would really appreciate it.

Second one can be easily achieved with multimap_agg (add transform_values with array_distinct to remove duplicates if needed):
-- sample data
WITH dataset(user, target, result) AS (
values (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[1, 2]))
)
-- query
select user, multimap_agg(k, target)
from dataset,
unnest(result) as t (k,v)
group by user;
Output:
user
_col1
1
{A=[b, c], B=[c]}
2
{C=[a]}
As for the first one - you can look into using map_union_sum if it is available in your version of Presto. Or use some magic with unnest and transform_values:
-- query
select user,
transform_values(
multimap_agg(k, v),
(k,v) -> reduce(v, 0, (s, x) -> s + x, s -> s) -- or array_sum if available
)
from dataset,
unnest(result) as t (k, v)
group by user;
Output:
user
_col1
1
{A=2, B=2}
2
{C=2}

Calculate difference of counter data in Hive

I have counter data stored in Hive table. Counter increments in time and sometimes is reset to zero.
I want to calculate difference between consecutive rows, but in case of a counter reset the difference is negative. An example data and expected output is here:
data: 1, 3, 6, 7, 1, 4
difference: 2, 3, 1, -6, 3, NA
expected: 2, 3, 1, 1, 3, NA
Usually such an operation is done by calculating a lag and subtracting it from the data. In case of negative difference, we should put just the value from lag, here is an example of function, which does this in R/dplyr:
diff_counter <-function(x){
# count difference between measurements
lag <- lag(x)
dx <- x - lag
reset_idx <- dx < 0 & !is.na(dx)
dx[reset_idx] = lag[reset_idx]
return(dx)
}
Can I do something similar in Hive?
Regards
Paweł

Assuming that t is your datetime column and the counter gets incremented in that order, you may use a CASE block with the LEAD function like this.
SELECT x
,CASE
WHEN (
LEAD(x) OVER (
ORDER BY t
) - x
) > 0
THEN LEAD(x) OVER (
ORDER BY t
) - x
ELSE LEAD(x) OVER (
ORDER BY t
)
END AS diff
FROM yourtable;
| X | DIFF |
|---|--------|
| 1 | 2 |
| 3 | 3 |
| 6 | 1 |
| 7 | 1 |
| 1 | 3 |
| 4 | (null) |

Json aggregation of data with missing values

I have a simple table of supplements
create table consumption (
account bigint not null,
date date not null,
supplement text not null check (supplement in ('multiVitamin', 'calMag', 'omega3', 'potassium', 'salt', 'antiOxidant', 'enzymes')),
quantity integer not null default 0
);
And I want to fetch what people have consumed per day. This would be an exmaple of my desired output
[
{
"date" : "2016-01-01",
"multiVitamin" : 7,
"calMag" : 0,
"omega3" : 3,
"potassium" : 3,
"salt" : 2,
"antiOxidant" : 0,
"enzymes" : 1
},
{
"date" : "2016-01-02",
"multiVitamin" : 2,
"calMag" : 1,
"omega3" : 1,
"potassium" : 2,
"salt" : 2,
"antiOxidant" : 0,
"enzymes" : 1
}
]
I'm confused how to get those values into a json object and coalesce so that I return 0 if there aren't any supplements entered for that day. So everyday should return all supplements. This is what I have so far but its very far from complete - it's at least fetching for the dates selected though.
WITH duration_amount AS (
SELECT date_trunc('day', date)::date AS date_group, json_build_object('quantity', SUM(consumption.quantity) )::jsonb->'quantity' as supplement
FROM consumption
WHERE account = 1667
GROUP BY date_group
)
SELECT DISTINCT date_group, supplement
FROM (
SELECT generate_series(date_trunc('day', '2016-10-20'::date), '2016-10-28'::date, '1 day') AS date_group
) x
LEFT JOIN duration_amount
USING (date_group)
ORDER BY date_group DESC

Example data:
insert into consumption values
(1667, '2016-10-21', 'multiVitamin', 1),
(1667, '2016-10-21', 'calMag', 2),
(1667, '2016-10-22', 'multiVitamin', 3),
(1667, '2016-10-22', 'calMag', 4),
(1667, '2016-10-22', 'omega3', 5);
You should prepare a template table containing rows for all possible values. In the example it will contain 14 rows (cross join 2 days with 7 supplements). Next, left join to it your table using coalesce() for missing values:
select
date_group::date as date,
supplement_group as supplement,
coalesce(quantity, 0) quantity
from generate_series('2016-10-21'::date, '2016-10-22', '1 day') as date_group
cross join (
values
('multiVitamin'), ('calMag'), ('omega3'),
('potassium'), ('salt'), ('antiOxidant'), ('enzymes')
) as supplements(supplement_group)
left join consumption
on date_group = date
and supplement = supplement_group
and account = 1667;
date | supplement | quantity
------------+--------------+----------
2016-10-21 | multiVitamin | 1
2016-10-21 | calMag | 2
2016-10-21 | omega3 | 0
2016-10-21 | potassium | 0
2016-10-21 | salt | 0
2016-10-21 | antiOxidant | 0
2016-10-21 | enzymes | 0
2016-10-22 | multiVitamin | 3
2016-10-22 | calMag | 4
2016-10-22 | omega3 | 5
2016-10-22 | potassium | 0
2016-10-22 | salt | 0
2016-10-22 | antiOxidant | 0
2016-10-22 | enzymes | 0
(14 rows)
The result can be easily aggregated to jsonb, see the full example here.

Inserting multiple rows into a table based on another

I am looking at creating a lookup table to join with one of our existing tables. The strucuture of the existing table is as follows:
Version| CompanyNumber|EffDate |ExpDate |Indicator
------------------------------------------------------
1 | 2 |xx/xx/xxxx|xx/xx/xxxx| 0
2 | 2 |xx/xx/xxxx|xx/xx/xxxx| 1
The new table has the structure of this and should be populated like so:
ID | Version | Form
---------------------
1 | 1 | 1
2 | 1 | 2
3 | 1 | 3
4 | 2 | 3
What I am struggling with is populating the new table with the data in the example above. If the indicator is 0 I will always populate the form with 1, 2 and 3 for the version.
So if the indicator is 0 I want to add form 1, 2, and 3 for each version and if the indicator is 1 I only want to add form 3.
Thanks in Advance

You can use a query like this to perform INSERT:
INSERT INTO Table2(Version, Form)
SELECT Version, x.v
FROM Table1
INNER JOIN (VALUES (1, 3), (2, 2), (3, 1)) AS x(i, v)
ON IIF(Table1.Indicator = 0, 3, Table1.Indicator) >= x.i
If Indicator is equal to 0, then 3 rows are being inserted, otherwise only 1 row is inserted.
Note: I assume that field ID of Table2 is an IDENTITY field.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

how to estimate basic statistics group of data with spark sql? - apache-spark-sql

Related

SQL Presto Aggregate Table column values with another column values

Aggregate columns containing dictionary in SQL presto

Calculate difference of counter data in Hive

Json aggregation of data with missing values

Inserting multiple rows into a table based on another

Categories

Resources