I have a table which is similar with the sample below:
id value group
a 2 0
b 3 0
c 4 0
d 6 0
e 4 0
f 3 1
g 2 1
h 1 1
i 0 1
j 3 1
The group column is a sign for each data group. From 0 to 0 is a sign for the first group, and so also with from 1 to 1 is also a sign for the second group. I want to create a new table which is the result of basic statistic (mean, median, st.deviation, variance, etc) estimation of each group.
One thing that I realize is I need to aggregate them first in order to estimate the basic statistic.
The desired table would be like this:
group max min mean median stdt var
0 6 2 3,8
1 3 0
2
How should I write my spark sql to get the desired result?
Thank you in advance.
import spark.implicits._
import org.apache.spark.sql.functions._
val sourceDF = Seq(
("a", 2, 0),
("b", 3, 0),
("c", 4, 0),
("d", 6, 0),
("e", 4, 0),
("f", 3, 1),
("g", 2, 1),
("h", 1, 1),
("i", 0, 1),
("j", 3, 1)
).toDF("id","value","group")
val resDF = sourceDF
.groupBy("group")
.agg(max("value")
, min("value")
, mean("value")
)
resDF.show(false)
// +-----+----------+----------+----------+
// |group|max(value)|min(value)|avg(value)|
// +-----+----------+----------+----------+
// |1 |3 |0 |1.8 |
// |0 |6 |2 |3.8 |
// +-----+----------+----------+----------+
Related
Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Count the number of unique targets for each result for each user. For example, for user 1, this user has 2 targets (b and c) who have result A. And it has one target for each result B (target c) and C (target d).
user | result
-------------------
1 | {A: 2, B:1, C:1}
2 | {A: 1, C: 1}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
** Or Even better, can we make a one table that has both columns?
user | result 1 | result 2
--------------------------------------------------
1 | {A:[b,c], B:[c], C:[d]} | {A: 2, B:1, C:1}
2 | {A:[d], C:[a]} | {A: 1, C: 1}
Can anyone help me with it? I would really appreciate it.
I'm pretty new to SQL so I didn't even know how to start it.`
This can be achieved with map aggregate functions. Assuming that result originally is a map you can flatten it with unnest and then group by user and use multimap_agg and histogram functions:
-- sample data
WITH dataset(user, target, result) AS (
VALUES (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[2, 3])),
(2, 'd', map(array['A'], array[1])),
(1, 'd', map(array['C'], array[4]))
)
-- query
select user, multimap_agg(k, target), histogram(k)
from dataset,
unnest(result) as t(k, v)
group by user;
Output:
user
_col1
_col2
2
{A=[d], C=[a]}
{A=1, C=1}
1
{A=[b, c], B=[c], C=[d]}
{A=2, B=1, C=1}
Hi I want to do SQL Presto query for the data table (say user_data) looks like
user | target | result
-----------------------------
1 | b | {A: 1}
2 | a | {C: 2}
1 | c | {A: 2, B: 3}
2 | d | {A: 1}
1 | d | {C: 4}
With this data table, I would like to generate the following two outputs.
Output 1: Aggregate the values of the {key:value} dictionary based on the user and regardless of target
user | result
-------------------
1 | {A:3, B:3, C:4}
2 | {A:1, C:2}
Output 2: Aggregate the last column based on the targets of the user.
user | result
-------------------
1 | {A:[b,c], B:[c], C:[d]}
2 | {A:[d], C:[a]}
Can anyone help me with it? I would really appreciate it.
Second one can be easily achieved with multimap_agg (add transform_values with array_distinct to remove duplicates if needed):
-- sample data
WITH dataset(user, target, result) AS (
values (1, 'b', map(array['A'], array[1])),
(2, 'a', map(array['C'], array[2])),
(1, 'c', map(array['A', 'B'], array[1, 2]))
)
-- query
select user, multimap_agg(k, target)
from dataset,
unnest(result) as t (k,v)
group by user;
Output:
user
_col1
1
{A=[b, c], B=[c]}
2
{C=[a]}
As for the first one - you can look into using map_union_sum if it is available in your version of Presto. Or use some magic with unnest and transform_values:
-- query
select user,
transform_values(
multimap_agg(k, v),
(k,v) -> reduce(v, 0, (s, x) -> s + x, s -> s) -- or array_sum if available
)
from dataset,
unnest(result) as t (k, v)
group by user;
Output:
user
_col1
1
{A=2, B=2}
2
{C=2}
I have counter data stored in Hive table. Counter increments in time and sometimes is reset to zero.
I want to calculate difference between consecutive rows, but in case of a counter reset the difference is negative. An example data and expected output is here:
data: 1, 3, 6, 7, 1, 4
difference: 2, 3, 1, -6, 3, NA
expected: 2, 3, 1, 1, 3, NA
Usually such an operation is done by calculating a lag and subtracting it from the data. In case of negative difference, we should put just the value from lag, here is an example of function, which does this in R/dplyr:
diff_counter <-function(x){
# count difference between measurements
lag <- lag(x)
dx <- x - lag
reset_idx <- dx < 0 & !is.na(dx)
dx[reset_idx] = lag[reset_idx]
return(dx)
}
Can I do something similar in Hive?
Regards
Paweł
Assuming that t is your datetime column and the counter gets incremented in that order, you may use a CASE block with the LEAD function like this.
SELECT x
,CASE
WHEN (
LEAD(x) OVER (
ORDER BY t
) - x
) > 0
THEN LEAD(x) OVER (
ORDER BY t
) - x
ELSE LEAD(x) OVER (
ORDER BY t
)
END AS diff
FROM yourtable;
| X | DIFF |
|---|--------|
| 1 | 2 |
| 3 | 3 |
| 6 | 1 |
| 7 | 1 |
| 1 | 3 |
| 4 | (null) |
I have a simple table of supplements
create table consumption (
account bigint not null,
date date not null,
supplement text not null check (supplement in ('multiVitamin', 'calMag', 'omega3', 'potassium', 'salt', 'antiOxidant', 'enzymes')),
quantity integer not null default 0
);
And I want to fetch what people have consumed per day. This would be an exmaple of my desired output
[
{
"date" : "2016-01-01",
"multiVitamin" : 7,
"calMag" : 0,
"omega3" : 3,
"potassium" : 3,
"salt" : 2,
"antiOxidant" : 0,
"enzymes" : 1
},
{
"date" : "2016-01-02",
"multiVitamin" : 2,
"calMag" : 1,
"omega3" : 1,
"potassium" : 2,
"salt" : 2,
"antiOxidant" : 0,
"enzymes" : 1
}
]
I'm confused how to get those values into a json object and coalesce so that I return 0 if there aren't any supplements entered for that day. So everyday should return all supplements. This is what I have so far but its very far from complete - it's at least fetching for the dates selected though.
WITH duration_amount AS (
SELECT date_trunc('day', date)::date AS date_group, json_build_object('quantity', SUM(consumption.quantity) )::jsonb->'quantity' as supplement
FROM consumption
WHERE account = 1667
GROUP BY date_group
)
SELECT DISTINCT date_group, supplement
FROM (
SELECT generate_series(date_trunc('day', '2016-10-20'::date), '2016-10-28'::date, '1 day') AS date_group
) x
LEFT JOIN duration_amount
USING (date_group)
ORDER BY date_group DESC
Example data:
insert into consumption values
(1667, '2016-10-21', 'multiVitamin', 1),
(1667, '2016-10-21', 'calMag', 2),
(1667, '2016-10-22', 'multiVitamin', 3),
(1667, '2016-10-22', 'calMag', 4),
(1667, '2016-10-22', 'omega3', 5);
You should prepare a template table containing rows for all possible values. In the example it will contain 14 rows (cross join 2 days with 7 supplements). Next, left join to it your table using coalesce() for missing values:
select
date_group::date as date,
supplement_group as supplement,
coalesce(quantity, 0) quantity
from generate_series('2016-10-21'::date, '2016-10-22', '1 day') as date_group
cross join (
values
('multiVitamin'), ('calMag'), ('omega3'),
('potassium'), ('salt'), ('antiOxidant'), ('enzymes')
) as supplements(supplement_group)
left join consumption
on date_group = date
and supplement = supplement_group
and account = 1667;
date | supplement | quantity
------------+--------------+----------
2016-10-21 | multiVitamin | 1
2016-10-21 | calMag | 2
2016-10-21 | omega3 | 0
2016-10-21 | potassium | 0
2016-10-21 | salt | 0
2016-10-21 | antiOxidant | 0
2016-10-21 | enzymes | 0
2016-10-22 | multiVitamin | 3
2016-10-22 | calMag | 4
2016-10-22 | omega3 | 5
2016-10-22 | potassium | 0
2016-10-22 | salt | 0
2016-10-22 | antiOxidant | 0
2016-10-22 | enzymes | 0
(14 rows)
The result can be easily aggregated to jsonb, see the full example here.
I am looking at creating a lookup table to join with one of our existing tables. The strucuture of the existing table is as follows:
Version| CompanyNumber|EffDate |ExpDate |Indicator
------------------------------------------------------
1 | 2 |xx/xx/xxxx|xx/xx/xxxx| 0
2 | 2 |xx/xx/xxxx|xx/xx/xxxx| 1
The new table has the structure of this and should be populated like so:
ID | Version | Form
---------------------
1 | 1 | 1
2 | 1 | 2
3 | 1 | 3
4 | 2 | 3
What I am struggling with is populating the new table with the data in the example above. If the indicator is 0 I will always populate the form with 1, 2 and 3 for the version.
So if the indicator is 0 I want to add form 1, 2, and 3 for each version and if the indicator is 1 I only want to add form 3.
Thanks in Advance
You can use a query like this to perform INSERT:
INSERT INTO Table2(Version, Form)
SELECT Version, x.v
FROM Table1
INNER JOIN (VALUES (1, 3), (2, 2), (3, 1)) AS x(i, v)
ON IIF(Table1.Indicator = 0, 3, Table1.Indicator) >= x.i
If Indicator is equal to 0, then 3 rows are being inserted, otherwise only 1 row is inserted.
Note: I assume that field ID of Table2 is an IDENTITY field.