Transposing data in Hive - hive

I have a Hive table with data in the below format
day class start_time count kpi1 kpi2 kpi3 kpi4 ... kpi160
-----------------------------------------------------------------------
20161010 abc 00 12 1 0 null 0 ...
I want to write a hive query to fetch the data in the format below
using some calculations like max, min , and avg.
day class start_time count kpi_name kpi_max kpi_min kpi_avg
-----------------------------------------------------------------------
20161010 abc 00 12 kpi1 max(kpi1) min(kpi1) avg(kpi1)
20161010 abc 00 12 kpi2 max(kpi2) min(kpi2) avg(kpi2)
Please suggest a solution to fetch the data in this format.
Thanks.

You need to put all the kpis in a map, explode the map to create one column, and then aggregate.
Example:
Data:
+---------+------+-----------+-------+-----+-----+------+------+------+------+
|day_ |class |start_time |count_ |kpi0 |kpi1 | kpi2 | kpi3 | kpi4 | kpi5 |
+---------+------+-----------+-------+-----+-----+------+------+------+------+
|20161010 |abc |00 |12 |1 |2 |3 |8 |9 |6 |
+---------+------+-----------+-------+-----+-----+------+------+------+------+
|20161010 |abc |00 |12 |4 |5 |null |6 |10 |null |
+---------+------+-----------+-------+-----+-----+------+------+------+------+
Query:
SELECT day_
, class
, start_time
, count_
, kpi_type
, MAX(vals) AS max_vals
, MIN(vals) AS min_vals
, AVG(vals) AS avg_vals
FROM (
SELECT day_, class, start_time, count_, kpi_type, vals
FROM database.table
LATERAL VIEW EXPLODE(MAP('kpi0', kpi0
, 'kpi1', kpi1
, 'kpi2', kpi2
, 'kpi3', kpi3
, 'kpi4', kpi4
, 'kpi5', kpi5)) et AS kpi_type, vals ) x
GROUP BY day_, class_, start_time, count_, kpi_type
Output:
+---------+------+-----------+-------+---------+---------+---------+---------+
|day_ |class |start_time |count_ |kpi_type |max_vals |min_vals |avg_vals |
+---------+------+-----------+-------+---------+---------+---------+---------+
|20161010 |abc |00 |12 |kpi0 |4 |1 |2.5 |
+---------+------+-----------+-------+---------+---------+---------+---------+
|20161010 |abc |00 |12 |kpi1 |5 |2 |3.5 |
+---------+------+-----------+-------+---------+---------+---------+---------+
|20161010 |abc |00 |12 |kpi2 |3 |3 |3.0 |
+---------+------+-----------+-------+---------+---------+---------+---------+
|20161010 |abc |00 |12 |kpi3 |8 |6 |7.0 |
+---------+------+-----------+-------+---------+---------+---------+---------+
|20161010 |abc |00 |12 |kpi4 |10 |9 |9.5 |
+---------+------+-----------+-------+---------+---------+---------+---------+
|20161010 |abc |00 |12 |kpi5 |6 |6 |6.0 |
+---------+------+-----------+-------+---------+---------+---------+---------+

If you want to get the min,max,avg , you have to specify the group By column , let's suppose you want to group by day.
SELECT day,
class,
start_time,
count,
kpi1,
MAX(kpi1) as max_kpi1,
MIN(kpi1) as min_kpi1,
AVG(kpi1) as avg_kpi1
FROM table
GROUP BY day

Related

How to fill a column to differentiate a set of rows from other rows in a group in Impala?

I have the following table in Impala.
|LogTime|ClientId|IsNewSession|
|1 |123 |1 |
|2 |123 | |
|3 |123 | |
|3 |666 |1 |
|4 |666 | |
|10 |123 |1 |
|23 |666 |1 |
|24 |666 | |
|25 |444 |1 |
|26 |444 | |
I want to make a new table as follows:
|LogTime|ClientId|IsNewSession|SessionId|
|1 |123 |1 |1 |
|2 |123 | |1 |
|3 |123 | |1 |
|3 |666 |1 |1 |
|4 |666 | |1 |
|10 |123 |1 |2 |
|23 |666 |1 |2 |
|24 |666 | |2 |
|25 |444 |1 |1 |
|26 |444 | |1 |
Basically, I want to make SessionId column that has a unique session ID per set of rows until there's a value of 1 in IsNewSession column after group by ClientId, to differentiate different sessions per ClientId.
I've made IsNewSession column to do so, but not sure how to iterate on the rows to make SessionId column.
Any help would be greatly appreciated!
You can use a cumulative sum:
select t.*,
sum(isnewsession) over (partition by clientid order by logtime) as sessionid
from t;

How to get week number using year and day of year using pyspark?

I am trying to add row numbers to a table. I need to add 1 for the first 7 rows in the dataframe and then 2 for the second 7 rows in the dataframe and so on. for eg pls refer to the last column in the dataframe.
I am basically trying to get week number based on day of the year and year
+-----------+---------------+----------------+------------------+---------+
|datekey |datecalendarday|datecalendaryear|weeknumberofseason|indicator| weeknumber
+-----------+---------------+----------------+------------------+---------+
|4965 |1 |2018 |2 |1 | 1
|4966 |2 |2018 |2 |2 | 1
|4967 |3 |2018 |2 |3 | 1
|4968 |4 |2018 |2 |4 | 1
|4969 |5 |2018 |2 |5 | 1
|4970 |6 |2018 |2 |6 | 1
|4971 |7 |2018 |3 |7 | 1
|4972 |8 |2018 |3 |8 | 2
|4973 |9 |2018 |3 |9 | 2
|4974 |10 |2018 |3 |10 | 2
|4975 |11 |2018 |3 |11 | 2
|4976 |12 |2018 |3 |12 | 2
|4977 |13 |2018 |3 |13 | 2
|4978 |14 |2018 |4 |14 | 2
I stumbled upon a solution where i use ntile function to get the number of week from the days available in that year. Any other effecient solution also would help. Thaks in advance

Renumbering a summary section field starting from "1" in SQL Server

I have the following input table. I need a smart way to dynamically renumber the parents section indexes starting from "01" and show them in a new column.
I'm using SQL Server 2014 Express SP2
MyTable:
ID Integer
SECTION Varchar
Query:
SELECT * FROM MyTable
Results:
+--+--------+
|ID|SECTION |
+--+--------+
|1 |03 |
|2 |03.01 |
|3 |03.01.01|
|4 |03.02 |
|5 |03.03 |
|6 |04 |
|7 |04.01 |
|8 |04.02 |
|9 |05 |
+--+--------+
Here is what I'm trying to achieve from my select or procedure:
+--+--------+--------+
|ID|SECTION |NEWSECT |
+--+--------+--------+
|1 |03 |01 |
|2 |03.01 |01.01 |
|3 |03.01.01|01.01.01|
|4 |03.02 |01.02 |
|5 |03.03 |01.03 |
|6 |04 |02 |
|7 |04.01 |02.01 |
|8 |04.02 |02.02 |
|9 |05 |03 |
+--+--------+--------+
This is just string operations:
select t.*,
stuff(section, 1, 2,
right(concat('00', dense_rank() over (order by left(section, 2))), 2)
)
from t;
I mean, the dense_rank() is doing the work for renumbering the main sections. The rest is just getting the value into your section.
Here is a db<>fiddle.

SQL: Need to SUM column for each type

How can I find the SUM of all scores for the minimum date of each lesson_id please:
-----------------------------------------------------------
|id |uid |group_id |lesson_id |game_id |score |date |
-----------------------------------------------------------
|1 |145 |1 |1 |0 |40 |1391627323 |
|2 |145 |1 |1 |0 |80 |1391627567 |
|3 |145 |1 |2 |0 |40 |1391627323 |
|4 |145 |1 |3 |0 |30 |1391627323 |
|5 |145 |1 |3 |0 |90 |1391627567 |
|6 |145 |1 |4 |0 |20 |1391628000 |
|7 |145 |1 |5 |0 |35 |1391628000 |
-----------------------------------------------------------
I need output:
-------------------
|sum_first_scores |
-------------------
|165 |
-------------------
I have this so far, which lists the score for each minimum date, per lesson, but I need to sum those results as above:
SELECT lesson_id, MIN(date), score AS first_score FROM cdu_user_progress
WHERE cdu_user_progress.uid = 145
GROUP BY lesson_id
You can identify the first score as the one where no earlier record exists. Then just take the sum:
select sum(score)
from edu_user_progress eup
where cdu_user_progress.uid = 145 and
not exists (select 1
from edu_user_progress eup2
where eup2.uid = eup.uid and
eup2.lesson_id = eup.lesson_id and
eup2.date < eup.date
);
This assumes that the minimum date for the lesson id has only one score.

SQL - get rows where column is greater than certain amount

I need to get the sum of the scores for the first of each lesson_id, but I also need the overall min and max scores for all lesson_ids as well as some other info:
cdu_groups:
----------------
|id |name |
----------------
|1 |group_1 |
|2 |group_2 |
----------------
cdu_user_progress145:
-----------------------------------------------------------
|id |uid |group_id |lesson_id |game_id |score |date |
-----------------------------------------------------------
|1 |145 |1 |1 |0 |40 |1391627323 |
|2 |145 |1 |1 |0 |80 |1391627567 |
|3 |145 |1 |2 |0 |40 |1391627323 |
|4 |145 |1 |3 |0 |30 |1391627323 |
|5 |145 |1 |3 |0 |90 |1391627567 |
|6 |145 |1 |4 |0 |20 |1391627323 |
|7 |145 |1 |5 |0 |35 |1391627323 |
-----------------------------------------------------------
I need this output:
-----------------------------------------------------------------
|name |group_id |min_score |max_score |... |sum_first_scores |
-----------------------------------------------------------------
|group_1 |1 |20 |90 |... |165 |
-----------------------------------------------------------------
SELECT
cdu_groups.*,
MAX(score) AS max_score,
MIN(score) AS min_score,
COUNT(DISTINCT(lesson_id)) AS scored_lesson_count,
COUNT(DISTINCT CASE WHEN score >= 75 then lesson_Id ELSE NULL END) as passed_lesson_count,
SUM(first_scores.first_score) AS sum_first_scores
FROM cdu_user_progress
JOIN cdu_groups ON cdu_groups.id = cdu_user_progress.group_id
JOIN
(
SELECT lesson_id, MIN(date), score AS first_score FROM cdu_user_progress
WHERE cdu_user_progress.uid = 145
GROUP BY lesson_id
) AS first_scores ON first_scores.lesson_id = cdu_user_progress.lesson_id
WHERE cdu_user_progress.uid = 145
I'm getting this error though:
You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'SUM(first_scores.first_score) AS sum_first_scores FROM cdu_user_progress ' at line 7