impala transpose column to row - hive

How to transpose column data to row data in impala
I have tried some solution that not work in impala but working in hive.
Table name : test
Data:
day name jobdone
2017-03-25 x_user 5
2017-03-25 y_user 10
2017-03-31 x_user 20
2017-03-31 y_user 1
I want the data should be like that in impala no in hive
Required Output Data
Day x_user y_user
2017-03-05 5 10
2001-03-31 20 1
I am able to do in Hive using the Map and collect_list. How can i do in Impala.

Using case + min() or max() aggregation:
select day,
max(case when name='x_user' then jobdone end) x_user,
max(case when name='y_user' then jobdone end) y_user
from test
group by day;
Use sum() instead of max() if there are many records per user, day and you need to sum them.

Related

How to segregate an monthly data from daily data using sql or bigquery?

We are receiving data monthly for some ids and daily for some ids. I need to segregate the data as monthly or daily before applying the logic required for further analysis.
i have tried to use datadiff sql function to do that, but its not really helpful in this case. Is there any way to segregate the daily receiving data from monthly using sql or big query?
Id date
55 11-02-2022 00:00
66 15-05-2022 00:00
77 13-08-2022 00:00
66 15-07-2022 00:00
77 12-08-2022 00:00
55 12-02-2022 00:00
66 15-06-2022 00:00
A count aggregation per id per month is an efficient way to separate the two groups. The question is whether a single ID consistently updates at the same cadence, and if it does not, do you want to treat it differently per time period or always daily if ever daily, etc.
Here's the basic logic to categorize the two groups by month
select
id
, date_trunc(date, MONTH) as mo
, count(*) > 1 as is_daily
from tbl
group by 1,2;
That gives you a per-id, per-month categorization.
Here is the same logic taken a step further to categorize any ID that ever receives daily updates as daily.
with bins as (
select
id
, date_trunc(date, MONTH) as mo
, count(*) > 1 as is_daily
from tbl
group by 1,2
)
select
id
, sum(case when is_daily then 1 else 0 end) > 0 as is_daily
from bins
group by 1;

how to distribute data approximately in a table so that the counts remain close to the given value

I've a table TEST with 14.5 million records(columns id and created_date).The end goal is to split this table into approximately N number of splits. Let's assume 15 in this case, close to a million records. I'm using created_date to split this data.
I've come up with the below query.
with cte as (
select created_date,
ntile(15) over (order by created_date) as created_date_range
from TEST
)
select created_date_range ,min(created_date),max(created_date),count(*)
from cte
group by created_date_range
order by created_date_range ;
i get the desired result with the table being split into 15 equal parts. Here's an example of the date i get
created_date_range A
min(created_date)
min(created_date)
count(*)
1
2022-04-14 00:00:02
2022-05-02 22:56:40
946455
2
2022-05-02 22:56:40
2022-05-21 17:10:20
946455
3
2022-05-21 17:10:21
2022-06-15 20:16:47
946455
.
.
.
14
2022-10-24 18:55:22
2022-11-04 17:12:26
946454
15
2022-11-04 17:12:26
2022-11-18 06:01:08
946454
How can i avoid the same date data to be distributed into two different ranges?
Am i doing this correct? Is there another way of achieving the result?
I tried to use the ceil function but i had issues with group by statement.

Impala: values are in wrong columns in result query

In my result query the values are in wrong columns.
My SQL Query is like:
create table some_database.table name as
select
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name,
sum(t.operation_amount) operation_amt,
current_timestamp() calculation_moment
from operations t
left join status_dict d on
d.status_id = t.status_id
group by
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name
(In fact, it's more complicated, but the main idea is that I'm aggregating source table and making some joins.)
The result I get is like:
#
operation_year
operation_month
operation_day
status_name
operation_amt
1
2021
1
1
success
100
2
2021
1
1
success
150
3
2021
1
2
success
120
4
null
2021-01-01 21:53:00
success
120
null
The problem is in row 4.
The field t.operation_date is not nullable, but in result query in column operation_year we get null
In operation_month we get untruncated timestamp
In operation_day we get string value from d.status_name
In status_name we get numeric aggregate from t.operation_amount
In operation_amt we get null
It looks very similar to a wrong parsing of a csv file when values jump to other columns, but obviously it can't be the case here. I can't figure out how on earth is it possible. I'm new to Hadoop and apparently I'm not aware of some important concept which causes the problem.

Calculating Average over date

I try to calculate an average of value over time: I created a tablix in ssrs:
I used this expression in my query:
Avg(case when CounterName = 'Count 1' then calculationUnits end) as Average
The Average which is shown in the picture:
=SUM(Fields!Prod.Value)/3
I divided the Sum of the Values by 3 (in a day there are 3 shifts) but I would like to have an average over the date.
Can I use in my query something with:
OVER(PARTITION BY [intervaldate])
I would like to divide the sum of each machines by the numbers of date value
Try getting just the rows for each date in your dateset and use the SSRS functions to get the average and the sum.
To filter you query in SQL use the the parameters inside the SQL Statement.
Your query should look something like this:
select
*
from
yourData
where
date BETWEEN #StartDate AND #EndDate
Example data for this statement:
yourDate M101 M102 M103
2015-12-24 12:00:00 100 34 54
2015-12-25 12:00:00 25 67 87
2015-12-26 12:00:00 30 434 54
2015-12-27 12:00:00 140 42 65
2015-12-28 12:00:00 21 66 77
Now create a tablix containing the data in your SSRS report. Add two new rows outside of your detail group:
Now add expressions in the new rows, to get the sum and the average, example for M101:
for sum:
=Sum(Fields!M101.Value)
for average (over date)
=SUM(Fields!Prod.Value) / COUNTDISTINCT(Fields!DateValue.Value)
I don't really see the relationship between your snippet of code and the example data. However, the logic that you want is probably:
Avg(case when CounterName = 'Count 1' then calculationUnits end) as Average
Note that there is no else clause. Without the else the case expression evaluates to NULL, which is ignored by avg().
Your expression has else 0. This is treated as a legitimate value, so non-matching rows affect the final results.

SQL Server : multiple rows ignoring some and returning others

Good afternoon folks,
I have a database table which contains data about sickness for employees. It looks like this.
Date ID State
2013-05-03 00:00:00 0002371 Working
2013-05-03 00:00:00 0002622 Working
2013-05-03 00:00:00 0005590 Working
2013-05-03 00:00:00 0005590 Sick
if you'll notice ID 00005590 has two entries, one for working (they were scheduled to) then another to say they phoned in sick. So if I query the database for that user I get two rows or results.
What I'd like to do is only return the sick row if the person was sick and ignore their working result, so I end up with...
Date ID State
2013-05-03 00:00:00 0002371 Working
2013-05-03 00:00:00 0002622 Working
2013-05-03 00:00:00 0005590 Sick
I'm running SQL Server 2005.
Any ideas ladies and gents?
Much appreciated.
D
You can do this with row_number() and case:
select [date], id, state
from (select t.*,
row_number() over (partition by id
order by (case when state = 'Sick' then 1 else 0 end) desc
) as seqnum
from t
) t
where seqnum = 1;
What this does is assign a sequential number to each id (based on the partition by clause). A row that contains "sick" is assigned the value of 1 (if present); otherwise, the "working" row will be assigned 1. The filter only takes the first row.
Note that this returns only one row per id. If you could have multiple "working" or "sick" rows, then you can use rank() instead of row_number().