Single column values into multiple columns in hive

Single column values into multiple columns in hive - hive

I have table which updates on weekly basis, I need to check count variation check between one week and previous week values. I just did below....
Select
case when F.wk_end_d=max(F.wk_end_d) over (partition by F.wk_end_d)then F.the_count end as count
from
(
select wk_end_d, count(*) as the_count
from table A
where wk_end_d between date_sub('2019-03-02',7) and '2019-03-02'
group by wk_end_d
) F
which give me value like below
100
200
but I need get value like 100 200 on 2 different columns as I need build some other calculations on top of it.

Related

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated

Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.

Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1

Calculate stdev over a variable range in SQL Server

Table format is as follows:
Date ID subID value
-----------------------------
7/1/1996 100 1 .0543
7/1/1996 100 2 .0023
7/1/1996 200 1 -.0410
8/1/1996 100 1 -.0230
8/1/1996 200 1 .0121
I'd like to apply STDEV to the value column where date falls within a specified range, grouping on the ID column.
Desired output would like something like this:
DateRange, ID, std_v
1 100 .0232
2 100 .0323
1 200 .0423
One idea I've had that works but is clunky, involves creating an additional column (which I've called 'partition') to identify a 'group' of values over which STDEV is taken (by using the OVER function and PARTITION BY applied to 'partition' and 'ID' variables).
Creating the partition variable involves a CASE statement prior where a given record is assigned a partition based on its date falling within a given range (ie,
...
, partition = CASE
WHEN date BETWEEN '7/1/1996' AND '10/1/1996' THEN 1
WHEN date BETWEEN '10/1/1996' AND '1/1/1997' THEN 2
...
Ideally, I'd be able to apply STDEV and the OVER function partitioning on the variable ID and variable date ranges (eg, say, trailing 3 months for a given reference date). Once this works for the 3 month period described above, I'd like to be able to make the date range variable, creating an additional '#dateRange' variable at the start of the program to be able to run this for 2, 3, 6, etc month ranges.

I ended up coming upon a solution to my question.
You can join the original table to a second table, consisting of a unique list of the dates in the first table, applying a BETWEEN clause to specify desired range.
Sample query below.
Initial table, with columns (#excessRets):
Date, ID, subID, value
Second table, a unique list of dates in the previous, with columns (#dates):
Date
select d.date, er.id, STDEV(er.value)
from #dates d
inner join #excessRet er
on er.date between DATEADD(m, -36, d.date) and d.date
group by d.date, er.id
order by er.id, d.date
To achieve the desired next step referenced above (making range variable), simply create a variable at the outset and replace "36" with the variable.

Redshift: Find MAX in list disregarding non-incremental numbers

I work for a sports film analysis company. We have teams with unique team IDs and I would like to find the number of consecutive weeks they have uploaded film to our site moving backwards from today. Each upload also has its own row in a separate table that I can join on teamid and has a unique date of when it was uploaded. So far I put together a simple query that pulls each unique DATEDIFF(week) value and groups on teamid.
Select teamid, MAX(weekdiff)
(Select teamid, DATEDIFF(week, dateuploaded, GETDATE()) as weekdiff
from leroy_events
group by teamid, weekdiff)
What I am given is a list of teamIDs and unique weekly date differences. I would like to then find the max for each teamID without breaking an increment of 1. For example, if my data set is:
Team datediff
11453 0
11453 1
11453 2
11453 5
11453 7
11453 13
I would like the max value for team: 11453 to be 2.
Any ideas would be awesome.

I have simplified your example assuming that I already have a table with weekdiff column. That would be what you're doing with DATEDIFF to calculate it.
First, I'm using LAG() window function to assign previous value (in ordered set) of a weekdiff to the current row.
Then, using a WHERE condition I'm retrieving max(weekdiff) value that has a previous value which is current_value - 1 for consecutive weekdiffs.
Data:
create table leroy_events ( teamid int, weekdiff int);
insert into leroy_events values (11453,0),(11453,1),(11453,2),(11453,5),(11453,7),(11453,13);
Code:
WITH initial_data AS (
Select
teamid,
weekdiff,
lag(weekdiff,1) over (partition by teamid order by weekdiff) as lag_weekdiff
from
leroy_events
)
SELECT
teamid,
max(weekdiff) AS max_weekdiff_consecutive
FROM
initial_data
WHERE weekdiff = lag_weekdiff + 1 -- this insures retrieving max() without breaking your consecutive increment
GROUP BY 1
SQLFiddle with your sample data to see how this code works.
Result:
teamid max_weekdiff_consecutive
11453 2

You can use SQL window functions to probe relationships between rows of the table. In this case the lag() function can be used to look at the previous row relative to a given order and grouping. That way you can determine whether a given row is part of a group of consecutive rows.
You still need overall to aggregate or filter to reduce the number of rows for each group of interest (i.e. each team) to 1. It's convenient in this case to aggregate. Overall, it might look like this:
select
team,
case min(datediff)
when 0 then max(datediff)
else -1
end as max_weeks
from (
select
team,
datediff,
case
when (lag(datediff) over (partition by team order by datediff) != datediff - 1)
then 0
else 1
end as is_consec
from diffs
) cd
where is_consec = 1
group by team
The inline view just adds an is_consec column to the data, marking whether each row is part of a group of consecutive rows. The outer query filters on that column (you cannot filter directly on a window function), and chooses the maximum datediff from the remaining rows for each team.
There are a few subtleties there:
The case expression in the inline view is written as it is to exploit the fact that the lag() computed for the first row of each partition will be NULL, which does not evaluate unequal (nor equal) to any value. Thus the first row in each partition is always marked consecutive.
The case testing min(datediff) in the outer select clause picks up teams that have no record with datediff = 0, and assigns -1 to column max_weeks for them.
It would also have been possible to mark rows non-consecutive if the first in their group did not have datediff = 0, but then you would lose such teams from the results altogether.

Group by two columns is possible?

I have this table:
ID Price Time
0 20,00 20/10/10
1 20,00 20/10/10
2 20,00 12/12/10
3 14,00 23/01/12
4 87,00 30/07/14
4 20,00 30/07/14
I use this syntax sql to get the list of all prices in a way that does not get repeated values:
SELECT * FROM myTable WHERE id in (select min(id) from %# group by Price)
This code return me the values (20,14,87,20)
But in this case I would implement another check, that will not only sort by price but also by date, example: That syntax is getting the list by price, if I find a way to check by date, the code will return me the values (20,20,14,87,20)
He repeats 20 two times but if we see in the table we have three numbers 20 (two with the date 20/10/10 and one with the date 12/12/10) and is exactly what I'm wanting to get!
Somebody could help me?

To group by multiple columns, just put a comma in between the list.
SELECT price FROM myTable group by price, time order by time
The group by looks at all distinct combinations of the listed columns values, and discards duplicates. You can also use aggregate functions like sum or max to pull in additional columns to the results.

The following should work as long as all you need is the price/time combination. If you need to include the ID, things get more complicated:
SELECT `Price` FROM items
GROUP BY `Price`, `Time`
ORDER BY `Time`;
Here's a fiddle with the result in action: http://sqlfiddle.com/#!2/40821/1

how to calculate separate averages for multiple columns

I have a table that has "months" as columns and "customer ID" as primary key.
I want to average all the values for each month separately for values not equal to 99999.
My current query for a single month is as follows and is working fine:
SELECT Avg([Table1]![Dec10]) AS Expr1
FROM Table1
WHERE ((([Table1]![Dec10])<>99999);
However, when I am trying to add the 2nd month, it is combining the first month's condition with the 2nd month's condition.
SELECT Avg([Table1]![Dec10) AS Expr1, Avg([Table1]![Dec11]) AS Expr2
FROM Table1
WHERE ((([Table1]![Dec10])<>99999) And ([Table1]![Dec11])<>99999);
I need to have each month separate, i.e. calculate the average of Dec10<>99999, and in the second column, calculate the average of Dec11<>99999.

You need to use a Group By clause in your query, and then you can separate your output by months.

In this case it would be convenient to use use GROUP BY.
If you have distinct month values e.g. "jan10", "feb10", "mar12" etc. you can group on the months, and then check that the values is not 99999.
SELECT avg(value), months
FROM tablename
WHERE value <> 99999
GROUP BY months
That is if you have the months stored as literals within a column, but from your database design this may be stored in an other way?

I need to have each month separate, i.e. calculate the average of Dec10<>99999, and in the second column, calculate the average of Dec11<>99999.
In Access 2010, for [Table1]...
CustomerID Dec10 Dec11
---------- ----- -----
1 1 5
2 2 99999
3 99999 0
4 3 7
...the query...
SELECT
DAvg("Dec10", "Table1", "Dec10<>99999") AS AvgOfDec10,
DAvg("Dec11", "Table1", "Dec11<>99999") AS AvgOfDec11
FROM (SELECT COUNT(*) AS n FROM Table1)
...produces:
AvgOfDec10 AvgOfDec11
---------- ----------
2 4

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Single column values into multiple columns in hive - hive

Related

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

Calculate stdev over a variable range in SQL Server

Redshift: Find MAX in list disregarding non-incremental numbers

Group by two columns is possible?

how to calculate separate averages for multiple columns

Categories

Resources