Dynamically Pivot/Transpose Rows to Columns in Hive/Spark - sql

I have Quaterly basis Data and Data keeps Growing dynamically as Quater Grows-
qtr dimvalue percentage
FY2019-Q1 XYZ 15
FY2019-Q1 ABC 80
FY2019-Q1 PPP 5
FY2019-Q2 XYZ 10
FY2019-Q2 ABC 70
FY2019-Q2 PPP 20
When the Number of Quarters are less i am manually editing the query every time and trying the query as below to transpose it-
SELECT dim_value,SUM(Quater_1) as Quater_1,SUM(Quater_2) as Quater_2 from
(
SELECT dim_value,
CASE WHEN qtr='FY2019-Q1' THEN percentage END AS Quater_1,
CASE WHEN qtr='FY2019-Q2' THEN percentage END AS Quater_2 FROM
( select * from schema.table where qtr in ('FY2019-Q1','FY2019-Q2'))t2 order by dim_value
)t1 group by dim_value;
dimvalue Quater_1 Quater_2
XYZ 15 10
ABC 80 70
PPP 5 20
But my Query is how can i active this in a dynamic way and more robust way to transpose rows into columns and keeping in mind the growing quaters and also have proper Quaterwise column names as the Quater grows.
Altogether i am looking for how can perform this using a more dynamic Query be it using Hive or Spark-SQL or if any suggestions to perform it?
Thanks for the Help

You could easily do such pivot using Dataset API if that's doable for you.
spark.table("schema.table").groupBy("dimvalue").pivot("qtr").sum("percentage").show
+--------+---------+---------+
|dimvalue|FY2019-Q1|FY2019-Q2|
+--------+---------+---------+
| PPP| 5| 20|
| XYZ| 15| 10|
| ABC| 80| 70|
+--------+---------+---------+
With SQL the only way is to build it dynamically.

Related

Write a query in MSSQL to get report for last 30 days

I want to create a report basis on,
Lead Name
Verified By
Verified on
Lead 1
ABC
11-02-2021
Lead 2
KMJ
9-02-2021
Lead 3
ABC
11-02-2021
The report will look like,
Consider today's date as 12-02-2021, we need to create a report for the last 30 days for employees work count
user
12-02-2021
11-02-2021
10-02-2021
9-02-2021
8-02-2021
7-02-2021
so on till last 30 days
ABC
0
2
0
0
0
0
XYZ
0
0
0
0
0
0
KMJ
0
0
0
1
0
0
I have written MSSQL Query as below,
CAST(lead.CREATED_ON as date) between cast(DATEADD(day, -30, getdate()) as date) and CAST(getdate() as date)
but, I am not able to get data in the below format, and also if there no entry for one date that date should show 0 in front of all users
user
12-02-2021
11-02-2021
10-02-2021
9-02-2021
8-02-2021
7-02-2021
so on
Kindly help me to complete this query, if possible kindly share any article link, it will be a great help for me thank you
First of all, are the dates really stored as strings like that? If so, that's a big problem that will make this already-difficult situation much worse. It's important enough you should consider the current schema as actively broken.
Moving on, the most correct way to handle this situation is pivot the data in the client code or reporting tool. That is, return a result set from the SQL database looking more like this:
User | Date | Value
ABC | 2021-02-12 | 0
ABC | 2021-02-12 | 2
ABC | 2021-02-10 | 0
ABC | 2021-02-09 | 0
ABC | 2021-02-08 | 0
ABC | 2021-02-07 | 0
XYZ | 2021-02-12 | 0
XYZ | 2021-02-12 | 0
XYZ | 2021-02-10 | 0
XYZ | 2021-02-09 | 0
XYZ | 2021-02-08 | 0
XYZ | 2021-02-07 | 0
... and so on
And then let the client do the work to reformat this data however you want, rather than asking a server that is often licensed at thousands of dollars per cpu core and isn't suited for the task to do that work.
But if you really must do this on the database server, you should know the SQL language has a very strict requirement to know the number, name, and type of columns in the output at query evaluation time, before looking at any data. Even SELECT * respects this, because the * is based on a table definition known ahead of time.
If the output won't know how many columns there are until it looks at the data, you must do this in 3 steps:
Run a query to get data about columns.
Use the result from step 1 to build a new query dynamically.
Run the new query.
However, since it looks like you always want exactly 30 days worth of data, you may be able to do this in a single query using the PIVOT keyword if you're willing to name the columns something more like OneDayBack, TwoDaysBack, ThreeDaysBack, ... ThirtyDaysBack, such that you can reference them in the pivot code regardless of the current date.

flip a table sql server with dynamic columns and fix rows

after a lot of join and group by i have come to the totals that i wanted...
to keep things simples i will reduce the complexity of the table..
lets say that i have this table that give me the totals of models per year/mont
YearMonth| Totals|model
------------------------
2015-05 | 70 |AA
2015-05 | 50 |BB
2015-06 | 30 |AA
2015-06 | 10 |BB
------------------
201x-yy | 33 |AA
201x-yy | 90 |BB
i have to create a specific (non convencional)graphic in excel with this data
but the only way is to transform the table to something
where the columns are dynamic and the rows fix... something like this
Model|2015-05|2015-06|----|201X-yy
------------------------------------
AA | 70 | 30 |--- |33
BB | 50 | 10 |----|90
is it possible to create with a query? or do i have to do it use some complex store procedure to first create a temp table and than insert data into it.
Me recommendation in this case is to return that data into Excel, and use Excel pivot table to get your output. You can tell Excel to return a linked quiet directly to a pivot table.

How to get subtotals with time datatype in SQL?

I get stuck generating a SQL query. I have a Table in a Firebird DB like the following one:
ID | PROCESS | STEP | TIME
654 | 1 | 1 | 09:08:40
655 | 1 | 2 | 09:09:32
656 | 1 | 3 | 09:10:04
...
670 | 2 | 15 | 09:30:05
671 | 2 | 16 | 09:31:00
and so on.
I need the subtotals for each process group (It's about 7 of these). The table has the "time"-type for the TIME column.I have been trying it with DATEDIFF, but it doesn't work.
You need to use SUM
This question has been answered here.
How to sum up time field in SQL Server
and here.
SUM total time in SQL Server
For more specific Firebird documentation. Read up on the sum function here.
Sum() - Firebird Official Documentation
I think you should use "GROUP BY" to get max time and min time, and to use them in the datediff function. Something like that:
select process, datediff(second, max(time), min(time)) as nb_seconds
from your_table
group by process;

Access Query: get difference of dates with a twist

I'm going to do my best to explain this so I apologize in advance if my explanation is a little awkward. If I am foggy somewhere, please tell me what would help you out.
I have a table filled with circuits and dates. Each circuit gets trimmed on a time cycle of about 36 months or 48 months. I have a column that gives me this info. I have one record for every time the a circuit's trim cycle has been completed. I am attempting to link a known circuit outage list, to a table with their outage data, to a table with the circuit's trim history. The twist is the following:
I only want to get back circuits that have exceeded their trim cycles by 6 months. So I would need to take all records for a circuit, look at each individual record, find the most recent previous record relative to the record currently being examined (I will need every record examined invididually), calculate the difference between the two records in months, then return only the records that exceeded 6 months of difference between any two entries for a given feeder.
Here is an example of the data:
+----+--------+----------+-------+
| ID | feeder | comp | cycle |
| 1 | 123456 | 1/1/2001 | 36 |
| 2 | 123456 | 1/1/2004 | 36 |
| 3 | 123456 | 7/1/2007 | 36 |
| 4 | 123456 | 3/1/2011 | 36 |
| 5 | 123456 | 1/1/2014 | 36 |
+----+--------+----------+-------+
Here is an example of the result set I would want (please note: cycle can vary by circuit, so the value in the cycle column needs to be in the calculation to determine if I exceeded the cycle by 6 months between trimmings):
+----+--------+----------+-------+
| ID | feeder | comp | cycle |
| 3 | 123456 | 7/1/2007 | 36 |
| 4 | 123456 | 3/1/2011 | 36 |
+----+--------+----------+-------+
This is the query I started but I'm failing really hard at determining how to make the date calculations correctly:
SELECT temp_feederList.Feeder, Temp_outagesInfo.causeType, Temp_outagesInfo.StormNameThunder, Temp_outagesInfo.deviceGroup, Temp_outagesInfo.beginTime, tbl_Trim_History.COMP, tbl_Trim_History.CYCLE
FROM (temp_feederList
LEFT JOIN Temp_outagesInfo ON temp_feederList.Feeder = Temp_outagesInfo.Feeder)
LEFT JOIN tbl_Trim_History ON Temp_outagesInfo.Feeder = tbl_Trim_History.CIRCUIT_ID;
I wasn't really able to figure out where I need to go from here to get that most recent entry and perform the mathematical comparison. I've never been asked to do SQL this complex before, so I want to thank all of you for your patience and any assistance you're willing to lend.
I'm making some assumptions, but this uses a subquery to give you rows in the feeder list where the previous completed date was greater than the number of months ago indicated by the cycle:
SELECT tbl_Trim_History.ID, tbl_Trim_History.feeder,
tbl_Trim_History.comp, tbl_Trim_History.cycle
FROM tbl_Trim_History
WHERE tbl_Trim_History.comp>
(SELECT Max(DateAdd("m", tbl_Trim_History.cycle, comp))
FROM tbl_Trim_History T2
WHERE T2.feeder = tbl_Trim_History.feeder AND
T2.comp < tbl_Trim_History.comp)
If you needed to check for longer than 36 months you could add an arbitrary value to the months calculated by the DateAdd function.
Also I don't know if the value of cycle specified the number of month from the prior cycle or the number of months to the next one. If the latter I would change tbl_Trim_History.cycle in the DateAdd function to just cycle.
SELECT tbl_trim_history.ID, tbl_trim_history.Feeder,
tbl_trim_history.Comp, tbl_trim_history.Cycle,
(select max(comp) from tbl_trim_history T
where T.feeder=tbl_trim_history.feeder and
t.comp<tbl_trim_history.comp) AS PriorComp,
IIf(DateDiff("m",[priorcomp],[comp])>36,"x") AS [Select]
FROM tbl_trim_history;
This query identifies (with an X in the last column) the records from tbl_trim_history that exceed the cycle time - but as noted in the comments I'm not entirely sure if this is what you need or not, or how to incorporate the other 2 tables. Once you see what it is doing you can modify it to only keep the records you need.

Quartile for subgroups in SQL Server 2008

I have a table with the times athletes of a sport club take to run a lap around the field . Each athlete has several entries in that table for each time they run and and for statistics purposed I need to gather some statistics regarding the time they take.
I already have the basic statistics like average time, median time, etc.... However I have no idea how to exactly do the bottom and top quartiles.
I see some examples for quartiles of a table if you just want the statistics of the whole table (in this case the whole club) but I have no idea how to make them for sub groups like distinct athletes of a table, could anyone give me point me on the right direction/give me an example?
The relevant data is in a very simple structure like this (there are more columns but in this case they don't matter)
LAP_ID | ATHLETE| TIME |
1 | Ath_X | 120 |
2 | Ath_Y | 160 |
3 | Ath_X | 90 |
4 | Ath_X | 80 |
5 | Ath_Z | 113 |
6 | Ath_X | 115 |
EDIT:There seems to be some misunderstanding, by Quartile I mean the 1st and 3rd Quartile, that is the place where it splits off the lowest 25% of data from the highest 75% and the place where it splits off the highest 25% of data from the lowest 75%.