DBT - Pivoting a table with multiple columns - sql

Wondering if anyone can help here,
I’m trying to use dbt_utils.pivot to get this table:
+-------+-------+-------+--------+
|metric1|metric2|metric3| date |
|-------+-------+-------+--------+
| 100 | 10400 | 8000 | 01/01 |
| 200 | 39500 | 90700 | 02/01 |
| 200 | 39500 | 90700 | 03/01 |
+-------+-------+-------+-------+
To look like this table:
+-------+-------+-------+-------+-------+
| metric_name| 01/01 | 02/01 | 03/01 | etc . . .
|-------+-------+-------+-------+----|
| metric1 | 10400 | 8000 | 11000 |
| metric2 | 39500 | 90700 | 12000 |
| metric3 | 39500 | 90700 | 12000 |
+-------+-------+-------+-------+-------+
I would take each metric one by one, (create a table with just metric1 and date) and pivot the dates, then union each table,
My problem is that dbt_utils.pivot doesn’t support CTEs, so I would be required to create a model for each metric (I have more than 3) so I’m wondering is there a way to get around this? Due to the number of dates I also can’t use Snowflake’s PIVOT function as this requires you to explicitly name each row you want to pivot, and there would be too many to list off and would constantly be new dates added!

What you're looking for is actually to transpose your table, not pivot it. This can be achieved by an unpivot (or "melt") operation, followed by a pivot on a different column.
dbt-utils has a macro for unpivoting, too:
-- unpivot_model.sql
{{
dbt_utils.unpivot(
relation=ref("table_name"),
cast_to="int",
exclude=["date"],
field_name="metric_name",
value_name="metic_value",
)
}}
this becomes:
date
metric_name
metric_value
01/01
metric1
100
02/01
metric1
200
03/01
metric1
200
01/01
metric2
10400
...
...
...
Then you can pivot that new table:
-- pivot_model.sql
select
metric_name,
{{
dbt_utils.pivot(
"date",
dbt_utils.get_column_values(ref("unpivot_model"), "date"),
then_value="metric_value",
)
}}
from {{ ref("unpivot_model") }}
group by metric_name

Related

Select data from multiple existing tables dynamically

I have tables "T1" in the database that are broken down by month of the form (table_082020, table_092020, table_102020). Each contains several million records.
+----+----------+-------+
| id | date | value |
+----+----------+-------+
| 1 | 20200816 | abc |
+----+----------+-------+
| 2 | 20200817 | xyz |
+----+----------+-------+
+----+----------+-------+
| id | date | value |
+----+----------+-------+
| 1 | 20200901 | cba |
+----+----------+-------+
| 2 | 20200901 | zyx |
+----+----------+-------+
There is a second table "T2" that stores a reference to the primary key of the first one and actually to the table itself only without the word "table_".
+------------+--------+--------+--------+--------+
| rec_number | period | field1 | field2 | field3 |
+------------+--------+--------+--------+--------+
| 777 | 092020 | aaa | bbb | ccc |
+------------+--------+--------+--------+--------+
| 987 | 102020 | eee | fff | ggg |
+------------+--------+--------+--------+--------+
| 123456 | 082020 | xxx | yyy | zzz |
+------------+--------+--------+--------+--------+
There is also a third table "T3", which is the ratio of the period and the table name.
+--------+--------------+
| period | table_name |
+--------+--------------+
| 082020 | table_082020 |
+--------+--------------+
| 092020 | table_092020 |
+--------+--------------+
| 102020 | table_102020 |
+--------+--------------+
Tell me how you can combine 3 tables to get dynamic data for several periods. For example: from 15082020 to 04092020, where the data will be located in different tables, respectively
There really is no good reason for storing data in this format. It makes querying a nightmare.
If you cannot change the data format, then add a view each month that combines the data:
create view t as
select '202010' as YYYYMM, t.*
from table_102020
union all
select '202008' as YYYYMM, t.*
from table_092020
union all
. . .;
For a once-a-month effort, you can spend 10 minutes writing the code and do so with a calendar reminder. Or, better yet, set up a job that uses dynamic SQL to generate the code and run this as a job after the underlying tables are using.
What should you be doing? Well, 5 million rows a months isn't actually that much data. But if you are concerned about it, you can use table partitioning to store the data by month. This can be a little tricky; for instance, the primary key needs to include the partitioning key.

Is there a way to create a "pivot group" of columns with t-sql?

This seems like a common need but, unfortunately, I can't find a solution.
Assume you have a query that outputs the following content:
| TimeFrame | User | Metric1 | Metric2 |
+------------+------------+---------+---------+
| TODAY | John Doe | 10 | 20 |
| MONTHTODAY | John Doe | 100 | 200 |
| TODAY | Jack Frost | 15 | 25 |
| MONTHTODAY | Jack Frost | 150 | 250 |
What I need as output after a pivot is data that looks like this:
| User | TODAY_Metric1 | TODAY_Metric2 | MONTHTODAY_Metric1 | MONTHTODAY_Metric2 |
+------------+---------------+---------------+--------------------+--------------------+
| John Doe | 10 | 20 |100 | 200 |
| Jack Frost | 15 | 25 |150 | 250 |
Note that I'm doing the pivoting on TimeFrame, however, columns Metric1 and Metric2 remain columns but are grouped by time frame values.
Can this be done within standard PIVOT syntax or will I need to write a more complex query to pull this data together in a result set specific to my needs?
You can do this with conditional aggregation:
select
user,
sum(case when timeframe = 'TODAY' then Metric_1 end) TODAY_Metric1,
sum(case when timeframe = 'TODAY' then Metric_2 end) TODAY_Metric2,
sum(case when timeframe = 'MONTHTODAY' then Metric_1 end) MONTHTODAY_Metric1,
sum(case when timeframe = 'MONTHTODAY' then Metric_2 end) MONTHTODAY_Metric2
from mytable
group by user
I tend to prefer the conditional aggregation technique over the vendor-specific implementations, because:
I find it simpler to understand and maintain
it is cross-RDBMS (so you can easily port it to some other database if needed)
it usually performs as well, or even better than vendor implementation (that usually rely upon it under the hood)

Return only one row of a column for minimum time in Postgresql

This is a bit of a complicated question to ask, but I am sure someone here will know the answer in about 2 minutes and I'll feel stupid.
What I have is a table of routes, delivery names, and delivery times. Let's say it looks like this:
+------------+---------------+-------+
| ROUTE CODE | NAME | TIME |
+------------+---------------+-------+
| A | McDonald's | 5:30 |
| A | Arby's | 5:45 |
| A | Burger King | 6:00 |
| A | Wendy's | 6:30 |
| B | Arby's | 7:45 |
| B | Arby's | 7:45 |
| B | Burger King | 8:30 |
| B | McDonald's | 9:00 |
| C | Wendy's | 9:30 |
| C | Lion's Choice | 8:15 |
| C | Steak N Shake | 9:50 |
| C | Hardee's | 10:30 |
+------------+---------------+-------+
What I want the result to return is something like this:
+------------+---------------+------+
| ROUTE CODE | NAME | TIME |
+------------+---------------+------+
| A | McDonald's | 5:30 |
| B | Arby's | 7:45 |
| C | Lion's Choice | 8:15 |
+------------+---------------+------+
So what I want is the name of the minimum time for each route code.
I have written a query that gets me most of the way there (and feel free to improve upon this query if you think there is a more efficient way to do it):
SELECT main1.route_code, main1.first_stop, main2.name
FROM
(SELECT route_code, min(time) as first_stop FROM table1 WHERE date = yesterday GROUP BY route_code) main1
JOIN
(SELECT route_code, name, time FROM table1 WHERE date = yesterday) main2
ON main1.route_code = main2.route_code and main1.first_stop = main2.time
Here is where I need your help though. If I have identical times, it returns that row twice, and I only want it once. So for instance, the above query would return Arby's for route code "B" twice because it has the same time. I only want to see that once, I never want to see anything from a route more than once.
Can anyone help me? Thanks much!
In Postgres, you can use distinct on:
select distinct on (route_code) t.*
from table1 t
order by route_code, time asc;
This is likely to be the fastest method in Postgres. For performance, an index on (route_code, time) is recommended.
Here's another way to get your result that you may or may not like better:
SELECT route_name, time, name FROM
(SELECT *, ROW_NUMBER() OVER (PARTITION BY route_code ORDER BY time ASC) row_num FROM table1) subq
WHERE row_num = 1;

SAP Business Objects Cross Table Data Duplication

I'm using Business Objects to construct a simple report on whether a unit is on or off for a given day. When constructing a vertical table, the data is correct and looks like such:
Unit ID | Status | Date
1 | On | 2016-09-10
1 | On | 2016-09-11
1 | Off | 2016-09-12
2 | Off | 2016-09-10
2 | Off | 2016-09-11
2 | On | 2016-09-12
However the cross table I've created, with columns of "date" and rows of "Unit ID" is duplicating Unit ID and having an entire row of 'On' followed by an entire row of 'Off' like:
____| 2016-09-10 | 2016-09-11 | 2016-09-12
1 | On | On | On
1 | Off | Off | Off
2 | On | On | On
2 | Off | Off | Off
instead of what it should be as:
____| 2016-09-10 | 2016-09-11 | 2016-09-12
1 | On | On | Off
2 | Off | Off | On
Any suggestions as to why it's doing this? The table isn't particularly useful if it has these duplicate rows and I can't understand why it's resulting in this odd table.
Turns out what happened is the "Status" field was a dimension type, but the cross table requires the data field to be a measure type. Simply making a new variable that was a measure equal to "Status" solved the issue.

Only Some Dates From SQL SELECT Being Set To "0" or "1969-12-31" -- UNIX_TIMESTAMP

So I have been doing pretty well on my project (Link to previous StackOverflow question), and have managed to learn quite a bit, but there is this one problem that has been really dogging me for days and I just can't seem to solve it.
It has to do with using the UNIX_TIMESTAMP call to convert dates in my SQL database to UNIX time-format, but for some reason only one set of dates in my table is giving me issues!
==============
So these are the values I am getting -
#abridged here, see the results from the SELECT statement below to see the rest
#of the fields outputted
| firstVst | nextVst | DOB |
| 1206936000 | 1396238400 | 0 |
| 1313726400 | 1313726400 | 278395200 |
| 1318910400 | 1413604800 | 0 |
| 1319083200 | 1413777600 | 0 |
when I use this SELECT statment -
SELECT SQL_CALC_FOUND_ROWS *,UNIX_TIMESTAMP(firstVst) AS firstVst,
UNIX_TIMESTAMP(nextVst) AS nextVst, UNIX_TIMESTAMP(DOB) AS DOB FROM people
ORDER BY "ref DESC";
So my big question is: why in the heck are 3 out of 4 of my DOBs being set to date of 0 (IE 12/31/1969 on my PC)? Why is this not happening in my other fields?
I can see the data quite well using a more simple SELECT statement and the DOB field looks fine...?
#formatting broken to change some variable names etc.
select * FROM people;
| ref | lastName | firstName | DOB | rN | lN | firstVst | disp | repName | nextVst |
| 10001 | BlankA | NameA | 1968-04-15 | 1000000 | 4600000 | 2008-03-31 | Positive | Patrick Smith | 2014-03-31 |
| 10002 | BlankB | NameB | 1978-10-28 | 1000001 | 4600001 | 2011-08-19 | Positive | Patrick Smith | 2011-08-19 |
| 10003 | BlankC | NameC | 1941-06-08 | 1000002 | 4600002 | 2011-10-18 | Positive | Patrick Smith | 2014-10-18 |
| 10004 | BlankD | NameD | 1952-08-01 | 1000003 | 4600003 | 2011-10-20 | Positive | Patrick Smith | 2014-10-20 |
It's because those DoB's are from before 12/31/1969, and the UNIX epoch starts then, so anything prior to that would be negative.
From Wikipedia:
Unix time, or POSIX time, is a system for describing instants in time, defined as the number of seconds that have elapsed since 00:00:00 Coordinated Universal Time (UTC), Thursday, 1 January 1970, not counting leap seconds.
A bit more elaboration: Basically what you're trying to do isn't possible. Depending on what it's for, there may be a different way you can do this, but using UNIX timestamps probably isn't the best idea for dates like that.