Redshift / Postgres SQL - Split Sequential Data Into Multiple Rows - sql

I'm creating a table from some data with the following query:
CREATE TABLE TableA AS
SELECT
color,
TRIM(Year) as year,
...
FROM
TableB;
In most cases year will be a single year, but in certain cases we could have e.g. 2003-2006
How can I create the table such that it turns
color | year
blue | 2003-2006
into
color | year
blue | 2003
blue | 2004
blue | 2005
blue | 2006
I tried fiddling with generate_series but it doesn't seem to work given our version:
dev=# CREATE TEMPORARY TABLE series_test AS SELECT generate_series(1,3);
INFO: Function "generate_series(integer,integer)" not supported.
ERROR: Specified types or functions (one per INFO message) not supported on Redshift tables.

You need to generate a tally table of some sort for this purpose. Say 10 years is enough, you could do:
with n as (
select row_number() over () as n
from tableb
limit 10
)
select t.from_year + n.n - 1, t.color
from (select t.*,
(case when year like '%-%' then split_part(year, '-', 1)::int
else year::int
end) as from_year,
(case when year like '%-%' then split_part(year, '-', 2)::int
else year::int
end) as to_year
from t
) t join
n
on t.from_year + n.n - 1 <= t.to_year;
This assumes that the table has enough rows to generate the numbers. If your given table isn't big enough, then you can use a different table.

Related

Redshift: Generate a sequential range of numbers

I'm currently migrating PostgreSQL code from our existing DWH to new Redshift DWH and few queries are not compatible.
I have a table which has id, start_week, end_week and orders_each_week in a single row. I'm trying to generate a sequential series between the start_week and end_week so that I separate rows for each week between the give timeline.
Eg.,
This how it is present in the table
+----+------------+----------+------------------+
| ID | start_week | end_week | orders_each_week |
+----+------------+----------+------------------+
| 1 | 3 | 5 | 10 |
+----+------------+----------+------------------+
This is how I want to have it
+----+------+--------+
| ID | week | orders |
+----+------+--------+
| 1 | 3 | 10 |
+----+------+--------+
| 1 | 4 | 10 |
+----+------+--------+
| 1 | 5 | 10 |
+----+------+--------+
The code below is throwing error.
SELECT
id,
generate_series(start_week::BIGINT, end_week::BIGINT) AS demand_weeks
FROM client_demand
WHERE createddate::DATE >= '2021-01-01'
[0A000][500310] Amazon Invalid operation: Specified types or functions (one per INFO message) not supported on Redshift tables.;
[01000] Function "generate_series(bigint,bigint)" not supported.
So basically I am trying to find a sequential series between two numbers and I couldn't find any solution and any help here is really appreciated. Thank you
Gordon Linoff has shown a very common method for doing this and this approach has the advantage that the process isn't generating "rows" that don't already exist. This can make this faster than approaches that generate data on the fly. However, you need to have a table with about the right number of rows laying around and this isn't always the case. He also shows that this number series needs to be cross joined with your data to perform the function you need.
If you need to generate a large number of numbers in a series not using an existing table there are a number of ways to do this. Here's my go to approach:
WITH twofivesix AS (
SELECT
p0.n
+ p1.n * 2
+ p2.n * POWER(2,2)
+ p3.n * POWER(2,3)
+ p4.n * POWER(2,4)
+ p5.n * POWER(2,5)
+ p6.n * POWER(2,6)
+ p7.n * POWER(2,7)
as n
FROM
(SELECT 0 as n UNION SELECT 1) p0,
(SELECT 0 as n UNION SELECT 1) p1,
(SELECT 0 as n UNION SELECT 1) p2,
(SELECT 0 as n UNION SELECT 1) p3,
(SELECT 0 as n UNION SELECT 1) p4,
(SELECT 0 as n UNION SELECT 1) p5,
(SELECT 0 as n UNION SELECT 1) p6,
(SELECT 0 as n UNION SELECT 1) p7
),
fourbillion AS (
SELECT (a.n * POWER(256, 3) + b.n * POWER(256, 2) + c.n * 256 + d.n) as n
FROM twofivesix a,
twofivesix b,
twofivesix c,
twofivesix d
)
SELECT ...
This example makes a whole bunch of numbers (4B) but you can extend or reduce the number in the series by changing the number of times the tables are cross joined and by adding where clauses (as Gordon Linoff did). I don't expect you need a list anywhere close to this long but wanted to show how this can be used to make series that are very long. (You can also write with in base 10 if that makes more sense to you.)
So if you have a table with a more rows that you need number then this can be the fastest method but if you don't have such a table or table lengths vary over time you may want this pure SQL approach.
Among the many Postgres features that Redshift does not support is generate_series() (except on the master node). You can generate one yourself.
If you have a table with enough rows in Redshift, then I find that this approach works:
with n as (
select row_number() over () - 1 as n
from client_demand cd
)
select cd.id, cd.start_week + n.n as week, cd.orders_each_week
from client_demand cd join
n
on n.n <= (end_week - start_week);
This assumes that you have a table with enough rows to generate enough numbers for the on clause. If the table is really big, then add something like limit 100 in the n CTE to limit the size.
If there are only a handful of values, you can use:
select 0 as n union all
select 1 as n union all
select 2 as n

SQL query to duplicate each row 12 times

I have a table which has columns site,year and sales . this table is unique on site+year eg
site year sales
-------------------
a 2012 50
b 2013 100
a 2006 35
Now what I want to do is make this table unique on site+year+month. Thus each row gets duplicated 12 times, a month column is added which is labelled from 1-12 and the sales values get divided by 12 thus
site year month sales
-------------------------
a 2012 1 50/12
a 2012 2 50/12
...
a 2012 12 50/12
...
b 2013 1 100/12
...
a 2006 12 35/12
I am doing this on python currently and it works like a charm, but I need to do this in SQL (ideally PostgreSQL since I will be using this as a datasource for tableau)
It would be very helpful if someone can provide the explanations with the solution as well, since I am a novice at this
You can use generate_series() for that
select t.site, t.year, g.month, t.sales / 12
from the_table t
cross join generate_series(1,12) as g (month)
order by t.site, t.year, g.month;
If the column sales is an integer, you should cast that to a numeric to avoid the integer division: t.sales::numeric / 12
Online example: http://rextester.com/GUWPI39685
Try this approach (For T-SQL - MS SQL) :
DECLARE #T TABLE
(
[site] VARCHAR(5),
[year] INT,
sales INT
)
INSERT INTO #T
VALUES('A',2012,50),('B',2013,100),('C',2006,35)
;WITH CTE
AS
(
SELECT
MonthSeq = 1
UNION ALL
SELECT
MonthSeq = MonthSeq+1
FROM CTE
WHERE MonthSeq <12
)
SELECT
T.[site],
T.[year],
[Month] = CTE.MonthSeq,
sales = T.[sales]/12
FROM CTE
CROSS JOIN #T T
ORDER BY T.[site],CTe.MonthSeq

Find gaps of a sequence in SQL without creating additional tables

I have a table invoices with a field invoice_number. This is what happens when i execute select invoice_number from invoice:
invoice_number
--------------
1
2
3
5
6
10
11
I want a SQL that gives me the following result:
gap_start | gap_end
4 | 4
7 | 9
How can i write a SQL to perform such query?
I am using PostgreSQL.
With modern SQL, this can easily be done using window functions:
select invoice_number + 1 as gap_start,
next_nr - 1 as gap_end
from (
select invoice_number,
lead(invoice_number) over (order by invoice_number) as next_nr
from invoices
) nr
where invoice_number + 1 <> next_nr;
SQLFiddle: http://sqlfiddle.com/#!15/1e807/1
We can use a simpler technique to get all missing values first, by joining on a generated sequence column like so:
select series
from generate_series(1, 11, 1) series
left join invoices on series = invoices.invoice_number
where invoice_number is null;
This gets us the series of missing numbers, which can be useful on it's own in some cases.
To get the gap start/end range, we can instead join the source table with itself.
select invoices.invoice_number + 1 as start,
min(fr.invoice_number) - 1 as stop
from invoices
left join invoices r on invoices.invoice_number = r.invoice_number - 1
left join invoices fr on invoices.invoice_number < fr.invoice_number
where r.invoice_number is null
and fr.invoice_number is not null
group by invoices.invoice_number,
r.invoice_number;
dbfiddle: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=32c5f3c021b0f1a876305a2bd3afafc9
This is probably less optimised than the above solutions, but could be useful in SQL servers that don't support lead() function perhaps.
Full credit goes to this excellent page in SILOTA docs:
http://www.silota.com/docs/recipes/sql-gap-analysis-missing-values-sequence.html
I highly recommend reading it, as it explains the solution step by step.
I found another query:
select invoice_number + lag gap_start,
invoice_number + lead - 1 gap_end
from (select invoice_number,
invoice_number - lag(invoice_number) over w lag,
lead(invoice_number) over w - invoice_number lead
from invoices window w as (order by invoice_number)) x
where lag = 1 and lead > 1;

Generate SQL rows

Given a number of types and a number of occurrences per type, I would like to generate something like this in T-SQL:
Occurrence | Type
-----------------
0 | A
1 | A
0 | B
1 | B
2 | B
Both the number of types and the number of occurrences per type are presented as values in different tables.
While I can do this with WHILE loops, I'm looking for a better solution.
Thanks!
This works with a number-table which i would use.
SELECT Occurrence = ROW_NUMBER() OVER (PARTITION BY Type ORDER BY Type) - 1
, Type
FROM Numbers num
INNER JOIN #temp1 t
ON num.n BETWEEN 1 AND t.Occurrence
Tested with this sample data:
create table #temp1(Type varchar(10),Occurrence int)
insert into #temp1 VALUES('A',2)
insert into #temp1 VALUES('B',3)
How to create a number-table? http://sqlperformance.com/2013/01/t-sql-queries/generate-a-set-1
If you have a table with the columns type and num, you have two approaches. One way is to use recursive CTEs:
with CTE as (
select type, 0 as occurrence, num
from table t
union all
select type, 1 + occurrence, num
from cte
where occurrence + 1 < num
)
select cte.*
from cte;
You may have to set the MAXRECURSION option, if the number exceeds 100.
The other way is to join in a numbers table. SQL Server uses spt_values for this purpose:
select s.number - 1 as occurrence, t.type
from table t join
spt_values s
on s.number <= t.num ;

SQL: create sequential list of numbers from various starting points

I'm stuck on this SQL problem.
I have a column that is a list of starting points (prevdoc), and anther column that lists how many sequential numbers I need after the starting point (exdiff).
For example, here are the first several rows:
prevdoc | exdiff
----------------
1 | 3
21 | 2
126 | 2
So I need an output to look something like:
2
3
4
22
23
127
128
I'm lost as to where even to start. Can anyone advise me on the SQL code for this solution?
Thanks!
;with a as
(
select prevdoc + 1 col, exdiff
from <table> where exdiff > 0
union all
select col + 1, exdiff - 1
from a
where exdiff > 1
)
select col
If your exdiff is going to be a small number, you can make up a virtual table of numbers using SELECT..UNION ALL as shown here and join to it:
select prevdoc+number
from doc
join (select 1 number union all
select 2 union all
select 3 union all
select 4 union all
select 5) x on x.number <= doc.exdiff
order by 1;
I have provided for 5 but you can expand as required. You haven't specified your DBMS, but in each one there will be a source of sequential numbers, for example in SQL Server, you could use:
select prevdoc+number
from doc
join master..spt_values v on
v.number <= doc.exdiff and
v.number >= 1 and
v.type = 'p'
order by 1;
The master..spt_values table contains numbers between 0-2047 (when filtered by type='p').
If the numbers are not too large, then you can use the following trick in most databases:
select t.exdiff + seqnum
from t join
(select row_number() over (order by column_name) as seqnum
from INFORMATION_SCHEMA.columns
) nums
on t.exdiff <= seqnum
The use of INFORMATION_SCHEMA columns in the subquery is arbitrary. The only purpose is to generate a sequence of numbers at least as long as the maximum exdiff number.
This approach will work in any database that supports the ranking functions. Most databases have a database-specific way of generating a sequence (such as recursie CTEs in SQL Server and CONNECT BY in Oracle).