PostgreSQL cross tab with three columns with values summed from one column - sql

I am new to SQL and was trying to do a crosstab in Postgres. I would have done it in Excel, but I have a database of around 3.5 million rows, 20,000 different values for code, 7 categories in cat, and variable values from 1 to 100. A code may only have few of the 7 categories.
Excel can't handle the number of rows, so SQL it is.
My data is in the form
code | cat | value |
--------------------------------
abc123 | 1 | 4 |
abc234 | 2 | 6 |
abc345 | 1 | 1 |
abc123 | 3 | 2 |
abc123 | 6 | 12 |
with code and cat as text, value as integer stored in a Postgres table.
I would like to perform a crosstab on code and cat, with sum of value. I would like it to show zero instead of 'null' in the return, but if 'null' would be simpler query, then that would be fine.
So the output I would like is
code | 'cat=0' | 'cat=1' | 'cat=2' | 'cat=3' | 'cat=4' | 'cat=5' | 'cat=6'|
abc123 | 25 | 0 | 3 | 500 | 250 | 42 | 0 |
abc234 | 0 | 100 | 0 | 10 | 5 | 0 | 25 |
abc345 | 1000 | 0 | 0 | 0 | 0 | 0 | 0 |
I have searched on Postgres help files and other forums; the closest thing was the SO question PostgreSQL Crosstab Query but I couldn't figure out how to sum the values from third column.
Any assistance would be greatly appreciated.

I got this working by updating my code to the following:
select * from crosstab(
'select code, cat, sum(value) as value
from my_table
group by code, cat
order by 1,2'
) as ct(code varchar(255),
cat_0 bigint,
cat_1 bigint,
cat_2 bigint,
cat_3 bigint,
cat_4 bigint,
cat_5 bigint,
cat_6 bigint)
I was able to determine the right data type by running the select statement inside the crosstab and matching my as ct data types to those returned by the query inside the crosstab.

Try:
select * from crosstab(
'select code, cat, sum(value) as value
from my_table
group by code, cat
order by 1,2'
) as ct(code text,
cat_0 int,
cat_1 int,
cat_2 int,
cat_3 int,
cat_4 int,
cat_5 int,
cat_6 int)

Related

Counting columns if certain Id

I have a table tblTitles that I am attempting to run a select query on. I would like to select a count based reports are there with IdState and do a count on how many of those titles belong to IsOnSaleCountId which would be if that column has an id of 1
Here is an example of the table:
+---------+----------+-------------------+-----------------+
| IdState | RegionId | Title | IsOnSaleId |
+---------+----------+-------------------+-----------------+
| 22 | 1 | Online Shopping | 0 |
| 22 | 1 | Retail Shopping | 1 |
| 22 | 1 | Pick Up | 0 |
| | | | |
+---------+----------+-------------------+-----------------+
My expected outcome should read that IdState of 22 has 3 reports and 1 report is onSale due to the 1 integer in the second row. Which would look similar to this:
+---------+-------------+---------------+
| IdState | ReportCount | IsOnSaleCount |
+---------+-------------+---------------+
| 22 | 3 | 1 |
+---------+-------------+---------------+
I am having issues when doing a select statement with this count. The IsOnSaleCount is identical to the ReportCount number which they should not be.
I believe this is the case due to my line of code of case when count(i.IsOnSaleId) > 0 THEN count(1) Else 0 End as IsOnSaleCount
Is this something that I can do in a SELECT query?
Here is an example of my query :
select
i.IdState,
count(i.RegionId) as ReportCount,
case when count(i.IsOnSaleId) > 0 THEN count(1) Else 0 End as IsOnSaleCount,
0 as EnterpriseReportCount,
i.IdReportCollection_PK_PrimaryCollection
from IBIS_Local.dbo.tblindustry i
If you want the count:
count(i.IsOnSaleId) as IsOnSaleCount,
If you just want a 0/1 flag, you could do:
sign(count(i.IsOnSaleId)) as IsOnSaleCount,
IF OBJECT_ID('stack.report') IS NOT NULL DROP TABLE stack.report
CREATE TABLE stack.report ( IdState TINYINT, RegionID TINYINT, Title VARCHAR(50), IsOnSaleId INT)
INSERT INTO stack.report
VALUES
(22,1,'Online Shopping', 0)
, (22,1,'Retail Shopping', 1)
, (22,1,'Pick Up', 0)
SELECT *, CONVERT(TINYINT, IsOnSaleId) isonsaleint FROM stack.report
SELECT IdState, COUNT(*) ReportCount, SUM(IsOnSaleId) OnSaleCount
FROM stack.report
GROUP BY IdState
ORDER BY IdState
Result
IdState | ReportCount | OnSaleCount
22 | 3 | 1
The SUM works if IsOnSaleId is an INT, SMALLINT or TINYINT. If IsOnSalesId datatype is BIT (commonly used for flags), then you will need to convert to one of the int types like this SUM(CONVERT(INT, IsOnSaleId))

Linking Related IDs together through two other ID columns

I have a table of about 100k rows with the following layout:
+----+-----------+------------+-------------------+
| ID | PIN | RAID | Desired Output ID |
+----+-----------+------------+-------------------+
| 1 | 80602627 | 1737852-1 | 1 |
| 2 | 80602627 | 34046655-1 | 1 |
| 3 | 351418172 | 33661 | 2 |
| 4 | 351418172 | 33661 | 2 |
| 5 | 351418172 | 33661 | 2 |
| 6 | 351418172 | 34443321-1 | 2 |
| 7 | 491863017 | 26136 | 3 |
| 8 | 491863017 | 34575 | 3 |
| 9 | 491863017 | 34575 | 3 |
| 10 | 661254727 | 26136 | 3 |
| 11 | 661254727 | 26136 | 3 |
| 12 | NULL | 7517 | 4 |
| 13 | NULL | 7517 | 4 |
| 14 | NULL | 7517 | 4 |
| 15 | NULL | 7517 | 4 |
| 16 | NULL | 7517 | 4 |
| 17 | 554843813 | 33661 | 2 |
| 18 | 554843813 | 33661 | 2 |
+----+-----------+------------+-------------------+
The ID column has unique values, with the PIN and RAID columns being two separate identifying numbers used to group linked IDs together. The Desired Output ID column is what I would like SQL to do, essentially looking at both the PIN and RAID columns to spot where there are any relationships between them.
So for example Where Desired Output ID = 2, IDs 3-6 match on PIN = 351418172, and then IDs 17-18 also match as the RAID of 33661 was in the rows for IDs 3-5.
To add as well, NULLs will be in the PIN Column but not in any others.
I did spot a similar question Text however as it is in BigQuery I wasnt sure it would help.
Have been trying to crack this one for a while with no luck, any help massively appreciated.
I suppose DENSE_RANK can solve your problem. Not sure what the combination of PIN and RAID should be, but I think you'll be able to figure it out how to do it like this:
SELECT *,DENSE_RANK( ) over (ORDER BY isnull(pin,id) ),DENSE_RANK( ) over (ORDER BY raid)
FROM accounts
I believe I have found a bit of a bodged solution to this. It runs very slowly as it goes row by row and will only go two links deep on PIN/RAID, but this should be sufficient for 99%+ cases.
Would appreciate any suggestions to speeding it up if anything is immediately obvious.
ID in post above is DebtorNo in Code:
DECLARE #Counter INT = 1
DECLARE #EndCounter INT = 0
IF OBJECT_ID('Tempdb..#OrigACs') IS NOT NULL
BEGIN
DROP TABLE #OrigACs
END
SELECT DebtorNo,
Name,
PostCode,
DOB,
RAJoin,
COALESCE(PIN,DebtorNo COLLATE DATABASE_DEFAULT) AS PIN,
RelatedAssets,
RAID,
PINRelatedAssets
INTO #OrigACs
FROM MIReporting..HC_RA_Test_Data RA
IF OBJECT_ID('Tempdb..#Accounts') IS NOT NULL
BEGIN
DROP TABLE #Accounts
END
SELECT *,
ROW_NUMBER() OVER (ORDER BY CAST(RA.DebtorNo AS INT)) AS Row
INTO #Accounts
FROM #OrigACs RA
ORDER BY CAST(RA.DebtorNo AS INT)
CREATE INDEX Temp_HC_Index ON #OrigACs (RAID,PIN)
SET #EndCounter = (SELECT MAX(Row) FROM #Accounts)
WHILE #Counter <= #EndCounter
BEGIN
IF OBJECT_ID('Tempdb..#RAID1') IS NOT NULL
BEGIN
DROP TABLE #RAID1
END
SELECT *
INTO #RAID1
FROM #OrigACs A
WHERE A.RAID IN (SELECT RAID FROM #Accounts WHERE [Row] = #Counter)
IF OBJECT_ID('Tempdb..#PIN1') IS NOT NULL
BEGIN
DROP TABLE #PIN1
END
SELECT *
INTO #PIN1
FROM #OrigACs A
WHERE A.PIN IN (SELECT PIN FROM #RAID1)
IF OBJECT_ID('Tempdb..#RAID2') IS NOT NULL
BEGIN
DROP TABLE #RAID2
END
SELECT *
INTO #RAID2
FROM #OrigACs A
WHERE A.RAID IN (SELECT RAID FROM #PIN1)
IF OBJECT_ID('Tempdb..#PIN2') IS NOT NULL
BEGIN
DROP TABLE #PIN2
END
SELECT *
INTO #PIN2
FROM #OrigACs A
WHERE A.PIN IN (SELECT PIN FROM #RAID2)
INSERT INTO MIReporting..HC_RA_Final_ACs
SELECT DebtorNo,
Name,
PostCode,
DOB,
RAJoin,
CASE
WHEN PIN = DebtorNo COLLATE DATABASE_DEFAULT THEN NULL
ELSE PIN
END AS PIN,
RelatedAssets,
RAID,
PINRelatedAssets,
COALESCE((SELECT MAX(FRAID) FROM MIReporting..HC_RA_Final_ACs),0) + 1 AS FRAID
FROM #PIN2
SET #Counter = (SELECT MIN([ROW]) FROM #Accounts O WHERE O.DebtorNo NOT IN (SELECT DebtorNo FROM MIReporting..HC_RA_Final_ACs));
END;
SELECT *
FROM MIReporting..HC_RA_Final_ACs
DROP TABLE #OrigACs
DROP TABLE #Accounts
DROP TABLE #RAID1
DROP TABLE #PIN1
DROP TABLE #RAID2
DROP TABLE #PIN2

Unexpected effect of filtering on result from crosstab() query with multiple values

I have a crosstab() query similar to the one in my previous question:
Unexpected effect of filtering on result from crosstab() query
The common case is to filter extra1 field with multiples values: extra1 IN(value1, value2...). For each value included on the extra1 filter, I have added an ordering expression like this (extra1 <> valueN), as appear on the above mentioned post. The resulting query is as follows:
SELECT *
FROM crosstab(
'SELECT row_name, extra1, extra2..., another_table.category, value
FROM table t
JOIN another_table ON t.field_id = another_table.field_id
WHERE t.field = certain_value AND t.extra1 IN (val1, val2, ...) --> more values
ORDER BY row_name ASC, (extra1 <> val1), (extra1 <> val2)', ... --> more ordering expressions
'SELECT category_name FROM category_name WHERE field = certain_value'
) AS ct(extra1, extra2...)
WHERE extra1 = val1; --> condition on the result
The first value of extra1 included on the ordering expression value1, get the correct resulting rows. However, the following ones value2, value3..., get wrong number of results, resulting on less rows on each one. Why is that?
UPDATE:
Giving this as our source table (table t):
+----------+--------+--------+------------------------+-------+
| row_name | Extra1 | Extra2 | another_table.category | value |
+----------+--------+--------+------------------------+-------+
| Name1 | 10 | A | 1 | 100 |
| Name2 | 11 | B | 2 | 200 |
| Name3 | 12 | C | 3 | 150 |
| Name2 | 11 | B | 3 | 150 |
| Name3 | 12 | C | 2 | 150 |
| Name1 | 10 | A | 2 | 100 |
| Name3 | 12 | C | 1 | 120 |
+----------+--------+--------+------------------------+-------+
And this as our category table:
+-------------+--------+
| category_id | value |
+-------------+--------+
| 1 | Cat1 |
| 2 | Cat2 |
| 3 | Cat3 |
+-------------+--------+
Using the CROSSTAB, the idea is to get a table like this:
+----------+--------+--------+------+------+------+
| row_name | Extra1 | Extra2 | cat1 | cat2 | cat3 |
+----------+--------+--------+------+------+------+
| Name1 | 10 | A | 100 | 100 | |
| Name2 | 11 | B | | 200 | 150 |
| Name3 | 12 | C | 120 | 150 | 150 |
+----------+--------+--------+------+------+------+
The idea is to be able to filter the resulting table so I get results with Extra1 column with values 10 or 11, as follow:
+----------+--------+--------+------+------+------+
| row_name | Extra1 | Extra2 | cat1 | cat2 | cat3 |
+----------+--------+--------+------+------+------+
| Name1 | 10 | A | 100 | 100 | |
| Name2 | 11 | B | | 200 | 150 |
+----------+--------+--------+------+------+------+
The problem is that on my query, I get different result size for Extra1 with 10 as value and Extra1 with 11 as value. With (Extra1 <> 10) I can get the correct result size on Extra1 for that value but not in the case of 11 as value.
Here is a fiddle demonstrating the problem in more detail:
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=5c401f7512d52405923374c75cb7ff04
All "extra" columns are copied from the first row of the group (as pointed out in my previous answer)
While you filter with:
.... WHERE extra1 = 'val1';
...it makes no sense to add more ORDER BY expressions on the same column. Only rows that have at least one extra1 = 'val1' in their source group survive.
From your various comments, I guess you might want to see all distinct existing values of extra - within the set filtered in the WHERE clause - for the same unixdatetime. If so, aggregate before pivoting. Like:
SELECT *
FROM crosstab(
$$
SELECT unixdatetime, x.extras, c.name, s.value
FROM (
SELECT unixdatetime, array_agg(extra) AS extras
FROM (
SELECT DISTINCT unixdatetime, extra
FROM source_table s
WHERE extra IN (1, 2) -- condition moves here
ORDER BY unixdatetime, extra
) sub
GROUP BY 1
) x
JOIN source_table s USING (unixdatetime)
JOIN category_table c ON c.id = s.gausesummaryid
ORDER BY 1
$$
, $$SELECT unnest('{trace1,trace2,trace3,trace4}'::text[])$$
) AS final_result (unixdatetime int
, extras int[]
, trace1 numeric
, trace2 numeric
, trace3 numeric
, trace4 numeric);
Aside: advice given in the following related answer about the 2nd function parameter applies to your case as well:
PostgreSQL crosstab doesn't work as desired
I demonstrate a static 2nd parameter query above. While being at it, you don't need to join to category_table at all. The same, a bit shorter and faster, yet:
SELECT *
FROM crosstab(
$$
SELECT unixdatetime, x.extras, s.gausesummaryid, s.value
FROM (
SELECT unixdatetime, array_agg(extra) AS extras
FROM (
SELECT DISTINCT unixdatetime, extra
FROM source_table
WHERE extra IN (1, 2) -- condition moves here
ORDER BY unixdatetime, extra
) sub
GROUP BY 1
) x
JOIN source_table s USING (unixdatetime)
ORDER BY 1
$$
, $$SELECT unnest('{923,924,926,927}'::int[])$$
) AS final_result (unixdatetime int
, extras int[]
, trace1 numeric
, trace2 numeric
, trace3 numeric
, trace4 numeric);
db<>fiddle here - added my queries at the bottom of your fiddle.

Calculating consecutive range of dates with a value in Hive

I want to know if it is possible to calculate the consecutive ranges of a specific value for a group of Id's and return the calculated value(s) of each one.
Given the following data:
+----+----------+--------+
| ID | DATE_KEY | CREDIT |
+----+----------+--------+
| 1 | 8091 | 0.9 |
| 1 | 8092 | 20 |
| 1 | 8095 | 0.22 |
| 1 | 8096 | 0.23 |
| 1 | 8098 | 0.23 |
| 2 | 8095 | 12 |
| 2 | 8096 | 18 |
| 2 | 8097 | 3 |
| 2 | 8098 | 0.25 |
+----+----------+--------+
I want the following output:
+----+-------------------------------+
| ID | RANGE_DAYS_CREDIT_LESS_THAN_1 |
+----+-------------------------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 1 |
| 2 | 1 |
+----+-------------------------------+
In this case, the ranges are the consecutive days with credit less than 1. If there is a gap between date_key column, then the range won't have to take the next value, like in ID 1 between 8096 and 8098 date key.
Is it possible to do this with windowing functions in Hive?
Thanks in advance!
You can do this with a running sum classifying rows into groups, incrementing by 1 every time a credit<1 row is found(in the date_key order). Thereafter it is just a group by.
select id,count(*) as range_days_credit_lt_1
from (select t.*
,sum(case when credit<1 then 0 else 1 end) over(partition by id order by date_key) as grp
from tbl t
) t
where credit<1
group by id
The key is to collapse all the consecutive sequence and compute their length, I struggled to achieve this in a relatively clumsy way:
with t_test as
(
select num,row_number()over(order by num) as rn
from
(
select explode(array(1,3,4,5,6,9,10,15)) as num
)
)
select length(sign)+1 from
(
select explode(continue_sign) as sign
from
(
select split(concat_ws('',collect_list(if(d>1,'v',d))), 'v') as continue_sign
from
(
select t0.num-t1.num as d from t_test t0
join t_test t1 on t0.rn=t1.rn+1
)
)
)
Get the previous number b in the seq for each original a;
Check if a-b == 1, which shows if there is a "gap", marked as 'v';
Merge all a-b to a string, and then split using 'v', and compute length.
To get the ID column out, another string which encode id should be considered.

Convert tuple value to column names

Got something like:
+-------+------+-------+
| count | id | grade |
+-------+------+-------+
| 1 | 0 | A |
| 2 | 0 | B |
| 1 | 1 | F |
| 3 | 1 | D |
| 5 | 2 | B |
| 1 | 2 | C |
I need:
+-----+---+----+---+---+---+
| id | A | B | C | D | F |
+-----+---+----+---+---+---+
| 0 | 1 | 2 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 1 |
| 2 | 0 | 5 | 1 | 0 | 0 |
I don't know if I can even do this. I can group by id but how would you read the count value for each grade column?
CREATE TABLE #MyTable(_count INT,id INT , grade VARCHAR(10))
INSERT INTO #MyTable( _count ,id , grade )
SELECT 1,0,'A' UNION ALL
SELECT 2,0,'B' UNION ALL
SELECT 1,1,'F' UNION ALL
SELECT 3,1,'D' UNION ALL
SELECT 5,2,'B' UNION ALL
SELECT 1,2,'C'
SELECT *
FROM
(
SELECT _count ,id ,grade
FROM #MyTable
)A
PIVOT
(
MAX(_count) FOR grade IN ([A],[B],[C],[D],[F])
)P
You need a "pivot" table or "cross-tabulation". You can use a combination of aggregation and CASE statements, or, more elegantly the crosstab() function provided by the additional module tablefunc. All basics here:
PostgreSQL Crosstab Query
Since not all keys in grade have values, you need the 2-parameter form. Like this:
SELECT * FROM crosstab(
'SELECT id, grade, count FROM table ORDER BY 1,2'
, $$SELECT unnest('{A,B,C,D,F}'::text[])$$
) ct(id text, "A" int, "B" int, "C" int, "D" int, "F" int);