Summarized table in postgreSQL for better performance - sql

I am using postgreSQL as my database. I have a table MASTER(A, B, C, D, N1, N2, N3, N4, N5, N6) where the primary key is (A, B, C, D) and N1, N2, N3, N4, N5, N6 are the numeric columns.
I have a query as below to get the summarized data of each A selected from each list in MASTERCOMB.
SELECT MASTERCOM.A
,STATS.sumn1
,STATS.sumn2
,STATS.sumn3
,STATS.sumn4
,STATS.sumn5
,STATS.sumn6
FROM (WITH
sum1 AS (SELECT A, SUM(N1) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N1) DESC LIMIT $2),
sum2 AS (SELECT A, SUM(N2) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N2) DESC LIMIT $2),
sum3 AS (SELECT A, SUM(N3) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N3) DESC LIMIT $2),
sum4 AS (SELECT A, SUM(N4) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N4) DESC LIMIT $2),
sum5 AS (SELECT A, SUM(N5) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N5) DESC LIMIT $2),
sum6 AS (SELECT A, SUM(N6) FROM MASTER WHERE B = $1 GROUP BY A ORDER BY SUM(N6) DESC LIMIT $2)
SELECT DISTINCT COALESCE(sum1.A, sum2.A, sum3.A, sum4.A, sum5.A, sum6.A) A
FROM sum1
FULL OUTER JOIN sum2 ON sum2.A = sum1.A
FULL OUTER JOIN sum3 ON sum3.A = sum1.A
FULL OUTER JOIN sum4 ON sum4.A = sum1.A
FULL OUTER JOIN sum5 ON sum5.A = sum1.A
FULL OUTER JOIN sum6 ON sum6.A = sum1.A) MASTERCOMB
LEFT JOIN (SELECT A
,SUM(N1) sumn1
,SUM(N2) sumn2
,SUM(N3) sumn3
,SUM(N4) sumn4
,SUM(N5) sumn5
,SUM(N6 sumn6)
FROM MASTER WHERE B = $1 GROUP BY A) AS STATS
ON STATS.A = MASTERCOMB.A
This is just one kind of query with B in the WHERE clause. I may have to query with different combinations like 'WHERE C = $3' OR 'WHERE D = $4'. In rare cases I may have to query with combinations of multiple conditions on B, C and D together;
As the table grows, the performance of the queries could drop. So I am thinking of two aproaches
Approach #1:
Create Summary Tables SMRY_A_B, SMRY_A_C, SMRY_A_D
On each insert, update and delete of MASTER table, SUM the values and insert/update/delete respective tables
Approach #2:
Create a Summary table SMRY_A_B_C_D with primary key (A, B, C, D)
On each insert, update and delete of MASTER table, SUM the values and insert/update/delete SMRY_A_B_C_D table
possible values for SMRY_A_B_C_D could be
(valA, valB, 'N/A', 'N/A', sumn1, sumn2, sumn3, sumn4, sumn5, sumn6)
(valA, 'N/A, valC, 'N/A', sumn1, sumn2, sumn3, sumn4, sumn5, sumn6)
(valA, 'N/A, 'N/A', 'valD', sumn1, sumn2, sumn3, sumn4, sumn5, sumn6)
Questions:
Which approach is better to go with?
Should I not consider both the approaches and query from the master table itself? If so should I optimize the query?

Related

SQL Finding duplicate values in two of the three columns of each row

Let's say we have three columns: A, B, and C.
I would like to filter the results as follows:
The values of A and B are the same (duplicated) for > 1 (more than 1) row, and the value of C is always different.
In the attached image, the values that appear selected would meet the conditions mentioned above.
What I've tried:
SELECT
a.notation as A, a.gene as B, b.id as C
FROM
`db-dummy`.sgdata c
join `db-dummy`.g_info a on a.rec_id = c.gen_id
join `db-dummy`.spec_data b on b.rec_id = c.spec_id GROUP BY A, B HAVING COUNT(*) > 1;
I thought that using GROUP BY and HAVING COUNT(*) > 1 I could get the desired result, but I get the following error:
SQL Error [1055] [42000]: (conn=1632) Expression #3 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'db-dummy.b.spec_id' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
If you had a single table, I would suggest just using exists. But because you have a join, use window functions. If you are. looking for different values of id:
SELECT A, B, C
FROM (SELECT a.notation as A, a.gene as B, b.id as C,
MIN(b.id) OVER (PARTITION BY a.notation, a.gene) as min_id,
MAX(b.id) OVER (PARTITION BY a.notation, a.gene) as max_id
FROM `db-dummy`.sgdata c JOIN
`db-dummy`.g_info a
ON a.rec_id = c.gen_id JOIN
`db-dummy`.spec_data b
ON b.rec_id = c.spec_id
) x
WHERE min_id <> max_id;
If you are just looking for multiple rows for a given A and B, then you can use:
SELECT A, B, C
FROM (SELECT a.notation as A, a.gene as B, b.id as C,
COUNT(*) OVER (PARTITION BY a.noation, a.gene) as cnt
FROM `db-dummy`.sgdata c JOIN
`db-dummy`.g_info a
ON a.rec_id = c.gen_id JOIN
`db-dummy`.spec_data b
ON b.rec_id = c.spec_id
) x
WHERE cnt > 1;
SELECT * FROM `db-dummy`.sgdata a
LEFT JOIN
(SELECT COUNT(Id) as count, notation, gene
FROM `db-dummy`.sgdata
GROUP BY notation, gene
HAVING COUNT(id) > 1) b
on a.notation = b.notation AND a.gene = b.gene

Delete duplicate rows from oracle DB with one condition

I have got the script right but the execution time of completion is about 5 Mins to delete 11320860 records. Is there alternate way of writing this query so that the execution time is reduced ?
Scenario is same record combination can have E as well as A records. And the code is trying to delete both A and E records if there exists at least one E record for the same combination.
Delete from tableA u
WHERE EXISTS
(Select 1 from tableA w
WHERE w.a = u.a
AND w.b = u.b
AND w.c = u.c
AND w.d = u.d
AND w.flag ='E' ); - Del about 11320860 records in 4 Mins
So you need this, I think:
Delete from tableA u
WHERE u.flag in ('A', 'E')
and EXISTS
(Select 1 from tableA w
WHERE w.a = u.a
AND w.b = u.b
AND w.c = u.c
AND w.d = u.d
AND w.flag ='E')
This way also should work:
delete from tableA
where flag in ('A', 'E')
and (a, b, c, d) in
(select a, b, c, d
from tableA
where flag = 'E')

return column name of the maximum value in sql server 2012

My table looks like this (Totally different names)
ID Column1--Column2---Column3--------------Column30
X 0 2 6 0101 31
I want to find the second maximum value of Column1 to Column30 and Put the column_Name in a seperate column.
First row would look like :
ID Column1--Column2---Column3--------------Column30------SecondMax
X 0 2 6 0101 31 Column3
Query :
Update Table
Set SecondMax= (select Column_Name from table where ...)
with unpvt as (
select id, c, m
from T
unpivot (c for m in (c1, c2, c3, ..., c30)) as u /* <-- your list of columns */
)
update T
set SecondMax = (
select top 1 m
from unpvt as u1
where
u1.id = T.id
and u1.c < (
select max(c) from unpvt as u2 where u2.id = u1.id
)
order by c desc, m
)
I really don't like relying on top but this isn't a standard sql question anyway. And it doesn't do anything about ties other than returning the first column name by order of alphabetical sort.
You could use a modification via the condition below to get the "third maximum". (Obviously the constant 2 comes from 3 - 1.) Your version of SQL Server lets you use a variable there as well. I think SQL 2012 also supports the limit syntax if that's preferable to top. And since it should work for top 0 and top 1 as well, you might just be able to run this query in a loop to populate all of your "maximums" from first to thirty.
Once you start having ties you'll eventually get a "thirtieth maximum" that's null. Make sure you cover those cases though.
and u1.c < all (
select top 2 distinct c from unpvt as u2 where u2.id = u1.id
)
And after I think about it. If you're going to rank and update so many columns it would probably make even more sense to use a proper ranking function and do the update all at once. You'll also handle the ties a lot better even if the alphabetic sorting is still arbitrary.
with unpvt as (
select id, c, m, row_number() over (partition by id order by c desc, m) as nthmax
from T
unpivot (c for m in (c1, c2, c3, ..., c30)) as u /* <-- your list of columns */
)
update T set
FirstMax = (select c from unpvt as u where u.id = T.id and nth_max = 1),
SecondMax = (select c from unpvt as u where u.id = T.id and nth_max = 2),
...
NthMax = (select c from unpvt as u where u.id = T.id and nth_max = N)

Insert into table with multiple joins under a unique condition based off time

I have an insert statment that incorporates multiple joins. However, the last join (table ItemMulitplers) doesnt really have anything "tied" to the other tables. They are just multipliers in this table with no unique identification or connection with others. the only thing is a timestamp from this table.
I have 5 rows in this table and my script is taking all five rows. I need it to select only one and to base it off of the closest time from the table called ItemsProduced. They get executed at the same time but not on the same millisecond level. any help is most appreciated thank you
insert into KLNUser.dbo.ItemLookup (ItemNumber, Cases, [Description], [Type], Wic, Elc, totalelc, Shift, [TimeStamp])
select a.ItemNumber, b.CaseCount,b.ItemDescription, b.DivisionCode, b.WorkCenter, b.LaborPerCase, a.CaseCount* b.LaborPerCase* c.IaCoPc, a.shift, a.TimeStamp from ItemsProduced a
inner join MasterItemList b on a.ItemNumber = b.itemnumber
inner join ItemMultipliers c on c.MultiplyTimeStamp <=a.Timestamp Interval 1 seconds
where not exists (select * from ItemLookup where ItemNumber = a.ItemNumber and Cases = b.CaseCount and [TimeStamp] = a.TimeStamp)
I think the easiest way is with cross apply:
select a.ItemNumber, b.CaseCount,b.ItemDescription, b.DivisionCode, b.WorkCenter, b.LaborPerCase, a.CaseCount* b.LaborPerCase* c.IaCoPc, a.shift, a.TimeStamp
from ItemsProduced a inner join
MasterItemList b
on a.ItemNumber = b.itemnumber cross apply
(select top 1 *
from ItemMultipliers c
where c.MultiplyTimeStamp < a.Timestamp
order by c.MultiplyTimeStamp desc
) c
where not exists (select * from ItemLookup where ItemNumber = a.ItemNumber and Cases = b.CaseCount and [TimeStamp] = a.TimeStamp)

Improving a query to find out-of-sync values between two tables

I have the following query:
SELECT
tableOneId
SUM(a+b+c) AS tableOneData,
MIN(d) AS tableTwoData,
FROM
tableTwo JOIN tableOne ON tableOneId = tableTwoId
GROUP BY
tableOneId
All of the mentioned columns are declared as numeric(30,6) NOT NULL.
In tableOne, I have entries whose sum (columns a, b, c) should be equivalent to column d in Table Two.
A simple example of this:
Table One (id here should read tableOneId to match above query)
id=1, a=1, b=0, c=0
id=1, a=0, b=2, c=0
id=2, a=1, b=0, c=0
Table Two (id here should read tableTwoId to match above query)
id=1, d=3
id=2, d=1
My first iteration used SUM(d)/COUNT(*) but division is messy so I'm currently using MIN(d). What would be a better way to write this query?
Try this:
SELECT
tableOneId,
tableOneData,
d AS tableTwoData
FROM tableTwo
JOIN (select tableOneId, sum(a + b + c) AS tableOneData
from tableone
group by 1) x ON tableOneId = tableTwoId
where tableOneData <> d;
This will return all rows that have incorrect data in table 2.
select tableOneId, SUM(a) + SUM(b) + SUM(c) as tableOneData, d as tableTwoData
from tableTwo JOIN tableOne ON tableOneId = tableTwoId
GROUP BY tableOneId, d