Reduce rows in SQL - sql

I have a select query that will return something like the following table:
start | stop | id
------------------
0 | 100 | 1
1 | 101 | 1
2 | 102 | 1
2 | 102 | 2
5 | 105 | 1
7 | 107 | 2
...
300 | 400 | 1
370 | 470 | 1
450 | 550 | 1
Where stop = start + n; n = 100 in this case.
I would like to merge the overlaps for each id:
start | stop | id
------------------
0 | 105 | 1
2 | 107 | 2
...
300 | 550 | 1
id 1 does not give 0 - 550 because the start 300 is after stop 105.
There will be hundreds of thousands of records returned by the first query and n can go up to tens of thousands, so the faster it can be processed the better.
Using PostgreSQL btw.

WITH bounds AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY start) AS rn
FROM (
SELECT id, LAG(stop) OVER (PARTITION BY id ORDER BY start) AS pstop, start
FROM q
UNION ALL
SELECT id, MAX(stop), NULL
FROM q
GROUP BY
id
) q2
WHERE start > pstop OR pstop IS NULL OR start IS NULL
)
SELECT b2.start, b1.pstop
FROM bounds b1
JOIN bounds b2
ON b1.id = b2.id
AND b1.rn = b2.rn + 1

Related

Partition By - Sum all values Excluding Maximum Value

I have data as follows
+----+------+--------+
| ID | Code | Weight |
+----+------+--------+
| 1 | M | 200 |
| 1 | 2A | 50 |
| 1 | 2B | 50 |
| 2 | | 350 |
| 2 | M | 350 |
| 2 | 3A | 120 |
| 2 | 3B | 120 |
| 3 | 5A | 100 |
| 4 | | 200 |
| 4 | | 100 |
+----+------+--------+
For ID 1 the max weight is 200, I want to subtract sum of all weights from ID 1 except the max value that is 200.
There might be a case when there are 2 rows containing max values for same id. Example for ID 2 we have 2 rows containing max value i.e. 350 . In such scenario I want to sum all values except the max value. But I would mark weight 0 for 1 of the 2 rows containing max value. That row would be the one where Code is NULL/Blank.
Case where there is only 1 row for an ID the row would be kept as is.
Another scenario could be one where there is only row containing max weight but Code is NULL/Blank in such case we would simply do what we did for ID 1. Sum all values except max value and subtract from row containing max value.
Desired Output
+----+------+--------+---------------+
| ID | Code | Weight | Actual Weight |
+----+------+--------+---------------+
| 1 | M | 200 | 100 |
| 1 | 2A | 50 | 50 |
| 1 | 2B | 50 | 50 |
| 2 | | 350 | 0 |
| 2 | M | 350 | 110 |
| 2 | 3A | 120 | 120 |
| 2 | 3B | 120 | 120 |
| 3 | 5A | 100 | 100 |
| 4 | | 200 | 100 |
| 4 | | 100 | 100 |
+----+------+--------+---------------+
I want to create column Actual Weight as shown above. I can't find a way to apply partition by excluding max value and create column Actual Weight.
dense_rank() to identify the row with max weight, dr = 1 is rows with max weight
row_number() to differentiate the max weight row for Code = blank from others
with cte as
(
select *,
dr = dense_rank() over (partition by ID order by [Weight] desc),
rn = row_number() over (partition by ID order by [Weight] desc, Code desc)
from tbl
)
select *,
ActWeight = case when dr = 1 and rn <> 1
then 0
when dr = 1 and rn = 1
then [Weight]
- sum(case when dr <> 1 then [Weight] else 0 end) over (partition by ID)
else [Weight]
end
from cte
dbfiddle demo
Hmmm . . . I think you just want window functions and conditional logic:
select t.*,
(case when 1 = row_number() over (partition by id order by weight desc, (case when code <> '' then 2 else 1 end))
then weight - sum(case when weight <> max_weight then weight else 0 end) over (partition by id)
else weight
end) as actual_weight
from (select t.*,
max(weight) over (partition by id, code) as max_weight
from t
) t

Conditionally finding groups of duplicate column pairs, if one of the pairs contains a specific flag. Then updating all to that txId

I am new to SQL. Bad data got inserted into a table. The table represents items that a user has purchased. txId is an id generated upon purchase. The assumption is each item/user combination all have the same txId.
id | item | user | txId | date (ms)
-----------------------------------------------
1 | cup | bob | 10 | 1000000
2 | cup | bob | -1 | 1000000
3 | cup | bob | 10 | 1000000
4 | cup | jim | -1 | 2000000
5 | hat | bob | 10 | 1000000
6 | pen | tom | -1 | 3000000
7 | pen | tom | -1 | 3000000
8 | pen | tom | 13 | 3000000
9 | shoe | bob | 10 | 1000000
10 | hat | dan | -1 | 4000000
11 | hat | dan | -1 | 4000000
I am trying to find all item/sku groups that have a txId of -1 AND another valid txId (that is not -1). I don't care about line 4, since cup/jim is only 1 row of item/user. Line 5/9 also only have 1 row for the item/user group. Line 10/11 do not have a valid transaction id, so I don't want them either.
I want my results to be:
id | item | user | txId | date (ms)
-----------------------------------------------
1 | cup | bob | 10 | 1000000
2 | cup | bob | -1 | 1000000
3 | cup | bob | 10 | 1000000
6 | pen | tom | -1 | 3000000
7 | pen | tom | -1 | 3000000
8 | pen | tom | 13 | 3000000
Here is what I have tried
select groups.id, groups.item, groups.user, groups.txId, groups.date
from (
SELECT
*
FROM (
SELECT
*
, COUNT(*) OVER (PARTITION BY item, user) AS occurrences
FROM myTable tbl
) t
WHERE t.occ > 1
) groups
INNER JOIN
(
SELECT * FROM myTable WHERE txId = -1
) violators
ON violators.user = groups.user AND violators.item = groups.item
ORDER BY groups.user, groups.item DESC
I am getting some false positives and can't figure out why.
You can use exists:
select t.*
from t
where (t.txid > 0 and
exists (select 1
from t t2
where t2.item = t.item and t2.user = t.usr and
t2.txid < 0
)
) or
(t.txid < 0 and
exists (select 1
from t t2
where t2.item = t.item and t2.user = t.usr and
t2.txid > 0
)
);
You can also use window functions:
select t.*
from (select t.*,
min(txid) over (partition by item, user) as min_txid,
max(txid) over (partition by item, user) as max_txid
from t
) t
where min_txid = -1 and max_txid > 0;

SQL Server sum field from previous calculation

In SQL Server, I have table with 4 column
artid num A B
46 1 417636000 0
47 1 15024000 0
102 1 3418105650 0
226 1 1160601286 0
60 668 260000 0
69 668 5500000 0
I want in result set create new column for some calculation
This column should have value like this:
artid num a b newColumnValue
----------- ----------- ---------------------- ---------------------- ----------------------
46 1 417636000 0 a-b+previous newColumnValue
I write this query, but I can't get previous newColumnValue:
select *, (a- b+ lag(a- b, 1, a- b) over (order by num,artid)) as newColumnValue
FROM MainTbl
ORDER BY num,artid
i get this result
artid num a b newColumnValue
----------- ----------- ---------------------- ---------------------- ----------------------
46 1 417636000 0 417636000
47 1 15024000 0 432660000
102 1 3418105650 0 3433129650
226 1 1160601286 0 4578706936
60 668 260000 0 1160861286
69 668 5500000 0 5760000
i want get this result
artid num a b newColumnValue
----------- ----------- ---------------------- ---------------------- ----------------------
46 1 417636000 0 417636000
47 1 15024000 0 432660000
102 1 3418105650 0 3850765650
226 1 1160601286 0 5011366936
60 668 260000 0 5011626936
69 668 5500000 0 5017126936
You want cumulative sums (well, the difference between them):
select a, b, sum(a - b) over (order by num, artid)
from mytbl;
Note: SQL tables represent unordered sets. You need a column to specify the ordering to define previous. If you really only have two columns, then I might assume the ordering is based on a, and the query would be:
select a, b, sum(a - b) over (order by a)
from mytbl;
Given the following example data,
+----+---+---+
| Id | A | B |
+----+---+---+
| 1 | 2 | 3 |
+----+---+---+
| 2 | 3 | 4 |
+----+---+---+
| 3 | 4 | 5 |
+----+---+---+
| 4 | 5 | 6 |
+----+---+---+
| 5 | 6 | 7 |
+----+---+---+
the following short SQL statement produces the desired output:
select A - B + lag(A - B, 1, 0) over (order by id)
from test
+----+
| -1 |
+----+
| -2 |
+----+
| -2 |
+----+
| -2 |
+----+
| -2 |
+----+
Note that the Lag function takes three arguments: the first is the expression you would like evaluated for the "lagged" record, the second is the amount of the lag (defaults to 1), and the third is the value to return if the expression cannot be computed (e.g. if it is the first record).

Record batching on bases of running total values by specific number (FileSize wise batching)

We are dealing with large recordset and are currently using NTILE() to get the range of FileIDs and then using FileID column in BETWEEN clause to get specific records set. Using FileID in BETWEEN clause is a mandatory requirement from Developers. So, we cannot have random FileIDs in one batch, it has to be incremental.
As per new requirement, we have to make range based on FileSize column, e.g. 100 GB per batch.
For example:
Batch 1 : 1 has 100 size So ID: 1 record only.
Batch 2 : 2,3,4,5 = 80 but it is < 100 GB, so have to take FileId 6 if 120 GB (Total 300 GB)
Batch 3 : 7 ID has > 100 so 1 record only
And so on…
Below are my sample code, but it is not giving the expected result:
CREATE TABLE zFiles
(
FileId INT
,FileSize INT
)
INSERT INTO dbo.zFiles (
FileId
,FileSize
)
VALUES (1, 100)
,(2, 20)
,(3, 20)
,(4, 30)
,(5, 10)
,(6, 120)
,(7, 400)
,(8, 50)
,(9, 100)
,(10, 60)
,(11, 40)
,(12, 5)
,(13, 20)
,(14, 95)
,(15, 40)
DECLARE #intBatchSize FLOAT = 100;
SELECT y.FileID ,
y.FileSize ,
y.RunningTotal ,
DENSE_RANK() OVER (ORDER BY CEILING(RunningTotal / #intBatchSize)) Batch
FROM ( SELECT i.FileID ,
i.FileSize ,
RunningTotal = SUM(i.FileSize) OVER ( ORDER BY i.FileID ) -- RANGE UNBOUNDED PRECEDING)
FROM dbo.zFiles AS i WITH ( NOLOCK )
) y
ORDER BY y.FileID;
Result:
+--------+----------+--------------+-------+
| FileID | FileSize | RunningTotal | Batch |
+--------+----------+--------------+-------+
| 1 | 100 | 100 | 1 |
| 2 | 20 | 120 | 2 |
| 3 | 20 | 140 | 2 |
| 4 | 30 | 170 | 2 |
| 5 | 10 | 180 | 2 |
| 6 | 120 | 300 | 3 |
| 7 | 400 | 700 | 4 |
| 8 | 50 | 750 | 5 |
| 9 | 100 | 850 | 6 |
| 10 | 60 | 910 | 7 |
| 11 | 40 | 950 | 7 |
| 12 | 5 | 955 | 7 |
| 13 | 20 | 975 | 7 |
| 14 | 95 | 1070 | 8 |
| 15 | 40 | 1110 | 9 |
+--------+----------+--------------+-------+
Expected Result:
+--------+---------------+---------+
| FileID | FileSize (GB) | BatchNo |
+--------+---------------+---------+
| 1 | 100 | 1 |
| 2 | 20 | 2 |
| 3 | 20 | 2 |
| 4 | 30 | 2 |
| 5 | 10 | 2 |
| 6 | 120 | 2 |
| 7 | 400 | 3 |
| 8 | 50 | 4 |
| 9 | 100 | 4 |
| 10 | 60 | 5 |
| 11 | 40 | 5 |
| 12 | 5 | 6 |
| 13 | 20 | 6 |
| 14 | 95 | 6 |
| 15 | 40 | 7 |
+--------+---------------+---------+
We can achieve this if somehow we can reset the running total once it gets over 100. We can write a loop to have this result, but for that we need to go record by record, which is time consuming.
Please somebody help us on this?
You need to do this with a recursive CTE:
with cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize + z.filesize > #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize + z.filesize > #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;
Note: I realize that fileid probably is not a sequence. You can create a sequence using row_number() in a CTE, to make this work.
There is a technical reason why running sums don't work for this. Essentially, any given fileid needs to know the breaks before it.
Small modification on above answered by Gordon Linoff and got expected result.
DECLARE #intBatchSize INT = 100
;WITH cte as (
select z.fileid, z.filesize, z.filesize as batch_filesize, 1 as batchnum
from zfiles z
where z.fileid = 1
union all
select z.fileid, z.filesize,
(case when cte.batch_filesize >= #intBatchSize
then z.filesize
else cte.batch_filesize + z.filesize
end),
(case when cte.batch_filesize >= #intBatchSize
then cte.batchnum + 1
else cte.batchnum
end)
from cte join
zfiles z
on z.fileid = cte.fileid + 1
)
select *
from cte;

Postgres width_bucket() not assigning values to buckets correctly

In postgresql 9.5.3 I can't get width_bucket() to work as expected, it appears to be assigning values to the wrong buckets.
Dataset:
1
2
4
32
43
82
104
143
232
295
422
477
Expected output (bucket ranges and zero-count rows added to help analysis):
bucket | bucketmin | bucketmax | Expect | Actual
--------+-----------+-----------+--------|--------
1 | 1 | 48.6 | 5 | 5
2 | 48.6 | 96.2 | 1 | 2
3 | 96.2 | 143.8 | 2 | 1
4 | 143.8 | 191.4 | 0 | 0
5 | 191.4 | 239 | 1 | 1
6 | 239 | 286.6 | 0 | 1
7 | 286.6 | 334.2 | 1 | 0
8 | 334.2 | 381.8 | 0 | 1
9 | 381.8 | 429.4 | 1 | 0
10 | 429.4 | 477 | 1 | 1
Actual output:
wb | count
----+-------
1 | 5
2 | 2
3 | 1
5 | 1
6 | 1
8 | 1
10 | 1
Code to generate actual output:
create temp table metrics (val int);
insert into metrics (val) values(1),(2),(4),(32),(43),(82),(104),(143),(232),(295),(422),(477);
with metric_stats as (
select
cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
),
hist as (
select
width_bucket(val, s.minV, s.maxV, 9) wb,
count(*)
from metrics m, metric_stats s
group by 1 order by 1
)
select * from hist;
Your calculations appear to be off. The following query:
with metric_stats as (
select cast(min(val) as float) as minV,
cast(max(val) as float) as maxV
from metrics m
)
select g.n,
s.minV + ((s.maxV - s.minV) / 9) * (g.n - 1) as bucket_start,
s.minV + ((s.maxV - s.minV) / 9) * g.n as bucket_end
from generate_series(1, 9) g(n) cross join
metric_stats s
order by g.n
Yields the following bins:
1 1 53.8888888888889
2 53.8888888888889 106.777777777778
3 106.777777777778 159.666666666667
4 159.666666666667 212.555555555556
5 212.555555555556 265.444444444444
6 265.444444444444 318.333333333333
7 318.333333333333 371.222222222222
8 371.222222222222 424.111111111111
9 424.111111111111 477
I think you intend for the "9" to be a "10", if you want 10 buckets.