Reduce the table fragmentation Postgresql - sql

I have a table with n elements with ids from 1 to n. Then I delete nearly all the data from the table, leaving only m elements, m << n, including element with id=n. Before deleting the difference between ids of adjacent rows was 1, now it could be much larger. I need to update the table such a way that remaining element ids were from 1 to m and difference between ids of adjacent rows was 1. How I can do this? I use PostgreSQL.

You may find a better option to not update the primary key (especially in heavy multiuser einvironment) but to delegate the removal of the gaps to the accessing view.
Simple Example
create table test as
select id from generate_series(1,5) id;
delete from test
where id in (2,3,4);
The view assigns teh new continuous sequence using the row_number function.
create view test_gapless as
select
id,
row_number() over (order by id) gapless_id
from test;
select * from test_gapless;
id, gapless_id
1 1
5 2

Related

Limiting output of rows based on count of values in another table?

As a base example, I have a query that effectively produces a table with a list of values (ID numbers), each of which is attached to a specific category. As a simplified example, it would produce something like this (but at a much larger scale):
IDS
Categories
12345
type 1
12456
type 6
77689
type 3
32456
type 4
67431
type 2
13356
type 2
.....
.....
Using this table, I want to populate another table that gives me a list of ID numbers, with a limit placed on how many of each category are in that list, cross referenced against a sort of range based chart. For instance, if there are 5-15 IDS of type 1 in my first table, I want the new table with the column of IDS to have 3 type 1 IDS in it, if there are 15-30 type 1 IDS in the first table, I want to have 6 type 1 IDS in the new table.
This sort of range based limit would apply to each category, and the IDS would all populate the same column in the new table. The order, or specific IDS that end up in the final table don't matter, as long as the correct number of IDS end up as a part of that final list of ID numbers. This is being used to provide a semi-random sampling of ID numbers based on categories for a sort of QA related process.
If parts of this are unclear I can do my best to explain more. My initial thought was using a variable for a limit clause, but that isnt possible. I have been trying to sort out how to do this with a case statement but I'm really just not making any headway there but I feel like I am at this sort of paper thin wall I just can't break through.
You can use two window functions:
COUNT to keep track of the amount of ids for each category
ROW_NUMBER to uniquely identify each id within each category
Once you have collected these information, it's sufficient to keep all those rows that satisfy either of the following conditions:
count of rows less or equal to 30 >> ranking less or equal to 6
count of rows less or equal to 15 >> ranking less or equal to 3
WITH cte AS (
SELECT IDS,
Categories,
ROW_NUMBER() OVER(ORDER BY IDS PARTITION BY Categories) AS rn
COUNT(IDS) OVER(PARTITION BY Categories) AS cnt
FROM tab
)
SELECT *
FROM cte
WHERE (rn <= 3 AND cnt <= 15)
OR (rn <= 6 AND cnt <= 30)
Note: If you have concerns regarding a specific ordering, you need to fix the ORDER BY clause inside the ROW_NUMBER window function.

Select 1 record from each of 2 duplicate records

I have a messaging application which regularly inserts duplicate messages in BigQuery. The table name is 'metrics' and it has the following fields:
The Row column is a bigquery ROW_NUMBER() which is not part of the metrics table. All the other columns except batch_id form 2 duplicate rows for each message_id. You can see that message_id is repeated twice, and for each insertion 1 different batch_id is created.
I want the output like this, only 3 rows should be in the select result with 3 different message_id instead of the 6 rows i get here. It would be better if the row which had been inserted first among the duplicates for each message id would be selected(as the start_time and end_time is same for the duplicates i am not sure how to find that). I am new to Bigquery seen some examples in sql but not in Bigquery so any help is appreciated
Thanks for your help.
This deduping process becomes part of your business logic, so pick one method and stay consistent. I would do something like this:
with data as (
select
*,
row_number() over(partition by message_id order by batch_id asc) as rn
from `project.dataset.table`
)
select * from data where rn = 1
This query selects the row that has the "minimum" batch_id for each message_id. Your batch_id seem random/hashed (and not necessarily in a specific order), so this might or might work for your purposes, but it should reproduce the same results everytime (unless a 3rd record shows up, then it could begin to vary).

Query to find duplicate values for two fields

Sorry for the Title, But didn't know how to explain.
I have a table that have 2 fields A and B.
I want find all rows in the table that have duplicate A (more than one record) but at the same time A will consider as a duplicate only if B is different in both rows.
Example:
FIELD A Field B
10 10
10 10 // This is not duplicate
10 10
10 5 // this is a duplicate
How to to this in a single query
Let's break this down into how you would go about constructing such a query. You don't make it clear whether you're looking for all values of A or all rows but let's assume all values of A initially.
The first step therefore is to create a list of all values of A. This can be done two ways, DISTINCT or GROUP BY. I'm going to use GROUP BY because of what else you want to do:
select a
from your_table
group by a
This returns a single column that is unique on A. Now, how can you change this to give you the unique values? The most obvious thing to use is the HAVING clause, which allows you to restrict on aggregated values. For instance the following will give you all values of A which only appear once in the table
select a
from your_table
group by a
having count(*) = 1
That is the count of all values of A inside the group is 1. You don't want this of course, you want to do this with the column B. You need there to exist more than one value of B in order for the situation you want to identify to be possible (if there's only one value of B then it's impossible). This gets us to
select a
from your_table
group by a
having count(b) > 1
This still isn't enough as you want two different values of B. The above just counts the number of records with the column B. Inside an aggregate function you use the DISTINCT keyword to determine unique values; bringing us to:
select a
from your_table
group by a
having count(distinct b) > 1
To transcribe this into English this means select all unique values of A from YOUR_TABLE that have more than one values of B in the group.
You can use this method, or something similar, to build up your own queries as you create them. Determine what you want to achieve and slowly build up to it.
select FIELD from your_table group by FIELD having count(b) > 1
take in consideration that this will return count of all duplicate
example
if you have values
1
1
2
1
it will return 3 for value 1 not 2

How to search on levelOrder values un SQL?

I have a table in SQL Server that contains the following columns :
Id Name ParentId LevelOrder
8 vehicle 0 0/8/
9 car 8 0/8/9/
10 bike 8 0/8/10/
11 House 0 0/11/
...
This creates a tree.
Say that I have the LevelOrder 0/8/, this should return only the car and bike rows, but how do I handle this in SQL Server?
I have tried :
Select * FROM MyTable WHERE LevelOrder >= '0/8/'
but that does not work.
The underscore character will guarantee at least one character comes after '0/8/', so you don't get a match on the "vehicle" row.
SELECT *
FROM MyTable
WHERE LevelOrder LIKE '0/8/_%'
This code allows you to select values that start with 0/8/
Select * FROM MyTable WHERE LevelOrder like '0/8/%'
Okay -
While #Joe's answer is the simplest and easiest to implement (and possibly better performing than what I'm about to propose...), there are some issues with update anomalies.
Specifically:
You already have a parentId column. You need to synchronize both this and the levelOrder column, or risk inconsistent data. (I believe this also violates 1NF, although my understanding of the exact definition is a little sketchy...)
levelOrder contains the entire heirarchy. If any one parent is moved, all children rows must have levelOrder modified to reflect this (potentially very messy).
In light of this, here's what I recommend:
Drop the levelOrder column, as its existence will (generally) cause problems.
Use a recursive CTE and the parentId column to build the heirarchy dynamically. Either leave the column where it is, or move it to a dedicated relationship table. Moving one parent then requires only one cell to be updated, and cannot result in any (data, not semantic) anomalies. The CTE should look similar to this form (will need to be adjusted for purpose):
WITH heir_parent (parentId, id) as (SELECT parentId, id
FROM table
WHERE id =
UNION ALL
SELECT b.parentId, b.id
FROM heir_parent as a
JOIN table as b
ON b.parentId = a.id)
At the moment, the CTE returns a list of all children of the given id, with their id and their immediate parent. It can be adjusted to return a number of other things as well - although I recommend that the CTE be used only to generate the relationship, and join externally to get the remaining data.

Approach to a Bin Packing sql problem

I have a problem in sql where I need to generate a packing list from a list of transactions.
Data Model
The transactions are stored in a table that contains:
transaction id
item id
item quantity
Each transaction can have multiple items (and coincidentally multiple rows with the same transaction id). Each item then has a quantity from 1 to N.
Business Problem
The business requires that we create a packing list, where each line item in the packing list contains the count of each item in the box.
Each box can only contain 160 items (they all happen to be the same size/weight). Based on the total count of the order we need to split items into different boxes (sometimes splitting even the individual item's collection into two boxes)
So the challenge is to take that data schema and come up with the result set that includes how many of each item belong in each box.
I am currently brute forcing this in some not so pretty ways and wondering if anyone has an elegant/simple solution that I've overlooked.
Example In/Out
We really need to isolate how many of each item end up in each box...for example:
Order 1:
100 of item A100 of item B140 of item C
This should result in three rows in the result set:
Box 1: A (100), B (60) Box 2: B(40), C (120) Box 3: C(20)
Ideally the query would be smart enough to put all of C together, but at this point - we're not too concerned with that.
How about something like
SELECT SUM([Item quantity]) as totalItems
, SUM([Item quantity]) / 160 as totalBoxes
, MOD(SUM([Item Quantity), 160) amountInLastBox
FROM [Transactions]
GROUP BY [Transaction Id]
Let me know what fields in the resultset you're looking for and I could come up with a better one
I was looking for something similar and all I could achieve was expanding the rows to the number of item counts in a transaction, and grouping them into bins. Not very elegant though.. Moreover, because string aggregation is still very cumbersome in SQL Server (Oracle, i miss you!), I have to leave the last part out. I mean putting the counts in one single row..
My solution is as follows:
Example transactions table:
INSERT INTO transactions
(trans_id, item, cnt) VALUES
('1','A','50'),
('2','A','140'),
('3','B','100'),
('4','C','80');
GO
Create a dummy sequence table, which contains numbers from 1 to 1000 (I assume that maximum number allowed for an item in a single transaction is 1000):
CREATE TABLE numseq (n INT NOT NULL IDENTITY) ;
GO
INSERT numseq DEFAULT VALUES ;
WHILE SCOPE_IDENTITY() < 1000 INSERT numseq DEFAULT VALUES ;
GO
Now we can generate a temporary table from transactions table, in which each transaction and item exist "cnt" times in a subquery, and then give numbers to the bins using division, and group by bin number:
SELECT bin_nr, item, count(*) count_in_bin
INTO result
FROM (
SELECT t.item, ((row_number() over (order by t.item, s.n) - 1) / 160) + 1 as bin_nr
FROM transactions t
INNER JOIN numseq s
ON t.cnt >= s.n -- join conditionally to repeat transaction rows "cnt" times
) a
GROUP BY bin_id, item
ORDER BY bin_id, item
GO
Result is:
bin_id item count_in_bin
1 A 160
2 A 30
2 B 100
2 C 30
3 C 50
In Oracle, the last step would be as simple as that:
SELECT bin_id, WM_CONCAT(CONCAT(item,'(',count_in_bin,')')) contents
FROM result
GROUP BY bin_id
This isn't the prettiest answer but I am using a similar method to keep track of stock items through an order process, and it is easy to understand, and may lead to you developing a better method than I have.
I would create a table called "PackedItem" or something similar. The columns would be:
packed_item_id (int) - Primary Key, Identity column
trans_id (int)
item_id (int)
box_number (int)
Each record in this table represents 1 physical unit you will ship.
Lets say someone adds a line to transaction 4 with 20 of item 12, I would add 20 records to the PackedItem table, all with the transaction ID, the Item ID, and a NULL box number. If a line is updated, you need to add or remove records from the PackedItem table so that there is always a 1:1 correlation.
When the time comes to ship, you can simply
SELECT TOP 160 FROM PackedItem WHERE trans_id = 4 AND box_number IS NULL
and set the box_number on those records to the next available box number, until no records remain where the box_number is NULL. This is possible using one fairly complicated UPDATE statement inside a WHILE loop - which I don't have the time to construct fully.
You can now easily get your desired packing list by querying this table as follows:
SELECT box_number, item_id, COUNT(*) AS Qty
FROM PackedItem
WHERE trans_id = 4
GROUP BY box_number, item_id
Advantages - easy to understand, fairly easy to implement.
Pitfalls - if the table gets out of sync with the lines on the Transaction, the final result can be wrong; This table will get many records in it and will be extra work for the server. Will need each ID field to be indexed to keep performance good.