Remove redundant SQL price cost records - sql

I have a table costhistory with fields id,invid,vendorid,cost,timestamp,chdeleted. It looks like it was populated with a trigger every time a vendor updated their list of prices.
It has redundant records - since it was populated regardless of whether price changed or not since last record.
Example:
id | invid | vendorid | cost | timestamp | chdeleted
1 | 123 | 1 | 100 | 1/1/01 | 0
2 | 123 | 1 | 100 | 1/2/01 | 0
3 | 123 | 1 | 100 | 1/3/01 | 0
4 | 123 | 1 | 500 | 1/4/01 | 0
5 | 123 | 1 | 500 | 1/5/01 | 0
6 | 123 | 1 | 100 | 1/6/01 | 0
I would want to remove records with ID 2,3,5 since they do not reflect any change since the last price update.
I'm sure it can be done, though it might take several steps.
Just to be clear, this table has swelled to 100gb and contains 600M rows. I am confident that a proper cleanup will take this table's size down by 90% - 95%.
Thanks!

The approach you take will vary depending on the database you are using. For SQL Server 2005+, the following query should give you the records you want to remove:
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
where Rank > 1
You can then delete them like this:
delete from costhistory
where id in (
select id
from (
select id, Rank() over (Partition BY invid, vendorid, cost order by timestamp) as Rank
from costhistory
) tmp
)

I would suggest that you recreate the table using a group by query. Also, I assume the the "id" column is not used in any other tables. If that is the case, then you need to fix those tables as well.
Deleting such a large quantity of records is likely to take a long, long time.
The query would look like:
insert into newversionoftable(invid, vendorid, cost, timestamp, chdeleted)
select invid, vendorid, cost, timestamp, chdeleted
from table
group by invid, vendorid, cost, timestamp, chdeleted
If you do opt for a delete, I would suggestion:
(1) Fix the code first, so no duplicates are going in.
(2) Determine the duplicate ids and place them in a separate table.
(3) Delete in batches.
To find the duplicate ids, use something like:
select *
from (select id,
row_number() over (partition by invid, vendorid, cost, timestamp, chdeleted order by timestamp) as seqnum
from table
) t
where seqnum > 1
If you want to keep the most recent version instead, then use "timestamp desc" in the order by clause.

Related

ORACLE SELECT DISTINCT VALUE ONLY IN SOME COLUMNS

+----+------+-------+---------+---------+
| id | order| value | type | account |
+----+------+-------+---------+---------+
| 1 | 1 | a | 2 | 1 |
| 1 | 2 | b | 1 | 1 |
| 1 | 3 | c | 4 | 1 |
| 1 | 4 | d | 2 | 1 |
| 1 | 5 | e | 1 | 1 |
| 1 | 5 | f | 6 | 1 |
| 2 | 6 | g | 1 | 1 |
+----+------+-------+---------+---------+
I need get a select of all fields of this table but only getting 1 row for each combination of id+type (I don't care the value of the type). But I tried some approach without result.
At the moment that I make an DISTINCT I cant include rest of the fields to make it available in a subquery. If I add ROWNUM in the subquery all rows will be different making this not working.
Some ideas?
My better query at the moment is this:
SELECT ID, TYPE, VALUE, ACCOUNT
FROM MYTABLE
WHERE ROWID IN (SELECT DISTINCT MAX(ROWID)
FROM MYTABLE
GROUP BY ID, TYPE);
It seems you need to select one (random) row for each distinct combination of id and type. If so, you could do that efficiently using the row_number analytic function. Something like this:
select id, type, value, account
from (
select id, type, value, account,
row_number() over (partition by id, type order by null) as rn
from your_table
)
where rn = 1
;
order by null means random ordering of rows within each group (partition) by (id, type); this means that the ordering step, which is usually time-consuming, will be trivial in this case. Also, Oracle optimizes such queries (for the filter rn = 1).
Or, in versions 12.1 and higher, you can get the same with the match_recognize clause:
select id, type, value, account
from my_table
match_recognize (
partition by id, type
all rows per match
pattern (^r)
define r as null is null
);
This partitions the rows by id and type, it doesn't order them (which means random ordering), and selects just the "first" row from each partition. Note that some analytic functions, including row_number(), require an order by clause (even when we don't care about the ordering) - order by null is customary, but it can't be left out completely. By contrast, in match_recognize you can leave out the order by clause (the default is "random order"). On the other hand, you can't leave out the define clause, even if it imposes no conditions whatsoever. Why Oracle doesn't use a default for that clause too, only Oracle knows.

how would you generate fake account numbers for a company project with SQL?

I have a big data set in Redshift which my company will share with university students to analyze. I need to mask the real customer account numbers.
I've looked at the random function but there's one catch: some customers are repeated, so I need to retain that for the analysis to be useful. Also, with a random number there's a small possibility you would repeat account numbers, right?
How would you achieve this? Generate a new_random_id. It must be unique from all others in the table (there are over 4 million in the table), but must be the same for those rows where the actual account ID is the same.
+-------------------+---------------+---------+
| actual_accound_id | new_random_id | status |
+-------------------+---------------+---------+
| 100 | 123 | new |
| 100 | 123 | upgrade |
| 200 | 249 | new |
| 300 | 401 | upgrade |
+-------------------+---------------+---------+
I realize I could first generate a mapping table like this below, and then join to the main table, but it still doesn't solve the problem of possibly repeating new random IDs.
select distinct actual_account_id, cast(random()*1000000 as int) as new_random_id
into mapping_table
from t1;
I would create a mapping table using window functions:
select actual_account_id,
row_number() over (order by random()) as fake_account_id
from t1
group by actual_account_id;
This should be a meaningless sequential number.
Redshift might be a bit slow on the ROW_NUMBER() with no PARTITION BY. If performance is an issue, you can use something like this:
select actual_account_id,
count(*) * 100 + row_number(partition by tmp order by random()) as fake_acocunt_number
from (select actual_account_id,
cast(random()*1000000 as int) as tmp
from t1
group by actual_account_id
) t;

Remove Duplicates Based Off of Two Columns in PostgreSQL

So let's say I have a table named Class with the following fields: userid, time, and score. The table looks like this:
+--------+------------+-------+
| userid | time | score |
+--------+------------+-------+
| 1 | 08-20-2018 | 75 |
| 1 | 10-25-2018 | 50 |
| 1 | 02-01-2019 | 88 |
| 2 | 04-23-2019 | 98 |<remove
| 2 | 04-23-2019 | 86 |
| 3 | 06-05-2019 | 71 |<remove
| 3 | 06-05-2019 | 71 |
+--------+------------+-------+
However, I would like to remove records where the userid and the time is the same (since it doesn't make sense for someone to give another score on the same day). This would also take care of the records where the userid, time, and score are the same. So in this table, rows 4 and 6 should be removed.
The following query gives me a list of the duplicated records:
select userid, time
FROM class
GROUP BY userid, time
HAVING count(*)>1;
However, how do I remove the duplicates while still keeping the userid, time, and score column in the outcome?
You can use the row_number() window function to assign a number to each record in the order of score for each userid and time and then select only the rows where this number is equal to one.
SELECT userid,
time,
score
FROM (SELECT userid,
time,
score,
row_number() OVER (PARTITION BY userid,
time
ORDER BY score) rn
FROM class) x
WHERE rn = 1;
First, you need some criterium to distinguish between two rows that have different scores (unless you want to randomly choose between the two). E.g., you could pick the highest score (like the SATs) or the lowest.
Assuming you want the highest score per day, you can do this:
SELECT distinct on (userid, time)
user_id, time, score
from class
order by userid, time, score desc
Some key things: you have to have the same columns in your distinct on in the left-most positions in your order by but the magic is in the field that comes next in the order by - it’ll pick the first row among dupes of (userid, time) when ordered by score desc.
You have a real problem with your data model. This is easy enough to fix in a select query, as the other answer suggest (I would recommend distinct on) for this.
For actually deleting the row, you can use ctid (as mentioned in a comment. The approach is:
delete from t
where exists (select 1
from t t2
where t2.user_id = t.user_id and t2.time = t.time and
t2.ctid < t.ctid
);
That is, delete any row where there is a smaller ctid for the user_id/time combination.

More efficient way to query shortest string value associated with each value in another column in Hive QL

I have a table in Hive containing store names, order IDs, and User IDs (as well as some other columns including item ID). There is a row in the table for every item purchased (so there can be more than one row per order if the order contains multiple items). Order IDs are unique within a store, but not across stores. A single order can have more than one user ID associated with it.
I'm trying to write a query that will return a list of all stores and order IDs and the shortest user ID associated with each order.
So, for example, if the data looks like this:
STORE | ORDERID | USERID | ITEMID
------+---------+--------+-------
| a | 1 | bill | abc |
| a | 1 | susan | def |
| a | 2 | jane | abc |
| b | 1 | scott | ghi |
| b | 1 | tony | jkl |
Then the output would look like this:
STORE | ORDERID | USERID
------+---------+-------
a | 1 | bill
a | 2 | jane
b | 1 | tony
I've written a query that will do this, but I feel like there must be a more efficient way to go about it. Does anybody know a better way to produce these results?
This is what I have so far:
select
users.store, users.orderid, users.userid
from
(select
store, orderid, userid, length(userid) as len
from
sales) users
join
(select distinct
store, orderid,
min(length(userid)) over (partition by store, orderid) as len
from
sales) len on users.store = len.store
and users.orderid = len.orderid
and users.len = len.len
Check out probably this will work for you, here you can achieve your goal of single "SELECT" clause with no extra overhead on SQL.
select distinct
store, orderid,
first_value(userid) over(partition by store, orderid order by length(userid) asc) f_val
from
sales;
The result will be:
store orderid f_val
a 1 bill
a 2 jane
b 1 tony
Probably rank() is the best way:
select s.*
from (select s.*, rank() over (partition by store order by length(userid) as seqnum
from sales s
) s
where seqnum = 1;

Counting the total number of rows with SELECT DISTINCT ON without using a subquery

I have performing some queries using PostgreSQL SELECT DISTINCT ON syntax. I would like to have the query return the total number of rows alongside with every result row.
Assume I have a table my_table like the following:
CREATE TABLE my_table(
id int,
my_field text,
id_reference bigint
);
I then have a couple of values:
id | my_field | id_reference
----+----------+--------------
1 | a | 1
1 | b | 2
2 | a | 3
2 | c | 4
3 | x | 5
Basically my_table contains some versioned data. The id_reference is a reference to a global version of the database. Every change to the database will increase the global version number and changes will always add new rows to the tables (instead of updating/deleting values) and they will insert the new version number.
My goal is to perform a query that will only retrieve the latest values in the table, alongside with the total number of rows.
For example, in the above case I would like to retrieve the following output:
| total | id | my_field | id_reference |
+-------+----+----------+--------------+
| 3 | 1 | b | 2 |
+-------+----+----------+--------------+
| 3 | 2 | c | 4 |
+-------+----+----------+--------------+
| 3 | 3 | x | 5 |
+-------+----+----------+--------------+
My attemp is the following:
select distinct on (id)
count(*) over () as total,
*
from my_table
order by id, id_reference desc
This returns almost the correct output, except that total is the number of rows in my_table instead of being the number of rows of the resulting query:
total | id | my_field | id_reference
-------+----+----------+--------------
5 | 1 | b | 2
5 | 2 | c | 4
5 | 3 | x | 5
(3 rows)
As you can see it has 5 instead of the expected 3.
I can fix this by using a subquery and count as an aggregate function:
with my_values as (
select distinct on (id)
*
from my_table
order by id, id_reference desc
)
select count(*) over (), * from my_values
Which produces my expected output.
My question: is there a way to avoid using this subquery and have something similar to count(*) over () return the result I want?
You are looking at my_table 3 ways:
to find the latest id_reference for each id
to find my_field for the latest id_reference for each id
to count the distinct number of ids in the table
I therefore prefer this solution:
select
c.id_count as total,
a.id,
a.my_field,
b.max_id_reference
from
my_table a
join
(
select
id,
max(id_reference) as max_id_reference
from
my_table
group by
id
) b
on
a.id = b.id and
a.id_reference = b.max_id_reference
join
(
select
count(distinct id) as id_count
from
my_table
) c
on true;
This is a bit longer (especially the long thin way I write SQL) but it makes it clear what is happening. If you come back to it in a few months time (somebody usually does) then it will take less time to understand what is going on.
The "on true" at the end is a deliberate cartesian product because there can only ever be exactly one result from the subquery "c" and you do want a cartesian product with that.
There is nothing necessarily wrong with subqueries.