Conditional removing duplicate records - sql

I'm storing some realtime data in SQLite. Now, I want to remove duplicate records to reduce data and enlarge its timeframe to 20 seconds by SQL commands.
Sample data:
id t col1 col2
-----------------------------
23 9:19:18 15 16
24 9:19:20 10 11
25 9:19:20 10 11
26 9:19:35 10 11
27 9:19:45 10 11
28 9:19:53 10 11
29 9:19:58 14 13
Logic: In above sample, records 25-28 have same value in col1 and col2 field, so they are duplicate. But because keeping one (for example, record 25) and removing others will cause timeframe (= time difference between subsequent data) to be more than 20s, i don't want to remove all of records 26-28. So, in above sample, row=25 will be kept because, it's not duplicate of its previous row. Row=26 will be kept, because although its duplicate of its previous row, removing this row causes to have timeframe to more than 20s (19:45 - 19:20). Row=27 will be removed, meeting these 2 conditions and row=28 will be kept.
I can load data to C# datatable and apply this logic in code in a loop over records, but it is slow comparing to run SQL in database. I'm not sure this can be implemented in SQL. Any help would be greatly appreciated.
Edit: I've added another row before row =25 to show rows with the same time. Fiddle is here: Link

OK so here's an alternate answer that handles the duplicate record scenario you've described, uses LAG and LEAD and also ends up considerably simpler as it turns out!
delete from t1 where id in
(
with cte as (
select id,
lag(t, 1) over(partition by col1, col2 order by t) as prev_t,
lead(t, 1) over(partition by col1, col2 order by t) as next_t
from t1
)
select id
from cte
where strftime('%H:%M:%S',next_t,'-20 seconds') < strftime('%H:%M:%S',prev_t)
)
Online demo here

I believe this accomplishes what you are after:
delete from t1 where id in
(
select ta.id
from t1 as ta
join t1 as tb
on tb.t = (select max(t) from t1 where t < ta.t
and col1 = ta.col1 and col2 = ta.col2)
and tb.col1 = ta.col1 and tb.col2 = ta.col2
join t1 as tc
on tc.t = (select min(t) from t1 where t > ta.t
and col1 = ta.col1 and col2 = ta.col2)
and tc.col1 = ta.col1 and tc.col2 = ta.col2
where strftime('%H:%M:%S',tc.t,'-20 seconds') < strftime('%H:%M:%S',tb.t)
)
Online demo is here where I've gone through a couple of iterations to simplify it to the above. Basically you need to look at both the previous row and the next row to determine whether you can delete the current row, which happens only when there's a difference of less than 20 seconds between the previous and next row times, as I understand your requirement.
Note: You could probably achieve the same using LAG and LEAD but I'll leave that as an exercise to anyone else who's interested!!
EDIT: In case the time values are not unique, I've included additional conditions to the ta/tb and ta/tc joins to include col1 and col2 and updated the fiddle.

I think you can do the following:
Create a result set in SQL that adds the previous row ordered by id (for this use LAG function (https://www.sqlitetutorial.net/sqlite-window-functions/sqlite-lag/)
Calculate a new column using the CASE construct (https://www.sqlitetutorial.net/sqlite-case/). This column could be a boolean called "keep" that basically is calculated in the following way:
if the previous row col1 and col2 values are not the same => true
if the previous row col1 and col2 values are the same but the time difference > 20 sec => true
in other cases => false
Filter on this query to only select the rows to keep (keep = true).

Related

Max match same numbers from each row

To generate 1mln rows of report with the below mentioned script is taking almost 2 days so, really appreciate if somebody could help me with different script which the report can be generated within 10-15mins please.
The requirement of the report is as following;
Table “cover” contains 5mln rows & 6 columns of data and likewise table “data” contains 500,000 rows and 6 columns.
So, each numbers of the rows in table cover has to go through table date and provide the maximum matches.
For instance, as mentioned on the below tables, there could be 3 matches in row #1, 2 matches in row #2 and 5 matches in row #3 so the script has to select the max selection which is 5 in row #3.
Sample table
UPDATE public.cover_sheet AS fc
SET maxmatch = (SELECT MAX(tmp.mtch)
FROM (
SELECT (SELECT CASE WHEN fc.a=drwo.a THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.b=drwo.b THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.c=drwo.c THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.d=drwo.d THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.e=drwo.e THEN 1 ELSE 0 END) +
(SELECT CASE WHEN fc.f=drwo.f THEN 1 ELSE 0 END) AS mtch
FROM public.data AS drwo
) AS tmp)
WHERE fc.code>0;
SELECT *
FROM public.cover_sheet AS fc
WHERE fc.maxmatch>0;
As #a_horse_with_no_name mentioned in the comment to the question, your question is not clear...
Seems, you want to get the number of records which 6 fields from both tables are equal.
I'd suggest to:
reduce the number of select statements, then the speed of query execution will increase,
split your query into few smaller ones (good practice), to check your logic,
use join to get equal data, see: Visual Representation of SQL Joins
use subquery or cte to get result on which you'll be able to update table.
I think you want to get result as follow:
SELECT COUNT(*) mtch
FROM public.cover_sheet AS fc INNER JOIN public.data AS drwo ON
fc.a=drwo.a AND fc.b=drwo.b AND fc.c=drwo.c AND fc.d=drwo.d AND fc.e=drwo.e AND fc.f=drwo.f
If i'm not wrong and above query is correct, the time of execution of above query will reduce to about 1-2 minutes.
Finally, update query may look like:
WITH qry AS
(
-- proper select statement here
)
UPDATE public.cover_sheet AS fc
SET maxmatch = qry.<fieldname>
FROM qry
WHERE fc.code>0 AND fc.<key> = qry.<key>;
Note:
I do not see your data and i know nothing about its structure, relationships, etc. So, you have to change above query to your needs.

Sum up Adjacent columns in sql

I'm asking for a solution without functions or procedures (Permissions problem).
I have a table like this:
where k=number of columns (In reality : k=500)
col1 col2 col3 col4 col5.... col(k)
10 20 30 -50 60 100
and I need to create a comulative row like this:
col1 col2 col3 col4 col5 ... col(k)
10 30 60 10 70 X
In Excel, it's a simple shit to make a forumla and drag it but in sql if I have lot of columns, it seems a very clumsy work to add Manually (col1 as col1, col1+col2 as col2, col1+col2+col3 as col3 till colk etc).
Any way of finding a good solution for this problem?
You say that you've changed your data model to rows. So let's say that the new table has three columns:
grp (some group key to identify which rows belong together, i.e. what was one row in your old table)
pos (a position number from 1 to 500 to indicate the order of the values)
value
You get the cumulative sums with SUM OVER:
select grp, pos, value, sum(value) over (partition by grp order by pos) as running_total
from mytable
order by grp, pos;
If this "colk" is going to be needed/used in a lot of reports, I suggest you create a computed column or a view to sum all the columns using k = cola+colb+...
There's no function in sql to sum up columns (ex. between colA and colJ)

MonetDB: Enumerate groups of rows based on a given "boundary" condition

Consider the following table:
id gap groupID
0 0 1
2 3 1
3 7 2
4 1 2
5 5 2
6 7 3
7 3 3
8 8 4
9 2 4
Where groupID is the desired, computed column, such as its value is incremented whenever the gap column is greater than a threshold (in this case 6). The id column defines the sequential order of appearance of the rows (and it's already given).
Can you please help me figure out how to dynamically fill out the appropriate values for groupID?
I have looked in several other entries here in StackOverflow, and I've seen the usage of sum as an aggregate for a window function. I can't use sum because it's not supported in MonetDB window functions (only rank, dense_rank, and row_num). I can't use triggers (to modify the record insertion before it takes place) either because I need to keep the data mentioned above within a stored function in a local temporary table -- and trigger declarations are not supported in MonetDB function definitions.
I have also tried filling out the groupID column value by reading the previous table (id and gap) into another temporary table (id, gap, groupID), with the hope that this would force a row-by-row operation. But this has failed as well because it gives the groupID 0 to all records:
declare threshold int;
set threshold = 6;
insert into newTable( id, gap, groupID )
select A.id, A.gap,
case when A.gap > threshold then
(select case when max(groupID) is null then 0 else max(groupID)+1 end from newTable)
else
(select case when max(groupID) is null then 0 else max(groupID) end from newTable)
end
from A
order by A.id asc;
Any help, tip, or reference is greatly appreciated. It's been a long time already trying to figure this out.
BTW: Cursors are not supported in MonetDB either --
You can assign the group using a correlated subquery. Simply count the number of previous values that exceed 6:
select id, gap,
(select 1 + count(*)
from t as t2
where t2.id <= t.id and t2.gap > 6
) as Groupid
from t;

SQL: Most efficient way to select sequences of rows from a table

I have a tagged textual corpus stored in an SQL table like the following:
id tag1 tag2 token sentence_id
0 a e five 1
1 b f score 1
2 c g years 1
3 d h ago 1
My task is to search the table for sequences of tokens that meet certain criteria, sometimes with gaps between each token.
For example:
I want to be able to search for a sequence similar to the following:
the token has the value a in the tag1 column, and
the second token is one to two rows away from the first, and has the value g in tag2 or b in tag1, and
the third token should be at least three rows away, and has ago in the token column.
In SQL, this would be something like the following:
SELECT * FROM my_table t1
JOIN my_table t2 ON t1.sentence_id = t2.sentence_id
JOIN my_table t3 ON t3.sentence_id = t1.sentence_id
WHERE t1.tag1 = 'a' AND (t2.id = t1.id + 1 OR t2.id = t1.id + 2)
AND (t2.tag2 = 'g' OR t2.tag1 = 'b')
AND t3.id >= t1.id + 3 AND t3.token = 'ago'
So far I have only been able to achieve this by joining the table by itself each time I specify a new token in the sequence (e.g. JOIN my_table t4), but with millions of rows this gets quite slow. Is there a more efficient way to do this?
You could try this staged approach:
apply each condition (other than the various distance conditions) as a subquery
Calculate the distances between the tokens which meet the conditions
Apply all the distance conditions separately.
This might improve things, if you have indexes on the tag1, tag2 and token columns:
SELECT DISTINCT sentence_id FROM
(
-- 2. Here we calculate the distances
SELECT cond1.sentence_id,
(cond2.id - cond1.id) as cond2_distance,
(cond3.id - cond1.id) as cond3_distance
FROM
-- 1. These are all the non-distance conditions
(
SELECT * FROM my_table WHERE tag1 = 'a'
) cond1
INNER JOIN
(
SELECT * FROM my_table WHERE
(tag1 = 'b' OR tag2 = 'g')
) cond2
ON cond1.sentence_id = cond2.sentence_id
INNER JOIN
(
SELECT * FROM my_table WHERE token = 'ago'
) cond3
ON cond1.sentence_id = cond3.sentence_id
) conditions
-- 3. Now apply the distance conditions
WHERE cond2_distance BETWEEN 0 AND 2
AND cond3_distance >= 3
ORDER BY sentence_id;
If you apply this query to this SQL fiddle you get:
| sentence_id |
|-------------|
| 1 |
| 4 |
Which is what you want. Now whether it's any faster or not, only you (with your million-row database) can really tell, but from the perspective of having to actually write these queries, you'll find they're much easier to read, understand and maintain.
You need to edit your question and give more details on how these sequences of tokens work (for instance, what does "each time I specify a new token in the sequence" mean in practice?).
In postgresql you can solve this class of queries with a window function. Following your exact specification above:
SELECT *,
CASE
WHEN lead(tag2, 2) OVER w = 'g' THEN lead(token, 2) OVER w
WHEN lead(tag1) OVER w = 'b' THEN lead(token) OVER w
ELSE NULL::text
END AS next_token
FROM my_table
WHERE tag1 = 'a'
AND next_token IS NOT NULL
WINDOW w AS (PARTITION BY sentence_id ORDER BY id);
The lead() function looks ahead a number of rows (default is 1, when not specified) from the current row in the window frame, in this case all rows with the same sentence_id as specified in the partition of the window definition. So, lead(tag1, 2) looks at the value of tag1 two rows ahead to compare against your condition, and lead(token, 2) returns the token from two rows ahead as column next_token in the current row and having the same sentence_id. If the first CASE condition fails, the second is evaluated; if that fails NULL is returned. Note that the order of the conditions in the CASE clause is significant: different ordering gives different results.
Obviously, if you keep on adding conditions for subsequent tokens the query becomes very complex and you may have to put individual search conditions in separate stored procedures and then call these depending on your requirements.

Finding the Time Since items interacted in SQL database

I have a table such that I have
Item1 Item2 Timestamp
A B 2012-06-5 06:14:12
B A 2012-06-6 06:20:12
C A 2012-06-5 06:23:45
A B 2012-06-7 08:35:35
C A 2012-06-8 13:12:42
B C 2012-06-8 15:14:57
I want to append another column which we can call time_since that shows me the last time Item1 and Item2 interacted.
For example in Line Item 2 would have an extra row with the number 6 in it. Since the two items interacted 6 minutes prior.
I think on the second row of your example data you meant to put June FIFTH (not 6th) in timestamp field? Only in that scenario would there be a difference of 6 minutes.
If that is true and you really do want the difference between in minutes you should be able to use the following:
Fiddle Test: http://sqlfiddle.com/#!2/0aa157/4/0
select t1.item1,
t1.item2,
t1.timestamp,
timestampdiff(minute, max(t2.timestamp), t1.timestamp) as time_diff
from tbl t1
left join tbl t2
on t2.timestamp < t1.timestamp
and ((t1.item1 = t2.item1 and t1.item2 = t2.item2) or
(t1.item1 = t2.item2 and t1.item2 = t2.item1))
group by t1.item1, t1.item2, t1.timestamp
order by 3
(I changed the second row of your example data to be June 5th)