Remove old duplicate rows in BQ based on timestamp

Remove old duplicate rows in BQ based on timestamp - sql

I have a BQ table with duplicate (x2 times) rows of the same ad_id.
I want to delete old rows with ts > 120 min where there is a newer one with the same ad_id (Schema contains timestamp, ad_id, value. But there is not rowId).
This is my try, is there a nicer way to do so?
DELETE FROM {table_full_name} o
WHERE timestamp < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 120 MINUTE) AND timestamp in (
SELECT MIN(timestamp)
FROM {table_full_name} i
WHERE i.ad_id=o.ad_id
GROUP BY ad_id)
Data example:
`ad-id` | `ts` | `value` |
`1` | Sep-1-2021 12:01 | `Scanned` |
`2` | Sep-1-2021 12:02 | `Error` |
`1` | Sep-1-2021 12:03 | `Removed` |
I want to clean it up to be:
`ad-id` | `ts` | `value` |
`2` | Sep-1-2021 12:02 | `Error` |
`1` | Sep-1-2021 12:03 | `Removed` |
I saw this post, but BQ doesn't support auto-increment for row-id.
I saw this post. But how can I modify it without the ts interval (as it's unknown).

You can try this script. Used COUNT() with HAVING to pull duplicate records with timestamp older than 120 minutes from current time using TIMESTAMP_DIFF.
DELETE
FROM `table_full_name`
WHERE ad_id in (SELECT ad_id
FROM `table_full_name`
GROUP BY ad_id
HAVING COUNT(ad_id) > 1)
AND TIMESTAMP_DIFF(CURRENT_TIMESTAMP(), timestamp, MINUTE) > 120
Before:
After:

Related

MariaDB convert created_at timestamp in total hours from now

I have following question which return created_at timestamps. I would like to convert it in total hours from now. Is there an easy way to make that conversion and print it in total hours?
MariaDB version 10.5.12-MariaDB-1:10.5.12+maria~focal-log
MariaDB [nova]> select hostname, uuid, instances.created_at, instances.deleted_at, json_extract(flavor, '$.cur.*."name"') AS FLAVOR from instances join instance_extra on instances.uuid = instance_extra.instance_uuid WHERE (vm_state='active' OR vm_state='stopped');
+----------+--------------------------------------+---------------------+------------+--------------+
| hostname | uuid | created_at | deleted_at | FLAVOR |
+----------+--------------------------------------+---------------------+------------+--------------+
| vm1 | ef6380b4-5455-48f8-9e4b-3d04199be3f5 | 2023-01-05 14:25:51 | NULL | ["tempest2"] |
+----------+--------------------------------------+---------------------+------------+--------------+
1 row in set (0.001 sec)

Try it like this:
SELECT hostname, UUID, instances.created_at,
TIMESTAMPDIFF(hour,instances.created_at, NOW()) AS HOURDIFF,
instances.deleted_at,
JSON_EXTRACT(flavor, '$.cur.*."name"') AS FLAVOR
FROM instances
JOIN instance_extra ON instances.uuid = instance_extra.instance_uuid
WHERE (vm_state='active' OR vm_state='stopped');
Demo fiddle

Querying the retention rate on multiple days with SQL

Given a simple data model that consists of a user table and a check_in table with a date field, I want to calculate the retention date of my users. So for example, for all users with one or more check ins, I want the percentage of users who did a check in on their 2nd day, on their 3rd day and so on.
My SQL skills are pretty basic as it's not a tool that I use that often in my day-to-day work, and I know that this is beyond the types of queries I am used to. I've been looking into pivot tables to achieve this but I am unsure if this is the correct path.
Edit:
The user table does not have a registration date. One can assume it only contains the ID for this example.
Here is some sample data for the check_in table:
| user_id | date |
=====================================
| 1 | 2020-09-02 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 12:00:00 |
-------------------------------------
| 1 | 2020-09-04 13:00:00 |
-------------------------------------
| 4 | 2020-09-04 11:00:00 |
-------------------------------------
| ... |
-------------------------------------
And the expected output of the query would be something like this:
| day_0 | day_1 | day_2 | day_3 |
=================================
| 70% | 67 % | 44% | 32% |
---------------------------------
Please note that I've used random numbers for this output just to illustrate the format.

Oh, I see. Assuming you mean days between checkins for users -- and users might have none -- then just use aggregation and window functions:
select sum( (ci.date = ci.min_date)::numeric ) / u.num_users as day_0,
sum( (ci.date = ci.min_date + interval '1 day')::numeric ) / u.num_users as day_1,
sum( (ci.date = ci.min_date + interval '2 day')::numeric ) / u.num_users as day_2
from (select u.*, count(*) over () as num_users
from users u
) u left join
(select ci.user_id, ci.date::date as date,
min(min(date::date)) over (partition by user_id order by date) as min_date
from checkins ci
group by user_id, ci.date::date
) ci;
Note that this aggregates the checkins table by user id and date. This ensures that there is only one row per date.

Performance on querying only the most recent entries

I made an app that saves when a worker arrives and departures from the premises.
Over a 24 hours multiple checks are made, so the database can quickly fill hundreds to thousands of records depending on the activity.
| user_id | device_id | station_id | arrived_at | departed_at |
|-----------|-----------|------------|---------------------|---------------------|
| 67 | 46 | 4 | 2020-01-03 11:32:45 | 2020-01-03 11:59:49 |
| 254 | 256 | 8 | 2020-01-02 16:29:12 | 2020-01-02 16:44:65 |
| 97 | 87 | 7 | 2020-01-01 09:55:01 | 2020-01-01 11:59:18 |
...
This becomes a problem since the daily report software, which later reports who was absent or who made extra hours, filters by arrival date.
The query becomes a full table sweep:
(I just used SQLite for this example, but you get the idea)
EXPLAIN QUERY PLAN
SELECT * FROM activities
WHERE user_id = 67
AND arrived_at > '2020-01-01 00:00:00'
AND departed_at < '2020-01-01 23:59:59'
ORDER BY arrived_at DESC
LIMIT 10
What I want to make is make the query snappier for records created (arrived) only the most recent day, since queries for older days are rarely executed. Otherwise, I'll have to deal with timeouts.

I would use the following index, so that departed_at that don't match can be eliminated before probing the table:
CREATE INDEX ON activities (arrived_at, departed_at);

On Postgres, you may use DISTINCT ON:
SELECT DISTINCT ON (user_id) *
FROM activities
ORDER BY user_id, arrived_at::date DESC;
This assumes that you only want to report the latest record, as determined by the arrival date, for each user. If instead you just want to show all records with the latest arrival date across the entire table, then use:
SELECT *
FROM activities
WHERE arrived_at::date = (SELECT MAX(arrived_at::date) FROM activities);

query from/to, newest within a specific timerange

I have the following table:
PersNumber | Property | From | To
XXX | 34 | 20180101 | 20180630
XXX | 38 | 20180701 | 20190330
XXX | 39 | 20180401 | 20201231
I have a period time frame, i.e. from 2018-01-01 to 2019-12-31
I need to query the last row (actually only the 2 first columns). The criteria is actually : from / to within the timerange, and the newest if more than one. Meaning :
row : out because not in the period scope
row : a part is in the period scope but not the newest
row : a part is in the period scope, and this is the newest
I don't know whether the problem is understandable, if not do not hesitate to tell it to me

You seem to want:
select t.*
from t
where date_from >= '2018-01-01' and date_to <= '2019-12-31'
order by date_from
limit 1;

sql -- issue with repeat values on join

I have two tables in PostgreSQL. I think it maybe due to an issue of my PK/FK or my lack of understanding of how to query properly:
CREATE TABLE Minute
(
Name varchar(20),
Day date,
Minute time,
Weight real
Speed real
PRIMARY KEY (Name, Day, Minute)
)
--NOTE: This table has everyday, for every minute in a month.
CREATE TABLE DataMan
(
Name varchar(20),
Day date, --NOTE: This is by day 10/31/2013, 11/31/2013
Size real,
Volume real,
NumEv real,
PRIMARY KEY (Name, Day)
)
The kind of data that I have in DataMan would be like:
GOOG | 10/31/2013 | 123 | 456 | 5
GOOG | 11/31/2013 | 234 | 412 | 5
and with a bunch of other names and data with months.
The kind of data that I have in Minute would be like:
GOOG | 10/31/2013 | 12:00:00 | 251.312 | 1231.12
GOOG | 10/31/2013 | 12:01:00 | 124.51 | 1239
So, I want to create table where it has:
Minute.Name | Minute.Date | Minute.Time | DataMan.Size
GOOG | 10/31/2013 | 12:00:00 | 123
GOOG | 10/31/2013 | 12:01:00 | 123
This is my query
SELECT minute.name, minute.date, minute.time, dataman.size
FROM minute LEFT JOIN dataman ON (minute.name = dataman.name)
ORDER BY minute.name ASC, minute.date ASC, minute.time ASC
And what happens is that the table output does something like:
GOOG | 10/31/2013 | 12:00:00 | 123
GOOG | 10/31/2013 | 12:00:00 | 234
I want the Dataman.size to remain the same by the increment of minutes, but it seems to do a cartesian product and put every value of Dataman.size on the minute time frame, which doesn't make sense.

It looks like you just forgot to join on Day in addition to Name.
In the join condition, instead of:
ON (minute.name = dataman.name)
this should be:
ON (minute.name = dataman.name AND minute.Day=dataman.Day)
Since there's a unique constraint on (name,day) in dataman, we know that only one row of dataman will match for every row in minute, with the above join condition.

If I understand correctly, you want the most recent size from dataman before or on the date in the minute table. Is this interpretation correct?
Here is one way of getting it with a correlated subquery:
SELECT minute.name, minute.date, minute.time,
(select dataman.size
from dataman
where minute.name = dataman.name and
minute.date >= dataman.date
limit 1
) as size
FROM minut
ORDER BY minute.name ASC, minute.date ASC, minute.time ASC

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Remove old duplicate rows in BQ based on timestamp - sql

Related

MariaDB convert created_at timestamp in total hours from now

Querying the retention rate on multiple days with SQL

Performance on querying only the most recent entries

query from/to, newest within a specific timerange

sql -- issue with repeat values on join

Categories

Resources