SQL: need only 1 row per particular timestamp - sql

i have some SQL code that is inserting values from another (non sql-based) system. one of the values i get is a timestamp.
i can get multiple inserts that have the same timestamp (albeit different values for other fields).
my problem is that i am trying to get the first insert happening every day (based upon timestamp) since a particular day (i.e. give me the first insert of each day since January 28, 2007...)
my code to get the first timestamp of every day is as follows:
SELECT MIN(my_timestamp) AS first_timestamp
FROM my_schema.my_table
WHERE my_col1 = 'WHATEVER'
AND my_timestamp > timestamp '2010-Jul-27 07:45:24' - INTERVAL '365 DAY'
GROUP BY DATE (my_timestamp);
This delivers me the list of times available. But when I join against these times, I can get several rows, as there are lots of rows that mach these times. So for 365 days, I may get 5,000 rows (I could be inserting 100 rows at 00:00:00 every day).
Assuming, in the example above, my_table has columns my_col1 and my_col2, how can I get exactly 365 rows that contain my_col1 & my_col2? it doesn't matter which row i get back if there are multiple rows for a date; any row will suffice.
it's an odd question. the overall problem is: given a timestamp, how can one get 1-row-per-timestamp even if there are multiple rows that have said timestamp (assuming there is no other priority)?
thanks for the help in advance.
EDIT:
So, let's say for example, this table has the following columns: my_col1, my_col2, and my_timestamp.
Here are example values (in order of my_col1 - my_col2 - my_timestamp):
'my_val1' - 10 - '2010-07-01 01:01:01'
'my_val2' - 11 - '2010-07-01 01:01:01'
'my_val3' - 12 - '2010-07-01 01:01:01'
'my_val4' - 13 - '2010-07-01 01:01:02'
'my_val5' - 14 - '2010-07-02 01:01:01'
'my_val6' - 15 - '2010-07-02 01:01:01'
'my_val7' - 16 - '2010-07-03 01:01:01'
in the end, i would want only 3 rows, 1 with a timestamp with '2010-07-01 01:01:01', one with '2010-07-02 01:01:01', and one with '2010-07-03 01:01:01'. the third one is easy, since there is only 1 row with that last timestamp. but the first two are the tricky ones. the sql i posted above will ignore the row with 'my_val4'.
i need a query that will return me all of the columns, not just the dates.
how would i get sql to give me either the first or last of the values that would match that timestamp (it doesn't matter either way. i just need to get 1-per first-day's timestamp matching)?

select distinct on (date(my_timestamp)) *
from my_table
order by date(my_timestamp), my_timestamp
This selects all columns, exactly one row per date(my_timestamp). The single row per day is the first row for the group, as determined by order by (so that's the row with minimal my_timestamp).
Of course you can add whatever joins, wheres etc. you need. But this is the stub you're looking for.

The solution is to use the SQL's DISTINCT statement (http://www.sql-tutorial.com/sql-distinct-sql-tutorial/):
SELECT DISTINCT MIN(my_timestamp) AS first_timestamp FROM my_schema.my_table WHERE my_col1 = 'WHATEVER' AND my_timestamp > timestamp '2010-Jul-27 07:45:24' - INTERVAL '365 DAY' GROUP BY DATE (my_timestamp);

I know you already have an answer, but I still don't understand why you have mentioned a join in your question. Why not just include the rest of the columns in your query, like this:
SELECT MIN(my_timestamp) AS first_timestamp, my_col1, my_col2
FROM my_table
GROUP BY DATE(my_timestamp);
This works in MySQL. Does it not return the expected result in PostgreSQL?

Related

druid sql query - count distinctly for a multi value field across records

Is there a way to do a distinct count across different rows for a multi-value field in druid SQL for a particular value in which value is only counted once across an array? eg suppose I have below records :
shippingSpeed
[standard, standard, standard, ground]
[standard,ground]
[ground,ground]
Expected Result:
standard 2
ground 3
I tried below query but it is aggregating the field count inside an array and then giving the total count across all records:
SELECT
"shippingSpeed", count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC
Result:
standard 4
ground 4
This is because the Group By on multi-valued columns will UNNEST the array into multiple rows. It is counting each item as an instance correctly.
If you want to remove duplicates, define "shippingSpeed" at ingestion time with the property:
"multiValueHandling": "SORTED_SET"
You can find more details here: https://druid.apache.org/docs/latest/querying/multi-value-dimensions.html#overview
Okay there are some undocumented function's that you can use.
SELECT
array_set_add(MV_TO_ARRAY("shippingSpeed",null) , count(*)
FROM orders
WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '30' DAY
GROUP BY 1
ORDER BY 2 ASC
which might work.
MV_TO_ARRAY -> converts the multi value col to an array
array_set_add -> creates a set out of the arrays. Since we donot have 2 arrays, second argument is null.
but what #sergio said might be the easiest option.

Duplicate records (rows) with modified values (Postgresql)

I have a dataset that I am preparing for a pivot view (in Excel). It's customer data and I want to create a view that allows to summarize how many customers were active every month. Therefore I need to duplicate rows with a modified value for first_status_date. This is my data:
This is where I am trying to get: (The green rows are duplicated, and the cells in bold are modified.
Because I am adding a month on top of the value of the cell above, I am thinking of working with the lag function. But I am not familiar with duplicating rows. The id 123 and 356 needed 2 duplicated rows because there is a 2 month difference between last and first status date, the id 221 needs only 1 duplicated row, as there is a one month difference.
You can use generate_series() to generate the new rows - a little tweaking is needed for the end date to ensure that we do get a row for the last month, regardless of the actual month day:
select
t.id,
t.first_status,
d.first_status_date,
t.last_status,
t.last_status_date
from mytable t
cross join lateral generate_series(
t.first_status_date,
date_trunc('month', t.last_status_date) + interval '1 month',
interval '1 month'
) d(first_status_date)

How to select data but without similar times?

I have a table with create_dt times and i need to get records but without the datas that have similar create_dt time (15 minutes).
So i need to get only one record instead od two records if the create_dt is in 15 minutes of the first one.
Format of the date and time is '(29.03.2019 00:00:00','DD.MM.YYYY HH24:MI:SS'). Thanks
It's a bit unclear what exactly you want, but one thing I can think of, is to round all values to the nearest "15 minute" and then only pick one row from those "15 minute" intervals:
with rounded as (
select create_dt,
date '0001-01-01' + (round((cast(create_dt as date) - date '0001-01-01') * 24 * 60 / 15) * 15 / 60 / 24) as rounded,
... other columns ....
from your_table
), numbered as (
select create_dt,
rounded,
row_number() over (partition by rounded order by create_dt) as rn
... other columns ....
from rounded
)
select *
from numbered
where rn = 1;
The expression date '0001-01-01' + (round((cast(create_dt as date) - date '0001-01-01') * 24 * 60 / 15) * 15 / 60 / 24) will return create_dt rounded up or down to the next "15 minutes" interval.
The row_number() then assigns unique numbers for each distinct 15 minutes interval and the final select then always picks the first row for that interval.
Online example: https://dbfiddle.uk/?rdbms=oracle_11.2&fiddle=e6c7ea651c26a6f07ccb961185652de7
I'm going to walk you through this conceptually. First of all, there's a difficulty in doing this that you might not have noticed.
Let's say you wanted one record from the same hour or day. But if there are two record created on the same day, you only want one in your results. Which one?
I mention this because to the designers of SQL, there is not a single answer that they can provide SQL to pick. Then cannot show data from both records without both records being in the tabular output.
This is a common problem, but when the designers of SQL provided a feature to handle it, it can only work if there is no ambiguity of how to have one row of result for two records. That solution is GROUP BY, but it only works for showing the fields other than the timestamp if they are the same for all the records which match the time period. You have to include all the fields in your select clause and if multiple records in your time period are the same, they will create multiple records in your output. So although there is a tool GROUP BY for this problem, you might not be able to use it.
So here is the solution you want. If multiple records are close together, then don't include the records after the first one. So you want a WHERE clause which will exclude a record if another record recently proceeds it. So the test for each record in the result will involve other records in the table. You need to join the table to itself.
Let's say we have a table named error_events. If we get multiples of the same value in the field error_type very close to the time of other similar events, we only want to see the first one. The SQL will look something like this:
SELECT A.*
FROM error_events A
INNER JOIN error_events B ON A.error_type = B.error_type
WHERE ???
You will have to figure out the details of the WHERE clause, and the functions for the timestamp will depend you when RDBMS product you are using. (mysql and postgres for instance may work differently.)
You want only the records where there is no record which is earlier by less then 15 minutes. You do want the original record. That record will match itself in the join, but it will be the only record in the time period between its timestamp and 15 minutes prior.
So an example WHERE clause would be
WHERE B.create_dt BETWEEN [15 minutes before A.create_dt] and A.create_dt
GROUP BY A.*
HAVING 1 = COUNT(B.pkey)
Like we said, you will have to find out how your database product subtracts time, and how 15 minutes is represented in that difference.

How to compare time stamps from consecutive rows

I have a table that I would like to sort by a timestamp desc and then compare all consecutive rows to determine the difference between each row. From there, I would like to find all the rows whose difference is greater than ~2hours.
I'm stuck on how to actually compare consecutive rows in a table. Any help would be much appreciated.
I'm using Oracle SQL Developer 3.2
You didn't show us your table definition, but something like this:
select *
from (
select t.*,
t.timestamp_column,
t.timestamp_column - lag(timestamp_column) over (order by timestamp_column) as diff
from the_table t
) x
where diff > interval '2' hour;
This assumes that timestamp_column is defined as timestamp not date (otherwise the result of the difference wouldn't be an interval)

Query to count records within time range SQL or Access Query

I have a table that looks like this:
Row,TimeStamp,ID
1,2014-01-01 06:01:01,5
2,2014-01-01 06:00:03,5
3,2014-01-01 06:02:00,5
4,2014-01-01 06:02:39,5
What I want to do is count the number of records for each ID, however I don't want to count records if a subsequent TimeStamp is within 30 seconds.
So in my above example the total count for ID 5 would be 3, because it wouldn't count Row 2 because it is within 30 seconds of the last timestamp.
I am building a Microsoft Access application, and currently using a Query, so this query can either be an Access query or a SQL query. Thank you for your help.
I think the query below does what you want however I don't understand your expected output. It returns a count of 4 (all the rows in your example) which I believe would be correct because all of your records are at least 30 seconds apart. No single timestamp has a subsequent timestamp within 30 seconds from it (in time).
Row 2 with a timestamp of '2014-01-01 06:00:03' is not within 30 seconds of any timestamp coming after. The closest is row #1 which is 58 seconds later (58 is greater than 30 so I don't know why you think it should be excluded (given what you said you wanted in your explanation)).
Rows 1/3/4 of your example data also are not within 30 seconds of each other.
This is a test of the sql below but like I said it returns all 4 rows (change to a count if you want the count, I brought back the rows to illustrate):
http://sqlfiddle.com/#!3/0d727/20/0
Now check this example with some added data: (I added a fifth row)
http://sqlfiddle.com/#!3/aee67/1/0
insert into tbl values ('2014-01-01 06:01:01',5);
insert into tbl values ('2014-01-01 06:00:03',5);
insert into tbl values ('2014-01-01 06:02:00',5);
insert into tbl values ('2014-01-01 06:02:39',5);
insert into tbl values ('2014-01-01 06:02:30',5);
Note how the query result shows only 3 rows. That is because the row I added (#5) is within 30 seconds of row #3, so #3 is excluded. Row #5 also gets excluded because row #4 is 9 seconds (<=30) later than it. Row #4 does come back because no subsequent timestamp is within 30 seconds (there are no subsequent timestamps at all).
Query to get the detail:
select *
from tbl t
where not exists
(select 1
from tbl x
where x.id = t.id
and x.timestamp > t.timestamp
and datediff(second, t.timestamp, x.timestamp) <= 30)
Query to get the count by ID:
select id, count(*)
from tbl t
where not exists
(select 1
from tbl x
where x.id = t.id
and x.timestamp > t.timestamp
and datediff(second, t.timestamp, x.timestamp) <= 30)
group by id
To the best of my knowledge it is impossible to do with just a SQL statement as presented.
I use two approaches:
For small result sets, remove the surplus records inside your time windows in code, then calculate the relevant statistics. The main advantage to this approach is you do not have to alter the database structure.
Add a field to flag each record relative to the time window, then use code to preprocess your data & fill the indicator. You can now use SQL to aggregate / filter based on the new flag column. If you need to track multiple time windows, you can use multiple flags / multiple columns (e.g. 30 second window, 600 second window, etc)
For this, I'd recommend the second approach, it allows the database (SQL) do more work after you once the preprocessing step is done.