HiveQL - Query Number of Entries over fixed unit of time

HiveQL - Query Number of Entries over fixed unit of time - sql

I have a table that is similar to the following:
LOGIN ID (STRING): TIME_STAMP (STRING HH:MM:SS)
BillyJoel 10:45:00
PianoMan 10:45:30
WeDidnt 10:45:45
StartTheFire 10:46:00
AlwaysBurning 10:46:30
Is there any possible way to get a query that gives me a column of the number of logins over a period of time? Something like this:
3 (number of logins from 10:45:00 - 10:45:59)
2 (number of logins from 10:46:00 - 10:46:59)
Note: If you can only do it with int timestamps, that's alright. My original table is all strings, so I thought I would represent that here. The stuff in parentheses don't need to be printed

If you want it by minute, you can just lop off the seconds:
select substr(1, 5, time_stamp) as hhmm, count(*)
from t
group by hhmm
order by hhmm;

Related

How to make a group query to select multiple rows?

I have a DateTime column (timestamp 2022-05-22 10:10:12) with a batch of stamps per each day.
I need to filter the rows where stamp is before 9am (here is no problem) and I'm using this code:
SELECT * FROM tickets
WHERE date_part('hour'::text, tickets.date_in) < 9::double precision;
The output is the list of the rows where the time in timestamp is less than 9 am (50 rows from 2000).
date_in
2022-05-22 08:10:12
2022-04-23 07:11:13
2022-06-15 08:45:26
Then I need to find all the days where at least one row has a stamp before 9 am - and here I'm stuck. Any idea how to select all the days where at least one stamp was before 9 am?
The code I'm trying:
SELECT * into temp1 FROM tickets
WHERE date_part('hour'::text, tickets.date_in) < 9::double precision
ORDER BY date_part('day'::text, date_in);
Select * into temp2
from tickets, temp1
where date_part('day'::text, tickets.date_in) = date_part('day'::text, temp1.date_in);
Update temp2 set distorted_route = 1;
But this is giving me nothing.
Expected output is to get all the days where at least one route was done before 9am:
date_in
2022-05-22 08:10:12
2022-05-22 10:11:45
2022-05-22 12:14:59
2022-04-23 07:11:13
2022-04-23 11:42:25
2022-06-15 08:45:26
2022-06-15 15:10:57
Should I make an additional table (temp1) to feed it with the first query result (just the rows before 9am) and then make a cross table query to find in the source table public.tickets all the days which are equal to the public.temp1?
Select * from tickets, temp1
where TO_Char(tickets.date_in, 'YYYY-MM-DD')
= TO_Char(temp1.date_in, 'YYYY-MM-DD');
or like this:
SELECT *
FROM tickets
WHERE EXISTS (
SELECT date_in FROM TO_Char(tickets.date_in, 'YYYY-MM-DD') = TO_Char(temp1.date_in, 'YYYY-MM-DD')
);
Ideally, I'd want to avoid using a temporary table and make a request just for one table.
After that, I need to create a view or update and add some remarks to the source table.

Assuming you mean:
How to select all rows where at least one row exists with a timestamp before 9 am of the same day?
SELECT *
FROM tickets t
WHERE EXISTS (
SELECT FROM tickets t1
WHERE t1.date_in::date = t.date_in::date -- same day
AND t1.date_in::time < time '9:00' -- time before 9:00
AND t1.id <> t.id -- exclude self
)
ORDER BY date_id; -- optional, but typically helpful
id being the PK column of your undisclosed table.
But be aware that ...
... typically you'll want to work with timestamptz instead of timestamp. See:
Ignoring time zones altogether in Rails and PostgreSQL
https://wiki.postgresql.org/wiki/Don%27t_Do_This#Don.27t_use_timestamp_.28without_time_zone.29
... this query is slow for big tables, because it cannot use a plain index on (date_id) (not "sargable"). Related:
How do you do date math that ignores the year?
There are various ways to optimize performance. The best way depends on undisclosed information for performance questions.

BQ: Select latest date from multiple columns

Good day, all. I wrote a question relating to this earlier, but now I have encountered another problem.
I have to calculate the timestamp difference between the install_time and contributer_time columns. HOWEVER, I have three contributor_time columns, and I need to select the latest time from those columns first then subtract it from install time.
Sample Data
users
install_time
contributor_time_1
contributor_time_2
contributor_time_3
1
8:00
7:45
7:50
7:55
2
10:00
9:15
9:45
9:30
3
11:00
10:30
null
null
For example, in the table above I would need to select contributor_time_3 and subtract it from install_time for user 1. For user 2, I would do the same, but with contributor_time_2.
Sample Results
users
install_time
time_diff_min
1
8:00
5
2
10:00
15
3
11:00
30
The problem I am facing is that 1) the contributor_time columns are in string format and 2) some of them have 'null' string values (which means that I cannot cast it into a timestamp.)
I created a query, but I am am facing an error stating that I cannot subtract a string from timestamp. So I added safe_cast, however the time_diff_min results are only showing when I have all three contributor_time columns as a timestamp. For example, in the sample table above, only the first two rows will pull.
The query I have so far is below:
SELECT
users,
install_time,
TIMESTAMP_DIFF(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), MINUTE) as ctct_min
FROM
(SELECT
users,
install_time,
safe_cast(contributor_time_1 as timestamp) as contributor_time_1,
safe_cast(contributor_time_2 as timestamp) as contributor_time_2,
safe_cast(contributor_time_3 as timestamp) as contributor_time_3,
FROM
(SELECT
users,
install_time,
case when contributor_time_1 = 'null' then '0' else contributor_time_1 end as contributor_time_1,
....
FROM datasource
Any help to point me in the right direction is appreciated! Thank you in advance!

Consider below
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
greatest(
parse_time('%H:%M',contributor_time_1),
parse_time('%H:%M',contributor_time_2),
parse_time('%H:%M',contributor_time_3)
),
minute) as time_diff_min
from `project.dataset.table`
if applied to sample data in your question - output is
Above can be refactored slightly into below
create temp function latest_time(arr any type) as ((
select parse_time('%H:%M',val) time
from unnest(arr) val
order by time desc
limit 1
));
select users, install_time,
time_diff(
parse_time('%H:%M',install_time),
latest_time([contributor_time_1, contributor_time_2, contributor_time_3]),
minute) as time_diff_min
from `project.dataset.table`
less verbose and no redundant parsing - with same result - so just matter of preferences

You can use greatest():
select t.*,
time_diff(install_time, greatest(contributor_time_1, contributor_time_2, contributor_time_3), minute) as diff_min
from t;
Note: this assumes that the values are never NULL, which seems reasonable based on your sample data.

Get the count of distinct userids for last couple of days

Let's say the last 7 days for this table:
Userid Download time
Rab01 2020-04-29 03:28
Klm01 2020-04-29 04:01
Klm01 2020-04-30 05:10
Rab01 2020-04-29 12:14
Osa_3 2020-04-25 09:01
Following is the required output:
Count Download_time
1 2020-04-25
2 2020-04-29
1 2020-04-30

Tested with PostgreSQL. You also tagged Redshift, which forked at Postgres 8.2, a long time ago. There may be discrepancies ..
Since you seem to be happy with standard ISO format, a simple cast to date would be most efficient:
SELECT count(DISTINCT userid) AS "Count"
, download_time::date AS "Download_Day"
FROM tbl
WHERE download_time >= CURRENT_DATE - 7
AND download_time < CURRENT_DATE
GROUP BY 2;
db<>fiddle here
CURRENT_DATE is standard SQL and works for both Postgres and Redshift. Related:
How do I determine the last day of the previous month using PostgreSQL?
About the "last 7 days": I took the last 7 whole days (excluding today - necessarily incomplete), with syntax that can use a plain index on (download_time). Related:
Get dates of a day of week in a date range
Slow LEFT JOIN on CTE with time intervals
Interval (days) in PostgreSQL with two parameters
Ideally, you have a composite index on (download_time, userid) (and fulfill some preconditions) to get very fast index-only scans. See:
Is a composite index also good for queries on the first field?
count(DISTINCT ...) is typically slow. For big tables with many duplicates, there are faster techniques. Disclose your exact setup and cardinalities if you need to optimize performance.
If the actual data type is timestamptz, not just timestamp, you also need to define the time zone defining day boundaries. See:
Ignoring time zones altogether in Rails and PostgreSQL
About the optional short syntax GROUP BY 2:
Select first row in each GROUP BY group?
About capitalization of identifiers:
Are PostgreSQL column names case-sensitive?

You can use date_trunc function for get day only part from datetime and use it for grouping.
The query may be next:
SELECT
count(distinct Userid) as Count, -- get unuque users count
to_char(date_trunc('day', Download_time), 'YYYY-MM-DD') AS Download_Day -- convert time do day
FROM table
WHERE DATE_PART('day', NOW() - Download_time) < 7 -- last 7 days
GROUP BY Download_Day; -- group by day
Fiddle

Exasol Extrackt from Timestamp Hours AND Minutes

I have a Exasol database with Login values of datatype TIMESTAMP like:
2015-10-01 13:00:34.0
2015-11-02 13:10:10.0
2015-10-06 13:20:03.0
2016-02-01 14:15:34.0
2016-04-03 14:08:10.0
2016-07-01 11:05:07.0
2016-09-03 10:08:12.0
2016-11-15 09:03:30.0
and many many more. I want to do a SQL (SQLite) query where I get like
Logins from 09:00:00 to 09:15:00 and logins from 09:15:00 to 09:30:00 and so on in separate tables (no matter what date it is). I already had success with selecting on 1 hour interval with:
...EXTRACT(HOUR FROM entryTime ) BETWEEN 8 and 8
that way i get entries of my database (no matter what date it is) within 1 hour, but i need smaller intervals, like every 09:00:00 - 09:15:00 minutes. Any ideas how to solve this in Exasol (SQLite)?

You can simply convert the time part of your timestamp to a string and do a between, something like:
WHERE to_char(entryTime, 'hhmi') BETWEEN '0900' AND '0915'
If you want to use extract and numeric values, I suggest this:
WHERE (EXTRACT(HOUR FROM entryTime) * 100) + EXTRACT(MINUTE FROM entryTime)
BETWEEN 900 and 915
I'm not in front of my computer now, but this (or something pretty similar) should work.
But I suspect that in both cases EXASOL will create an expression index for the first part of the WHERE clause. Because, I guess, you use EXASOL because you have a huge amount of data and you want fast performance, my suggestion is to have an additional column in your table where you store the time part of entryTime as a numeric value, that will create a proper index give you better performance.

I found a workaround. The solution is to interleave the SQL states. So in first step, you select the hours, and around that SELECT state, you wrap another, where you specify the minutes.
SELECT * FROM
(SELECT * FROM MY_SCHEMA.EXA_LOCAL WHERE EXTRACT(HOUR FROM TIMESTMP) BETWEEN 9 and 9)
where EXTRACT(MINUTE FROM TIMESTMP) BETWEEN 0 and 15;

MySQL - Calculate the net time difference between two date-times while excluding breaks?

In a MySQL query I am using the timediff/time_to_sec functions to calculate the total minutes between two date-times.
For example:
2010-03-23 10:00:00
-
2010-03-23 08:00:00
= 120 minutes
What I would like to do is exclude any breaks that occur during the selected time range.
For example:
2010-03-23 10:00:00
-
2010-03-23 08:00:00
-
(break 08:55:00 to 09:10:00)
= 105 minutes
Is there a good method to do this without resorting to a long list of nested IF statements?
UPDATE1:
To clarify - I am trying to calculate how long a user takes to accomplish a given task. If they take a coffee break that time period needs to be excluded. The coffee breaks are a at fixed times.

sum all your breaks that occur during the times, and then subtract to the result of the timediff/time_to_sec function
SELECT TIME_TO_SEC(TIMEDIFF('17:00:00', '09:00:00')) -- 28800
SELECT TIME_TO_SEC(TIMEDIFF('12:30:00', '12:00:00')) -- 1800
SELECT TIME_TO_SEC(TIMEDIFF('10:30:00', '10:15:00')) -- 900
-- 26100
Assuming this structure :
CREATE TABLE work_unit (
id INT NOT NULL,
initial_time TIME,
final_time TIME
)
CREATE TABLE break (
id INT NOT NULL,
initial_time TIME,
final_time TIME
)
INSERT work_unit VALUES (1, '09:00:00', '17:00:00')
INSERT break VALUES (1, '10:00:00', '10:15:00')
INSERT break VALUES (2, '12:00:00', '12:30:00')
You can calculate it with next query:
SELECT *, TIME_TO_SEC(TIMEDIFF(final_time, initial_time)) total_time
, (SELECT SUM(
TIME_TO_SEC(TIMEDIFF(b.final_time, b.initial_time)))
FROM break b
WHERE (b.initial_time BETWEEN work_unit.initial_time AND work_unit.final_time) OR (b.final_time BETWEEN work_unit.initial_time AND work_unit.final_time)
) breaks
FROM work_unit

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas