top ten in ten minutes - hive query - hive

I have to calculate ten most visited sites in for each 5 min interval since 00:00:00
Given below is my data:
time site
00:00:00 google.com
00:01:06 yahoo.com
00:03:06 youtube.com
00:05:09 google.com
00:11:07 amazon.com
00:14:05 yahoo.com
00:21:00 google.com
00:30:56 amazon.com
I am confused to use the multiple queries as part of the single hive query.
Please help.

Query:
select interv, col2, count(col2)
from (
select floor((unix_timestamp(col1)-unix_timestamp(to_date(col1),'yyyy-MM-dd'))/600)*600 as interv, col1, col2
from default.test_time_group) t
group by interv,col2;
Sample schema:
{time TIMESTAMP, site STRING}
Example:
2015-10-07 00:00:00 google.com
2015-10-07 00:01:06 yahoo.com
Result:
interval col2 count
0 google.com 2
0 yahoo.com 1
0 youtube.com 1
600 amazon.com 1
600 yahoo.com 1
1200 google.com 1
1800 amazon.com 1
Explanation: I subtracted the unix timestamp of column date from unix timestamp of date (without timestamp) to check the interval. To get all 10 minutes transactions in the same interval I divided by 600, took the floor(), and multiply by 600. Then I grouped by interval and col2 to get the results.
Note: You can set sorting as per your needs.

Related

Splunk: Split a time period into hourly intervals

index="dummy" url="https://www.dummy.com" status="200 OK"
| stats count by id
| where count > 10
If I apply this above query for 1 day, I would get, for example
id count
ABC 50
XYZ 60
..
This would mean ABC hit https://www.dummy.com 50 times in 1 day, and XYZ called that 60 times.
Now I want to check this for 1 day but with every two hours interval
Suppose, ABC called that request 25 times at 12:00 AM, then 25 times at 3:AM,
and XYZ called all the 60 requests between 12 AM and 2 AM
I want the output to look like this (time format doesn't matter)
id count time
XYZ 60 12:00 AM
ABC 25 12:00 AM
ABC 25 2:00 AM
..
You can use bin to group events into time buckets. You can use any span value, but for the 2 hours you mentioned, the updated query would be:
index="dummy" url="https://www.dummy.com" status="200 OK"
| bin _time span=2h
| stats count by id, _time
| where count > 10

Get timestamp of minimum value

Could someone help me with this problem;
I have a table with a Value associated with a point_id and a timestamp.
tData
timestamp
point_id
Value
2022-10-28 08:00:00
carrots
8
2022-10-28 08:00:00
cabbage
6
2022-10-28 08:00:00
screw
255
2022-10-28 08:00:04
carrots
9
2022-10-28 08:05:11
cabbage
15
2022-10-28 08:55:16
screw
270
I have another table in which I have the point_id that I want to extract to transfert to another table
tDefAvgMinMax
point_id
carrots
cabbage
Lastly I have a table that looks like this
tReportAvgMinMax
Date
point_id
average
minimum
maximum
<value>
<value>
<value>
<value>
<value>
I run this query in order to extract the average, minimum and maximum and transfer it to the last table:
INSERT INTO tReportAvgMinMax (Date, point_id, average, minimum, maximum)
SELECT #Date, Def.point_id, AVG(T._VAL), MIN(T._VAL), MAX(T._VAL)
FROM tData T
RIGH JOIN tDefAvgMinMax Def
ON T.point_id = Def.point_id
WHERE T.Timestamp BETWEEN (CAST(#Date as datetime) + '00:00:00') AND (CAST(#Date as datetime) + '23:59:59")
GROUP BY def.point_id
This is working fine so far.
Now I need to add the timestamp of the minimum value and maximum value in the table tReportAvgMinMax. I tried different approach but I can't seem to get it to work. How can I acheive this?
Point_ID
Average
Timestamp_min
Minimum
Timestamp_max
Maximum
carrots
8.5
2022-10-28 08:00:00
8
2022-10-28 08:00:04
9
cabbage
10.5
2022-10-28 08:00:00
6
2022-10-28 08:05:11
15
Thank you
I tried to use a substatement like
SELECT Def.point_id, (SELECT ...) TimestampMin, MIN(T._VAL)
but the query is taking too long and does not give me the correct result.
I also tried with a LEFT JOIN and a ORDER BY to get the lowest value and return the timestamp but that gives me only the first point_id and the query is taking a long time to run

Extract 30 minutes from timestamp and group it by 30 mins time interval -PGSQL

In PostgreSQL I am extracting hour from the timestamp using below query.
select count(*) as logged_users, EXTRACT(hour from login_time::timestamp) as Hour
from loginhistory
where login_time::date = '2021-04-21'
group by Hour order by Hour;
And the output is as follows
logged_users | hour
--------------+------
27 | 7
82 | 8
229 | 9
1620 | 10
1264 | 11
1990 | 12
1027 | 13
1273 | 14
1794 | 15
1733 | 16
878 | 17
126 | 18
21 | 19
5 | 20
3 | 21
1 | 22
I want the same output for same SQL for 30 mins. Please suggest
SELECT to_timestamp((extract(epoch FROM login_time::timestamp)::bigint / 1800) * 1800)::timestamp AS interval_30_min
, count(*) AS logged_users
FROM loginhistory
WHERE login_time::date = '2021-04-21' -- inefficient!
GROUP BY 1
ORDER BY 1;
Extracting the epoch gets the number of seconds since the epoch. Integer division truncates. Multiplying back effectively rounds down, achieving the same as date_trunc() for arbitrary time intervals.
1800 because 30 minutes contain 1800 seconds.
Detailed explanation:
Truncate timestamp to arbitrary intervals
The cast to timestamp makes me wonder about the actual data type of login_time? If it's timestamptz, the cast depends on your current time zone setting and sets you up for surprises if that setting changes. See:
How do I match an entire day to a datetime field?
Subtract hours from the now() function
Ignoring time zones altogether in Rails and PostgreSQL
Depending on the actual data type, and exact definition of your date boundaries, there is a more efficient way to phrase your WHERE clause.
You can change the column on which you're aggregating to use the minute too:
select
count(*) as logged_users,
CONCAT(EXTRACT(hour from login_time::timestamp), '-', CASE WHEN EXTRACT(minute from login_time::timestamp) < 30 THEN 0 ELSE 30 END) as HalfHour
from loginhistory
where login_time::date = '2021-04-21'
group by HalfHour
order by HalfHour;

How to build in product expiration in SQL?

I have a table that looks like the following and from it I want to get days remaining of total doses:
USER|PURCHASE_DATE|DOSES
1111|2017-07-27|15
2222|2020-07-17|3
3333|2021-02-01|5
If the doses do not have an expiration and each can be used for 90 days then the SQL I use is:
SUM(DOSES)*90-DATEDIFF(DAY,MIN(DATE),GETDATE())
USER|DAYS_REMAINING
1111|0
2222|6
3333|385
But what if I want to impose an expiration of each dose at a year? What can I do to modify my SQL to get the following desired answer:
USER|DAYS_REMAINING
1111|-985
2222|6
3333|300
It probably involves taking the MIN between when doses expire and how long they would last but I don't know how to aggregate in the expiry logic.
MIN is a aggregate function you want LEAST to pick between the two values:
WITH data(user,purchase_date, doses) AS (
SELECT * FROM VALUES
(1111,'2017-07-27',15),
(2222,'2020-07-17',3),
(3333,'2021-02-01',5)
)
SELECT
d.*,
d.doses * 90 AS doses_duration,
365::number AS year_duration,
least(doses_duration, year_duration) as max_duration,
DATEADD('day', max_duration, d.purchase_date)::date as last_dose_day,
DATEDIFF('day', current_date, last_dose_day) as day_remaining
FROM data AS d
ORDER BY 1;
gives:
USER PURCHASE_DATE DOSES DOSES_DURATION YEAR_DURATION MAX_DURATION LAST_DOSE_DAY DAY_REMAINING
1111 2017-07-27 15 1350 365 365 2018-07-27 -986
2222 2020-07-17 3 270 365 270 2021-04-13 5
3333 2021-02-01 5 450 365 365 2022-02-01 299
which can all be rolled together with a tiny fix on the date_diff, as:
WITH data(user,purchase_date, doses) AS (
SELECT * FROM VALUES
(1111,'2017-07-27',15),
(2222,'2020-07-17',3),
(3333,'2021-02-01',5)
)
SELECT
d.user,
DATEDIFF('day', current_date, DATEADD('day', least(d.doses * 90, 365::number), d.purchase_date)::date)+1 as day_remaining
FROM data AS d
ORDER BY 1;
giving:
USER DAY_REMAINING
1111 -985
2222 6
3333 300

Group records by time

I have a table containing a datetime column and some misc other columns. The datetime column represents an event happening. It can either contains a time (event happened at that time) or NULL (event didn't happen)
I now want to count the number of records happening in specific intervals (15 minutes), but do not know how to do that.
example:
id | time | foreign_key
1 | 2012-01-01 00:00:01 | 2
2 | 2012-01-01 00:02:01 | 4
3 | 2012-01-01 00:16:00 | 1
4 | 2012-01-01 00:17:00 | 9
5 | 2012-01-01 00:31:00 | 6
I now want to create a query that creates a result set similar to:
interval | COUNT(id)
2012-01-01 00:00:00 | 2
2012-01-01 00:15:00 | 2
2012-01-01 00:30:00 | 1
Is this possible in SQL or can anyone advise what other tools I could use? (e.g. exporting the data to a spreadsheet program would not be a problem)
Give this a try:
select datetime((strftime('%s', time) / 900) * 900, 'unixepoch') interval,
count(*) cnt
from t
group by interval
order by interval
Check the fiddle here.
I have limited SQLite background (and no practice instance), but I'd try grabbing the minutes using
strftime( FORMAT, TIMESTRING, MOD, MOD, ...)
with the %M modifier (http://souptonuts.sourceforge.net/readme_sqlite_tutorial.html)
Then divide that by 15 and get the FLOOR of your quotient to figure out which quarter-hour you're in (e.g., 0, 1, 2, or 3)
cast(x as int)
Getting the floor value of a number in SQLite?
Strung together it might look something like:
Select cast( (strftime( 'YYYY-MM-DD HH:MI:SS', your_time_field, '%M') / 15) as int) from your_table
(you might need to cast before you divide by 15 as well, since strftime probably returns a string)
Then group by the quarter-hour.
Sorry I don't have exact syntax for you, but that approach should enable you to get the functional groupings, after which you can massage the output to make it look how you want.