Grouping timestamps by day, not by time - sql

I've got a table storing user access/view logs for the webservice I run. It tracks the time as a timestamp though I'm finding when I do aggregate reporting I only care about the day, not the time.
I'm currently running the following query:
SELECT
user_logs.timestamp
FROM
user_logs
WHERE
user_logs.timestamp >= %(timestamp_1)s
AND user_logs.timestamp <= %(timestamp_2)s
ORDER BY
user_logs.timestamp
There are often other where conditions but they shouldn't matter to the question. I'm using Postgres but I'd assume whatever feature is used will work in other languages.
I pull the results into a Python script which counts the number of views per date but it'd make much more sense to me if the database could group and count for me.
How do I strip it down so it'll group by the day and ignore the time?

SELECT date_trunc('day', user_logs.timestamp) "day", count(*) views
FROM user_logs
WHERE user_logs.timestamp >= %(timestamp_1)s
AND user_logs.timestamp <= %(timestamp_2)s
group by 1
ORDER BY 1

Related

partition big query LIMIT over date range

I'm quite new to SQL & big query so this might be simple. I'm running some queries on the public dataset GDELT in BQ and have a question regarding the LIMIT. GDELT is massive (14.4 TB) and when I query for something, in this case a person, I could get up to 100k rows of results or more which is this case is too much. But when I use LIMIT it seems like it does not really partition the results evenly over the dates, causing me to get very random timelines. How does limit work and is there a way to get the results more evenly based on days?
SELECT DATE,V2Tone,DocumentIdentifier as URL, Themes, Persons, Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE DATE>=20210610000000 and _PARTITIONTIME >= TIMESTAMP(#start_date)
AND DATE<=20210818999999 and _PARTITIONTIME <= TIMESTAMP(#end_date)
AND LOWER(DocumentIdentifier) like #url_topic
LIMIT #limit
When running this query and doing some preproc, I get the following time series:
It's based on 15k results, but they are distributed very unevenly/randomly across the days (since there are over 500k results in total if I don't use limit). I would like to make a query that limits my output to 15k but partitions the data somewhat equally over the days.
you need to order by , when you are not sorting your result , the order of returned result is not guaranteed:
but if you are looking to get the same number of rows per day , you can use window functions:
select * from (
SELECT
DATE,
V2Tone,
DocumentIdentifier as URL,
Themes,
Persons,
Locations,
row_number() over (partition by DATE) rn
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
DATE >= 20210610000000 AND DATE <= 20210818999999
and _PARTITIONDATE >= #start_date and _PARTITIONDATE <= #end_date
AND LOWER(DocumentIdentifier) like #url_topic
) t where rn = #numberofrowsperday
if you are passing date only you can use _PARTITIONDATE to filter the partitions.

Oracle SQL: How to best go about counting how many values were in time intervals? Database query vs. pandas (or more efficient libraries)?

I currently have to wrap my head around programming the following task.
Situation: suppose we have one column where we have time data (Year-Month-Day Hours-Minutes). Our program shall get the input (weekday, starttime, endtime, timeslot) and we want to return the interval (specified by timeslot) where there are the least values. For further information, the database has several million entries.
So our program would be specified as
def calculate_optimal_window(weekday, starttime, endtime, timeslot):
return optimal_window
Example: suppose we want to input
weekday = Monday, starttime = 10:00, endtime = 12:00, timeslot = 30 minutes.
Here we want to count how many entries there are between 10:00 and 12:00 o'clock, and compute the number of values in every single 30 minute slot (i.e. 10:00 - 10:30, 10:01 - 10:31 etc.) and in the end return the slot with the least values. How would you go about formulating an efficient query?
Since I'm working with an Oracle SQL database, my second question is: would it be more efficient to work with libraries like Dask or Vaex to get the filtering and counting done? Where is the bottleneck in this situation?
Happy to provide more information if the formulation was too blurry.
All the best.
This part:
Since I'm working with an Oracle SQL database, my second question is:
would it be more efficient to work with libraries like Dask or Vaex to
get the filtering and counting done? Where is the bottleneck in this
situation?
Depending on your server's specs and the cluster/machine you have available for Dask, it is rather likely that the bottleneck in your analysis would be the transfer of data between the SQL and Dask workers, even in the (likely) case that this can be efficiently parallelised. From the DB's point of view, selecting data and serialising it is likely at least as expensive as counting in a relatively small number of time bins.
I would start by investigating how long the process takes with SQL alone, and whether this is acceptable, before moving the analysis to Dask. Usual rules would apply: having good indexing and sharding on the time index.
You should at least do the basic filtering and counting in the SQL query. With a simple predicate, Oracle can decide whether to use an index or a partition and potentially reduce the database processing time. And sending fewer rows will significantly decrease the network overhead.
For example:
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between timestamp '2021-01-25 10:00:00' and timestamp '2021-01-25 11:59:59'
group by trunc(the_time, 'MI')
order by the_minute desc;
(The trickiest part of these queries will probably be off-by-one issues. Do you really want "between 10:00 and 12:00", or do you want "between 10:00 and 11:59:59"?)
Optionally, you can perform the entire calculation in SQL. I would wager the SQL version will be slightly faster, again because of the network overhead. But sending one result row versus 120 aggregate rows probably won't make a noticeable difference unless this query is frequently executed.
At this point, the question veers into the more subjective question about where to put the "business logic". I bet most programmers would prefer your Python solution to my query. But one minor advantage of doing all the work in SQL is keeping all of the weird date logic in one place. If you process the results in multiple steps there are more chances for an off-by-one error.
--Time slots with the smallest number of rows.
--(There will be lots of ties because the data is so boring.)
with dates as
(
--Enter literals or bind variables here:
select
cast(timestamp '2021-01-25 10:00:00' as date) begin_date,
cast(timestamp '2021-01-25 11:59:59' as date) end_date,
30 timeslot
from dual
)
--Choose the rows with the smallest counts.
select begin_time, end_time, total_count
from
(
--Rank the time slots per count.
select begin_time, end_time, total_count,
dense_rank() over (order by total_count) smallest_when_1
from
(
--Counts per timeslot.
select begin_time, end_time, sum(the_count) total_count
from
(
--Counts per minute.
select trunc(the_time, 'MI') the_minute, count(*) the_count
from test1
where the_time between (select begin_date from dates) and (select end_date from dates)
group by trunc(the_time, 'MI')
order by the_minute desc
) counts
join
(
--Time ranges.
select
begin_date + ((level-1)/24/60) begin_time,
begin_date + ((level-1)/24/60) + (timeslot/24/60) end_time
from dates
connect by level <=
(
--The number of different time ranges.
select (end_date - begin_date) * 24 * 60 - timeslot + 1
from dates
)
) time_ranges
on the_minute between begin_time and end_time
group by begin_time, end_time
)
)
where smallest_when_1 = 1
order by begin_time;
You can run a db<>fiddle here.

SELECT if exists where DATE=TODAY, if not where DATE=YESTERDAY

I have a table with some columns and a date column (that i made a partition with)
For example
[Amount, Date ]
[4 , 2020-4-1]
[3 , 2020-4-2]
[5 , 2020-4-4]
I want to get the latest Amount based on the Date.
I thought about doing a LIMIT 1 with ORDER BY, but, is that optimized by BigQuery or it will scan my entire table?
I want to avoid costs at all possible, I thought about doing a query based on the date today, and if nothing found search for yesterday, but I don't know how to do it in only one query.
Below is for BigQuery Standard SQL
#standardSQL
SELECT ARRAY_AGG(amount ORDER BY `date` DESC LIMIT 1)[SAFE_OFFSET(0)]
FROM `project.dataset.table`
WHERE `date` >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
Note: above assumes your date field is of DATE data type.
If your date field is a partition, you can use it in WHERE clause to filter which partitions should be read in your query.
In your case, you could do something like:
SELECT value
FROM <your-table>
WHERE Date >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
ORDER BY Data DESC
LIMIT 1
This query basically will:
Filter only today's and yesterday's partitions
Order the rows by your Date field, from the most recent to the older
Select the first element of the ordered list
If the table has a row with today's date, the query will return the data for today. If it dont't, the query will return the data for yesterday.
Finally, I would like to attach here this reference regarding querying partitioned tables.
I hope it helps
The LIMIT order stops the query whet it gets the amount of results indicated.
I think the query should be something like this, I'm not sure if "today()-1" returns
SELECT Amount
FROM <table> as t
WHERE date(t.Date) = current_date()
OR date(t.Date) = makedate(year(current_date()), dayofyear(current_date())-1);
Edited: Sorry, my answer is for MariaDB I now see you ask for Google-BigQuery which I didn't even know, but it looks like SQL, I hope it has some functions like the ones I posted.

Why are different result between use date_part and exactly date parameter query data in peroid date?

I'm try to count distinct value in some columns in a table.
i have a logic and i try to write in 2 way
But i get diffent results from this two query.
Can any one help to clarify me? I dont know what wrong is code or i think.
SQL
select count(distinct membership_id) from members_membership m
where date_part(year,m.membership_expires)>=2019
and date_part(month,m.membership_expires)>=7
and date_part(day,m.membership_expires)>=1
and date_part(year,m.membership_creationdate)<=2019
and date_part(month,m.membership_creationdate)<=7
and date_part(day,m.membership_creationdate)<=1
;
select count(distinct membership_id) from members_membership m
where m.membership_expires>='2019-07-01'
and m.membership_creationdate<='2019-07-01'
;
I actually think that this is the query you intend to run:
SELECT
COUNT(DISTINCT membership_id)
FROM members_membership m
WHERE
m.membership_expires >= '2019-07-01' AND
m.membership_creationdate < '2019-07-01';
It doesn't make sense for a membership to expire at the same moment it gets created, so if it expires on midnight of 1st-July 2019, then it should have been created strictly before that point in time.
That being said, the problem with the first query is that, e.g., the restriction on the month being on or before July would apply to every year, not just 2019. It is difficult to write a date inequality using the year, month, and day terms separately. For this reason, the second version you used is preferable. It is also sargable, meaning that an index on membership_expires or membership_creationdate can be used.
There is an issue with the first query:
select count(distinct membership_id) from members_membership m
where date_part(year,m.membership_expires)>=2019
and date_part(month,m.membership_expires)>=7
and date_part(day,m.membership_expires)>=1
and date_part(year,m.membership_creationdate)<=2019
and date_part(month,m.membership_creationdate)<=7
and date_part(day,m.membership_creationdate)<=1; -- do you think that any day is less than 1??
-- this condition will be satisfy by only 01-Jul-2019, But I think you need all the dates before 01-Jul-2019
and date_part(day,m.membership_creationdate)<=1 is culprit of the issue.
even membership_creationdate = 15-jan-1901 will not satisfy above condition.
You need to always use date functions on date columns to avoid such type of issue. (Your second query is perfectly fine)
Cheers!!
The reason could be due to a time component.
The proper comparison for the first query is:
select count(distinct membership_id)
from members_membership m
where m.membership_expires >= '2019-07-01' and
m.membership_creationdate < '2019-07-02'
--------------------------------^ not <= ---^ next day
This logic should work regardless of whether or not the "date" has a time component.

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span
You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.
I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.