Count of records that are created the same day - sql

I am trying to create a array that should be something like this:
[ [created_at date * 1000, count record for that date],
[created_at date * 1000, count record for that date],
[created_at date * 1000, count record for that date] ]
The created_at date is not exactly the same, because of minutes, hours and seconds.
I was thinking is it possible to change created_at time on create to 00:00:00
I have tried with this,
#kliks = Klik.all.map{|klik| [(klik.created_at.to_i * 1000), 1]}
But I have not figure out to sum those records that are created the same day. Also this loops create a array for every single record, I don't want duplicates of the sum.

Rails has ActiveRecord::Calculations which is designed to do exactly this sort of thing at the database level. You should use it. In this case, count is the method you want:
#kliks = Klik.count( :group => "DATE( created_at )" )
This is equivalent to the following SQL:
SELECT *, COUNT(*) FROM kliks
GROUP BY DATE( created_at )
The DATE() function is MySQL changes a datetime (like created_at, e.g. 2012-02-27 10:08:59) to a plain date (e.g. 2012-02-27). No need to go converting things to integers or multiplying minutes and seconds, and no need to use map or any other method in Ruby.

According to the query guide, you should try with
items = Klik.select("date(created_at) as creation_date, count(*) as count").group("date(creation_date)")
result = items.map { |k| [ k['creation_date'], k['count'] ] }

The following will produce the result you have asked for:
Klik.all.group_by do |k|
k.created_at.beginning_of_day
end.map do |date, records|
[date, records.length]
end

Related

POSTGRES DATA_TRUNC should return 0 for intervals that has no data

I am trying to do a time series-like reporting, for that, I am using the Postgres DATA_TRUNC function, it works fine and I am getting the expected output, but when a specific interval has no record then it is getting skipped to show, but my expected output is to get the interval also with 0 as the count, below is the query that I have right now. What change I should do to get the intervals that have no data? Thanks in advance.
SELECT date_trunc('days', sent_at), count('*')
FROM (select * from invoice
WHERE supplier = 'ABC' and sent_at BETWEEN '2021-12-01' AND '2022-07-31') as inv
GROUP BY date_trunc('days', sent_at)
ORDER BY date_trunc('days', sent_at);
Expected: As you can see below, the current output now shows 02/12 and then 07/12, it has skipped dates in the middle, but for me, it should also show 03/12, 04/12, 05/12 with count as 0
Current output
It doesn't seem like you have those dates in your data, in which case you need to generate them. Also, casting your timestamp to date instead of date_trunc() can get rid of those zeroes.
SELECT dates::date, count(*) filter (where sent_at is not null)
FROM (
select *
from invoice a
right join generate_series( '2021-12-01'::date,
'2021-12-31'::date,
'1 day'::interval ) as b(dates)
on sent_at::date=b.dates) as inv
GROUP BY 1
ORDER BY 1;
Here's a working example. Also, please try to improve your question according to #nbk's comment.

Hive - calculating string type timestamp differences in minutes

I'm novice to SQL (in hive) and trying to calculate every anonymousid's time spent between first event and last event in minutes. The resource table's timestamp is formatted as string,
like: "2020-12-24T09:47:17.775Z". I've tried in two ways:
1- Cast column timestamp to bigint and calculated the difference from main table.
select anonymousid, max(from_unixtime(cast('timestamp' as bigint)) - min(from_unixtime(cast('timestamp' as bigint)) from db1.formevent group by anonymousid
I got NULLs after implementing this as a solution.
2- Create a new table from main resource, put conditions to call with 'where' and tried to convert 'timestamp' to date format without any min-max calculation.
create table db1.successtime as select anonymousid, pagepath,buttontype, itemname, 'location', cast(to_date(from_unixtime(unix_timestamp('timestamp', "yyyy-MM-dd'T'HH:mm:ss.SSS"),'HH:mm:ss') as date) from db1.formevent where pagepath = "/account/sign-up/" and itemname = "Success" and 'location' = "Standard"
Then I got NULLs again and I left. It looks like this
Is there any way I can reformat and calculate time difference in minutes between first and last event ('timestamp') and take the average grouped by 'location'?
select anonymousid,
(max(unix_timestamp(timestamp, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")) -
min(unix_timestamp(timestamp, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"))
) / 60
from db1.formevent
group by anonymousid;
From your description, this should work:
select anonymousid,
(max(unix_timestamp(timestamp, 'yyyy-MM-dd'T'HH:mm:ss.SSS'),'HH:mm:ss') -
min(unix_timestamp(timestamp, 'yyyy-MM-dd'T'HH:mm:ss.SSS'),'HH:mm:ss')
) / 60
from db1.formevent
group by anonymousid;
Note that the column name is not in single quotes.

How to select rows until the sum of a column reaches N, where the column is of type TIME

I would like to select enough audio calls to have 00:10:00 minutes of audio. I have tried to achieve this by writing the following SQL (postgres) statement
SELECT file_name, audio_duration
FROM (
SELECT distinct file_name, audio_duration, SUM(audio_duration)
OVER (ORDER BY audio_duration) AS total_duration
FROM data
) AS t
WHERE
t.total_duration <='00:10:00'
GROUP BY file_name, audio_duration
My problem is that it doesn't seem to be calculating the total duration correctly.
I suspect this is due the audio_duration column being the TIME type.
If anyone have any hints or suggestions on how to make this query, it would be greatly appreciated.
You should really define that column to be an interval. A time column stores a moment in time, e.g. "3 in the afternoon".
However you can cast a single time value to an interval. You also don't need the window function to first calculate the "running total" if you want the total duration per file:
SELECT file_name, sum(audio_duration::interval) as total_duration
FROM data
GROUP BY file_name
HAVING sum(audio_duration::interval) <= interval '10 minute';
To permanently change the column type to an interval you can use:
alter table data
alter duration type interval;
I fully agree with #a_horse_with_no_name that Interval is the better datatype, but must admit that the Time datatype in not incorrect. While you cannot add (+) time datatypes you can SUM them. Summing time datatypes result in an interval, and produces the same result as summing corresponding intervals. Time besides being moment is also the interval from the beginning of day to that moment. Demo (fiddle)
with as_time (dur) as ( values ('10:34:45 AM'::time), ('03:14:50 PM'::time), ('11:15:25 PM'::time))
, as_intv (dur) as ( values ('10:34:45'::interval), ('15:14:50'::interval),('23:15:25'::interval))
select *
from (select sum(dur) sum_time from as_time) st
, (select sum(dur) sum_intv from as_intv) si;
BTW: The answer to the rhetorical question "what is the sum of "8 in the morning" and "3 in the afternoon"? Well it's 23:00:00.

How do I expire rows based on a lookup table of expiry times?

If I have two tables:
items
Id VARCHAR(26)
CreateAt bigint(20)
Type VARCHAR(26)
expiry
Id VARCHAR(26)
Expiry bigint(20)
The items table contains when the item was created, and what type it is. Then another table, expiry, is a lookup table to say how long certain types should last for. A query is run every day to make sure that items that have expired are removed.
At the moment this query is written in our app, as programming code:
for item in items {
expiry = expiry.get(item.Type)
if (currentDate() - expiry.Expiry > item.CreateAt) {
item.delete()
}
}
This was fine when we only had a few thousand items, but now we have tens of millions it takes a significant amount of time to run. Is there a way to put this into just an SQL statement?
Assuming all date values are actually UNIX timestamps, you could write a query such as:
SELECT * -- DELETE
FROM items
WHERE EXISTS (
SELECT 1
FROM expiry
WHERE expiry.id = items.type
AND items.CreateAt + expiry.Expiry < UNIX_TIMESTAMP()
)
Replace SELECT with DELETE once you're sure that the query selects the correct rows.
If the dates stored are in seconds since the UNIX epoch, you could use this PostgreSQL query:
DELETE FROM items
USING expiry
WHERE items.type = expiry.id
AND items.createat < EXTRACT(epoch FROM current_timestamp) - expiry.expiry;
A standard SQL solution that should work anywhere would be
DELETE FROM items
WHERE items.createat < EXTRACT(epoch FROM current_timestamp)
- (SELECT expiry.expiry FROM expiry
WHERE expiry.id = items.type);
That can be less efficient in PostgreSQL.
Your code is getting slow because you do the join between the tables outside the database.
Second slowing aspect is that you delete the items 1 by 1.
So using the compact delete statements which were provided is the correct solution.
It seems that you are using something like python-sqlalchemy. There the code would be something like:
items.delete().\
where(items.c.type==\
select([expiry.c.id]).\
where(currentDate() - expiry.Expiry > item.c.CreateAt ))

Trying to UNNEST timestamp array field, but need to GROUP BY

I have a repeated field of type TIMESTAMP in a BigQuery table. I am attempting to UNNEST this field. However, I must group or aggregate the field in order. I am not knowledgable with SQL, so I could use some help. The code snippet is part of a larger query that works when substituting subscription.future_renewal_dates with GENERATE_TIMESTAMP_ARRAY
subscription.future_renewal_dates is ARRAY<TIMESTAMP>
The TIMESTAMP array is unique (recurring subscriptions) and cannot be generated using GENERATE_TIMESTAMP_ARRAY, so I have to generate the dates before uploading to BigQuery. UDF is too much.
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name
Cannot figure out how to successfully unnest subscription.future_renewal_dates without error 'UNNEST expression references subscription.future_renewal_dates which is neither grouped nor aggregated'
When you do GROUP BY - all expressions, columns in the SELECT (except those in GROUP BY list) should be used with some aggregation function - which you clearly do not have. So you need to decide what it is that you actually trying to achieve here with that grouping
Below is the option I think you had in mind - though it can be different - but at least you have an idea on how to fix it
SELECT
subscription.amount AS subscription_amount,
subscription.status AS subscription_status,
"1" AS analytic_name,
ARRAY_CONCAT_AGG( ARRAY (
SELECT
AS STRUCT FORMAT_TIMESTAMP("%x", days) AS type_value, subscription.amount AS analytic_name
FROM
UNNEST(subscription.future_renewal_dates) as days
WHERE
(
days >= TIMESTAMP("2019-06-05T19:30:02+00:00")
AND days <= TIMESTAMP("2019-08-01T03:59:59+00:00")
)
)) AS forecast
FROM
`mydataset.subscription` AS subscription
GROUP BY
subscription_amount,
subscription_status,
analytic_name