Limit in Group - ActiveRecord Postgres

Limit in Group - ActiveRecord Postgres - sql

I have an events table I'm querying by month and trying to limit the number of events returned per day to 3.
[39] pry(#<EventsController>)> #events.group("DATE_TRUNC('day', start)").count
CACHE (0.0ms) SELECT COUNT(*) AS count_all, DATE_TRUNC('day', start) AS
date_trunc_day_start FROM "events" WHERE ((start >= '2014-08-31 00:00:00' and
start <= '2014- 10-12 00:00:00')) GROUP BY DATE_TRUNC('day', start)
=> {2014-09-24 00:00:00 UTC=>5,
2014-09-18 00:00:00 UTC=>6,
2014-09-25 00:00:00 UTC=>3}
Here we have 5 events on the 24th, 6 on the 18th, and 3 on the 25th.
http://stackoverflow.com/a/12529783/3317093>
When I try the query without the .count, I get the error message
PG::GroupingError: ERROR: column "events.id" must appear in the GROUP BY clause or be used in an aggregate function
I looked at using select() to get the grouping to work, but would need to list all the columns in the table. How should I structure the query/scope to return 3 records from each group of events?
Edit - I'm close!
I've found many similar questions, most of them in MySQL using select. I think using select could be the way to go, either as events.* or as below
#events.where("exists (select 1 from events GROUP BY DATE_TRUNC('day', start) limit 3)")
yields the SQL
SELECT "events".* FROM "events" WHERE ((start >= '2014-08-31 00:00:00'
and start <= '2014-10-12 00:00:00')) AND (exists (select 1 from events
GROUP BY DATE_TRUNC('day', start) limit 3))
The query returns all #events sorted by id (seems :id is implicitly a part of the grouping). I've tried switching things up but most often get the same grouping error as earlier.

For anyone experiencing a similar issue, I would recommend checking out window functions and this blog post covering different ways to solve a similar question. The three approaches covered in the post include using 1) group_by, 2) SQL subselects, 3) window functions.
My solution, using window functions:
#events.where("(events.id)
IN (
SELECT id FROM
( SELECT DISTINCT id,
row_number() OVER (PARTITION BY DATE_TRUNC('day', start) ORDER BY id) AS rank
FROM events) AS result
WHERE (
start >= '#{startt}' and
start <= '#{endt}' and
rank <= 3
)
)
")

If you don't want to use count, you can use group_by from rails for list events as the following:
hash = #events.group_by{ |p| p.start.to_date}
And use this code for limit(3) for each date:
hash.inject({}){ |hash, (k, v)| hash.merge( k => v.take(3) ) }
Helper link for map on hash, and return hash instead of array.

Related

How to use sql LAG() properly

I have the following SQL line that has a syntax error. I'm trying to reference prior day close in my SQL query how do i fix my query to not error out?
Thanks!
SELECT *
FROM "daily_data"
WHERE date >'2018-01-01' and (open-LAG(close))/LAG(close)>=1.4 and volume > 1000000 and open > 1
Error:
Query execution failed
Reason: SQL Error [42809]: ERROR: window function lag requires an OVER
clause Position: 63

You need to use a subquery. You cannot use window functions in the where clause. You also need an ORDER BY and potentially a PARTITION BY clause:
SELECT *
FROM (SELECT dd.*,
LAG(close) OVER (ORDER BY date) as prev_close
FROM "daily_data" dd
) dd
WHERE date > '2018-01-01' AND
(open - prev_close) / prev_close >= 1.4 AND
volume > 1000000 AND
open > 1;

lag(close) means "the value of close from the prior record." So the phrase by itself is missing something fundamental, specifically how do you define "prior record" since there is never any implied order in a RDBMS.
As with functions such as rank and row_number, to propery form the lead and lag commands you need to establish the prior (or next) record by defining the order of output. In other words, "if you were to sort the output by x, the prior record's close" would look like this:
lag (close) over (order by x)
To order by something descending:
lag (close) over (order by x desc)
You can optionally chunk the data by a field using partition by which may or may not be useful in your problem. For example, "for each item, if you were to sort the output by x, the prior record's close:"
lag (close) over (partition by item order by x)
To the question here is prior record (lag)... how? By which fields, in which order?
As a final thought, analytic/windowing functions cannot be used in the where clause in PostgreSQL. To accomplish this, wrap them in a subquery:
with daily as (
SELECT
d.*,
LAG (d.close) over (order by d.<something>) as prior_close
FROM "daily_data" d
WHERE
d.date >'2018-01-01' and
d.volume > 1000000 and
d.open > 1
)
select *
from daily
where
(open - prior_close) / prior_close >= 1.4

SQL question: count of occurrence greater than N in any given hour

I'm looking through login logs (in Netezza) and trying to find users who have greater than a certain number of logins in any 1 hour time period (any consecutive 60 minute period, as opposed to strictly a clock hour) since December 1st. I've viewed the following posts, but most seem to address searching within a specific time range, not ANY given time period. Thanks.
https://dba.stackexchange.com/questions/137660/counting-number-of-occurences-in-a-time-period
https://dba.stackexchange.com/questions/67881/calculating-the-maximum-seen-so-far-for-each-point-in-time
Count records per hour within a time span

You could use the analytic function lag to look back in a sorted sequence of time stamps to see whether the record that came 19 entries earlier is within an hour difference:
with cte as (
select user_id,
login_time,
lag(login_time, 19) over (partition by user_id order by login_time) as lag_time
from userlog
order by user_id,
login_time
)
select user_id,
min(login_time) as login_time
from cte
where extract(epoch from (login_time - lag_time)) < 3600
group by user_id
The output will show the matching users with the first occurrence when they logged a twentieth time within an hour.

I think you might do something like that (I'll use a login table, with user, datetime as single column for the sake of simplicity):
with connections as (
select ua.user
, ua.datetime
from user_logons ua
where ua.datetime >= timestamp'2018-12-01 00:00:00'
)
select ua.user
, ua.datetime
, (select count(*)
from connections ut
where ut.user = ua.user
and ut.datetime between ua.datetime and (ua.datetime + 1 hour)
) as consecutive_logons
from connections ua
It is up to you to complete with your columns (user, datetime)
It is up to you to find the dateadd facilities (ua.datetime + 1 hour won't work); this is more or less dependent on the DB implementation, for example it is DATE_ADD in mySQL (https://www.w3schools.com/SQl/func_mysql_date_add.asp)
Due to the subquery (select count(*) ...), the whole query will not be the fastest because it is a corelative subquery - it needs to be reevaluated for each row.
The with is simply to compute a subset of user_logons to minimize its cost. This might not be useful, however this will lessen the complexity of the query.
You might have better performance using a stored function or a language driven (eg: java, php, ...) function.

SQL WITH AS statements in Ecto Subquery

I have an SQL query that using the PostgreSQL WITH AS to act as an XOR or "Not" Left Join. The goal is to return what is in unique between the two queries.
In this instance, I want to know what users have transactions within a certain time period AND do not have transactions in another time period. The SQL Query does this by using WITH to select all the transactions for a certain date range in new_transactions, then select all transactions for another date range in older_transactions. From those, we will select from new_transactions what is NOT in older_transactions.
My Query in SQL is :
/* New Customers */
WITH new_transactions AS (
select * from transactions
where merchant_id = 1 and inserted_at > date '2017-11-01'
), older_transactions AS (
select * from transactions
where merchant_id = 1 and inserted_at < date '2017-11-01'
)
SELECT * from new_transactions
WHERE user_id NOT IN (select user_id from older_transactions);
I'm trying to replicate this in Ecto via Subquery. I know I can't do a subquery in the where: statement, which leaves me with a left_join. How do I replicate that in Elixir/Ecto?
What I've replicated in Elixir/Ecto throws an (Protocol.UndefinedError) protocol Ecto.Queryable not implemented for [%Transaction....
Elixir/Ecto Code:
def new_merchant_transactions_query(merchant_id, date) do
from t in MyRewards.Transaction,
where: t.merchant_id == ^merchant_id and fragment("?::date", t.inserted_at) >= ^date
end
def older_merchant_transactions_query(merchant_id, date) do
from t in MyRewards.Transaction,
where: t.merchant_id == ^merchant_id and fragment("?::date", t.inserted_at) <= ^date
end
def new_customers(merchant_id, date) do
from t in subquery(new_merchant_transactions_query(merchant_id, date)),
left_join: ot in subquery(older_merchant_transactions_query(merchant_id, date)),
on: t.user_id == ot.user_id,
where: t.user_id != ot.user_id,
select: t.id
end
Update:
I tried changing it to where: is_nil(ot.user_id), but get the same error.

This maybe should be a comment instead of an answer, but it's too long and needs too much formatting so I went ahead and posted this as an answer. With that out of the way, here we go.
What I would do is re-write the query to avoid the Common Table Expression (or CTE; this is what a WITH AS is really called) and the IN() expression, and instead I'd do an actual JOIN, like this:
SELECT n.*
FROM transactions n
LEFT JOIN transactions o ON o.user_id = n.user_id and o.merchant_id = 1 and o.inserted_at < date '2017-11-01'
WHERE n.merchant_id = 1 and n.inserted_at > date '2017-11-01'
AND o.inserted_at IS NULL
You might also choose to do a NOT EXISTS(), which on Sql Server at least will often produce a better execution plan.
This is probably a better way to handle the query anyway, but once you do that you may also find this solves your problem by making it much easier to translate to ecto.

BigQuery "Response Too Large To Return" when trying to ORDER BY

I have a query that finds all the events that a user's "last" day (ie, does not show up again for 2+ weeks). I want to whittle this down to the last N events they perform before leaving in order of most recent to least recent.
I created an unordered table without issue, but when I try to ORDER BY timestamp DESC, it then gives me a "Response too large to return" error.
Why do I get this error when trying to sort (no GROUP BYs or anything), but not on the unordered table?
EDITED TO ADD QUERY BELOW
This query gives me the table with events for users who have not shown up in the last 14 days.
SELECT user.user_key as user_key, user.lastTime as lastTime, evt.actiontime as actiontime, evt.actiontype as actiontype, evt.action_parameters.parameter_name as actionParameterName
FROM (
SELECT user_key , MAX(actiontime) AS lastTime, DATE(MAX(actiontime)) as lastDate
FROM [db.action_log]
WHERE DATEDIFF(CURRENT_TIMESTAMP(), actiontime) >= 14
GROUP EACH BY user_key
HAVING DATEDIFF(CURRENT_TIMESTAMP(), lastTime) >= 14) as user
JOIN EACH(
SELECT user_key, actiontime, actiontype, action_parameters.parameter_name, DATE(actiontime) as actionDate
FROM [db.action_log]
WHERE DATEDIFF(CURRENT_TIMESTAMP(), actiontime) >= 14) as evt
ON (user.user_key = evt.user_key) AND (user.lastDate = evt.actionDate)
WHERE actiontime <= lastTime;
And this runs just fine. I want to GROUP_CONCAT() to turn the actions into a list, but first I need to sort by actiontime (descending) so that the most recent event is the first in the list. But when I run:
SELECT user_key, lastTime, actiontime, user_level, actiontype, actionParameterName
FROM [db.lastActions]
ORDER BY actiontime DESC;
I get "Response too large to return."

BQ sometimes shouts at you when you have too many combinations of group, order and group_concats in a single query.
However, I believe that a straight group_concat does not create an ordered list, so doing an order by before the concat is essentially meaningless in this case. This question should solve your problem though
For completions sake, I have found several useful ways to get around getting shouted at (for whatever reason):
1) Break up your script into smaller segments if possible. This way you can have more control if you are playing around with data.
2) Instead of using order by and then limit, it will be quicker and may be more useful to use more where statements.
3) Use subselects - wisely. For example grouping first and then ordering should always be quicker than ordering first for example.

SQL Average Inter-arrival Time, Time Between Dates

I have a table with sequential timestamps:
2011-03-17 10:31:19
2011-03-17 10:45:49
2011-03-17 10:47:49
...
I need to find the average time difference between each of these(there could be dozens) in seconds or whatever is easiest, I can work with it from there. So for example the above inter-arrival time for only the first two times would be 870 (14m 30s). For all three times it would be: (870 + 120)/2 = 445 (7m 25s).
A note, I am using postgreSQL 8.1.22 .
EDIT: The table I mention above is from a different query that is literally just a one-column list of timestamps

Not sure I understood your question completely, but this might be what you are looking for:
SELECT avg(difference)
FROM (
SELECT timestamp_col - lag(timestamp_col) over (order by timestamp_col) as difference
FROM your_table
) t
The inner query calculates the distance between each row and the preceding row. The result is an interval for each row in the table.
The outer query simply does an average over all differences.

i think u want to find avg(timestamptz).
my solution is avg(current - min value). but since result is interval, so add it to min value again.
SELECT avg(target_col - (select min(target_col) from your_table))
+ (select min(target_col) from your_table)
FROM your_table

If you cannot upgrade to a version of PG that supports window functions, you
may compute your table's sequential steps "the slow way."
Assuming your table is "tbl" and your timestamp column is "ts":
SELECT AVG(t1 - t0)
FROM (
-- All this silliness would be moot if we could use
-- `` lead(ts) over (order by ts) ''
SELECT tbl.ts AS t0,
next.ts AS t1
FROM tbl
CROSS JOIN
tbl next
WHERE next.ts = (
SELECT MIN(ts)
FROM tbl subquery
WHERE subquery.ts > tbl.ts
)
) derived;
But don't do that. Its performance will be terrible. Please do what
a_horse_with_no_name suggests, and use window functions.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Limit in Group - ActiveRecord Postgres - sql

Related

How to use sql LAG() properly

SQL question: count of occurrence greater than N in any given hour

SQL WITH AS statements in Ecto Subquery

BigQuery "Response Too Large To Return" when trying to ORDER BY

SQL Average Inter-arrival Time, Time Between Dates

Categories

Resources