ActiveRecord query too difficult for my tiny brain - ruby-on-rails-3

I have an activerecord model Event that I want to order by a rank. The rank will be a sum of weighted properties. For instance I might want to rank an event with some logic like:
LOG(number of followers) + (7 - number of days from now)
The following works, but is not satisfactory because it returns a result set instead of a relation object. Therefore I won't be able to treat it as a scope. (FYI I am using Postgres with PostGIS extension)
x = Event.find_by_sql("
SELECT
(
CASE WHEN COUNT(follows.*) = 0
THEN 0
ELSE LOG(COUNT(follows.*)) END
+
SIGN(7 - (EXTRACT(EPOCH FROM start) - EXTRACT(EPOCH FROM NOW())) / 86400) * LOG(ABS(7 - (EXTRACT(EPOCH FROM start) - EXTRACT(EPOCH FROM NOW())) / 86400))
) as score,
events.*
FROM events
LEFT OUTER JOIN follows
ON events.id = follows.followable_id AND follows.followable_type = 'Event' AND follows.blocked = 'f'
WHERE (events.start > NOW()) AND (ST_DWithin(st_setsrid(st_point(#{#location[:lng]}, #{#location[:lat]}), 4326), st_transform(loc, 4326), 48280.2, true))
GROUP BY events.id
ORDER BY 1 DESC
")
I understand that I could add a counter cache to the Events table and avoid the join, but in the future I will want to compute the rank through some other association so it would be very helpful to know how.
Thanks

This is actually pretty easy to split up into ActiveRecord::Relation queries.
x = Event.select("(
CASE WHEN COUNT(follows.*) = 0
THEN 0
ELSE LOG(COUNT(follows.*)) END
+
SIGN(7 - (EXTRACT(EPOCH FROM start) - EXTRACT(EPOCH FROM NOW())) / 86400) * LOG(ABS(7 - (EXTRACT(EPOCH FROM start) - EXTRACT(EPOCH FROM NOW())) / 86400))
) as score,
events.*")
x = x.joins("LEFT OUTER JOIN follows
ON events.id = follows.followable_id AND follows.followable_type = 'Event' AND follows.blocked = 'f'")
x = x.where("(events.start > NOW()) AND (ST_DWithin(st_setsrid(st_point(#{#location[:lng]}, #{#location[:lat]}), 4326), st_transform(loc, 4326), 48280.2, true))")
x = x.group('events.id')
x = x.order('1 desc')
Of course, I would recommend splitting these up in various scopes, but this should at least get you in the right direction.

I don't think this answer your question, but I want to use formatting.
I would make something easier, as I am more comfortable with ruby,I would create a new methods in your Event class:
def rank
# you rank computation according model columns and associations
Math.log(followers_count) + (7 - number_of_days_from_now)
end
def self.by_rank(order = 'desc')
all.sort{ |event, event1| order == 'desc' ? event1 <=> event : event <=> event1 }
end
Then you can scale caching your computation, creating a rank column in events table, so you can then do something like this:
Event.order('rank desc')
Event.order('rank asc')

Related

Show different field from table depending on job code

I'm after some ideas on how I can write some Oracle SQL code to show a different field value depending on what the job_type_code is.
If the job_type_code is either EN11 or EN12 the only jobs that need to be returned are those where the target_comp_date is in the past.
If the job_type_code is either EN90 or EN91 all jobs should be displayed with the target_comp_date.
Code I've tried putting in is below.
select
case
when job_type.job_type_code in ('EN11','EN12') and job.target_comp_date < SYSDATE then
job.target_comp_date
when job_type.job_type_code in ('EN90','EN91') then job.target_comp_date
else 'check'
end as Test
from
job
inner join job_type on job.job_type_key = job_type.job_type_key
I think you need a WHERE clause here, not a CASE expression:
SELECT *
FROM job j
INNER JOIN job_type jt
ON j.job_type_key = jt.job_type_key
WHERE
(jt.job_type_code IN ('EN11', 'EN12') AND j.target_comp_date < SYSDATE) OR
jt.job_type_code IN ('EN90', 'EN91');
A case expression in the select is not going to filter any data. This sounds like a where clause:
where (job_type_code in ('EN11', 'EN12') and target_comp_date < sysdate
) or
(job_type_code in ('EN90', 'EN91') and target_comp_date > sysdate
)
It is not clear if you want other job_type_codes. If so, add or (job_type_code not in ('EN11', 'EN12', 'EN90', 'EN91').

BETWEEN and standard comparison operators Oracle SQL

As I need range of data I am using BETWEEN because, as far as I know, this two queries below should be same:
select * from table1 where my_date1 - my_date2 between (-1) and (-30);
and
select * from table1 where my_date1 - my_date2 <= (-1) and my_date1 - my_date2 >= (-30);
However when I try it in my script:
SELECT
a.account_no AS ACCOUNT_NO,
a.installment_no AS INSTALLMENT_NO,
a.INSTALLMENT_DATE AS INSTALLMENT_DATE
FROM myTable a
INNER JOIN (SELECT
ACCOUNT_NO,
MIN(INSTALLMENT_NO) AS INSTALLMENT_NO
FROM myTable
WHERE
ACCOUNT_NO IS NOT NULL
AND INSTALLMENT_NO IS NOT NULL
AND STATUS = 'A'
GROUP BY ACCOUNT_NO) b
ON A.ACCOUNT_NO = B.ACCOUNT_NO AND A.INSTALLMENT_NO = B.INSTALLMENT_NO
WHERE (TRUNC(INSTALLMENT_DATE) - TRUNC(TO_DATE('12/01/2011','DD/MM/YYYY'))) BETWEEN (-1) AND (-30) -- If I change this
I got 0 rows, But when I change
WHERE (TRUNC(INSTALLMENT_DATE) - TRUNC(TO_DATE('12/01/2011','DD/MM/YYYY'))) BETWEEN (-1) AND (-30)
to
WHERE (TRUNC(INSTALLMENT_DATE) - TRUNC(TO_DATE('12/01/2011','DD/MM/YYYY'))) <= (-1) and (TRUNC(INSTALLMENT_DATE) - TRUNC(TO_DATE('12/01/2011','DD/MM/YYYY'))) >= (-30)
I get more than 0 rows. I would like to use BETWEEN as it is more readable. Am I missing something?
I believe the syntax for the range used with BETWEEN is:
WHERE col BETWEEN <smaller_value> AND <larger_value>
which is equivalent to
WHERE col >= <smaller_value> AND col <= <larger_value>
Your current WHERE clause is looking for date a date difference greater than -1 and less than -30. This will eliminate all the records you are trying to target, and in fact will never be true. To fix this, correct the range:
WHERE (TRUNC(INSTALLMENT_DATE) - TRUNC(TO_DATE('12/01/2011','DD/MM/YYYY')))
BETWEEN (-30) AND (-1)
https://docs.oracle.com/cd/B28359_01/server.111/b28286/conditions011.htm#SQLRF52147 says:
expr1 [NOT] BETWEEN expr2 AND expr3
If expr3 < expr2, then the interval is empty.
BETWEEN is a syntax shortcut that is evaluated as
WHERE col >= [smaller_value] AND <= [larger_value]
and it is VITAL that the values compared to are presented in that order (small then large) otherwise it will never be satisfied.
HOWEVER I never recommend using between for date ranges and suggest, instead, that one always uses this syntax instead:
WHERE col >= [smaller_value] AND < [larger_value]+1
This syntax allows accurate filtering of date/time information for any level of time precision.

Fetch rows based on condition

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

PostgreSQL range-based join works too slow

Supporse I have 2 tables with some Events and Callbacks with following structure:
Event:
id
timestamp (BIGINT, btree index)
type (VARCHAR, btree index)
(pair index (timestamp, type))
Callback:
id
timestamp (BIGINT, btree index)
event_type
Event table contains about (M=) 300000 rows, Callbacks is about (N=) 25000.
I try to do somethink like:
SELECT * FROM Callback
JOIN Event
ON ABS(Callback.timestamp - Event.Timestamp) < 300000 AND
Callback.event_type = Event.type;
As it was planed, it should work for O(N log(M) + R) (where R - is result size. R is about 1000000 (AVG 50 events for each order)), but practicaly it works about 40 minutes on powerfull CPU.
UPD: Sorry, forget to say, I try:
SELECT * FROM Callback
JOIN Event
ON Event.Timestamp < Callback.timestamp + 300000 AND
Event.Timestamp > Callback.timestamp - 300000 AND
Callback.event_type = Event.type;
But nothing changes.
Can anyone tell, what I do wrong?
Thank you.
Perhaps the following will work with an index on event(type, timestamp):
SELECT *
FROM Callback c JOIN
Event e
ON c.event_type = e.type AND e.Timestamp > c.timestamp - 300000;
The idea is to leave one of the timestamp columns with no modifications. These can prevent the use of the index.
I do wonder if you also want a condition on c.timestamp >= e.TimeStamp. Your performance problem might simply be the volume of data you are returning.
Re-arrange your joins so that one column is expressed as a function of the other, so something like:
SELECT * FROM Callback
JOIN Event
ON (Event.Timestamp > (Callback.timestamp - 300000) AND
Callback.event_type = Event.type);
... or ...
SELECT * FROM Callback
JOIN Event
ON (Callback.timestamp > (Event.Timestamp + 300000) AND
Callback.event_type = Event.type);
(I think I got the >'s and <'s the right way round).
This allows indexes on the columns to be used, but I wouldn't rule out the possibility that full scans will be needed on both tables. It depends on the data distribution of the values.

SQL multiple SELECT too slow (7 min)

This source is good but too slow.
Function:
Selecting all rows if SC and %%5 and 2013.07.11 < date < 2013.07.18
and
some older lines represent lines
Method:
Finding X count rows.
one by one to see whether there is consistency 28 days
select efi_name, efi_id, count(*) as dupes, id, mlap_date
from address m
where
mlap_date > "2013.07.11"
and mlap_date < "2013.07.18"
and mlap_type = "SC"
and calendar_id not like "%%5"
and concat(efi_id,irsz,ucase(city), ucase(address)) in (
select concat(k.efi_id,k.irsz,ucase(k.city), ucase(k.address)) as dupe
from address k
where k.mlap_date > adddate(m.`mlap_date`,-28)
and k.mlap_date < m.mlap_date
and k.mlap_type = "SC"
and k.calendar_id not like "%%5"
and k.status = 'Befejezett'
group by concat(k.efi_id,k.irsz,ucase(k.city), ucase(k.address))
having (count(*) > 1)
)
group by concat(efi_id,irsz,ucase(city), ucase(address))
Thanks for helping!
NOT LIKE plus wildcard-prefixed terms are index-usage killers.
You could also try replacing the IN + inline table with an inner join: does the optimizer run the NOT LIKE query twice (see your explain plan)?
It looks like you might be using MySql, in which case you could build a hash column based on
efi_id
irsz
ucase(city)
ucase(address))
and compare that column directly. This is a way of implementing a hash join in MySql.
I don't think you need a subquery to do this. You should be able to do it just with the outer group by and conditional aggregations.
select efi_name, efi_id,
sum(case when mlap_date > "2013.07.11" and mlap_date < "2013.07.18" then 1 else 0 end) as dupes,
id, mlap_date
from address m
where mlap_type = 'SC' and calendar_id not like '%%5'
group by efi_id,irsz, ucase(city), ucase(address)
having sum(case when m.status = 'Befejezett' and
m.mlap_date <= '2013.07.11' and
k.mlap_date > adddate(date('2013.07.11'), -28)
then 1
else 0
end) > 1
This produces a slightly different result from your query. Instead of looking at the 28 days before each record, it looks at all records in the week period and then at the four weeks before that period. Despite this subtle difference, it is still identifying dupes in the four-week period before the one-week period.