Trying to understand the delayed job polling query - sql

I'm trying to port Delayed Job to Haskell and am unable to understand the WHERE clause in the query that DJ fires to poll the next job:
UPDATE "delayed_jobs"
SET locked_at = '2017-07-18 03:33:51.729884',
locked_by = 'delayed_job.0 host:myhostname pid:21995'
WHERE id IN (
SELECT id FROM "delayed_jobs"
WHERE
(
(
run_at <= '2017-07-18 03:33:51.729457'
AND (locked_at IS NULL OR locked_at < '2017-07-17 23:33:51.729488')
OR locked_by = 'delayed_job.0 host:myhostname pid:21995'
)
AND failed_at IS NULL
) ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING *
The structure of the WHERE clause is the following:
(run_at_condition AND locked_at_condition OR locked_by_condition)
AND failed_at_condition
Is there a set of inner parentheses missing in run_at_condition AND locked_at_condition OR locked_by_condition? In what precedence are the AND/OR clauses evaluated?
What is the purpose of the locked_by_condition where it seems to be picking up jobs that have already been locked by the current DJ process?!

The statement is probably fine. The context of the whole statement is to take the lock on the highest-priority job by setting its locked_at/locked_by fields.
The where condition is saying something like: "if run_at is sooner than now (it's due) AND, it's either not locked or it was locked more than four hours ago... alternatively that's all overridden if it was me that locked it, and of course, if it hasn't failed THEN lock it." So if I'm reading it correctly it looks kinda like it's running things that are ready to run but with a timeout so that things can't be locked-out forever.
To your second question, AND has a higher precedence than OR:
SELECT 'yes' WHERE false AND false OR true; -- 'yes', 1 row
SELECT 'yes' WHERE (false AND false) OR true; -- 'yes', 1 row
SELECT 'yes' WHERE false AND (false OR true); -- 0 rows
The first two statements mean the same thing, the third one is different.
The second point may just be a rough sort of ownership system? If the current process is the one that locked something, it should be able to override that lock.

Related

Insert query failed in Vertica with ERROR code 4534 when triggered from RStudio

I am executing an insert query on Vertica DB and it is working fine when triggered from a SQL client(SQuirrel). But when I am trying to trigger the same query from RStudio it is returning the following error:
Error in .local(conn, statement, ...) : execute JDBC update query
failed in dbSendUpdate ([Vertica]VJDBC ERROR: Receive on
v_default_node0005: Message receipt from v_default_node0008 failed [])
The SQL query somewhat looks like:
insert into SCHEMA1.TEMP_NEW(
SELECT C.PROGRAM_GROUP_ID,
C.POPULATION_ID,
C.PROGRAM_ID,
C.FULLY_QUALIFIED_NAME,
C.STATE,
C.DATA_POINT_TYPE,
C.SOURCE_TYPE,
B.SOURCE_DATA_PARTITION_ID AS DATA_PARTITION_ID,
C.PRIMARY_CODE_PRIMARY_DISPLAY,
C.PRIMARY_CODE_ID,
C.PRIMARY_CODING_SYSTEM_ID,
C.PRIMARY_CODE_RAW_CODE_DISPLAY,
C.PRIMARY_CODE_RAW_CODE_ID,
C.PRIMARY_CODE_RAW_CODING_SYSTEM_ID,
(C.COMPONENT_QUALIFIED_NAME)||('/2') AS SPLIT_PART,
Count(*) AS RECORD_COUNT
from (SELECT DPL.PROGRAM_GROUP_ID,
DPL.POPULATION_ID,
DPL.PROGRAM_ID,
DPL.FULLY_QUALIFIED_NAME,
'MET' AS STATE,
DPL.DATA_POINT_TYPE,
DPL.IDENTIFIER_SOURCE_TYPE AS SOURCE_TYPE,
DPL.IDENTIFIER_SOURCE_DATA_PARTITION_ID AS DATA_PARTITION_ID,
DPL.PRIMARY_CODE_PRIMARY_DISPLAY,
DPL.PRIMARY_CODE_ID,
DPL.PRIMARY_CODING_SYSTEM_ID,
DPL.PRIMARY_CODE_RAW_CODE_DISPLAY,
DPL.PRIMARY_CODE_RAW_CODE_ID,
DPL.PRIMARY_CODE_RAW_CODING_SYSTEM_ID,
DPL.supporting_data_point_lite_id,
DPL.COMPONENT_QUALIFIED_NAME,
COUNT(*) AS RECORD_COUNT
FROM SCHEMA2.TABLE1 DPL
WHERE DPL.DATA_POINT_TYPE <> 'PREFERRED_DEMOGRAPHICS'
AND DPL.DATA_POINT_TYPE <> 'PERSON_DEMOGRAPHICS'
AND DPL.DATA_POINT_TYPE <> 'CALCULATED_RISK_SCORE'
AND DPL.DATA_POINT_TYPE <> '_NOT_RECOGNIZED'
AND DPL.POPULATION_ID NOT ILIKE '%ARCHIVE%'
AND DPL.POPULATION_ID NOT ILIKE '%SNAPSHOT%'
AND DPL.PROGRAM_GROUP_ID = '<PROGRAM_GROUP_ID>'
AND PROGRAM_GROUP_ID IS NOT NULL
AND DPL.IDENTIFIER_SOURCE_DATA_PARTITION_ID IS NULL
AND DPL.PRIMARY_CODE_RAW_CODE_ID IS NOT NULL
AND DPL.PRIMARY_CODE_ID IS NOT NULL
AND EXISTS (SELECT 1
FROM SCHEMA2.TABLE2 MO
WHERE MO.STATE = 'MET'
AND MO.POPULATION_ID NOT ILIKE '%ARCHIVE%'
AND MO.POPULATION_ID NOT ILIKE '%SNAPSHOT%'
AND DPL.PROGRAM_GROUP_ID = MO.PROGRAM_GROUP_ID
AND DPL.PROGRAM_ID = MO.PROGRAM_ID
AND DPL.FULLY_QUALIFIED_NAME = MO.FULLY_QUALIFIED_NAME
AND DPL.OUTCOME_SEQUENCE = MO.MEASURE_OUTCOME_SEQ
AND MO.PROGRAM_GROUP_ID = '<PROGRAM_GROUP_ID>')
GROUP BY 1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16) AS C
Left Join
(SELECT DISTINCT SOURCE_DATA_PARTITION_ID,
supporting_data_point_lite_id
FROM SCHEMA2.TABLE3 DPI
where DPI.SOURCE_DATA_PARTITION_ID is not null
AND EXISTS (SELECT 1
FROM (SELECT DPL.supporting_data_point_lite_id
FROM SCHEMA2.TABLE1 DPL
WHERE DPL.DATA_POINT_TYPE <> 'PREFERRED_DEMOGRAPHICS'
AND DPL.DATA_POINT_TYPE <> 'PERSON_DEMOGRAPHICS'
AND DPL.DATA_POINT_TYPE <> 'CALCULATED_RISK_SCORE'
AND DPL.DATA_POINT_TYPE <> '_NOT_RECOGNIZED'
AND DPL.POPULATION_ID NOT ILIKE '%ARCHIVE%'
AND DPL.POPULATION_ID NOT ILIKE '%SNAPSHOT%'
AND DPL.PROGRAM_GROUP_ID = '<PROGRAM_GROUP_ID>'
AND PROGRAM_GROUP_ID IS NOT NULL
AND DPL.IDENTIFIER_SOURCE_DATA_PARTITION_ID IS NULL
AND DPL.PRIMARY_CODE_RAW_CODE_ID IS NOT NULL
AND DPL.PRIMARY_CODE_ID IS NOT NULL
AND EXISTS (SELECT 1
FROM SCHEMA2.TABLE2 MO
WHERE MO.STATE = 'MET'
AND MO.POPULATION_ID NOT ILIKE '%ARCHIVE%'
AND MO.POPULATION_ID NOT ILIKE '%SNAPSHOT%'
AND DPL.PROGRAM_GROUP_ID = MO.PROGRAM_GROUP_ID
AND DPL.PROGRAM_ID = MO.PROGRAM_ID
AND DPL.FULLY_QUALIFIED_NAME = MO.FULLY_QUALIFIED_NAME
AND DPL.OUTCOME_SEQUENCE = MO.MEASURE_OUTCOME_SEQ
AND MO.PROGRAM_GROUP_ID = '<PROGRAM_GROUP_ID>')) SDP
WHERE DPI.supporting_data_point_lite_id = SDP.supporting_data_point_lite_id)) AS B
on C.supporting_data_point_lite_id = B.supporting_data_point_lite_id
group by 1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15)
Only the schema name and the table names have been replaced. All other details the same.
Can someone please help me to fix the error.
This error means some node-to-node communication that happened during the processing of your query failed for some reason.
There are many possible reasons this could happen. Sometimes a poor network or other environment issues could cause this to occur. If v_default_node0008 was taken down while this query was running for example, you may see this message. Other times it can be the sign of a Vertica bug, in which case you'd have to take it up with support and/or your administrator.
Normally when a query plan is executing, the control flow happens from the bottom up. At the lowest levels of the plan, various scan(s) read from projections, and when there's no data left to feed to operators above the scan(s), they stop, which causes their neighboring operators to stop, until ultimately the root operator stops and the query finishes.
Occasionally, there is a need to end the query in a top-down fashion. When you have many nodes, each passing data between multiple threads in service of your query, it can be tricky for Vertica to tear down everything atomically in a deterministic fashion. If a thread sending data stops before the thread receiving data was expecting it to (because the receiver hasn't realized the plan is being stopped yet), then it may log this error message. Usually when that happens it is innocuous; you'll see it in vertica.log but it doesn't bubble all the way up to the application. If one of those is making its way to the application then it is probably a Vertica bug.
So when can this happen?
One common scenario is when you have a LIMIT clause. The different scans producing rows on different nodes can't coordinate directly, so they have to be told by operators higher up in the plan when the limit has been reached.
It also happens when a query gets canceled. Cancellation can happen for many reasons -- at the request of the application, from the dba running interrupt_statement on your query, or via resource pool policy. If you exceed the RUNTIMECAP for your resource pool for example, the query is automatically cancelled if it exceeds a configured execution time threshold.
There may be others too, but those are the most common cases. It won't always be obvious that either limits or cancels are happening to you. The query may be rewritten to include a limit at various stages, and the application or and/or DBA's policy may be affecting things under the cover.
While this doesn't directly solve your problem, it hopefully gives you some additional context and ideas for further troubleshooting. The problem is likely going to be very specific to your use case, environment and data, and could be a bug. If you can't make progress I'd suggest taking it to Vertica support, since they will be more equipped to help you make sense of this further.

Fetch rows based on condition

I am using PostgreSQL on Amazon Redshift.
My table is :
drop table APP_Tax;
create temp table APP_Tax(APP_nm varchar(100),start timestamp,end1 timestamp);
insert into APP_Tax values('AFH','2018-01-26 00:39:51','2018-01-26 00:39:55'),
('AFH','2016-01-26 00:39:56','2016-01-26 00:40:01'),
('AFH','2016-01-26 00:40:05','2016-01-26 00:40:11'),
('AFH','2016-01-26 00:40:12','2016-01-26 00:40:15'), --row x
('AFH','2016-01-26 00:40:35','2016-01-26 00:41:34') --row y
Expected output:
'AFH','2016-01-26 00:39:51','2016-01-26 00:40:15'
'AFH','2016-01-26 00:40:35','2016-01-26 00:41:34'
I had to compare start and endtime between alternate records and if the timedifference < 10 seconds get the next record endtime till last or final record.
I,e datediff(seconds,2018-01-26 00:39:55,2018-01-26 00:39:56) Is <10 seconds
I tried this :
SELECT a.app_nm
,min(a.start)
,max(b.end1)
FROM APP_Tax a
INNER JOIN APP_Tax b
ON a.APP_nm = b.APP_nm
AND b.start > a.start
WHERE datediff(second, a.end1, b.start) < 10
GROUP BY 1
It works but it doesn't return row y when conditions fails.
There are two reasons that row y is not returned is due to the condition:
b.start > a.start means that a row will never join with itself
The GROUP BY will return only one record per APP_nm value, yet all rows have the same value.
However, there are further logic errors in the query that will not successfully handle. For example, how does it know when a "new" session begins?
The logic you seek can be achieved in normal PostgreSQL with the help of a DISTINCT ON function, which shows one row per input value in a specific column. However, DISTINCT ON is not supported by Redshift.
Some potential workarounds: DISTINCT ON like functionality for Redshift
The output you seek would be trivial using a programming language (which can loop through results and store variables) but is difficult to apply to an SQL query (which is designed to operate on rows of results). I would recommend extracting the data and running it through a simple script (eg in Python) that could then output the Start & End combinations you seek.
This is an excellent use-case for a Hadoop Streaming function, which I have successfully implemented in the past. It would take the records as input, then 'remember' the start time and would only output a record when the desired end-logic has been met.
Sounds like what you are after is "sessionisation" of the activity events. You can achieve that in Redshift using Windows Functions.
The complete solution might look like this:
SELECT
start AS session_start,
session_end
FROM (
SELECT
start,
end1,
lead(end1, 1)
OVER (
ORDER BY end1) AS session_end,
session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN session_switch = 0 AND reverse_session_switch = 1
THEN 'start'
ELSE 'end' END AS session_boundary
FROM (
SELECT
start,
end1,
CASE WHEN datediff(seconds, end1, lead(start, 1)
OVER (
ORDER BY end1 ASC)) > 10
THEN 1
ELSE 0 END AS session_switch,
CASE WHEN datediff(seconds, lead(end1, 1)
OVER (
ORDER BY end1 DESC), start) > 10
THEN 1
ELSE 0 END AS reverse_session_switch
FROM app_tax
)
AS sessioned
WHERE session_switch != 0 OR reverse_session_switch != 0
UNION
SELECT
start,
end1,
'start'
FROM (
SELECT
start,
end1,
row_number()
OVER (PARTITION BY APP_nm
ORDER BY end1 ASC) AS row_num
FROM APP_Tax
) AS with_row_number
WHERE row_num = 1
) AS with_boundary
) AS with_end
WHERE session_boundary = 'start'
ORDER BY start ASC
;
Here is the breadkdown (by subquery name):
sessioned - we first identify the switch rows (out and in), the rows in which the duration between end and start exceeds limit.
with_row_number - just a patch to extract the first row because there is no switch into it (there is an implicit switch that we record as 'start')
with_boundary - then we identify the rows where specific switches occur. If you run the subquery by itself it is clear that session start when session_switch = 0 AND reverse_session_switch = 1, and ends when the opposite occurs. All other rows are in the middle of sessions so are ignored.
with_end - finally, we combine the end/start of 'start'/'end' rows into (thus defining session duration), and remove the end rows
with_boundary subquery answers your initial question, but typically you'd want to combine those rows to get the final result which is the session duration.

How to add Order by in sql query

update Room set Status = case
when Room_Rev.In_DateTime IS NOT NULL and Room_Rev.Out_DateTime IS NULL
then 'U'
when Room_Rev.In_DateTime IS NOT NULL and Room_Rev.Out_DateTime IS NOT NULL
then 'A'
when Room.Status!='R' and Room.Status!='U' and Room.Status!='A'
then Room.Status
else 'R'
end
FROM Room JOIN Room_Rev
ON Room.Room_ID=Room_Rev.Room_ID
and
((Room_Rev.Start_Date >= '2015-03-22' and Room_Rev.End_Date <= '2015-03-22')
OR
(Room_Rev.Start_Date<= '2015-03-22' and Room_Rev.End_Date> '2015-03-22')
OR
(Room_Rev.Start_Date< '2015-03-22' and Room_Rev.End_Date>= '2015-03-22'))
How to add order by Rev_ID desc in the query?
There are two table which is Room and Room_Rev,
they are one to many relationship
The last two row ROM0006 already fill the In_DateTime and Out_DateTime,
thus it regard check out,
and the last row insert new reservation,
the In_DateTime is null
thus i need the query return 'R' (Reserved status)
As one of the possible solutions I suggest a nested query instead of a join in UPDATE statement. The logic of the update is not completely clear to me, so I leave the final update for OP to correct order of sorting (Note I used top 1 and order by room_ID in the nested SELECT statement). However, this approach allows to handle all usual techniques applicable for a SELECT.
update Room set Status = (select TOP 1 case
when Room_Rev.In_DateTime IS NOT NULL and Room_Rev.Out_DateTime IS NULL
then 'U'
when Room_Rev.In_DateTime IS NOT NULL and Room_Rev.Out_DateTime IS NOT NULL
then 'A'
when Room.Status!='R' and Room.Status!='U' and Room.Status!='A'
then Room.Status
else 'R'
end
FROM Room_Rev
WHERE Room.Room_ID=Room_Rev.Room_ID
and
((Room_Rev.Start_Date >= '2015-03-22' and Room_Rev.End_Date <= '2015-03-22')
OR
(Room_Rev.Start_Date<= '2015-03-22' and Room_Rev.End_Date> '2015-03-22')
OR
(Room_Rev.Start_Date< '2015-03-22' and Room_Rev.End_Date>= '2015-03-22'))
ORDER BY Room_Rev.Room_Id
)
PS. As a piece of advise I still assume that such approach is not valid. It prevents proper normalization of data. You'd rather have this information always queried dynamically when required, instead of writing static value to ROOM.status

with-constrained consecutive updates

Please assume I have built a query in MS Sqlserver, it has the following structure:
WITH issues_a AS
(
SELECT a_prop
FROM ds_X x
)
, issues_b AS
(
SELECT key
, z.is_flagged as is_flagged
, some_prop
FROM ds_Z z
JOIN issues_a i_a
ON z.a_diff = i_a.a_prop
)
-- {{ run }}
UPDATE samples
SET error =
CASE
WHEN i_b.some_prop IS NULL THEN '#1 ...'
WHEN UPPER(i_b.is_flagged) != 'Y' THEN '#2 ...'
END
FROM samples s
left join issues_b i_b ON s.key = i_b.key;
Now I want enhance the whole thing, updating one more table in a consecutive way by enclosing parts of the query in BEGIN TRANSACTION and COMMIT, but don't get my head around the how of it. Tried enclosing the whole expression with the transaction parenthesis, but that didn't bring me any further.
Are there any other ways to achieve the above task - even without concatenating the consecutive updates in a transactional manner, though better it would be?
For abbreviation the task again: WITH <...>(...), <...>(...) UPDATE <... Using data from latter WITH> UPDATE <... using data from latter WITH>?
Hope you don't mind my poor grammar...

What can cause Hive's where clause not to have an effect?

I have this seemingly straightforward query.
SELECT
table1.one
FROM
(SELECT
user,
1 AS one
FROM
users
WHERE date=${hiveconf:TODAY}
DISTRIBUTE BY user.id
SORT BY user.id
) table1
WHERE table1.one < 0;
Surprisingly, this returns all rows in the users table:
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
1
1
1
As table1.one is clearly 1 and thus table1.one < 0 is false, I'd expect no rows to be returned. How could that happen?
EDIT:
When I add table1.one < 0 to the select clause, I get
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
1 false
1 false
1 false
SECOND EDIT:
Removing WHERE date=${hiveconf:TODAY} (which was unnecessary because that was a partition attribute anyway) fixed this weird behavior. Not sure what the cause was.