What can cause Hive's where clause not to have an effect?

What can cause Hive's where clause not to have an effect? - hive

I have this seemingly straightforward query.
SELECT
table1.one
FROM
(SELECT
user,
1 AS one
FROM
users
WHERE date=${hiveconf:TODAY}
DISTRIBUTE BY user.id
SORT BY user.id
) table1
WHERE table1.one < 0;
Surprisingly, this returns all rows in the users table:
Execution completed successfully
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
1
1
1
As table1.one is clearly 1 and thus table1.one < 0 is false, I'd expect no rows to be returned. How could that happen?
EDIT:
When I add table1.one < 0 to the select clause, I get
Mapred Local Task Succeeded . Convert the Join into MapJoin
OK
1 false
1 false
1 false
SECOND EDIT:
Removing WHERE date=${hiveconf:TODAY} (which was unnecessary because that was a partition attribute anyway) fixed this weird behavior. Not sure what the cause was.

Related

Joining multiple different update queries into single query - PostgreSQL

Apologies if something similar has been asked before, currently I run 3 different update queries in order to get the desired result. The Queries are as follows,
UPDATE users SET enabled = false where username in (SELECT username
FROM users WHERE enabled = true AND lastaccess != 0 AND lastaccess
< (EXTRACT('epoch' FROM CURRENT_TIMESTAMP) - (86400*200))*1000 AND
username NOT LIKE ('admintest%'));
Once this query runs (disabling all users who haven't accessed the system in a certain period), then I run the following 2 Queries on the whole table,
update users set weeklypopupuse = 0;
update users set monthlypopupuse = 0;
These 2 queries then reset the weekly and monthly use to 0.
Now this works perfectly fine as per the requirements, however is there a more elegant way of writing all these 3 queries into a single query which gives the same result.
Any help or advice will be highly appreciated. Thanks in advance

update
after OP gave full structure, refactored to:
update users
set
weeklypopupuse = 0
, monthlypopupuse = 0
, enabled = case
when
enabled = true
AND lastaccess != 0
AND lastaccess < (EXTRACT('epoch' FROM CURRENT_TIMESTAMP) - (86400*200))*1000
AND username NOT LIKE ('admintest%')
then false
end

Trying to understand the delayed job polling query

I'm trying to port Delayed Job to Haskell and am unable to understand the WHERE clause in the query that DJ fires to poll the next job:
UPDATE "delayed_jobs"
SET locked_at = '2017-07-18 03:33:51.729884',
locked_by = 'delayed_job.0 host:myhostname pid:21995'
WHERE id IN (
SELECT id FROM "delayed_jobs"
WHERE
(
(
run_at <= '2017-07-18 03:33:51.729457'
AND (locked_at IS NULL OR locked_at < '2017-07-17 23:33:51.729488')
OR locked_by = 'delayed_job.0 host:myhostname pid:21995'
)
AND failed_at IS NULL
) ORDER BY priority ASC, run_at ASC LIMIT 1 FOR UPDATE) RETURNING *
The structure of the WHERE clause is the following:
(run_at_condition AND locked_at_condition OR locked_by_condition)
AND failed_at_condition
Is there a set of inner parentheses missing in run_at_condition AND locked_at_condition OR locked_by_condition? In what precedence are the AND/OR clauses evaluated?
What is the purpose of the locked_by_condition where it seems to be picking up jobs that have already been locked by the current DJ process?!

The statement is probably fine. The context of the whole statement is to take the lock on the highest-priority job by setting its locked_at/locked_by fields.
The where condition is saying something like: "if run_at is sooner than now (it's due) AND, it's either not locked or it was locked more than four hours ago... alternatively that's all overridden if it was me that locked it, and of course, if it hasn't failed THEN lock it." So if I'm reading it correctly it looks kinda like it's running things that are ready to run but with a timeout so that things can't be locked-out forever.
To your second question, AND has a higher precedence than OR:
SELECT 'yes' WHERE false AND false OR true; -- 'yes', 1 row
SELECT 'yes' WHERE (false AND false) OR true; -- 'yes', 1 row
SELECT 'yes' WHERE false AND (false OR true); -- 0 rows
The first two statements mean the same thing, the third one is different.
The second point may just be a rough sort of ownership system? If the current process is the one that locked something, it should be able to override that lock.

Applying the count function within the case function

I am relatively new to SQL and am trying to apply the case function within a view.
While I understand the fundamentals of it, I am having difficulty applying it in the way that I need.
I have 3 columns ApplicationID, ServerName and ServerShared? (true/false).
Each application can have many servers associated to it, while each server only has 1 server type.
I would like to use case to create a further field which can take three values dependent upon whether the values of ServerShared related to an application are all True = Shared, False = Non-shared, Both True and False = Partially shared.
My thoughts were using count function within the case function to set statements where:
if 'count true > 0 and count false > 0' then ServerShared? =
partially if 'count true > 0' and 'count false = 0' then
ServerShared = true and vice versa.
I believe the above logic a way of achieving my result, yet I would appreciate help in both how to structure this within a case statement and any wisdom if there is a better way.
Thanks in advance!

If I get your question right, this should do the trick. Maybe you need to add further columns or adapt the logic. But you should get the logic behind.
SELECT ServerName,
CASE
WHEN COUNT(distinct ServerShared) = 2
THEN N'Server shared'
WHEN MIN(ServerShared) = 0
THEN N'Server not shared'
WHEN MAX(ServerShared) = 1
THEN N'Server shared'
END as ServerShared
FROM myTable
GROUP BY ServerName

There are two main ways to do this problem (super generic answer from non expert :D)
less often executed (one off?), slow execution with potential exponential time increases as rows go up:
This is similar to your suggested solution and involves putting other queries in the Select / field list part of the query - this will get executed for every row returned by the main part of the query (bad news generally speaking):
select
applicationID
, Case (select count * from table as b where a.applicationid = b.applicationid and shareserver=true)
WHEN 0 then 'Non-Shared'
WHEN (select count * from table where a.applicationid = b.applicationid) then 'Shared'
ELSE 'Partially-Shared' END as ShareType
from
tabls as a
get all your data once then perform just the comparison row by row. this is what i would use by default.. its basically better as far as i know but sometimes can be harder to think through.
this line is here to fix formatting issue
select
a.applicationid
,case
when sharedservers = 0 then 'Non-Shared'
when totalservers=sharedservers then 'Shared'
else 'Partially-Shared' END as ShareType
FROM
(select applicationID, count(*) as TotalServers from table) as a
LEFT OUTER JOIN (select applicationID, count(*) as SharedServersfrom table where sharedserver = true) as b
ON a.applicationid=b.applicationid
these queries are just written off the top of my head let me know if there are bug :/
note also the two uses of case statement. one with CASE *value* WHEN *possible value* THEN .. and the second way CASE WHEN *statement that evaluates to boolean* THEN ..

RankOver Partition by with minutes and seconds

I am trying to sequence data and as it occurs there are instances where I have to order this sequence using hour/minutes and seconds. However when I use the rank/partition by function, it's almost as if it does not recognize this as chronological data at all. An example of the data I am trying to sequence is below:
Mod_Order Last_Activity ACTIVITY_DATE_DTTM hdm_modif_dttm
1 NULL 15/08/2007 00:00:00 59:47.3
2 NULL 27/09/2007 14:30:02 59:22.9
3 NULL 27/11/2007 15:30:02 59:10.5
3 NULL 27/11/2007 15:30:02 58:38.9
As you can see the last two ACTIVITY_DATE_DTTM date times are exactly the same so I need to go a step further - I removed the date from the hdm_modif_dttm field to see if it made any difference but it does not (I left it as time though as I figured it does not make any difference anyhow). So my code was as follows:
Update q
set q.Mod_Order = b.Mod_Order
from [#Temp_last_act_2]q
Left join
(
select
RANK () over
(partition by pathway_id
order by pathway_id, ACTIVITY_DATE_DTTM,hdm_modif_dttm) as Mod_Order,
PATHWAY_ID,
MODIF_DTTM,
ACTIVITY_DATE_DTTM
from #temp_Last_act_2
) as b on b.PATHWAY_ID = q.PATHWAY_ID
and b.MODIF_DTTM = q.MODIF_DTTM
and b.ACTIVITY_DATE_DTTM = q.ACTIVITY_DATE_DTTM
Is anyone aware of any limitations using this function that I am unaware of or is there a function that may handle this better (or am I being really daft?)

SELECT DISTINCT Extremely slow

I have a query that is taking 48 seconds to execute as follows:
SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days
FROM tmd_logins
join tmd_users on tmd_logins.userID = tmd_users.userID
where tmd_users.isPatient = 1 AND loggedIn > '2011-03-25'
and tmd_logins.userID in
(SELECT userID as accounts30Days FROM tmd_users
where isPatient = 1 AND created > '2012-04-29' AND computerID is null)
When I remove the DISTINCT keyword it takes less than 1 second, so it seems the bottle neck lies within that.
The database adds an entry to the tmd_logins table every time a user logs into the system. I am trying to get a total count of all users that are patients that have been created and logged in within a given period of time, such as in the last 30 days.
I have tried removing the DISTINCT keyword and adding group by tmd_logins.userID to the statement but the performance issue remains.
Table tmd_logins has about 300,000 records, tmd_users has about 40,000
Is there a better way of doing this?

The problem that you have is the execution plan. My guess is that the "in" clause might be confusing it. You might try:
SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days
FROM tmd_logins join
tmd_users
on tmd_logins.userID = tmd_users.userID join
(SELECT distinct userID as accounts30Days
FROM tmd_users
where isPatient = 1 AND
created > '2012-04-29' AND
computerID is null
) t
on tmd_logins.userID = t.accounts30Days
where tmd_users.isPatient = 1 AND
loggedIn > '2011-03-25'
That might or might not work. However, I'm wondering about the structure of the query itself. It would seem that UserID should be distinct in a table called tmd_users. If so, then you can wrap all your conditions into one:
SELECT count(DISTINCT tmd_logins.userID) as totalLoginsUniqueLast30Days
FROM tmd_logins join
tmd_users
on tmd_logins.userID = tmd_users.userID
where tmd_users.isPatient = 1 AND
loggedIn > '2011-03-25' and
created > '2012-04-29' AND
computerID is null
If my guess is true, then this should definitely run faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

What can cause Hive's where clause not to have an effect? - hive

Related

Joining multiple different update queries into single query - PostgreSQL

Trying to understand the delayed job polling query

Applying the count function within the case function

RankOver Partition by with minutes and seconds

SELECT DISTINCT Extremely slow

Categories

Resources