Join Oracle tables on an exact match, and a closest match - sql

I am trying to join two tables of performance metrics, system stats and memory usage. Entries in these tables come in on differing time schedules. I need to join the tables by finding the exact match for the System_Name in both tables, and the closest for WRITETIME. Write time uses the systems own idea of time and is NOT a standard Oracle timestamp.
I can select the closest timestamp from one table with something like:
select "Unix_Memory"."WRITETIME", ABS ('1140408134015004' - "Unix_Memory"."WRITETIME")
as Diff from "Unix_Memory"
where "Unix_Memory"."WRITETIME" > '1140408104015004' order by Diff;
The constants there will be parameterised in my script.
However when I try to expand this into my larger query:
select "System"."System_Name", "System"."WRITETIME" as SysStamp,
from "System"
join "Unix_Memory" on "System"."System_Name" = "Unix_Memory"."System_Name"
and "Unix_Memory"."WRITETIME" = (
select Stamp from (
select "Unix_Memory"."WRITETIME" as Stamp,
ABS ( "System"."WRITETIME" - "Unix_Memory"."WRITETIME") as Diff
from "Unix_Memory" where "Unix_Memory"."WRITETIME" > '1140408104015004' and rownum = 1 order by Diff
)
)
WHERE "System"."System_Name" in ('this','that', 'more')
and "System"."WRITETIME" > '1140408124015004';
I get:
Error at Command Line:38 Column:72
Error report:
SQL Error: ORA-00904: "System"."WRITETIME": invalid identifier
00904. 00000 - "%s: invalid identifier"
I have tried a few variations, but I am not getting any closer.

You must state the System table in the inner Select as well.
select "System"."System_Name", "System"."WRITETIME" as SysStamp,
from "System"
join "Unix_Memory" on "System"."System_Name" = "Unix_Memory"."System_Name"
and "Unix_Memory"."WRITETIME" = (
select Stamp from (
select "Unix_Memory"."WRITETIME" as Stamp,
ABS ( "System"."WRITETIME" - "Unix_Memory"."WRITETIME") as Diff
from "Unix_Memory"
-- THE NEXT LINE IS MISSING IN YOUR CODE
INNER JOIN "System" ON "System.System_Name" = "Unix_Memory"."System_Name"
and "System"."WRITETIME" > '1140408124015004'
-- end of missing
where "Unix_Memory"."WRITETIME" > '1140408104015004' and rownum = 1 order by Diff
)
)
WHERE "System"."System_Name" in ('this','that', 'more')
and "System"."WRITETIME" > '1140408124015004';

Unfortunately the column names are only known in the next nesting level. So System.writetime would be known in select Stamp from ..., but no more in select "Unix_Memory"."WRITETIME" as Stamp ...
Anyhow, you would select a rather random stamp anyhow, the first Unix_Memory"."WRITETIME" > '1140408104015004' found to be precise, because rownum = 1 gets executed before order by. You will have to re-write your statement completely.
EDIT: Here is one possibility to re-write the statement using MIN/MAX KEEP:
select
s.system_name,
s.writetime as sysstamp,
min(um.id) keep (dense_rank first order by abs(s.writetime - um.writetime)) as closest_um_id
from system sys
join unix_memory um on s.system_name = um.system_name
where s.system_name in ('this','that', 'more')
and s.writetime > '1140408124015004'
and um.writetime > '1140408104015004'
group by s.system_name, s.writetime
order by s.system_name, s.writetime;
If you need more than just the ID of unix_memory then surround this with another select:
select
sy.system_name,
sy.sysstamp,
mem.*
from
(
select
s.system_name,
s.writetime as sysstamp,
min(um.id) keep (dense_rank first order by abs(s.writetime - um.writetime)) as closest_um_id
from system sys
join unix_memory um on s.system_name = um.system_name
where s.system_name in ('this','that', 'more')
and s.writetime > '1140408124015004'
and um.writetime > '1140408104015004'
group by s.system_name, s.writetime
) sy
join unix_memory mem on mem.id = sy.closest_um_id
order by sy.system_name, sy.sysstamp;

Related

Snowflake - update with correlated subquery using timediff

I am running this query on Snowflake Database:
UPDATE "click" c
SET "Registration_score" =
(SELECT COUNT(*) FROM "trackingpoint" t
WHERE 1=1
AND c."CookieID" = t."CookieID"
AND t."page" ilike '%Registration complete'
AND TIMEDIFF(minute,c."Timestamp",t."Timestamp") < 4320
AND TIMEDIFF(second,c."Timestamp",t."Timestamp") > 0);
The Database returns Unsupported subquery type cannot be evaluated. However, if I run it without the last two conditions (with TIMEDIFF), it works without problem. I confirmed that the actual TIMEDIFF statements are alright with these queries:
select count(*) from "trackingpoint"
where TIMEDIFF(minute, '2018-01-01', "Timestamp") > 604233;
select count(*) from "click"
where TIMEDIFF(minute, '2018-01-01', "Timestamp") > 604233;
and these work without problem. I don't see a reason why the TIMEDIFF condition shoud prevent the database from returning the result. Any idea what should I alter to make it work?
so using the following setup
create table click (id number,
timestamp timestamp_ntz,
cookieid number,
Registration_score number);
create table trackingpoint(id number,
timestamp timestamp_ntz,
cookieid number,
page text );
insert into click values (1,'2018-03-20', 101, 0),
(2,'2019-03-20', 102, 0);
insert into trackingpoint values (1,'2018-03-20 00:00:10', 101, 'user reg comp'),
(2,'2018-03-20 00:00:11', 102, 'user reg comp'),
(3,'2018-03-20 00:00:13', 102, 'pet reg comp'),
(4,'2018-03-20 00:00:15', 102, 'happy dance');
you can see we get the rows we expect
select c.*, t.*
from click c
join trackingpoint t
on c.cookieid = t.cookieid ;
now there are two ways to get your count, the first as you have it, which is good if your counting only one thing, as all the rules are join filtering:
select c.id,
count(1) as new_score
from click c
join trackingpoint t
on c.cookieid = t.cookieid
and t.page ilike '%reg comp'
and TIMEDIFF(minute, c.timestamp, t.timestamp) < 4320
group by 1;
or you can (in snowflake syntax) move the count to the aggregate/select side,and thus get more than one answer if that's what you need (this is the place I find myself more, thus why I present it):
select c.id,
sum(iff(t.page ilike '%reg comp' AND TIMEDIFF(minute, c.timestamp, t.timestamp) < 4320, 1, 0)) as new_score
from click c
join trackingpoint t
on c.cookieid = t.cookieid
group by 1;
thus plugging this into the UPDATE pattern (see last example in the doc's)
https://docs.snowflake.net/manuals/sql-reference/sql/update.html
you can move to a single subselect instead of a corolated subquery which snowflake doesn't support, which is the error message you are getting.
UPDATE click c
SET Registration_score = s.new_score
from (
select ic.id,
count(*) as new_score
from click ic
join trackingpoint it
on ic.cookieid = it.cookieid
and it.page ilike '%reg comp'
and TIMEDIFF(minute, ic.timestamp, it.timestamp) < 4320
group by 1) as s
WHERE c.id = s.id;
The reason add the TIMEDIFF turns your query into a correlated sub-query, is each row of the UPDATE, now relates to the sub-query results, the correlation. The work around is to make "big but simpler" sub-query and join to that.

Hive - select rows within 1 year of earliest date

I am trying to select all rows in a table that are within 1 year of the earliest date in the table. I'm using the following code:
select *
from baskets a
where activitydate < (select date_add((select min(activitydate) mindate_a from baskets), 365) date_b from baskets)
limit 10;
but get the following error message:
Error while compiling statement: FAILED: ParseException line 1:55 cannot recognize input near 'select' 'date_add' '(' in expression specification
Total execution time: 00:00:00.338
Any suggestions?
EDIT:
With this code:
select *
from baskets a
where activitydate < (select date_add(min(activitydate), 365) from baskets)
limit 10;
I'm getting this error:
Error while compiling statement: FAILED: ParseException line 1:55 cannot recognize input near 'select' 'date_add' '(' in expression specification
I'd be tempted to use window functions:
select b.*
from (select b.*, min(activity_date) as min_ad
from baskets b
) b
where activity_date < add_months(min_ad, 12);
If you really want your syntax to work, try reducing the number of selects:
where activitydate < (select date_add(min(activitydate), 365) from baskets)
Use JOINs instead of select in Sub-query. I don't think Hive supports select in where clause with < condition. Only IN and EXISTS could be used as of Hive 0.13.
: Language Manual SubQueries
SELECT a.*
FROM baskets a
JOIN (SELECT DATE_ADD(MIN(b.activitydate), 365) maxdate
FROM baskets) b
ON a.activitydate < b.maxdate
LIMIT 10;

ROW_NUMBER() Query Plan SORT Optimization

The query below accesses the Votes table that contains over 30 million rows. The result set is then selected from using WHERE n = 1. In the query plan, the SORT operation in the ROW_NUMBER() windowed function is 95% of the query's cost and it is taking over 6 minutes to complete execution.
I already have an index on same_voter, eid, country include vid, nid, sid, vote, time_stamp, new to cover the where clause.
Is the most efficient way to correct this to add an index on vid, nid, sid, new DESC, time_stamp DESC or is there an alternative to using the ROW_NUMBER() function for this to achieve the same results in a more efficient manner?
SELECT v.vid, v.nid, v.sid, v.vote, v.time_stamp, v.new, v.eid,
ROW_NUMBER() OVER (
PARTITION BY v.vid, v.nid, v.sid ORDER BY v.new DESC, v.time_stamp DESC) AS n
FROM dbo.Votes v
WHERE v.same_voter <> 1
AND v.eid <= #EId
AND v.eid > (#EId - 5)
AND v.country = #Country
One possible alternative to using ROW_NUMBER():
SELECT
V.vid,
V.nid,
V.sid,
V.vote,
V.time_stamp,
V.new,
V.eid
FROM
dbo.Votes V
LEFT OUTER JOIN dbo.Votes V2 ON
V2.vid = V.vid AND
V2.nid = V.nid AND
V2.sid = V.sid AND
V2.same_voter <> 1 AND
V2.eid <= #EId AND
V2.eid > (#EId - 5) AND
V2.country = #Country AND
(V2.new > V.new OR (V2.new = V.new AND V2.time_stamp > V.time_stamp))
WHERE
V.same_voter <> 1 AND
V.eid <= #EId AND
V.eid > (#EId - 5) AND
V.country = #Country AND
V2.vid IS NULL
The query basically says to get all rows matching your criteria, then join to any other rows that match the same criteria, but which would be ranked higher for the partition based on the new and time_stamp columns. If none are found then this must be the row that you want (it's ranked highest) and if none are found that means that V2.vid will be NULL. I'm assuming that vid otherwise can never be NULL. If it's a NULLable column in your table then you'll need to adjust that last line of the query.

Group By & Having vs. SubQuery (Where Count is Greater Than 1)

I'm struggling here trying to write a script that finds where an order was returned multiple times by the same associate (count greater than 1). I'm guessing my syntax with the subquery is incorrect. When I run the script, I get a message back that the "SELECT failed.. [3669] More than one value was returned by the subquery."
I'm not tied to the subquery, and have tried using just the group by and having statements, but I get an error regarding a non-aggregate value. What's the best way to proceed here and how do I fix this?
Thank you in advance - code below:
SEL s.saletran
, s.saletran_dt SALE_DATE
, r.saletran_id RET_TRAN
, r.saletran_dt RET_DATE
, ra.user_id RET_ASSOC
FROM salestrans s
JOIN salestrans_refund r
ON r.orig_saletran_id = s.saletran_id
AND r.orig_saletran_dt = s.saletran_dt
AND r.orig_loc_id = s.loc_id
AND r.saletran_dt between s.saletran_dt and s.saletran_dt + 30
JOIN saletran rt
ON rt.saletran_id = r.saletran_id
AND rt.saletran_dt = r.saletran_dt
AND rt.loc_id = r.loc_id
JOIN assoc ra --Return Associate
ON ra.assoc_prty_id = rt.sls_assoc_prty_id
WHERE
(SELECT count(*)
FROM saletran_refund
GROUP BY ORIG_SLTRN_ID
) > 1
AND s.saletran_dt between '2015-01-01' and current_date - 1
Based on what you've got so far, I think you want to use this instead:
where r.ORIG_SLTRN_ID in
(select
ORIG_SLTRN_ID
from
saletran_refund
group by ORIG_SLTRN_ID
having count (*) > 1)
That will give you the ORIG_SLTRN_IDs that have more than one row.
you don't give enough for a full answer but this is a start
group by s.saletran
, s.saletran_dt SALE_DATE
, r.saletran_id RET_TRAN
, r.saletran_dt RET_DATE
, ra.user_id RET_ASSOC
having count(distinct(ORIG_SLTRN_ID)) > 0
this does return more the an one row
run it
SELECT count(*)
FROM saletran_refund
GROUP BY ORIG_SLTRN_ID

Issue with subquery in Oracle corrected the query? [duplicate]

This question already has answers here:
ORA-00979 not a group by expression
(10 answers)
Closed 8 years ago.
SELECT COUNT(DISTINCT SEC.ERROR_GROUP_ID),
COUNT(DISTINCT SEC_DET.ERROR_GROUP_ID),
COUNT(DISTINCT MB.ERROR_GROUP_ID),
COUNT(DISTINCT OD.ERROR_GROUP_ID),
(SELECT COUNT (DISTINCT SEC_SCH.ERROR_GROUP_ID)
FROM SCHEMA.SECURITY SEC
LEFT OUTER JOIN SCHEMA.SECURITY_SCHEDULE SEC_SCH
ON SEC.MSD_SECURITY_ID =SEC_SCH.MSD_SECURITY_ID
WHERE SEC.MSD_SECURITY_ID IN
( SELECT DISTINCT main.MSD_SECURITY_ID
FROM SCHEMA2.Positions main
WHERE main.QUANTITY != 0
AND systimestamp >= main.eff_from_dt
AND main.eff_to_dt > systimestamp
AND systimestamp >= main.asrt_from_dt
AND main.asrt_to_dt > systimestamp
))
FROM SCHEMA.SECURITY SEC
JOIN SCHEMA.SECURITY_DETAIL SEC_DET
ON SEC.MSD_SECURITY_ID = SEC_DET.MSD_SECURITY_ID
LEFT OUTER JOIN SCHEMA.MUNI_BOND MB
ON SEC.MSD_SECURITY_ID=MB.MSD_SECURITY_ID
LEFT OUTER JOIN SCHEMA.OPTION_DETAIL OD
ON SEC.MSD_SECURITY_ID =OD.MSD_SECURITY_ID
WHERE SEC.MSD_SECURITY_ID IN
( SELECT DISTINCT main.MSD_SECURITY_ID
FROM SCHEMA2.Positions main
WHERE main.QUANTITY != 0
AND systimestamp >= main.eff_from_dt
AND main.eff_to_dt > systimestamp
AND systimestamp >= main.asrt_from_dt
AND main.asrt_to_dt > systimestamp
) ;
Error ORA-00936: missing expression
00936. 00000 - "missing expression"
*Cause:
*Action:
Error at Line: 365 Column: 3
The Nested query syntax needs to b corrected for this to work thats where i am stuck at ?.
This is a partial answer - I do not understand fully some of your code. I think it has issues.,
Note: ** means bold -- I messed up formatting ** is not part of this SQL.
You have to group by something. In this case:
(SELECT DISTINCT (SEC_SCH.ERROR_GROUP_ID)
FROM SCHEMA.SECURITY SEC
LEFT OUTER JOIN SCHEMA.SECURITY_SCHEDULE SEC_SCH
ON SEC.MSD_SECURITY_ID =SEC_SCH.MSD_SECURITY_ID
WHERE SEC.MSD_SECURITY_ID IN
( SELECT DISTINCT main.MSD_SECURITY_ID
FROM SCHEMA2.Positions main
WHERE main.QUANTITY != 0
AND systimestamp >= main.eff_from_dt
AND main.eff_to_dt > systimestamp
AND systimestamp >= main.asrt_from_dt
AND main.asrt_to_dt > systimestamp
)) **foo**
FROM SCHEMA.SECURITY SEC
JOIN SCHEMA.SECURITY_DETAIL SEC_DET
ON SEC.MSD_SECURITY_ID = SEC_DET.MSD_SECURITY_ID
LEFT OUTER JOIN SCHEMA.MUNI_BOND MB
ON SEC.MSD_SECURITY_ID=MB.MSD_SECURITY_ID
LEFT OUTER JOIN SCHEMA.OPTION_DETAIL OD
ON SEC.MSD_SECURITY_ID =OD.MSD_SECURITY_ID
WHERE SEC.MSD_SECURITY_ID IN
( SELECT DISTINCT main.MSD_SECURITY_ID
FROM SCHEMA2.Positions main
WHERE main.QUANTITY != 0
AND systimestamp >= main.eff_from_dt
AND main.eff_to_dt > systimestamp
AND systimestamp >= main.asrt_from_dt
AND main.asrt_to_dt > systimestamp
)
**group by foo** ;
Your overall query is an aggregation query without a group by, so it would be expected to return one row. You have a subquery in the select that can return multiple rows -- I suspect the problem is related to this structure.
I would suggest that you change the distinct to an aggregation function. But what? COUNT(DISTINCT SEC_SCH.ERROR_GROUP_ID)) ? MAX(SEC_SCH.ERROR_GROUP_ID))? LISTAGG(SEC_SCH.ERROR_GROUP_ID, ',') WITHIN GROUP (ORDER BY SEC_SCH.ERROR_GROUP_ID))? I don't know. It is not clear what you want for the third column.
Your entire query looks suspect. So many count(disintct) expressions often mean that you are joining along independent dimensions -- creating a cartesian product. Hard to say if that is a problem, because without sample data and desired results, your question doesn't actually say what you want to accomplish.