Bigquery data-lineage for CREATE TABLE AS SELECT shows nothing - google-bigquery

SOLVED All it takes is quite a bit of patience (1 hour or so)
I'm working through the on the fly decryption example from strmprivacy.io which works fine, but the lineage tab in BigQuery shows a stand-alone table, instead of something derived. This is my query, and the resulting table is fine. But no lineage at all.
CREATE TABLE on_the_fly_decryption.decrypted_retail_consent_ge_1 AS
SELECT * FROM (
SELECT
e.strmMeta_keyLink,
DETERMINISTIC_DECRYPT_STRING(k.key, FROM_BASE64(e.Customer_ID), '') AS decryptedCustomerID,
e.Customer_ID,
( SELECT MAX(CAST(num AS Int64)) FROM
UNNEST(JSON_VALUE_ARRAY(e.strmMeta_consentLevels)) AS num ) max_consent
FROM
on_the_fly_decryption.encrypted e
INNER JOIN (
SELECT
k.keyLink,
KEYS.KEYSET_FROM_JSON( k.encryptionKey ) key
FROM
on_the_fly_decryption.keys k) k
ON e.strmMeta_keyLink = k.keyLink
)
WHERE max_consent >= 1;
This should work according to the docs.
Data lineage is enabled for my project. Any hints of what (is going|am I doing) wrong?
P.S the query is a slightly modified one, for the UCI online retail II set instead of the earlier version, but it's pretty much equal.

Wow, it showed up after a while:
So patience is all it takes :-)

Related

Query Snowflake Jobs [duplicate]

is there any way within snowflake/sql query to view what tables are being queried the most as well as what columns? I want to know what data is of most value to my users and not sure how to do this programatically. Any thoughts are appreciated - thank you!
2021 update
The new ACCESS_HISTORY view has this information (in preview right now, enterprise edition).
For example, if you want to find the most used columns:
select obj.value:objectName::string objName
, col.value:columnName::string colName
, count(*) uses
, min(query_start_time) since
, max(query_start_time) until
from snowflake.account_usage.access_history
, table(flatten(direct_objects_accessed)) obj
, table(flatten(obj.value:columns)) col
group by 1, 2
order by uses desc
Ref: https://docs.snowflake.com/en/sql-reference/account-usage/access_history.html
2020 answer
The best I found (for now):
For any given query, you can find what tables are scanned through looking at the plan generated for it:
SELECT *, "objects"
FROM TABLE(EXPLAIN_JSON(SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.any_table_or_view')))
WHERE "operation"='TableScan'
You can find all of your previous ran queries too:
select QUERY_TEXT
from table(information_schema.query_history())
So the natural next step would be combine both - but that's not straightforward, as you'll get an error like:
SQL compilation error: argument 1 to function EXPLAIN_JSON needs to be constant, found 'SYSTEM$EXPLAIN_PLAN_JSON('SELECT * FROM a.b.c')'
The solution would be to combine the queries from the query_history() with the SYSTEM$EXPLAIN_PLAN_JSON outside (to make the strings constant), and then you will be able to find out the most queried tables.

Oracle view join to table very weird slow issue

I have a table order, which is very straightforward, it is storing order data.
I have a view, which is storing currency pair and currency rate. The view is created as below:
create or replace view view_currency_rate as (
select c.* from currency_rate c, (
select curr_from, curr_to, max(rate_date) max_rate_date from currency_rate
where system_rate > 0
group by curr_from, curr_to) r
where c.curr_from = r.curr_from
and c.curr_to = r.curr_to
and c.rate_date = r.max_rate_date
and c.system_rate > 0
);
nothing fancy here, this view populate the latest currency rate (curr_from -> curr_to) from the currency_rate table.
When I do as below, it populate 80k row (all data) because I have plenty of records in order table. And the time spent is less than 5 seconds.
First Query:
select * from
VIEW_CURRENCY_RATE c, order a
where
c.curr_from = A.CURRENCY;
I want to add in more filter, so I thought it could be faster, so I added this:
Second Query:
select * from
VIEW_CURRENCY_RATE c, order a
where
a.id = 'xxxx'
and c.curr_from = A.CURRENCY;
And now it run over 1 minute! I totally have no idea what happen to this. I thought it would be some oracle optimizer goes wrong, so I try to find another way, think of just the 80K data can be populated quite fast, so I try to get the data from it, so I nested the SQL as below:
select * from (
select * from
VIEW_CURRENCY_RATE c, order a
where
c.curr_from = A.CURRENCY
)
where id = 'xxxx';
It run damn slow as well! I running out of idea, can anyone explain what happen to my script?
Updated on 6-Sep-2016
After I know how to 'explain plan', I capture the screen:
Fist query (fast one with 80K data):
Second query (slow one):
The slow one totally break the view and form a new SQL! This is super weird that how can Oracle optimize this like that?
It seems problem relates to the plan of second query. because it uses of nest loops inplace of hash joint.
at first check if _hash_join_enable is true if it isn't true change it to true. if it is true there are some problem with oracle optimizer. for test it use of USE_HASH(tab2 tab1) hint.
Regards
mohsen
I am using Mike solution, I re-write the script, and it is running fast now, although the root cause is not determined, probably due to the oracle optimizer algorithm working in different way that I expect.

WHERE clause on calculated field not working

I have an Access SQL query pulling back results from a Latitude & Longitude input (similar to a store locator). It works perfectly fine until I attempt to put in a WHERE clause limiting the results to only resultants within XXX miles (3 in my case).
The following query works fine without the WHERE distCalc < 3 clause being added in:
PARAMETERS
[selNum] Long
, [selCBSA] Long
, [cosRadSelLAT] IEEEDouble
, [radSelLONG] IEEEDouble
, [sinRadSelLAT] IEEEDouble;
SELECT B.* FROM (
SELECT A.* FROM (
SELECT
CERT
, RSSDHCR
, NAMEFULL
, BRNUM
, NAMEBR
, ADDRESBR
, CITYBR
, STALPBR
, ZIPBR
, simsLAT
, simsLONG
, DEPDOM
, DEPSUMBR
, 3959 * ArcCOS(
cosRadSelLAT
* cosRadSimsLAT
* cos(radSimsLONG - radSelLONG)
+ sinRadSelLAT
* sinRadSimsLAT
) AS distCalc
FROM aBRc
WHERE CBSA = selCBSA
AND cosRadSimsLAT IS NOT NULL
AND UNINUMBR <> selNum
) AS A
ORDER BY distCalc
) AS B
WHERE B.distCalc < 3
ORDER BY B.DEPSUMBR DESC;
When I add the WHERE distCalc < 3 clause, I get the dreaded
This expression is typed incorrectly, or it is too complex to be evaluated.
error.
Given that the value is created in the A sub-query I thought that it would be available in the outer B query for comparative calcs. I could recalculate the distCalc in the WHERE, however, I'm trying to avoid that since I'm using a custom function (ArcCOS). I'm already doing one hit on each row and there is significant overhead involved doing additional if I can avoid it.
The way you have it typed you are limiting it by B.distCalc, which is requires calcuation of A.distCalc, on which you are asking for a sort. Even if this worked worked, would require n^2 calculations to compute.
Try putting the filter on distCalc in the inner query (using the formula for distCalc, not distCalc itself).
This is not an answer. It's a formatted comment. What happens when you do this:
select *
from (
select somefield, count(*) records
from sometable
group by somefield) temp
If that runs successfully, try it with
where records > 0
at the end. If that fails, you probably need another approach. If it succeeds, start building your real query using baby steps like this. Test early and test often.
I was able to "solve" the problem by pushing the complicated formula into the function and then returning the value (which was then able to be used in the WHERE clause).
Instead of:
3959 * ArcCOS( cosRadSelLAT * cosRadSimsLAT * cos(radSimsLONG - radSelLONG) + sinRadSelLAT * sinRadSimsLAT) AS distCalc
I went with:
ArcCOS2(cosRadSelLAT,cosRadSimsLAT,radSimsLONG, radSelLONG,sinRadSelLAT,sinRadSimsLAT) AS distCalc
The ArcCOS2 Function contained the full formula. The upside is it works, the downside is that appears to be a slight tad slower. I appreciate everyone's help on this. Thank you.

Retrieve hierarchical groups ... with infinite recursion

I've a table like this which contains links :
key_a key_b
--------------
a b
b c
g h
a g
c a
f g
not really tidy & infinite recursion ...
key_a = parent
key_b = child
Require a query which will recompose and attribute a number for each hierarchical group (parent + direct children + indirect children) :
key_a key_b nb_group
--------------------------
a b 1
a g 1
b c 1
**c a** 1
f g 2
g h 2
**link responsible of infinite loop**
Because we have
A-B-C-A
-> Only want to show simply the link as shown.
Any idea ?
Thanks in advance
The problem is that you aren't really dealing with strict hierarchies; you're dealing with directed graphs, where some graphs have cycles. Notice that your nbgroup #1 doesn't have any canonical root-- it could be a, b, or c due to the cyclic reference from c-a.
The basic way of dealing with this is to think in terms of graph techniques, not recursion. In fact, an iterative approach (not using a CTE) is the only solution I can think of in SQL. The basic approach is explained here.
Here is a SQL Fiddle with a solution that addresses both the cycles and the shared-leaf case. Notice it uses iteration (with a failsafe to prevent runaway processes) and table variables to operate; I don't think there's any getting around this. Note also the changed sample data (a-g changed to a-h; explained below).
If you dig into the SQL you'll notice that I changed some key things from the solution given in the link. That solution was dealing with undirected edges, whereas your edges are directed (if you used undirected edges the entire sample set is a single component because of the a-g connection).
This gets to the heart of why I changed a-g to a-h in my sample data. Your specification of the problem is straightforward if only leaf nodes are shared; that's the specification I coded to. In this case, a-h and g-h can both get bundled off to their proper components with no problem, because we're concerned about reachability from parents (even given cycles).
However, when you have shared branches, it's not clear what you want to show. Consider the a-g link: given this, g-h could exist in either component (a-g-h or f-g-h). You put it in the second, but it could have been in the first instead, right? This ambiguity is why I didn't try to address it in this solution.
Edit: To be clear, in my solution above, if shared braches ARE encountered, it treats the whole set as a single component. Not what you described above, but it will have to be changed after the problem is clarified. Hopefully this gets you close.
You should use a recursive query. In the first part we select all records which are top level nodes (have no parents) and using ROW_NUMBER() assign them group ID numbers. Then in the recursive part we add to them children one by one and use parent's groups Id numbers.
with CTE as
(
select t1.parent,t1.child,
ROW_NUMBER() over (order by t1.parent) rn
from t t1 where
not exists (select 1 from t where child=t1.parent)
union all
select t.parent,t.child, CTE.rn
from t
join CTE on t.parent=CTE.Child
)
select * from CTE
order by RN,parent
SQLFiddle demo
Painful problem of graph walking using recursive CTEs. This is the problem of finding connected subgraphs in a graph. The challenge with using recursive CTEs is to prevent unwarranted recursion -- that is, infinite loops In SQL Server, that typically means storing them in a string.
The idea is to get a list of all pairs of nodes that are connected (and a node is connected with itself). Then, take the minimum from the list of connected nodes and use this as an id for the connected subgraph.
The other idea is to walk the graph in both directions from a node. This ensures that all possible nodes are visited. The following is query that accomplishes this:
with fullt as (
select keyA, keyB
from t
union
select keyB, keyA
from t
),
CTE as (
select t.keyA, t.keyB, t.keyB as last, 1 as level,
','+cast(keyA as varchar(max))+','+cast(keyB as varchar(max))+',' as path
from fullt t
union all
select cte.keyA, cte.keyB,
(case when t.keyA = cte.last then t.keyB else t.keyA
end) as last,
1 + level,
cte.path+t.keyB+','
from fullt t join
CTE
on t.keyA = CTE.last or
t.keyB = cte.keyA
where cte.path not like '%,'+t.keyB+',%'
) -- select * from cte where 'g' in (keyA, keyB)
select t.keyA, t.keyB,
dense_rank() over (order by min(cte.Last)) as grp,
min(cte.Last)
from t join
CTE
on (t.keyA = CTE.keyA and t.keyB = cte.keyB) or
(t.keyA = CTE.keyB and t.keyB = cte.keyA)
where cte.path like '%,'+t.keyA+',%' or
cte.path like '%,'+t.keyB+',%'
group by t.id, t.keyA, t.keyB
order by t.id;
The SQLFiddle is here.
you might want to check with COMMON TABLE EXPRESSIONS
here's the link

Convert row data into columns Access 07 without using PIVOT

I am on a work term from school. I am not very comfortable using SQL, I am trying to get a hold of it....
My supervisor gave me a task for a user in which I need to take row data and make columns. We used the Crosstab Wizard and automagically created the SQL to get what we needed.
Basically, we have a table like this:
ReqNumber Year FilledFlag(is a checkbox) FilledBy
1 2012 (notchecked) ITSchoolBoy
1 2012 (checked) GradStudent
1 2012 (notchecked) HighSchooler
2 etc, etc.
What the user would like is to have a listing of all of the req numbers and what is checked
Our automatic pivot code gives us all of the FilledBy options (there are 9 in total) as column headings, and groups it all by reqnumber.
How can you do this without the pivot? I would like to wrap my head around this. Nearest I can find is something like:
SELECT
SUM(IIF(FilledBy = 'ITSchoolboy',1,0) as ITSchoolboy,
SUM(IIF(FilledBy = 'GradStudent',1,0) as GradStudent, etc.
FROM myTable
Could anyone help explain this to me? Point me in the direction of a guide? I've been searching for the better part of a day now, and even though I am a student, I don't think this will be smiled upon for too long. But I would really like to know!
I think your boss' suggestion could work if you GROUP BY ReqNumber.
SELECT
ReqNumber,
SUM(IIF(FilledBy = 'ITSchoolboy',1,0) as ITSchoolboy,
SUM(IIF(FilledBy = 'GradStudent',1,0) as GradStudent,
etc.
FROM myTable
GROUP BY ReqNumber;
A different approach would be to JOIN multiple subqueries. This example pulls in 2 of your categories. If you need to extend it to 9 categories, you would have a whole lot of joining going on.
SELECT
itsb.ReqNumber,
itsb.ITSchoolboy,
grad.GradStudent
FROM
(
SELECT
ReqNumber,
FilledFlag AS ITSchoolboy
FROM myTable
WHERE FilledBy = "ITSchoolboy"
) AS itsb
INNER JOIN
(
SELECT
ReqNumber,
FilledFlag AS GradStudent
FROM myTable
WHERE FilledBy = "GradStudent"
) AS grad
ON itsb.ReqNumber = grad.ReqNumber
Please notice I'm not suggesting you should use this approach. However, since you asked about alternatives to your pivot approach (which works) ... this is one. Stay tuned in case someone else offers a simpler alternative. :-)