Using CORR() function on query with table join

Using CORR() function on query with table join - google-bigquery

I get a null value when using the CORR() function on a table join query. However, on a query without a join the CORR() function returns a value. I get values for the other fields. I have tried giving the fields aliases, or no aliases, but I cant seem to get a value for correlation in Query 2.
Thanks in advance.
Query 1
Returns a value for correlation. Query and result json link below.
select DATE(Time ) as date, ROUND(AVG(Price),2) as price, ROUND(SUM(amount),2) as volume, CORR(price, amount) as correlation
from
ds_5.tb_4981, ds_5.tb_4978, ds_5.tb_4967
where YEAR(Time) = 2014
group by date
order by date ASC
Query 1 result json: https://json.datadives.com/64cbd7a4a5aba3a864b17a719148620f.json
Query 2
Null value for correlation. Query and result json link below.
select bitcoin.date as date, bitcoin.btcprice, blockchain.trans_vol, CORR(bitcoin.btcprice,blockchain.trans_vol) as correlation
from
(select DATE(time) as date, AVG(price) as btcprice
from
ds_5.tb_4981, ds_5.tb_4978, ds_5.tb_4967
where YEAR(Time) = 2014
group by date) as bitcoin
JOIN
(select
DATE(blocktime) as date, SUM(vout.value) as trans_vol
from ds_14.tb_7917, ds_14.tb_7918, ds_14.tb_7919, ds_14.tb_7920, ds_14.tb_7921, ds_14.tb_7922, ds_14.tb_7923, ds_14.tb_7924, ds_14.tb_7925, ds_14.tb_7926, ds_14.tb_7927, ds_14.tb_7928, ds_14.tb_7934, ds_14.tb_7972, ds_14.tb_8016, ds_14.tb_8086, ds_14.tb_9743, ds_14.tb_9888, ds_14.tb_10084, ds_14.tb_10136, ds_14.tb_10500, ds_14.tb_10601
where YEAR(blocktime) = 2014
group by Date) as blockchain
on bitcoin.date = blockchain.date
group each by date, bitcoin.btcprice, blockchain.trans_vol
order by date ASC
Query 2 result json: https://json.datadives.com/9427dc9f51ba36add5f008403def7b6d.json

I took the CSV you linked and left it here: https://bigquery.cloud.google.com/table/fh-bigquery:public_dump.datadivescsv
(I'm not sure why would you prefer to share the csv by file, instead of creating a public dataset in BigQuery and sharing the link)
So this works:
SELECT CORR(btc_price, trans_vol)
FROM [fh-bigquery:public_dump.datadivescsv]
-0.004957046970769512
But this doesn't:
SELECT CORR(btc_price, trans_vol)
FROM [fh-bigquery:public_dump.datadivescsv]
GROUP BY date
null
null
...
null
And that's expected!
Why: To compute a correlation we need sets of more than 2 numbers. Grouping by date on the second query leaves us with n-groups of 1 element, hence correlation is non computable.
(Side note: Correlation between 2 elements is always 1 or -1. We really need at least 3 elements, and way more for the results to be significant)
SELECT CORR(x, y)
FROM (SELECT 1 x, 2 y)
null
SELECT CORR(x, y)
FROM (SELECT 1 x, 2 y), (SELECT 3 x, 8 y)
1.0
SELECT CORR(x, y)
FROM (SELECT 1 x, 2 y), (SELECT 3 x, 8 y), (SELECT 7 x, 1 y)
-0.3170147297373293
... and so on

Related

SQL SELECT filtering out combinations where another column contains empty cells, then returning records based on max date

I have run into an issue I don't know how to solve. I'm working with a MS Access DB.
I have this data:
I want to write a SELECT statement, that gives the following result:
For each combination of Project and Invoice, I want to return the record containing the maximum date, conditional on all records for that combination of Project and Invoice being Signed (i.e. Signed or Date column not empty).
In my head, first I would sort the irrelevant records out, and then return the max date for the remaining records. I'm stuck on the first part.
Could anyone point me in the right direction?
Thanks,
Hulu

Start with an initial query which fetches the combinations of Project, Invoice, Date from the rows you want returned by your final query.
SELECT
y0.Project,
y0.Invoice,
Max(y0.Date) AS MaxOfDate
FROM YourTable AS y0
GROUP BY y0.Project, y0.Invoice
HAVING Sum(IIf(y0.Signed Is Null,1,0))=0;
The HAVING clause discards any Project/Invoice groups which include a row with a Null in the Signed column.
If you save that query as qryTargetRows, you can then join it back to your original table to select the matching rows.
SELECT
y1.Project,
y1.Invoice,
y1.Desc,
y1.Value,
y1.Signed,
y1.Date
FROM
YourTable AS y1
INNER JOIN qryTargetRows AS sub
ON (y1.Project = sub.Project)
AND (y1.Invoice = sub.Invoice)
AND (y1.Date = sub.MaxOfDate);
Or you can do it without the saved query by directly including its SQL as a subquery.
SELECT
y1.Project,
y1.Invoice,
y1.Desc,
y1.Value,
y1.Signed,
y1.Date
FROM
YourTable AS y1
INNER JOIN
(
SELECT y0.Project, y0.Invoice, Max(y0.Date) AS MaxOfDate
FROM YourTable AS y0
GROUP BY y0.Project, y0.Invoice
HAVING Sum(IIf(y0.Signed Is Null,1,0))=0
) AS sub
ON (y1.Project = sub.Project)
AND (y1.Invoice = sub.Invoice)
AND (y1.Date = sub.MaxOfDate);

Write A SQL query, which should be possible in MS-Access too, like this:
SELECT
Project,
Invoice,
MIN([Desc]) Descriptions,
SUM(Value) Value,
MIN(Signed) Signed,
MAX([Date]) "Date"
FROM data
WHERE Signed<>'' AND [Date]<>''
GROUP BY
Project,
Invoice
output:
Project
Invoice
Descriptions
Value
Signed
Date
A
1
Ball
100
J.D.
2022-09-20
B
1
Sofa
300
J.D.
2022-09-22
B
2
Desk
100
J.D.
2022-09-23
Note: for invoice 1 on project A, you will see a value of 300, which is the total for that invoice (when grouping on Project='A' and Invoice=1).
Maybe I should have used DCONCAT (see: Concatenation in between records in Access Query ) for the Description, to include 'TV' in it. But I am unable to test that so I am only referring to this answer.

Try joining a second query:
Select *
From YourTable As T
Inner Join
(Select Project, Invoice, Max([Date]) As MaxDate
From YourTable
Group By Project, Invoice) As S
On T.Project = S.Project And T.Invoice = S.Invoice And T.Date = S.MaxDate

How can I enable the "GROUP BY" aggregation dimension as the input for a function created in BigQuery

I need to create a table with the same aggregations at multiple different levels in BigQuery, for example:
SELECT
dimension_a,
dimension_b,
SUM(value) AS value
FROM mydataset.table
GROUP BY dimension_a, dimension_b
UNION ALL
SELECT
dimension_a,
NULL AS dimension_b,
SUM(value) AS value
FROM mydataset.table
GROUP BY dimension_a
UNION ALL
SELECT
NULL AS dimension_a,
dimension_b,
SUM(value) AS value
FROM mydataset.table
GROUP BY dimension_b
I guess it is not the most elegant codes... for example, I have 18 different aggregation dimensions, that means I need to stack those similar code blocks 18 times with UNION ALL, I am wondering if there could be a function that allows me to have the aggregation dimension as the input?
For example, something like:
CREATE OR REPLACE TABLE FUNCTION mydataset.aggregation_dimension(X type, Y type)
AS
SELECT
X as dimension_a, Y as dimension_b,
SUM(value) AS value
FROM mydataset.table
GROUP BY X, Y;
SELECT * FROM mydataset.aggregation_dimension(dimension_a, dimension_b)
UNION ALL
SELECT * FROM mydataset.aggregation_dimension(dimension_a, NULL)
UNION ALL
SELECT * FROM mydataset.aggregation_dimension(NULL, dimension_b)
Where X, Y should be a column in the table mydataset.table ... However, I have no idea how to define the type of this kind of inputs. I also don't know if it is possible to have such a setting...
Thank you in advance for your help!

How to check if a column data is an arithematic progression in PostgreSQL

Suppose i have a column C in a table T, which is as follow:
sr
c
1
34444444444440
2
34444444444442
3
34444444444444
4
34444444444446
5
34444444444448
6
34444444444450
How can i verify or check if the values in Column C are arithmetic progression?

An arithmetic progression means that the differences are all constants. Assuming that the values are not floating point, then you can directly compare them:
select (min(c - prev_c) = max(c - prev_c)) as is_arithmetic_progression
from (select t.*,
lag(c) over (order by sr) as prev_c
from t
) t
If these are floating point values, you probably want some sort of tolerance, such as:
select abs(min(c - prev_c), max(c - prev_c)) < 0.001 as is_arithmetic_progression

step-by-step demo:db<>fiddle
SELECT
COUNT(*) = 1 as is_arithmetic_progression -- 4
FROM (
SELECT
difference
FROM (
SELECT
*,
lead(c) OVER (ORDER BY sr) - c as difference -- 1
FROM
mytable
) s
WHERE difference IS NOT NULL -- 2
GROUP BY difference -- 3
) s
Arithmetical progression: The difference between each element is constant.
lead() window function shifts the next value into the current row. Generating the difference to the current value shows the difference
lead() creates a NULL value in the last column, because it has no "next" value. So, this will be filtered
Grouping the difference values.
If you only have one difference value, this would return in only one single group. Only one difference value means: You have a constant difference between the elements. That is exactly what arithmetical progression means. So if the number of groups is exactly 1, you have arithmetical progression.

You can use exists as follows:
Select case when count(*) > 0 then 'no progression' else 'progression' end as res_
From your_table t
Where exists
(select 1 from your_table tt
Where tt.str > t.str
And tt.c < t.c)

Finding nearest dates in SQL

I know that there are some threads on this subject, however, my query is slightly different to what I've seen and the solutions presented before don't seem to be working for me.
I have two tables, X and Y, here simplified to one ID, in fact of course I have multiple IDs. The period category lasts from the Date given to the beginning of the next period.
ID Date Period
A 12/01/2010 1
A 12/03/2010 2
A 15/06/2010 3
A 17/08/2010 4
A 20/10/2010 5
and
ID SampleDate
A 20/01/2010
A 25/01/2010
A 21/11/2010
What I need to get is:
ID SampleDate Period
A 20/01/2010 1
A 25/01/2010 1
A 21/11/2010 5
I've tried this:
with cte as
(
select
Y.ID,
Y.sampleDate,
X.Period,
ROW_NUMBER() over (PARTITION by Y.ID, Y.sampleDate order by DATEDIFF(day,X.Date, Y.sampleDate)) as DaysSince
from X
left join Y
on X.ID=Y.ID
)
select ID,
sampleDate,
Period
from cte
where DaysSince=1
This produces the correct size of the table, but instead of giving the perspective periods for the samples, it just prints out the top period number for all of them (for a given ID).
Any idea where I'm making a mistake?

There is nothing in your query that removes entries with negative datediff, so if you add that to the join:
with cte as
(
select
Y.ID,
Y.sampleDate,
X.Period,
ROW_NUMBER() over (PARTITION by Y.ID, Y.sampleDate order by DATEDIFF(day,X.Date, Y.sampleDate)) as DaysSince
from X
left join Y
on X.ID=Y.ID and X.Date < Y.sampleDate /* skip periods after the one we're interested in */
)
select ID,
sampleDate,
Period
from cte
where DaysSince=1

How do I get the top 10 results of a query?

I have a postgresql query like this:
with r as (
select
1 as reason_type_id,
rarreason as reason_id,
count(*) over() count_all
from
workorderlines
where
rarreason != 0
and finalinsdate >= '2012-12-01'
)
select
r.reason_id,
rt.desc,
count(r.reason_id) as num,
round((count(r.reason_id)::float / (select count(*) as total from r) * 100.0)::numeric, 2) as pct
from r
left outer join
rtreasons as rt
on
r.reason_id = rt.rtreason
and r.reason_type_id = rt.rtreasontype
group by
r.reason_id,
rt.desc
order by r.reason_id asc
This returns a table of results with 4 columns: the reason id, the description associated with that reason id, the number of entries having that reason id, and the percent of the total that number represents.
This table looks like this:
What I would like to do is only display the top 10 results based off the total number of entries having a reason id. However, whatever is leftover, I would like to compile into another row with a description called "Other". How would I do this?

with r2 as (
...everything before the select list...
dense_rank() over(order by pct) cause_rank
...the rest of your query...
)
select * from r2 where cause_rank < 11
union
select
NULL as reason_id,
'Other' as desc,
sum(r2.num) over() as num,
sum(r2.pct) over() as pct,
11 as cause_rank
from r2
where cause_rank >= 11

As said above Limit and for the skipping and getting the rest use offset... Try This Site

Not sure about Postgre but SELECT TOP 10... should do the trick if you sort correctly
However about the second part: You might use a Right Join for this. Join the TOP 10 Result with the whole table data and use only the records not appearing on the left side. If you calculate the sum of those you should get your "Sum of the rest" result.
I assume that vw_my_top_10 is the view showing you the top 10 records. vw_all_records shows all records (including the top 10).
Like this:
SELECT SUM(a_field)
FROM vw_my_top_10
RIGHT JOIN vw_all_records
ON (vw_my_top_10.Key = vw_all_records.Key)
WHERE vw_my_top_10.Key IS NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Using CORR() function on query with table join - google-bigquery

Related

SQL SELECT filtering out combinations where another column contains empty cells, then returning records based on max date

How can I enable the "GROUP BY" aggregation dimension as the input for a function created in BigQuery

How to check if a column data is an arithematic progression in PostgreSQL

Finding nearest dates in SQL

How do I get the top 10 results of a query?

Categories

Resources