SQL. Calculation correlations of asset classes - sql

I have a database with 101 simulations for, lets say, 5 different asset classes returns.
I need to write a query that will calculate the respective correlations between each of the 5 classes. Table will look something like this:
AssetClass_ID | Simulation | AssetClass_Value
Any ideas? I am struggling to get even close.
(Depending on difficulty I may end up having to tell the end user to just download all the simulations and do the stats using inbuilt EXCEL functions, but I am unlikely to be popular for doing so)

Ok, with some google and some work I came up with:
SELECT
AssetID_1, AssetID_2,
((psum - (sum1 * sum2 / n)) / sqrt((sum1sq - sum1*sum1 / n) * (sum2sq - sum2*sum2 / n))) AS [Correlation Coefficient],
n
FROM
(SELECT
n1.AssetClass_ID AS AssetID_1,
n2.AssetClass_ID AS AssetID_2,
SUM(n1.RunResults_Value) AS sum1,
SUM(n2.RunResults_Value) AS sum2,
SUM(n1.RunResults_Value * n1.RunResults_Value) AS sum1sq,
SUM(n2.RunResults_Value * n2.RunResults_Value) AS sum2sq,
SUM(n1.RunResults_Value * n2.RunResults_Value) AS psum,
COUNT(*) AS n
FROM
dbo.tbl_RunResults AS n1
LEFT JOIN dbo.tbl_RunResults AS n2 ON n1.Simulation_ID = n2.Simulation_ID
WHERE
n1.AssetClass_ID < n2.AssetClass_ID AND
n1.series_ID = 2332 AND
n2.series_ID = 2332
GROUP BY
n1.AssetClass_ID, n2.AssetClass_ID) AS step1
ORDER BY
AssetID_1
Answers match Excel inbuilt functions so far, so good.

Related

Frequency table of continuous variable in SQL?

I have a continuous variable SQL table:
x
1 622.108
2 622.189
3 622.048
4 622.758
5 622.191
6 622.677
7 622.598
8 622.020
9 621.228
10 622.690
...
and I try to get a simple frequency table, e.g. with 3 buckets, like this:
bucket n
[621.228-621.738[ 1
[621.738-622.248[ 5
[622.248-622.758] 4
Seems easy but I cannot manage to make it in SQL (I am running it on a Cloudera Impala engine).
I have looked into dense_rank() or ntile() without success.
Any idea ?
You can use window functions to divide the range into three equal parts and then use arithmetic:
select min_x + range * (row_number() over (order by min(x)) - 1) as bucket_hi,
min_x + range * row_number() over (order by min(x)) as bucket_hi,
count(*)
from (select t.*,
min(x) over () as min_x,
max(x) over () as max_x,
0.000001 + max(x) over () - min(x) over () as range
from t
) t
group by floor((x - min_x) / range)), min_x, range
There are at least two problems with your question:
You have not provided any code to show us what you have tried. It really is good sometimes to just work out the problem yourself. Nevertheless, I found the problem interesting and decided to play.
Your range blocks overlap. If, for example, you were to have the value 621.738 in your list, which bucket would contain it? [621.228-621.738] or [621.738-622.248]?
There are also at least three problems with my answer, so I don't expect you to accept this. However, maybe it will get you started. Hopefully, this disclaimer will keep me from getting down voted. :-)
The answer is in T-SQL. Sorry, it's what I have to work with.
The answer is not generic. It always creates three and only three buckets.
It only works if the data type limits the result to 3 decimal places.
Remember, this is only one possible solution, and in my mind a very weak one at that.
With those disclaimers, here's what I wrote:
SELECT
'[' + STR( RANGES.RANGESTART, 7, 3 )
+ ' - '
+ STR( RANGES.RANGEEND, 7, 3 ) + ']' AS 'BUCKET'
,COUNT(*) AS 'N'
FROM
( SELECT
VALS.MINVAL + (CAST( CNT.INC AS DECIMAL(7,3) ) * VALS.RANGEWIDTH) AS 'RANGESTART'
,CASE WHEN CNT.INC < 2
THEN VALS.MINVAL + (CAST( CNT.INC + 1 AS DECIMAL(7,3) ) * VALS.RANGEWIDTH) - 0.001
ELSE VALS.MINVAL + (CAST( CNT.INC + 1 AS DECIMAL(7,3) ) * VALS.RANGEWIDTH)
END AS 'RANGEEND'
FROM
( SELECT
MIN(CURVAL) AS 'MINVAL'
,MAX(CURVAL) AS 'MAXVAL'
,(MAX(CURVAL) - MIN(CURVAL)) / 3 AS 'RANGEWIDTH'
FROM
MYVALUE ) VALS
CROSS JOIN (VALUES (0), (1), (2) ) CNT(INC)
) RANGES
INNER JOIN MYVALUE V
ON V.CURVAL BETWEEN RANGES.RANGESTART AND RANGES.RANGEEND
GROUP BY
RANGES.RANGESTART
,RANGES.RANGEEND
ORDER BY 1
;
In the above, your values would be in the CURVAL column of the MYVALUE table.
Good luck. I hope this helps you on your way.

find the maximum in a column, but only when two other columns match

I need help in PostgreSQL.
I have two tables
Predicton - predicts future disasters and casualties for each city.
Measures fits the type of damage control providers for each type of disaster (incl. cost and percent of "averted casualties")
Each disaster and provider combination has an amount of averted casualties (the percent from measures * amount of predicted casualties for that disaster*0.01).
For each combination of city and disaster, I need to find two providers that
1) their combined cost is less than a million
2) have the biggest amount of combined averted casualties.
My work and product so far
select o1.cname, o1.etype, o1.provider as provider1, o2.provider as provider2, (o1.averted + o2.averted) averted_casualties
from (select cname, m.etype, provider, mcost, (percent*Casualties*0.01)averted
from measures m, prediction p
where (m.etype = p.etype)) as o1, (select cname, m.etype, provider, mcost, (percent*Casualties*0.01)averted
from measures m, prediction p
where (m.etype = p.etype)) as o2
where (o1.cname = o2.cname) and (o1.etype = o2.etype) and (o1.provider < o2.provider) and (o1.mcost + o2.mcost < 1000000)
How do I change this query so It Will show me the best averted_casualties for each city/disaster combo (not just max of all table, max for each combo)
This is the desired outcome:
P.S. I'm not allowed to use ordering, views or functions.
First, construct all pairs of providers and do the casualty and cost calculation:
select p.*, m1.provider as provider_1, m2.provider as provider_2,
p.casualties * (1 - m1.percent / 100.0) * (1 - m2.percent / 100.0) as net_casualties,
(m1.mcost + m2.mcost) as total_cost
from measures m1 join
measures m2
on m1.etype = m2.etype and m1.provide < m2.provider join
prediction p
on m1.etype = p.etype;
Then, apply your conditions. Normally, you would use window functions, but since ordering isn't allowed for this exercise, you want to use a subquery:
with pairs as (
select p.*, m1.provider as provider_1, m2.provider as provider_2,
p.casualties * (1 - m1.percent / 100.0) * (1 - m2.percent / 100.0) as net_casualties,
(m1.mcost + m2.mcost) as total_cost
from measures m1 join
measures m2
on m1.etype = m2.etype and m1.provide < m2.provider join
prediction p
on m1.etype = p.etype;
)
select p.*
from pairs p
where p.total_cost < 1000000 and
p.net_casualties = (select min(p2.net_casualties)
from pairs p2
where p2.city = p.city and p2.etype = p.etype and
p2.total_cost < 1000000
);
The biggest number of averted casualties results in the smallest number of net casualties. They are the same thing.
As for your attempted solution. Just seeing the , in the from clause tells me that you need to study up on join. Simple rule: Never use commas in the from clause. Always use proper, explicit, standard join syntax.
Your repeated subqueries also suggest that you need to learn about CTEs.

SQL | Match Shop cash with Bank Cash

! - I'm not looking for paid software which will do this job (as too expensive)
We have an issue with cash management to match the values.
I have two SQL Tables, let's call it SHOP_CASH and BANK_CASH
1) The matching should be happens based on ShopName-CashAmount-Date.
2) Here I faced two issues
The cash should be round up to nearest £50, ideally, 12 400 and 12 499 should round up to 12 450, OR this just IDEAL is a match based on cash difference which less than 50, dry to match different value if the difference is less than 50, match them, but here is the question how to match the value up.. this is just the stupid ideas))??? Hmmm...stuck.
Dates, the shop can cash up a few days later, so need to join based on cash-up date (for example 2018-10-26) with bank date RANGE 2018-10-26 to (+7 days) 2018-11-02
Currently, I do not understand the possible way (logical) of matching in this circumstance. Any logical path of calculation/joining will be extremely appreciated
TRY:
Let's say I can join two tables by SHOPNAME - Cool
Then I will try to join by date, which potentially will be:
SELECT * FROM SHOP_CASH AS SC
LEFT JOIN BANK_CASH AS BC
ON SC.SHOP_NAME_SC = BC.SHOP_NAME_BC
AND SC.DATE_SC = (ANY DATE FROM SC.DATE_SC TO SC.DATE_SC (+7 DAYS) = TO DATE_BC - not sure how)
AND FLOOR(SC.CASH_SC / 50) * 50 = FLOOR(BC_CASH_BC / 50) * 50
P.S. For this project will be using the Google Big Query.
This is my (temporary solution)
WITH MAIN AS(SELECT
CMS.Store_name AS STORE_NAME,
CMS.Date AS SHOP_DATE,
CMB.ENTRY_DATE AS BANK_DATE,
SUM(CMS.Cash) AS STORE_CASH,
SUM(CMB.AMOUNT) AS BANK_CASH
FROM `store_data` CMS
LEFT JOIN `bank_data` AS CMB
ON CMS.store_name = CMB.STRAIGHT_LOOKUP
AND FLOOR(CMS.Cash / 50) * 50 = FLOOR(CMB.AMOUNT / 50) * 50
AND CAST(FORMAT_DATE("%F",CMB.ENTRY_DATE) AS STRING) > CAST(FORMAT_DATE("%F",CMS.Date) AS STRING)
AND CAST(FORMAT_DATE("%F",CMB.ENTRY_DATE) AS STRING) <= CAST(FORMAT_DATE("%F",DATE_ADD(CMS.Date, INTERVAL 4 day)) AS STRING)
GROUP BY STORE_NAME,SHOP_DATE,BANK_DATE)
SELECT
MAIN2.*
FROM (
SELECT
ARRAY_AGG(MAIN ORDER BY MAIN.SHOP_DATE ASC LIMIT 1)[OFFSET(0)] AS MAIN2
FROM
MAIN AS MAIN
GROUP BY MAIN.SHOP_DATE, MAIN.STORE_CASH)
this is quite interesting case.
You haven't provided any sample data so I'm not able to test it, but this may work. Some modification may be required since not sure about date format. Let me know if there is an issue.
SELECT * FROM SHOP_CASH AS SC
LEFT JOIN BANK_CASH AS BC
ON SC.SHOP_NAME_SC = BC.SHOP_NAME_BC
AND SC.DATE_SC BETWEEN BC.DATE_BC AND DATE_ADD(BC.DATE_BC, DAY 7)
AND trunc(SC.CASH_SC, -2) + 50 = trunc(BC.CASH_BC,2) + 50

How to do inheritance / transmission queries in BigQuery Variant Schema

The Variant Schema used by Google Genomics Variant Transform pipelines represents genotypes as nested records in BigQuery - for example:
(from: https://bigquery.cloud.google.com/table/genomics-public-data:1000_genomes.variants?pli=1&tab=preview)
I'm having trouble understanding how to write queries that involve relationships between samples - such as:
select all variants where sampleA.genotype=HET and sampleB.genotype=HET and sampleC.genotype=HOM-ALT
or similar queries where sampleA and sampleB are parents of sampleC and you're looking for variants that follow a particular inheritance pattern.
How are people writing these queries with the nested schema?
I think that would be something like below - have not tested as table is quite expensive - but one run gave zero output meaning that there is no records that meet that specific criteria - but at least you see the logic of how to do such query
SELECT * EXCEPT(cnt)
FROM (
SELECT reference_name, start, `end`,
(SELECT COUNT(1)
FROM UNNEST(call)
WHERE (call_set_name="HG00261" AND genotype[SAFE_OFFSET(0)] = 0 AND genotype[SAFE_OFFSET(1)] = 1)
OR (call_set_name="HG00593" AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 0)
OR (call_set_name="NA12749 " AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 1)
) cnt
FROM `genomics-public-data.1000_genomes.variants`
)
WHERE cnt = 3

Help structure query to list concerts/venues by distance

SQL noob here needing some help. I've got an idea of how to do this in PHP/SQL, but I would really like to condense this into one SELECT statement. OK:
The site I am working on is a list of concerts and venues. Venues have a latitude and longitude, and so do accounts corresponding to that users location.
I have three tables, accounts (users), concerts, I would like to SELECT a list of concerts (and join on venues for that info) that are happening at venues within x miles of the account, using this cheap formula for distance calculation (the site only lists venues in the UK so the error is acceptable):
x = 69.1 * (accountLatitude - venueLatitude);
y = 69.1 * (accountLongitude - venueLongitude) * cos(venueLatitude / 57.3);
distance = sqrt(x * x + y * y);
How can I achieve this in a single query?
Thanks in advance xD
This is done exactly as your formulas suggest.
Just substitute x and y into the distance formula.
If this is for MySQL the below should work (just replace the correct table names / column names).
SELECT concert.name, venue.name,
SQRT(POW(69.1 * (account.Latitude - venue.Latitude), 2) + POW(69.1 * (account.Longitude - venue.Longitude) * cos(venue.Latitude / 57.3), 2)) AS distance
FROM account, venue
LEFT JOIN concert ON conert.ID = venue.concertID
WHERE account.id = UserWhoIsLoggedIn
ORDER BY 3;
This should return all concert names, venue names and distance from user order by the distance.
If you are not using MySQL you may need to change either the POW or SQRT functions.
Also be aware that sin & cos functions take there inputs in Radians.
For everyone elses benefit heres now I did it in the end:
$q = 'SELECT gigs.date, bands.idbands, venues.latitude, venues.longitude, venues.idvenues, bands.name AS band, venues.name AS venue,
SQRT(POW(69.1 * (' . $lat . ' - venues.latitude), 2) + POW(69.1 * (' . $lon . ' - venues.longitude) * cos(venues.latitude / 57.3), 2)) AS distance
FROM gigs
LEFT JOIN bands ON bands.idbands=gigs.bands_idbands
LEFT JOIN venues ON venues.idvenues=gigs.venues_idvenues
WHERE 1
ORDER BY distance';
where $lat and $lon are the latitude and longitude of the user currently logged in!
this selects every single gig thats happening at every venue, and arranges them in order of distance