How to do inheritance / transmission queries in BigQuery Variant Schema - google-bigquery

The Variant Schema used by Google Genomics Variant Transform pipelines represents genotypes as nested records in BigQuery - for example:
(from: https://bigquery.cloud.google.com/table/genomics-public-data:1000_genomes.variants?pli=1&tab=preview)
I'm having trouble understanding how to write queries that involve relationships between samples - such as:
select all variants where sampleA.genotype=HET and sampleB.genotype=HET and sampleC.genotype=HOM-ALT
or similar queries where sampleA and sampleB are parents of sampleC and you're looking for variants that follow a particular inheritance pattern.
How are people writing these queries with the nested schema?

I think that would be something like below - have not tested as table is quite expensive - but one run gave zero output meaning that there is no records that meet that specific criteria - but at least you see the logic of how to do such query
SELECT * EXCEPT(cnt)
FROM (
SELECT reference_name, start, `end`,
(SELECT COUNT(1)
FROM UNNEST(call)
WHERE (call_set_name="HG00261" AND genotype[SAFE_OFFSET(0)] = 0 AND genotype[SAFE_OFFSET(1)] = 1)
OR (call_set_name="HG00593" AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 0)
OR (call_set_name="NA12749 " AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 1)
) cnt
FROM `genomics-public-data.1000_genomes.variants`
)
WHERE cnt = 3

Related

Sub-query works but would a join or other alternative be better?

I am trying to select rows from one table where the id referenced in those rows matches the unique id from another table that relates to it like so:
SELECT *
FROM booklet_tickets
WHERE bookletId = (SELECT id
FROM booklets
WHERE bookletNum = 2000
AND seasonId = 9
AND bookletTypeId = 3)
With the bookletNum/seasonId/bookletTypeId being filled in by a user form and inserted into the query.
This works and returns what I want but seems messy. Is a join better to use in this type of scenario?
If there is even a possibility for your subquery to return multiple value you should use in instead:
SELECT *
FROM booklet_tickets
WHERE bookletId in (SELECT id
FROM booklets
WHERE bookletNum = 2000
AND seasonId = 9
AND bookletTypeId = 3)
But I would prefer exists over in :
SELECT *
FROM booklet_tickets bt
WHERE EXISTS (SELECT 1
FROM booklets b
WHERE bookletNum = 2000
AND seasonId = 9
AND bookletTypeId = 3
AND b.id = bt.bookletId)
It is not possible to give a "Yes it's better" or "no it's not" answer for this type of scenario.
My personal rule of thumb if number of rows in a table is less than 1 million, I do not care optimising "SELECT WHERE IN" types of queries as SQL Server Query Optimizer is smart enough to pick an appropriate plan for the query.
In reality however you often need more values from a joined table in the final resultset so a JOIN with a filter WHERE clause might make more sense, such as:
SELECT BT.*, B.SeasonId
FROM booklet_tickes BT
INNER JOIN booklets B ON BT.bookletId = B.id
WHERE B.bookletNum = 2000
AND B.seasonId = 9
AND B.bookletTypeId = 3
To me it comes down to a question of style rather than anything else, write your code so that it'll be easier for you to understand it months later. So pick a certain style and then stick to it :)
The question however is old as the time itself :)
SQL JOIN vs IN performance?

Chaining endless sql and performance

I am chaining sql according to user filter which is unknown.
For instance he would like to first ask for certain dates :
def filterDates(**kwargs):
q = ('''
SELECT date_num, {subject_col}, {in_col} as {out_col}
FROM {base}
WHERE date_num BETWEEN {date1} AND {date2}
ORDER BY date_num
''').format(subject_col=subject_col,**kwargs)
return q
(base is input query string from previous, see next)
and then he wants to calculate another thing(or many) so we pass the dates filter string query q as base to this query:
WITH BS AS (
SELECT date_num, {subject_col}, {in_col}
FROM {base}
)
SELECT t1.{subject_col},t1.{in_col}, t2.{in_col} - t1.{in_col} as {out_col}
FROM BS t1
JOIN BS t2
ON t1.{subject_col} = t2.{subject_col} AND t2.date_num = {date2}
WHERE t1.date_num = {date1}
''').format(subject_col=subject_col,**kwargs)
Here the {base} is going to be :
base='('+q+')'+'AS base'
Now we can chain queries as much as we want and it works.
How would the engine handle this ? is that means that the efficiency is bad because engine has to make 2 rounds ( instead of having a normal WHERE on the dates? ) how would he optimize this?
Is there a common good practice way to chain unknown number of queries?

BigQuery - Adwords Data Transfer - AccountStats vs AccountBasicStats

For many tables, there's always a AccountStats vs AccountBasicStats.
The same SQL query might have different values from Stats vs BasicStats, for example:
SELECT
cs.Date,
SUM(cs.Impressions) AS Sum_Impressions,
SUM(cs.Clicks) AS Sum_Clicks,
SUM(cs.Interactions) AS Sum_Interactions,
(SUM(cs.Cost) / 1000000) AS Sum_Cost,
SUM(cs.Conversions) AS Sum_Conversions
FROM
`{dataset_id}.Customer_{customer_id}` c
LEFT JOIN
`{dataset_id}.AccountBasicStats_{customer_id}` cs
<-----OR USING----->
`{dataset_id}.AccountStats_{customer_id}` cs
ON
c.ExternalCustomerId = cs.ExternalCustomerId
WHERE
c._DATA_DATE = c._LATEST_DATE
AND c.ExternalCustomerId = {customer_id}
GROUP BY
1
ORDER BY
1
It seems the main difference is ClickType column, which might double count based on the documentation: ClickType.
The BasicStats seems the most accurate, and match up exactly from adwords. While the Stats give around 2x-3x increase in impressions.
Is there a way to transform the data so that both queries would get the same results?
Since there's no basic stats for Hourly data, which I'm interested.
According to:
https://groups.google.com/forum/#!topic/adwords-api/QiY_RT9aNlM
Seems that there is no way to de-segment the data after ClickType is brought in.

How to improve query performance in Oracle

Below sql query is taking too much time for execution. It might be due to repetitive use of same table in from clause. I am not able to find out how to fix this query so that performance would be improve.
Can anyone help me out with this?
Thanks in advance !!
select --
from t_carrier_location act_end,
t_location end_loc,
t_carrier_location act_start,
t_location start_loc,
t_vm_voyage_activity va,
t_vm_voyage v,
t_location_position lp_start,
t_location_position lp_end
where act_start.carrier_location_id = va.carrier_location_id
and act_start.carrier_id = v.carrier_id
and act_end.carrier_location_id =
decode((select cl.carrier_location_id
from t_carrier_location cl
where cl.carrier_id = act_start.carrier_id
and cl.carrier_location_no =
act_start.carrier_location_no + 1),
null,
(select cl2.carrier_location_id
from t_carrier_location cl2, t_vm_voyage v2
where v2.hire_period_id = v.hire_period_id
and v2.voyage_id =
(select min(v3.voyage_id)
from t_vm_voyage v3
where v3.voyage_id > v.voyage_id
and v3.hire_period_id = v.hire_period_id)
and v2.carrier_id = cl2.carrier_id
and cl2.carrier_location_no = 1),
(select cl.carrier_location_id
from t_carrier_location cl
where cl.carrier_id = act_start.carrier_id
and cl.carrier_location_no =
act_start.carrier_location_no + 1))
and lp_start.location_id = act_start.location_id
and lp_start.from_date <=
nvl(act_start.actual_dep_time, act_start.actual_arr_time)
and (lp_start.to_date is null or
lp_start.to_date >
nvl(act_start.actual_dep_time, act_start.actual_arr_time))
and lp_end.location_position_id = act_end.location_id
and lp_end.from_date <=
nvl(act_end.actual_dep_time, act_end.actual_arr_time)
and (lp_end.to_date is null or
lp_end.to_date >
nvl(act_end.actual_dep_time, act_end.actual_arr_time))
and act_end.location_id = end_loc.location_id
and act_start.location_id = start_loc.location_id;
There is no Stright forward one answer for your question and the query you've mentioned.
In order to get a better response time of any query, you need to keep few things in mind while writing your queries. I will mention few here which appeared to be important for your query
Use joins instead of subqueries.
Use EXPLAIN to determine queries are functioning appropriately.
Use the columns which are having indexes with your where clause else create an index on those columns. here use your common sense which are the columns to be indexed ex: foreign key columns, deleted, orderCreatedAt, startDate etc.
Keep the order of the select columns as they appear at the table instead of arbitrarily selecting columns.
The above four points are enough for the query you've provided.
To dig deep about SQL optimization and tuning refer this https://docs.oracle.com/database/121/TGSQL/tgsql_intro.htm#TGSQL130

Selecting rows from Parent Table only if multiple rows in Child Table match

Im building a code that learns tic tac toe, by saving info in a database.
I have two tables, Games(ID,Winner) and Turns(ID,Turn,GameID,Place,Shape).
I want to find parent by multiple child infos.
For Example:
SELECT GameID FROM Turns WHERE
GameID IN (WHEN Turn = 1 THEN Place = 1) AND GameID IN (WHEN Turn = 2 THEN Place = 4);
Is something like this possible?
Im using ms-access.
Turm - Game turn GameID - Game ID Place - Place on matrix
1=top right, 9=bottom left Shape - X or circle
Thanks in advance
This very simple query will do the trick in a single scan, and doesn't require you to violate First Normal Form by storing multiple values in a string (shudder).
SELECT T.GameID
FROM Turns AS T
WHERE
(T.Turn = 1 AND T.Place = 1)
OR (T.Turn = 2 AND T.Place = 4)
GROUP BY T.GameID
HAVING Count(*) = 2;
There is no need to join to determine this information, as is suggested by other answers.
Please use proper database design principles in your database, and don't violate First Normal Form by storing multiple values together in a single string!
The general solution to your problem can be accomplished by using a sub-query that contains a self-join between two instances of the Turns table:
SELECT * FROM Games
WHERE GameID IN
(
SELECT Turns1.GameID
FROM Turns AS Turns1
INNER JOIN Turns AS Turns2
ON Turns1.GameID = Turns2.GameID
WHERE (
(Turns1.Turn=1 AND Turns1.Place = 1)
AND
(Turns2.Turn=2 AND Turns2.Place = 4))
);
The Self Join between Turns (aliased Turns1 and Turns2) is key, because if you just try to apply both sets of conditions at once like this:
WHERE (
(Turns.Turn=1 AND Turns.Place = 1)
AND
(Turns.Turn=2 AND Turns.Place = 4))
you will never get any rows back. This is because in your table there is no way for an individual row to satisfy both conditions at the same time.
My experience using Access is that to do a complex query like this you have to use the SQL View and type the query in on your own, rather than use the Query Designer. It may be possible to do in the Designer, but it's always been far easier for me to write the code myself.
select GameID from Games g where exists (select * from turns t where
t.gameid = g.gameId and ((turn =1 and place = 1) or (turn =2 and place =5)))
This will select all the games that have atleast one turn with the coresponding criteria.
More info on exist:
http://www.techonthenet.com/sql/exists.php
I bypassed this problem by adding a column which holds the turns as a string example : "154728" and i search for it instead. I think this solution is also less demanding on the database