Understanding self joins and flattening - google-bigquery

I'll start with the fact that I'm a newbie and I managed to hack this original query together. I've looked over many examples but I'm just not wrapping my head around self joins and displaying the data I want to see.
I'm feeding BQ with mobile app data daily and thus am querying multiple tables. I'm trying to query for a count of fatal crashes by IMEI by date. This query does give me most of the output I want as it returns Date, IMEI and Count.
However, I want the output to be Date, IMEI, Branch, Truck and Count. user_dim.user_properties.key is a nested field and in my query I'm specifically asking for user_dim.user_properties.key = 'imei_id' and getting it's value in user_dim.user_properties.value.value.string_value.
I don't understand how I would perform the join to also get back the values where user_dim.user_properties.key = 'truck_id' and user_dim.user_properties.key = 'branch_id' and ultimately getting my output to be: Date, IMEI, Branch, Truck and Count in one row.
Thanks for your help.
SELECT
event_dim.date AS Date,
user_dim.user_properties.value.value.string_value AS IMEI,
COUNT(*) AS Count
FROM
FLATTEN( (
SELECT
*
FROM
TABLE_QUERY([smarttruck-6d137:com_usiinc_android_ANDROID],'table_id CONTAINS "app_events_"')), user_dim.user_properties)
WHERE
user_dim.user_properties.key = 'imei_id'
AND event_dim.name = 'app_exception'
AND event_dim.params.key = 'fatal'
AND event_dim.params.value.int_value = 1
AND event_dim.date = '20170807'
GROUP BY
Date,
IMEI
ORDER BY
Count DESC

Here is a query that should work for you, using standard SQL:
#standardSQL
SELECT
event_dim.date AS Date,
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'imei_id') AS IMEI,
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'branch_id') AS branch_id,
(SELECT value.value.string_value
FROM UNNEST(user_dim.user_properties)
WHERE key = 'truck_id') AS truck_id,
COUNT(*) AS Count
FROM `smarttruck-6d137.com_usiinc_android_ANDROID.app_events_*`
CROSS JOIN UNNEST(event_dim) AS event_dim
WHERE
event_dim.name = 'app_exception' AND
EXISTS (
SELECT 1 FROM UNNEST(event_dim.params)
WHERE key = 'fatal' AND value.int_value = 1
) AND
event_dim.date = '20170807'
GROUP BY
Date,
IMEI,
branch_id,
truck_id
ORDER BY
Count DESC;
A couple of thoughts/suggestions, though:
To restrict how much data you scan, you probably want to filter on _TABLE_SUFFIX = '20170807' instead of event_dim.date = '20170807'. This will be cheaper and (if I understand correctly) will return the same results.
If the combinations of IMEI, branch_id, and truck_id are unique, there probably isn't a benefit to computing the count, so you can remove the COUNT(*) and also the GROUP BY/ORDER BY clauses.

Related

How to write SQL query without join?

Recently during an interview I was asked a question: if I have a table like as below:
The requirement is: how many orders and how many shipments per day (based on date column) - output needs to be like this:
I have written the following code, but interviewer ask me to write a SQL query without JOIN and UNION, achieve the same output.
SELECT
COALESCE(a.order_date, b.ship_date), orders, shipments
FROM
(SELECT
order_date, COUNT(1) AS orders
FROM
table
GROUP BY 1) a
FULL JOIN
(SELECT
ship_date, COUNT(1) AS shipments
FROM table) b ON a.order_date = b.ship_date
Is this possible? Could you guys please advice?
You can use UNION and GROUP BY with conditional aggregation as follows:
SELECT DATE_,
COUNT(CASE WHEN FLAG = 'ORDER' THEN 1 END) AS ORDERS,
COUNT(CASE WHEN FLAG = 'SHIP' THEN 1 END) AS SHIPMENTS
FROM (SELECT ORDER_DATE AS DATE_, 'ORDER' AS FLAG FROM YOUR_TABLE
UNION ALL
SELECT SHIP_DATE AS DATE_, 'SHIP' AS FLAG FROM YOUR_TABLE) T
In BigQuery, I would express this as:
select date, countif(n = 0) as orders, countif(n = 1) as numships
from t cross join
unnest(array[order_date, ship_date]) date with offset n
group by 1
order by date;
The advantage of this approach (over union all) is two-fold. First, it only scans the table once. More importantly, the unnest() is all on the same node where the data resides -- so data does not need to be moved for the unpivot.

PostgreSQL - Joining latest record to group by result

Below works as intended, but you guys sometimes can do magic when it comes to optimization. Is this all right or it can be done in better/faster way?
WITH last_events AS (
SELECT DISTINCT ON (type, adid)
type,
adid,
value,
created_at
FROM public.adid
ORDER BY type, adid, created_at DESC
)
SELECT
adid.type,
adid.adid,
count(*) as count,
sum(adid.value) as summary,
le.created_at
FROM public.adid
JOIN last_events le ON le.type = adid.type AND le.adid = adid.adid
GROUP BY adid.type, adid.adid, le.created_at
ORDER BY summary DESC, le.created_at DESC;
I believe that certain parts of your solution are unnecessary. The CTE returns max created_at per (type,adid) group. The main query computes number of rows per (type,adid) group and sum of value per (type,adid) group. Therefore, it can be written like this
SELECT
adid.type,
adid.adid,
count(*) as count,
sum(adid.value) as summary,
max(adid.created_at) max_created_at
FROM public.adid
GROUP BY adid.type, adid.adid
ORDER BY summary DESC, max_created_at DESC;
If you are interested in other columns corresponding to the row with highest created_at then you can use one of the classical greatest-per-group approaches. One that I prefer is to use GROUP BY to find the greatest value (very similar to your approach):
SELECT
adid.type,
adid.adid,
t.count,
t.summary,
t.max_created_at,
adid.value
FROM public.adid
JOIN (
SELECT
adid.type,
adid.adid,
count(*) as count,
sum(adid.value) as summary,
max(adid.created_at) max_created_at
FROM public.adid
GROUP BY adid.type, adid.adid
) t ON t.type = adid.type and
t.adid = adid.adid and
t.max_created_at = adid.created_at
ORDER BY t.summary DESC, t.max_created_at DESC;
I believe it is better like this since my solution has just one aggregation. Your solution use DISTINCT ON (which is hidden aggregation) and another GROUP BY in the outer join.
Another option to find greatest-per-group is to use window function, however, I think aggregation is a much better solution for your problem since you need more aggregation values. Moreover, GROUP BY seems to have a better performance in certain cases than the window functions.

Compare two tables of data in HIVE

I have to find out if data in both the tables is same for a given view_date. If same my SQL should return zero, else non zero.
Table1/Table2 columns:
Source
view_date
count
start_date
end_date
I tried in the below way:
SELECT *
FROM (
SELECT count(*)
FROM table1
) a
JOIN (
SELECT count(*)
FROM TABLE 2
) b
WHERE view_date = '05/08/2016'
AND a.x != b.y;
But I am not getting the expected result. Could someone please help me?
Here is one method that counts the number of rows that are unique in each table:
select count(*)
from (select source, count, start_date, end_date,
min(which) as minwhich, max(which) as maxwhich
from ((select source, count, start_date, end_date, 1 as which
from table1
where viewdate = '2016-06-08'
) union all
(select source, count, start_date, end_date, 2 as which
from table2
where viewdate = '2016-06-08'
)
) t12
group by source, count, start_date, end_date
having minwhich = maxwhich
) t;
Note: If rows are duplicated across all values in a table, this does not check that the same number of duplicates are in each table.
To do a full comparison of 2 tables, you not only need to make sure that the number of rows match, but you must check that all the data in all the columns for all the rows match!
This can be a complicated problem (when I worked at Hortonworks, for 1 project we developed 3 different programs to try to solve this). Lately I had the opportunity to develop a program that solves this in an elegant and efficient way: https://github.com/bolcom/hive_compared_bq
The program shows you the differences in a webpage (which is something you could skip if you don't need it) and also gives you a return value 0/1 which is what you currently want.

Selecting a column that is not an aggregate or in group

The goal of this select is to get the latest score for a system that is in status = 'FD'. I want to get the ID of the row (id), the system ID (sys_id), and the score (score).
The following SQL gives me the id of the system (sys_id) as well as the score (score), but I also would like to get the id column associated with this score and sys_id. Hopefully that makes sense.
select sys_id, score from example
where (sys_id, end_date) in
(
select sys_id, max (end_date)
from example
where status = 'FD'
group by sys_id
);
Here is a SQL Fiddle to give you an idea of what I am talking about http://www.sqlfiddle.com/#!4/169a2/3
Before you ask, yes the combination of sys_id and end_date would give me a unique row and I could find the id that way, but I would rather get the id in my select statement.
You can use a simple CTE to get the max date for each SYS_ID, and join that back to your table to get all the details for that particular record.
with CTE as (
select sys_id, max (end_date) as MaxDate
from example
where status = 'FD'
group by sys_id)
select
EXAMPLE.*
from
EXAMPLE
INNER JOIN CTE
ON EXAMPLE.SYS_ID = CTE.SYS_ID
and EXAMPLE.END_DATE = CTE.MaxDate
Check out the change to your SQL Fiddle
answer from comment. SUbquery a is from your statement...lazy programming on my part.
select a.*, e.score
from
(
select sys_id, max (end_date) as 'ed'
from example
where status = 'FD'
group by sys_id
)a
inner join example e on a.ed = e.end_date and a.sys_id = e.sys_ID
Works on the predicate that there is only one unqiue value for a given sys_id and end date. Multiple end dates will return multiple rows in a cross join format.

Order by date different columns

I'm having a problem with a complex SELECT, so I hope some of you can help me out, because I'm really stuck with it... or maybe you can point me in a direction.
I have a table with the following columns:
score1, gamedate1, score2, gamedate2, score3, gamedate3
Basically I need to determine the ultimate winner of all the games, who got the SUMMED MAX score FIRST, based on the game times in ASCENDING order.
Assuming that the 1,2,3 are different players, something like this should work:
-- construct table as the_lotus suggests
WITH LotusTable AS
(
SELECT 'P1' AS Player, t.Score1 AS Score, t.GameDate1 as GameDate
FROM Tbl t
UNION ALL
SELECT 'P2' AS Player, t.Score2 AS Score, t.GameDate2 as GameDate
FROM Tbl t
UNION ALL
SELECT 'P3' AS Player, t.Score3 AS Score, t.GameDate3 as GameDate
FROM Tbl t
)
-- get running scores up through date for each player
, RunningScores AS
(
SELECT b.Player, b.GameDate, SUM(a.Score) AS Score
FROM LotusTable a
INNER JOIN LotusTable b -- self join
ON a.Player = b.Player
AND a.GameDate <= b.GameDate -- a is earlier dates
GROUP BY b.Player, b.GameDate
)
-- get max score for any player
, MaxScore AS
(
SELECT MAX(r.Score) AS Score
FROM RunningScores r
)
-- get min date for the max score
, MinGameDate AS
(
SELECT MIN(r.GameDate) AS GameDate
FROM RunningsScores r
WHERE r.Score = (SELECT m.Score FROM MaxScore m)
)
-- get all players who got the max score on the min date
SELECT *
FROM RunningScores r
WHERE r.Score = (SELECT m.Score FROM MaxScore m)
AND r.GameDate = (SELECT d.GameDate FROM MinGameDate d)
;
There are more efficient ways of doing it; in particular, the self-join could be avoided.
If your tables are set up three columns: player_id, score1, time
Then you would just need a simple query to sum their scores and group them by player_ID as follows:
SELECT gamedata1.player_ID as 'Player_ID',
sum(gamedata1.score1 + gamedata2.score1 + gamedata3.score1) as 'Total_Score'
FROM gamedata1
LEFT JOIN gamedata2 ON (gamedata1.player_ID = gamedata2.player_ID)
LEFT JOIN gamedata3 ON (gamedata1.player_ID = gamedata3.player_ID)
GROUP BY 'player_ID'
ORDER BY time ASC
Explanation:
You are essentially grouping by each player so you can get a distinct player in each row, and then summing their scores and organizing the data in this fashion. I put the "time" as a date type. The can be changed of coarse to any datetype, etc that you would prefer. The structure of the query would be the same.