Oldest Record For a Distinct ID - SparkSQL [duplicate] - sql

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 4 years ago.
I am relatively new here, so i will try to follow the means of SO.
I am working on spark on data bricks and working with the following data:
Distinct_Id Event Date
*some alphanumerical value* App Access 2018-01-09
*some alphanumerical value* App Opened 2017-23-01
... ... ...
The data means:
Every distinct_id identifies a distinct user. There are 4 main events - App access, app opened, app launched, mediaReady.
The problem:
I am trying to find the first app access date for a particular distinct_id.
App access is defined as: event in ('App access', 'App opened', 'App Launched')
The first app viewed date for a particular distinct_id.
App viewed is defined as: event == 'mediaReady'
My data is present in parquet files and the data volume is huge (2 years data).
I tried the following to find first app access date:
temp_result = spark.sql("
with cte as(
select gaid,
event,
event_date,
RANK() OVER (PARTITION BY gaid order by event_date) as rnk
from df_raw_data
WHERE upper(event) IN ('APP LAUNCHED', 'APP OPENED', 'APP ACCESS')
group by gaid,event,event_date
)
select DISTINCT gaid, event_date, event from cte where rnk = 1
")
I am trying to write a robust query which will scale with the increase in data and give the result.
I hope I've described the problem in a decent way.

Feels more like a pivot query:
SELECT
gaid,
MIN(CASE WHEN event in ('App access', 'App opened', 'App Launched') THEN date END) as first_app_access_date,
MIN(CASE WHEN event in ('mediaReady') THEN date END) as first_app_viewed_date
FROM df_raw_data
GROUP BY gaid
I've no idea about case sensitivity etc of a spark db so you might need to fix some of that up..

Related

Find first event occurring after given event

I am working with a table consisting of a number of web sessions with various events and event id:s. To simplify my question, let's say that I have 4 columns which are session_id, event_name and event_id, where the event id can be used to order the events in ascending/descending order. Let's also pretend that we have a large number of events and that I am particularly interest in 3 of the events with event_name: open, submit and decline. Assume that these 3 events can occur in any order.
What I would like to do, is that I would like to add a new column that for each session says which, if any, of the two events 'submit' and 'decline' that first follows the event 'open'. I have tried using the FIRST_VALUE partition function but have not made it successfully work yet.
So for a session with event sequence: 'open', ... (a number of different events happening in between), 'submit', 'decline', I would like to return 'submit',
and for a session with event sequence: open, ... (a number of different events happening in between), 'decline', I would like to return 'decline',
and for a sessions for which none of the events 'submit' nor 'decline' happens after 'open', I would like to return null.
You can use the following table with name 'events' for writing example SQL code:
I hope the question and its formulation is clear. Thank you very much in advance!
Sincerely,
Bertan
Use below (assuming you have only one accept or decline per session!)
select *, if(event_name != 'open', null, ['decline', 'accept'][ordinal(
sum(case event_name when 'decline' then 1 when 'accept' then 2 end) over win
)]) staus
from your_table
window win as (
partition by session_id order by event_id
rows between 1 following and unbounded following
)
if apply to sample data in your question - output is

How to get firebase console event details such as first_open, app_remove and Registration_Success using big query for last two weeks?

I'm creating visualization for App download count, the app removes count and user registration counts from firebase console data for the last two weeks. It gives us the total count of the selected period but we need date wise count for each. For that, we plan to get the data count using a big query. how do we get all metrics by writing a single query?
We will get all the metrics using single query has below
SELECT event_date,count(*),platform,event_name FROM `apple-XYZ.analytics_XXXXXX.events_*` where
(event_name = "app_remove" or event_name = "first_open" or event_name = "Registration_Success") and
(event_date between "20200419" and "20200502") and (stream_id = "XYZ" or stream_id = "ZYX") and
(platform = "ANDROID" or platform = "IOS") group by platform, event_date, event_name order by event_date;
Result: for two weeks (From 19-04-2020 to 02-04-2020)

BigQuery Firebase Average Coins Per Level In The Game

I developed a words game (using firebase as my backend) with levels and coins.
Now, I'm facing some difficulties while trying to query my DB, so that it will output a table with all levels in the game and average user coins for each level. For example :
Level Avg User Coins
0 50
1 12
2 2
Attached is a picture of my events table:
So as you can see, there is an event of 'level_end', then we can see the 'user coins' and 'level_num'. What is the right way to do that?
This is what I managed to do so far, obviously the wrong way :
SELECT event_name,user_id
FROM `words-game-en.analytics_208527783.events_20191004`,
UNNEST(event_params) as event_param
WHERE event_name = "level_end"
AND event_param.key = "user_coins"
You seem to want something like this:
SELECT event_param.level_num, AVG(event_param.user_coins)
FROM `words-game-en.analytics_208527783.events_20191004` CROSS JOIN
UNNEST(event_params) as event_param
WHERE event_name = 'level_end' AND event_param.key = 'user_coins'
GROUP BY level_num
ORDER BY level_num;
I'm a little confused by what is in event_params and what is directly in events, so you might need to properly qualify the column references.

Displaying single date header about multiple rows (Recycleview)

Evening everyone
I've currently got a simple recycle view adapter which is being populated by an SQL Lite database. The user can add information into the database from the app which then build a row inside of the recycle view. When you run the application it will display each row with its own date directly above it. I'm now looking to make the application look more professional by only displaying a single date above multiple records as a header.
So far I've built 2 custom designs, one which displays the header along with the row and the other which is just a standard row without a header built in. I also understand how to implement two layouts into a single adapter.
I've also incorporated a single row into my database which simply stores the date in a way in which I can order the database e.g. 20190101
Now my key question is when populating the adapter using the information from the SQL Lite database how can get it to check if the previous record has the same date. If the record has the same date then it doesn't need to show the custom row with header but if its a new date then it does?
Thank you
/////////////////////////////////////////////////////////////////////////////
Follow up question for Krokodilko, I've spent the last hour trying to work your implementation into my SQL Lite but still not being able to find the combination.
below the is the original code SQL Lite line I currently use to simply gain all the results.
Cursor cursor = sqLiteDatabase.rawQuery("SELECT * FROM " + Primary_Table + " " , null);
First you must define an order which will be used to determine which record is previous and which one is next. As I understand, you are simply using date column.
Then the query is simple - use LAG analytic function to pick a column value from previous row, here is a link to a simple demo (click "Run" button):
https://sqliteonline.com/#fiddle-5c323b7a7184cjmyjql6c9jh
DROP TABLE IF EXISTS d;
CREATE TABLE d(
d date
);
insert into d values ( '2012-01-22'),( '2012-01-22'),( '2015-01-22');
SELECT *,
lag( d ) OVER (order by d ) as prev_date,
CASE WHEN d = lag( d ) OVER (order by d )
THEN 'Previous row has the same date'
ELSE 'Previous row has different date'
END as Compare_status
FROM d
ORDER BY d;
In the above demo d column is used in OVER (order by d ) clause to determine the order of rows used by LAG function.

How to report MS Access data, calculated columns with group by

I have an Access 2003 database which records fault call help requests in a medium size organisation of around 200 users. Calls are logged (and appended into the database) via a Classic ASP page, and a team of systems administrators use a seperate classic ASP web page to view calls, provide a response, etc.
All calls are recorded in one table called tblFaultCall, it's structure is below
tblFault call
ID : Autonumber
strName
strPhone
dtmDateOpen : Date/Time (date call logged)
dtmDateClosed : Date/Time (date call closed)
dtmTime : Date/Time (time call logged)
strStatus (always 'Open', 'Pending' or 'Closed')
strCategory (always one of 10 categories, held as as list in tblCatgory, and used in lookup lists in the ASP web page)
strFaultDesc
strResolution
strCallOwner
dtmDatePending : Date/Time (date call set to pending, if it ever was)
For management, I need a way of easily creating a quarterly report which shows as below
Call recieved between dd/mm/yyyy and dd/mm/yyyy
----
Category Calls recieved Of which 'Closed' closed within 5 days Closed within 14 days Open Pending
Cateogry x 1052 950 700 200 50 50
Cateogry Y 65 60 50 5 0 5
I need an easy way to do this. I need the manager to be able to insert the dates he wants, and then click a button and it all comes up. I cannot work out how to create one query which gives all of this. It's easy to give just the categories and number of Open calls, but then can't work out how to add a further column to show number of Closed calls, or the number closed within x days, etc. I can create individual queries for the harder columns, but not get it all together.
So, options are
Classic ASP - I think would involve a lot of individual SQLs for the calculated fields
Access Report ?
Some kind of export to Excel?
VBA in Excel to link back to prepared queries in Access?
Any advise would be appreciated.
You should be able to get that data in one query. Try this one:
SELECT AllCalls.strCategory, CallsReceived, CallsClosed, ClosedWithin5Days, ClosedWithin14days, CallsOpen, CallsPending
FROM
((
SELECT strCategory,
Count(ID) AS CallsReceived,
Sum(IIF(strStatus='Closed',1,0)) AS CallsClosed,
Sum(IIF(strStatus='Open',1,0)) AS CallsOpen,
Sum(IIF(strStatus='Pending',1,0)) AS CallsPending
FROM tblFaultCall
WHERE dtmDateOpen BETWEEN #6/1/2014# and #6/30/2014#
GROUP BY strCategory
) AS AllCalls
LEFT JOIN
(
SELECT strCategory,
Count(ID) AS ClosedWithin5Days
FROM tblFaultCall
WHERE DateDiff("d", dtmDateOpen, dtmDateClosed) <=5
AND dtmDateOpen BETWEEN #6/1/2014# and #6/30/2014#
GROUP BY strCategory
) AS FiveDay ON AllCalls.strCategory=FiveDay.strCategory)
LEFT JOIN
(
SELECT strCategory,
Count(ID) AS ClosedWithin14Days
FROM tblFaultCall
WHERE DateDiff("d", dtmDateOpen, dtmDateClosed) between 5 and 14
AND dtmDateOpen BETWEEN #6/1/2014# and #6/30/2014#
GROUP BY strCategory
) AS FourteenDay ON AllCalls.strCategory=FourteenDay.strCategory
The classic ASP part should be very similar to your other pages: query the database, loop through the resulting data, output it to the screen. You would use the same approach if you were generating a spreadsheet too.
Each column can be calculated, mostly with iif statements:
Total calls = count(calls)
Closed calls = sum(iif(<call is closed>,1,0) (however you define <call is closed>)
Closed in 5 days = sum(iif(<call is closed in 5 days>,1,0))
and so on