Creating daily bins for blockchain transactions - google-bigquery

after some manipulation, I ended up with a table in GBQ that lists all transactions made on blockchain (around 280 million rows):
+-------+-------------------------+--------+-------+----------+
| Linha | timestamp | sender | value | receiver |
+-------+-------------------------+--------+-------+----------+
| 1 | 2018-06-28 01:31:00 UTC | User1 | 1.67 | User2 |
| 2 | 2017-04-06 00:47:29 UTC | User3 | 0.02 | User4 |
| 3 | 2013-11-27 13:22:05 UTC | User5 | 0.25 | User6 |
+-------+-------------------------+--------+-------+----------+
Since this table has all transactions, if I sum all the values for each user up to a given date, I may have his balance, and once I have close to 22 million users, I want to binarize them by the amount of coin they have. I used this code to go through all the dataset:
#standardSQL
SELECT
COUNT(val) AS num,
bin
FROM (
SELECT
val,
CASE
WHEN val > 0 AND val <= 1 THEN '0_to_1'
WHEN val > 1
AND val <= 10 THEN '1_to_10'
WHEN val > 10 AND val <= 100 THEN '10_to_100'
WHEN val > 100
AND val <= 1000 THEN '100_to_1000'
WHEN val > 1000 AND val <= 10000 THEN '1000_to_10000'
WHEN val > 10000 THEN 'More_10000'
END AS bin
FROM (
SELECT
max(timestamp),
receiver,
SUM(value) as val
FROM
`table.transactions`
WHERE
timestamp < '2011-02-12 00:00:00'
group by
receiver))
GROUP BY
bin
Which gives me something like:
+-------+-------+---------------+
| Linha | num | bin |
+-------+-------+---------------+
| 1 | 11518 | 1_to_10 |
| 2 | 9503 | 100_to_1000 |
| 3 | 18070 | 10_to_100 |
| 4 | 20275 | 0_to_1 |
| 5 | 1781 | 1000_to_10000 |
| 6 | 158 | More_10000 |
+-------+-------+---------------+
Now I want to iterate through the rows of my transactions tables checking the number of users in each bin at the end of every day. The final table should be something like this:
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| timestamp | 0_to_1 | 1_to_10 | 10_to_100 | 100_to_1000 | 1000_to_10000 | More_10000 |
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
| 2009-01-09 00:00:00 UTC | 1 | 1 | 0 | 0 | 0 | 0 |
| 2009-01-10 00:00:00 UTC | 0 | 2 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... |
| 2018-09-10 00:00:00 UTC | 2342823 | 124324325 | 43251315 | 234523555 | 2352355556 | 12124235231|
+-------------------------+---------+-----------+-----------+-------------+---------------+------------+
I can't order by timestamp to make my life easier because the dataset is too large, so I would appreciate some ideas. I wonder if there is ome way to improve performance and save resources using pagination, for example. I've heard about it, but don't have a clue how to use it.
Thanks in advance!
UPDATE: after some work, now I do have a transactions table ordered by timestamps.

The query below should give you the count of transactions within each bin by timestamp. Now, keep in mind that this query will evaluate the value of a transaction at the row level.
SELECT
timestamp,
COUNT(DISTINCT(CASE
WHEN value > 0 AND value <= 1 THEN receiver
END)) AS _0_to_1,
COUNT(DISTINCT(CASE
WHEN value > 1 AND value <= 10 THEN receiver
END)) AS _1_to_10,
COUNT(DISTINCT(CASE
WHEN value > 10 AND value <= 100 THEN receiver
END)) AS _10_to_100,
COUNT(DISTINCT(CASE
WHEN value > 100 AND value <= 1000 THEN receiver
END)) AS _100_to_1000,
COUNT(DISTINCT(CASE
WHEN value > 1000 AND value <= 10000 THEN receiver
END)) AS _1000_to_100000,
COUNT(DISTINCT(CASE
WHEN value > 10000 THEN receiver
END)) AS More_10000
FROM `table.transactions`
WHERE timestamp = TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
GROUP BY 1
Regarding your question for performance, one area you may want to explore (if possible) is to create a partitioned version of this big table. This will help you 1) improve performance, and 2) reduce cost of query the data for a specific data range. You can find more info here
EDIT
I added a WHERE clause to the query to filter for the previous day. I am assuming you will run your query, for example, today to get the data from the previous day. Now you may need to adjust CURRENT_TIMESTAMP() to your time zone by adding an additional TIMESTAMP_SUB(...., INTERVAL X HOUR or TIMESTAMP_ADD(...., INTERVAL X HOUR, where X is the number of hours that need to be subtracted or added to match the time zone of the data you are analyzing.
Also, you may need to CAST(timestamp AS TIMESTAMP) depending on the type of your field.

Related

Running maths over an entire database and ranking all users

I have a database of bets. Each bet has a 'Win', 'Loss', or 'Pending' state. What I want to do is to have an SQL statement that will get the last, say, 20 bets a user has placed, find out their ROI (Total profit / Total staked * 100).
So I'm just wondering if there is a better way to do this. Do I basically have to get the users table, loop over every user, get their last 20 bets, find the ROI and then order it. If my User table gets huge then this process is going to take ages, right?
Is creating a 'View' going to save on this time?
Is there a way to do this in one statement that won't cost my life in processing time?
Here are the tables
Users
| ID | User |
| 1 | Test1 |
| 2 | Test2 |
| 3 | Test3 |
| 4 | Test4 |
Bets
| ID | User | Amount | Odds | Result |
| 1 | 1 | 10 | 1.35 | Win |
| 2 | 1 | 25 | 2.55 | Win |
| 3 | 3 | 15 | 1.65 | Loss |
| 4 | 2 | 11 | 2.12 | Pending |
Se essentially I would like a table that ranks them as ROI.
| User | AmountBet | AmountWon | ROI |
| 1 | 35 | 77 | 215 |
| 2 | 11 | 0 | 0 |
| 3 | 15 | 0 | 0 |
| 4 | 0 | 0 | 0 |
Assuming the ID of the bets table represents increasing time such that it can be used to identify "last 20", then
WITH b
AS
(
SELECT id,
user,
CASE WHEN result = 'Pending' THEN 0 ELSE amount END AS amount,
CASE WHEN result = 'Win' THEN amount * odds ELSE 0 END as winnings,
ROW_NUMBER() OVER (PARTITION BY user ORDER BY id DESC) AS rownum
FROM bets
)
SELECT user,
SUM(amount) AS amount_bet,
SUM(winnings) AS amount_won,
CASE
WHEN SUM(amount) > 0
THEN SUM(winnings) * 100 / SUM(amount)
ELSE 0
END AS roi
FROM b
WHERE rownum < 21
GROUP BY user;
dbfiddle.uk

30 day rolling count of distinct IDs

So after looking at what seems to be a common question being asked and not being able to get any solution to work for me, I decided I should ask for myself.
I have a data set with two columns: session_start_time, uid
I am trying to generate a rolling 30 day tally of unique sessions
It is simple enough to query for the number of unique uids per day:
SELECT
COUNT(DISTINCT(uid))
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '30 days'
it is also relatively simple to calculate the daily unique uids over a date range.
SELECT
DATE_TRUNC('day',session_start_time) AS "date"
,COUNT(DISTINCT uid) AS "count"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - INTERVAL '90 days'
GROUP BY date(session_start_time)
I then I tried several ways to do a rolling 30 day unique count over a time interval
SELECT
DATE(session_start_time) AS "running30day"
,COUNT(distinct(
case when date(session_start_time) >= running30day - interval '30 days'
AND date(session_start_time) <= running30day
then uid
end)
) AS "unique_30day"
FROM segment_clean.users_sessions
WHERE session_start_time >= CURRENT_DATE - interval '3 months'
GROUP BY date(session_start_time)
Order BY running30day desc
I really thought this would work but when looking into the results, it appears I'm getting the same results as I was when doing the daily unique rather than the unique over 30days.
I am writing this query from Metabase using the SQL query editor. the underlying tables are in redshift.
If you read this far, thank you, your time has value and I appreciate the fact that you have spent some of it to read my question.
EDIT:
As rightfully requested, I added an example of the data set I'm working with and the desired outcome.
+-----+-------------------------------+
| UID | SESSION_START_TIME |
+-----+-------------------------------+
| | |
| 10 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 5 | 2020-01-13T01:46:07.000-05:00 |
| | |
| 3 | 2020-01-18T02:49:23.000-05:00 |
| | |
| 9 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 2 | 2020-03-06T18:18:28.000-05:00 |
| | |
| 8 | 2020-03-31T23:13:33.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 2 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 9 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 3 | 2020-08-28T18:23:15.000-04:00 |
| | |
| 8 | 2020-09-15T16:40:29.000-04:00 |
| | |
| 3 | 2020-09-21T20:49:09.000-04:00 |
| | |
| 1 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 6 | 2020-11-05T21:31:48.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 8 | 2020-12-12T04:42:00.000-05:00 |
| | |
| 5 | 2020-12-12T04:42:00.000-05:00 |
+-----+-------------------------------+
bellow is what the result I would like looks like:
+------------+---------------------+
| DATE | UNIQUE 30 DAY COUNT |
+------------+---------------------+
| | |
| 2020-01-13 | 3 |
| | |
| 2020-01-18 | 1 |
| | |
| 2020-03-06 | 3 |
| | |
| 2020-03-31 | 1 |
| | |
| 2020-08-28 | 4 |
| | |
| 2020-09-15 | 2 |
| | |
| 2020-09-21 | 1 |
| | |
| 2020-11-05 | 2 |
| | |
| 2020-12-12 | 2 |
+------------+---------------------+
Thank you
You can approach this by keeping a counter of when users are counted and then uncounted -- 30 (or perhaps 31) days later. Then, determine the "islands" of being counted, and aggregate. This involves:
Unpivoting the data to have an "enters count" and "leaves" count for each session.
Accumulate the count so on each day for each user you know whether they are counted or not.
This defines "islands" of counting. Determine where the islands start and stop -- getting rid of all the detritus in-between.
Now you can simply do a cumulative sum on each date to determine the 30 day session.
In SQL, this looks like:
with t as (
select uid, date_trunc('day', session_start_time) as s_day, 1 as inc
from users_sessions
union all
select uid, date_trunc('day', session_start_time) + interval '31 day' as s_day, -1
from users_sessions
),
tt as ( -- increment the ins and outs to determine whether a uid is in or out on a given day
select uid, s_day, sum(inc) as day_inc,
sum(sum(inc)) over (partition by uid order by s_day rows between unbounded preceding and current row) as running_inc
from t
group by uid, s_day
),
ttt as ( -- find the beginning and end of the islands
select tt.uid, tt.s_day,
(case when running_inc > 0 then 1 else -1 end) as in_island
from (select tt.*,
lag(running_inc) over (partition by uid order by s_day) as prev_running_inc,
lead(running_inc) over (partition by uid order by s_day) as next_running_inc
from tt
) tt
where running_inc > 0 and (prev_running_inc = 0 or prev_running_inc is null) or
running_inc = 0 and (next_running_inc > 0 or next_running_inc is null)
)
select s_day,
sum(sum(in_island)) over (order by s_day rows between unbounded preceding and current row) as active_30
from ttt
group by s_day;
Here is a db<>fiddle.
I'm pretty sure the easier way to do this is to use a join. This creates a list of all the distinct users who had a session on each day and a list of all distinct dates in the data. Then it one-to-many joins the user list to the date list and counts the distinct users, the key here is the expanded join criteria that matches a range of dates to a single date via a system of inequalities.
with users as
(select
distinct uid,
date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01'),
dates as
(select
distinct date_trunc('day',session_start_time) AS dt
from <table>
where session_start_time >= '2021-05-01')
select
count(distinct uid),
dates.dt
from users
join
dates
on users.dt >= dates.dt - 29
and users.dt <= dates.dt
group by dates.dt
order by dt desc
;

SQL: Get an aggregate (SUM) of a calculation of two fields (DATEDIFF) that has conditional logic (CASE WHEN)

I have a dataset that includes a bunch of stay data (at a hotel). Each row contains a start date and an end date, but no duration field. I need to get a sum of the durations.
Sample Data:
| Stay ID | Client ID | Start Date | End Date |
| 1 | 38 | 01/01/2018 | 01/31/2019 |
| 2 | 16 | 01/03/2019 | 01/07/2019 |
| 3 | 27 | 01/10/2019 | 01/12/2019 |
| 4 | 27 | 05/15/2019 | NULL |
| 5 | 38 | 05/17/2019 | NULL |
There are some added complications:
I am using Crystal Reports and this is a SQL Expression, which obeys slightly different rules. Basically, it returns a single scalar value. Here is some more info: http://www.cogniza.com/wordpress/2005/11/07/crystal-reports-using-sql-expression-fields/
Sometimes, the end date field is blank (they haven't booked out yet). If blank, I would like to replace it with the current timestamp.
I only want to count nights that have occurred in the past year. If the start date of a given stay is more than a year ago, I need to adjust it.
I need to get a sum by Client ID
I'm not actually any good at SQL so all I have is guesswork.
The proper syntax for a Crystal Reports SQL Expression is something like this:
(
SELECT (CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END)
)
And that's giving me the correct value for a single row, if I wanted to do this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 210 | // only days since June 4 2018 are counted
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 2 |
| 4 | 27 | 05/15/2019 | NULL | 21 |
| 5 | 38 | 05/17/2019 | NULL | 19 |
But I want to get the SUM of Duration per client, so I want this:
| Stay ID | Client ID | Start Date | End Date | Duration |
| 1 | 38 | 01/01/2018 | 01/31/2019 | 229 | // 210+19
| 2 | 16 | 01/03/2019 | 01/07/2019 | 4 |
| 3 | 27 | 01/10/2019 | 01/12/2019 | 23 | // 2+21
| 4 | 27 | 05/15/2019 | NULL | 23 |
| 5 | 38 | 05/17/2019 | NULL | 229 |
I've tried to just wrap a SUM() around my CASE but that doesn't work:
(
SELECT SUM(CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END)
)
It gives me an error that the StayDateEnd is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. But I don't even know what that means, so I'm not sure how to troubleshoot, or where to go from here. And then the next step is to get the SUM by Client ID.
Any help would be greatly appreciated!
Although the explanation and data set are almost impossible to match, I think this is an approximation to what you want.
declare #your_data table (StayId int, ClientId int, StartDate date, EndDate date)
insert into #your_data values
(1,38,'2018-01-01','2019-01-31'),
(2,16,'2019-01-03','2019-01-07'),
(3,27,'2019-01-10','2019-01-12'),
(4,27,'2019-05-15',NULL),
(5,38,'2019-05-17',NULL)
;with data as (
select *,
datediff(day,
case
when datediff(day,StartDate,getdate())>365 then dateadd(year,-1,getdate())
else StartDate
end,
isnull(EndDate,getdate())
) days
from #your_data
)
select *,
sum(days) over (partition by ClientId)
from data
https://rextester.com/HCKOR53440
You need a subquery for sum based on group by client_id and a join between you table the subquery eg:
select Stay_id, client_id, Start_date, End_date, t.sum_duration
from your_table
inner join (
select Client_id,
SUM(CASE
WHEN StayDateStart < DATEADD(year,-1,CURRENT_TIMESTAMP) THEN DATEDIFF(day,DATEADD(year,-1,CURRENT_TIMESTAMP),ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
ELSE DATEDIFF(day,StayDateStart,ISNULL(StayDateEnd,CURRENT_TIMESTAMP))
END) sum_duration
from your_table
group by Client_id
) t on t.Client_id = your_table.client_id

PARTITION BY in CASE doesn't work with several AND statements

I have a table with 4 columns: hitId, userId, timestamp and Camp.
I need to classify if a hit is a start of a new session or not (1 or 0) using two parameters: 1. the time difference between hits and 2. if the source of the hit is a new campaign.
I need a standard SQL query in BigQuery.
A hit is considered as a start of a new session if one of the following is true:
it's the first hit from its userId
the time difference between the timestamp of the previous hit from
the same userId is more than 30 mins.
the time difference between the timestamp of the previous hit from the same userId is less than 30 mins, but Camp (ad campaign) value is not NULL and occures for the first time for the same userId within the previous 30 min.
So if hit1 from user1 has a Camp equal to Campaign1, and hit2 from user1 has a Camp equal to Campaign1, and time difference between hit1 and hit2 is less than 30 mins, hit1 will be considered as a start of a session, and hit2 won't be considered as a start.
I have a trouble with Campaign part. I tried this code:
I tried this code:
WITH timeDifference AS (
SELECT *,
TIMESTAMP_DIFF(timestamp, LAG(timestamp, 1) OVER
(PARTITION BY userId ORDER BY timestamp), SECOND) AS difference
FROM hitTable
ORDER BY timestamp)
SELECT *,
CASE
WHEN difference >= 30 * 60 THEN 1
WHEN difference IS NULL THEN 1
WHEN difference <= 30 * 60 AND Camp IS NOT NULL AND RANK()
OVER (PARTITION BY userId ORDER BY Camp) = 1 THEN 1
ELSE 0 END AS sess
FROM timeDifference
ORDER BY timestamp;
The condition RANK() OVER (PARTITION BY userId ORDER BY Camp) seems not working, as I receive this table:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 0
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
While I expect to have 1 for sess column for hitId = 00152:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 1
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
This RANK() OVER (PARTITION BY userId ORDER BY Camp) returns falsely results in cases where a user had multiple Camps.
Notice your PARTITION BY uses userId while you want to mark sessions within each Camp.
The actual "rank 1" of the RANK() (...) statement for userId 00150 is where the Camp is NULL (hitId 00150) therefore it misses your CASE condition at hitId 00152.
You could try and add 'Camp' to your PARTITION BY as follows:
RANK() OVER (PARTITION BY userId, Camp ORDER BY Camp)
Alternatively, you could replace the RANK() (...) and use LAG(Camp) (... order by timestamp) in addition to the LAG(timestamp) (...) you are calculating.
This will retrieve the Camp value for the row before (call it 'PreviousCampValue'). Then you could add something like WHEN PreviousCampValue != Camp THEN 1
Hope that's helpful

Count function with multiple conditions

I'm trying to do an overall count function on a set of data with multiple conditions but am having trouble with it. I'm a beginner and tried using a simple count function but am having no luck. I looked into using case when but am having trouble with it. Does anyone know how I should go about this code?
Here is an example of my table:
Name | Date | Status | Candy | Soda | Water
Nancy | 10/19/16 | active | 2 | 0 | 1
Lindsy| 10/20/15 | active | 0 | 1 | 0
Erica | 10/20/13 | active | 0 | 2 | 3
Lane | 10/19/14 | active | 0 | 0 | 4
Alexa | 10/19/16 | notactive | 0 | 5 | 1
Jenn | 10/19/16 | active | 0 | 0 | 0
I'm looking to do an overall count of the names under the conditions that: either candy, soda, or water are anything other than zero(doesn't matter what column or how many, just if one of those three are not zero), the account is active and also when the date falls within the last two years, 10/2014 - 10/2016.
I would want the query to tell me that the count total was 3 and also show me:
Name | Date | Status | Candy | Soda | Water
Nancy | 10/19/16 | active | 2 | 0 | 1
Lindsy| 10/20/15 | active | 0 | 1 | 0
Lane | 10/19/14 | active | 0 | 0 | 4
These are two different questions. The basic idea to get the rows is:
select t.*
from t
where greatest(candy, soda, water) > 0 and
status = 'active' and
date >= curdate() - interval 2 year;
(In Oracle, you would could use sysdate rather than curdate().)
To get the count, you would use count(*) rather than * in the select. SQL queries only return one result set . . . so you either get all the rows or a single count.
SELECT *
FROM yourTable
WHERE (Candy > 0 OR Soda > 0 OR Water > 0) AND
Status = 'active' AND
Date BETWEEN '2014-10-01' AND SYSDATE