Google Big Query: New Column of Aggregate Based On Condition of Current Row

Google Big Query: New Column of Aggregate Based On Condition of Current Row - sql

Using the Google Big Query database bigquery-public-data.crypto_ethereum_classic.transactions as reference.
For each transaction row, I want to calculate the count of all transactions to the same address that occurred before that transaction, and sum of the gas usage of them. I am sure I can do this with a join as I have tried and Google accepts my old query, but since there is so much data as a result of the (inner) join, there is almost always a "quota limit exceeded" error. At the same time, I think a subquery solution is inefficient, as it is querying almost the same thing in both aggregate functions.
In a perfect world the query would use something like a join to create a temporary table with all columns I need (transaction_hash, receipt_gas_used, to_address, block_timestamp), according to the conditions (where to_address = table_1.to_address and block_timestamp < table_1.block_timestamp), where I can then perform the aggregate functions on the columns of that table.
What I have so far and what I'm looking for is something like...:
SELECT
table_1.*,
COUNT(
DISTINCT IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `hash`, NULL)
) as txn_count,
SUM(
IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `receipt_gas_used`, NULL)
) as total_gas_used
from
`bigquery-public-data.crypto_ethereum_classic.transactions` as table_1
where block_number >= 3000000 and block number <= 3500000 #just to subset the data a bit

I think you want window functions:
select t.*,
row_number() over (partition by to_address order by block_timestamp) as txn_seqnum,
sum(receipt_gas_used) over (partition by to_address order by block_timestamp) as total_gas_used
from `bigquery-public-data.crypto_ethereum_classic.transactions` as t
where block_number >= 3000000 and block number <= 3500000 #just to subset the
If you really have ties and need the distinct, then use dense_rank() instead of row_number().

Related

AWS Timestream query to get average measure for the first month of samples

In AWS Timestream I am trying to get the average heart rate for the first month since we have received heart rate samples for a specific user and the average for the last week. I'm having trouble with the query to get the first month part. When I try to use MIN(time) in the where clause I get the error: WHERE clause cannot contain aggregations, window functions or grouping operations.
SELECT * FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time < min(time) + 30
If I add it as a column and try to query on the column, I get the error: Column 'first_sample_time' does not exist
SELECT MIN(time) AS first_sample_time FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time > first_sample_time
Also if I try to add to MIN(time) I get the error: line 1:18: '+' cannot be applied to timestamp, integer
SELECT MIN(time) + 30 AS first_sample_time FROM "DATABASE"."TABLE"
Here is what I finally came up with but I'm wondering if there is a better way to do it?
WITH first_month AS (
SELECT
Min(time) AS creation_date,
From_milliseconds(
To_milliseconds(
Min(time)
) + 2628000000
) AS end_of_first_month,
USER
FROM
"DATABASE"."TABLE"
WHERE
USER = 'xxx'
AND measure_name = 'heart_rate'
GROUP BY
USER
),
first_month_avg AS (
SELECT
Avg(hm.measure_value :: DOUBLE) AS first_month_average,
fm.USER
FROM
"DATABASE"."TABLE" hm
JOIN first_month fm ON hm.USER = fm.USER
WHERE
measure_name = 'heart_rate'
AND hm.time BETWEEN fm.creation_date
AND fm.end_of_first_month
GROUP BY
fm.USER
),
last_week_avg AS (
SELECT
Avg(measure_value :: DOUBLE) AS last_week_average,
USER
FROM
"DATABASE"."TABLE"
WHERE
measure_name = 'heart_rate'
AND time > ago(14d)
AND USER = 'xxx'
GROUP BY
USER
)
SELECT
lwa.last_week_average,
fma.first_month_average,
lwa.USER
FROM
first_month_avg fma
JOIN last_week_avg lwa ON fma.USER = lwa.USER
Is there a better or more efficient way to do this?

I can see you've run into a few challenges along the way to your solution, and hopefully I can clear these up for you and also propose a cleaner way of reaching your solution.
Filtering on aggregates
As you've experienced first hand, SQL doesn't allow aggregates in the where statement, and you also cannot filter on new columns you've created in the select statement, such as aggregates or case statements, as those columns/results are not present in the table you're querying.
Fortunately there are ways around this, such as:
Making your main query a subquery, and then filtering on the result of that query, like below
Select * from (select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3) where total_good_stuff > 69
This works because the aggregate column (count) is no longer an aggregate at the time it's called in the where statement, it's in the result of the subquery.
Having clause
If a subquery isn't your cup of tea, you can use the having clause straight after your group by statement, which acts like a where statement except exclusively for handling aggregates.
This is better than resorting to a subquery in most cases, as it's more readable and I believe more efficient.
select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3 having total_good_stuff > 69
Finally, window statements are fantastic...they've really helped condense many queries I've made in the past by removing the need for subqueries/ctes. If you could share some example raw data (remove any pii of course) I'd be happy to share an example for your use case.
Nevertheless, hope this helps!
Tom

How to count Boolean changes in PostgreSQL

My table looks like this
The goal is to count how many times the actuator_state of an specific actuator (from the column actuator_names) changes in a period of time. Keep in mind that a specific actuator has various actuators (For instance Heater has Heator0, Heator1, etc) and the goal is to count how many times has changed Heater0+ Heater1+ Heator2+ Heater3.... (Also the name of the table is state_actuator
I tried this:
SELECT actuator_nome AS NOME,
SUM (DISTINCT CASE WHEN actuator_state.actuator AND DISTINCT actuator_state.actuator_time AND DISTINCT actuator_state.actuator_state THEN 1 ELSE 0) AS TROCAS_ESTADO
FROM actuator_state WHERE actuator_time BETWEEN '2020-05-17 16:58:54' AND '2020-05-17 17:09:58' AND actuator_name='Heater'
The result should be
Heater: 5;
(for instance Heater0 has changed 3 times and Heater1 two times and other Heaters 0 changes)

You can use window functions for this:
select
actuator_name,
count(*) filter(where actuator_state <> lag_actuator_state) no_changes
from (
select
t.*,
lag(actuator_state)
over(partition by actuator_name, actuator order by actuator_time) lag_actuator_state
from mytable t
where actuator_time between '2020-05-17 16:58:54' and '2020-05-17 17:09:58'
) t
group by actuator_name
The subquery uses lag() to retrieve the "previous" state of each actuator. Then, the outer query aggregates by actuator_name, and performs a count that increments by 1 everytime the consecutive values are not equal.
You can add additional filters in the where clause of the subquery as needed.
Note that this query does not count the first value in the period as a change. Only further changes are taken into account.

You can use lag():
select actuator_name,
count(*) filter (where prev_as is distinct from actuator_state)
from (select sa.*,
lag(actuator_state) over (partition by actuator order by actuator_time) as prev_as
from state_actuator sa
) sa
where actuator_time between '2020-05-17 16:58:54' and '2020-05-17 17:09:58'
group by actuator_name;
You can filter on a particular name in the where clause as well.
Note that this counts the first appearance as a "change". It is not clear if that matches your intention.

SQL ranking effective dates

There may be a very simple way to do this, but I can't quite think of it -- I have a dataset that returns a minimum job title and minimum effective date, then all effdts > than the min_effdt. In order to use this data in a charting program, I would like to rank each successive effdt if it exists, as in Min Role Effdt, then 2nd, 3rd, Max. Of course there could be anywhere from 2 to 20 jobs per person.
At first I considered trying a case statement, but I don't think that works when analyzing two columns at once. Is there a SQL statement that will allow ranking? Right now my data looks like
Employee Number | Min Base Role | Min Role Effdt | Base Role | Role Effdt
and comes from two tables, with the 2nd table brought in twice to get the Role / Effdt as Min, then All greater than Min.
I am using ORACLE. Code is below:
SELECT DISTINCT AL4.FULL_NAME,
AL4.EMPLOYEE_NUMBER,
AL4.HIRE_DATE,
AL4.DATE_OF_BIRTH,
AL4.AGE,
AL4.TERM_DATE,
AL4.ETHNIC_ORIGIN,
AL2.RECORDVALUE AS MIN_BASE_ROLE,
AL3.RECORDVALUE AS BASE_ROLE,
AL3.EFFECTIVE_START_DATE AS "ROLE EFFECTIVE DATE",
AL2.EFFECTIVE_START_DATE AS "MIN ROLE EFFDT"
FROM T1 AL2,
T2 AL3,
T3 AL4
WHERE AL4.PERSON_ID = AL2.PERSON_ID
AND AL4.PERSON_ID = AL3.PERSON_ID
AND AL4.EMPLOYEE_NUMBER = AL2.HISL_ID
AND AL4.EMPLOYEE_NUMBER = AL3.HISL_ID
AND AL2.RECORDTYPE = 'BASE_ROLE'
AND AL3.RECORDTYPE = 'BASE_ROLE'
AND AL2.EFFECTIVE_START_DATE = (SELECT MIN(A.EFFECTIVE_START_DATE) from T1 A where A.person_id = al2.person_id and a.recordtype = al2.recordtype)
AND AL3.EFFECTIVE_START_DATE > AL2.EFFECTIVE_START_DATE
AND (AL4.TERM_DATE >= '01-JAN-2012' or AL4.TERM_DATE is NULL)
order by AL4.EMPLOYEE_NUMBER

The function that you are looking for is row_number(). I think the expression you want is:
row_number() over (partition by AL4.EMPLOYEE_NUMBER
order by AL2.EFFECTIVE_START_DATE
) as ranking
The function row_number() says "assign a sequential number to a group of rows". The partition by clause defines the group, where the numbering starts over again at 1. The order by clause specifies the ordering within the group.
Similar functions rank() and dense_rank() might also be useful. They differ in how they handle duplicate values.

Select finishes where athlete didn't finish first for the past 3 events

Suppose I have a database of athletic meeting results with a schema as follows
DATE,NAME,FINISH_POS
I wish to do a query to select all rows where an athlete has competed in at least three events without winning. For example with the following sample data
2013-06-22,Johnson,2
2013-06-21,Johnson,1
2013-06-20,Johnson,4
2013-06-19,Johnson,2
2013-06-18,Johnson,3
2013-06-17,Johnson,4
2013-06-16,Johnson,3
2013-06-15,Johnson,1
The following rows:
2013-06-20,Johnson,4
2013-06-19,Johnson,2
Would be matched. I have only managed to get started at the following stub:
select date,name FROM table WHERE ...;
I've been trying to wrap my head around the where clause but I can't even get a start

I think this can be even simpler / faster:
SELECT day, place, athlete
FROM (
SELECT *, min(place) OVER (PARTITION BY athlete
ORDER BY day
ROWS 3 PRECEDING) AS best
FROM t
) sub
WHERE best > 1
->SQLfiddle
Uses the aggregate function min() as window function to get the minimum place of the last three rows plus the current one.
The then trivial check for "no win" (best > 1) has to be done on the next query level since window functions are applied after the WHERE clause. So you need at least one CTE of sub-select for a condition on the result of a window function.
Details about window function calls in the manual here. In particular:
If frame_end is omitted it defaults to CURRENT ROW.
If place (finishing_pos) can be NULL, use this instead:
WHERE best IS DISTINCT FROM 1
min() ignores NULL values, but if all rows in the frame are NULL, the result is NULL.
Don't use type names and reserved words as identifiers, I substituted day for your date.
This assumes at most 1 competition per day, else you have to define how to deal with peers in the time line or use timestamp instead of date.
#Craig already mentioned the index to make this fast.

Here's an alternative formulation that does the work in two scans without subqueries:
SELECT
"date", athlete, place
FROM (
SELECT
"date",
place,
athlete,
1 <> ALL (array_agg(place) OVER w) AS include_row
FROM Table1
WINDOW w AS (PARTITION BY athlete ORDER BY "date" ASC ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)
) AS history
WHERE include_row;
See: http://sqlfiddle.com/#!1/fa3a4/34
The logic here is pretty much a literal translation of the question. Get the last four placements - current and the previous 3 - and return any rows in which the athlete didn't finish first in any of them.
Because the window frame is the only place where the number of rows of history to consider is defined, you can parameterise this variant unlike my previous effort (obsolete, http://sqlfiddle.com/#!1/fa3a4/31), so it works for the last n for any n. It's also a lot more efficient than the last try.
I'd be really interested in the relative efficiency of this vs #Andomar's query when executed on a dataset of non-trivial size. They're pretty much exactly the same on this tiny dataset. An index on Table1(athlete, "date") would be required for this to perform optimally on a large data set.

; with CTE as
(
select row_number() over (partition by athlete order by date) rn
, *
from Table1
)
select *
from CTE cur
where not exists
(
select *
from CTE prev
where prev.place = 1
and prev.athlete = cur.athlete
and prev.rn between cur.rn - 3 and cur.rn
)
Live example at SQL Fiddle.

Total Count in Grouped TSQL Query

I have an performance heavy query, that filters out many unwanted records based on data in other tables etc.
I am averaging a column, and also returning the count for each average group. This is all working fine.
However, I would also like to include the percentage of the TOTAL count.
Is there any way of getting this total count without rerunning the whole query, or increasing the performance load significantly?
I would also prefer if I didn't need to completely restructure the sub query (e.g. by getting the total count outside of it), but can do if necessary.
SELECT
data.EquipmentId,
AVG(MeasureValue) AS AverageValue,
COUNT(data.*) AS BinCount
COUNT(data.*)/ ???TotalCount??? AS BinCountPercentage
FROM
(SELECT * FROM MultipleTablesWithJoins) data
GROUP BY data.EquipmentId

See Window functions.
SELECT
data.EquipmentId,
AVG(MeasureValue) AS AverageValue,
COUNT(*) AS BinCount,
COUNT(*)/ cast (cnt as float) AS BinCountPercentage
FROM
(SELECT *,
-- Here is total count of records
count(*) over() cnt
FROM MultipleTablesWithJoins) data
GROUP BY data.EquipmentId, cnt
EDIT: forgot to actually divide the numbers.

Another approach:
with data as
(
SELECT * FROM MultipleTablesWithJoins
)
,grand as
(
select count(*) as cnt from data
)
SELECT
data.EquipmentId,
AVG(MeasureValue) AS AverageValue,
COUNT(data.*) AS BinCount
COUNT(data.*)/ grand.cnt AS BinCountPercentage
FROM data cross join grand
GROUP BY data.EquipmentId

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Google Big Query: New Column of Aggregate Based On Condition of Current Row - sql

Related

AWS Timestream query to get average measure for the first month of samples

How to count Boolean changes in PostgreSQL

SQL ranking effective dates

Select finishes where athlete didn't finish first for the past 3 events

Total Count in Grouped TSQL Query

Categories

Resources