AWS Timestream query to get average measure for the first month of samples - sql

In AWS Timestream I am trying to get the average heart rate for the first month since we have received heart rate samples for a specific user and the average for the last week. I'm having trouble with the query to get the first month part. When I try to use MIN(time) in the where clause I get the error: WHERE clause cannot contain aggregations, window functions or grouping operations.
SELECT * FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time < min(time) + 30
If I add it as a column and try to query on the column, I get the error: Column 'first_sample_time' does not exist
SELECT MIN(time) AS first_sample_time FROM "DATABASE"."TABLE"
WHERE measure_name = 'heart_rate' AND time > first_sample_time
Also if I try to add to MIN(time) I get the error: line 1:18: '+' cannot be applied to timestamp, integer
SELECT MIN(time) + 30 AS first_sample_time FROM "DATABASE"."TABLE"
Here is what I finally came up with but I'm wondering if there is a better way to do it?
WITH first_month AS (
SELECT
Min(time) AS creation_date,
From_milliseconds(
To_milliseconds(
Min(time)
) + 2628000000
) AS end_of_first_month,
USER
FROM
"DATABASE"."TABLE"
WHERE
USER = 'xxx'
AND measure_name = 'heart_rate'
GROUP BY
USER
),
first_month_avg AS (
SELECT
Avg(hm.measure_value :: DOUBLE) AS first_month_average,
fm.USER
FROM
"DATABASE"."TABLE" hm
JOIN first_month fm ON hm.USER = fm.USER
WHERE
measure_name = 'heart_rate'
AND hm.time BETWEEN fm.creation_date
AND fm.end_of_first_month
GROUP BY
fm.USER
),
last_week_avg AS (
SELECT
Avg(measure_value :: DOUBLE) AS last_week_average,
USER
FROM
"DATABASE"."TABLE"
WHERE
measure_name = 'heart_rate'
AND time > ago(14d)
AND USER = 'xxx'
GROUP BY
USER
)
SELECT
lwa.last_week_average,
fma.first_month_average,
lwa.USER
FROM
first_month_avg fma
JOIN last_week_avg lwa ON fma.USER = lwa.USER
Is there a better or more efficient way to do this?

I can see you've run into a few challenges along the way to your solution, and hopefully I can clear these up for you and also propose a cleaner way of reaching your solution.
Filtering on aggregates
As you've experienced first hand, SQL doesn't allow aggregates in the where statement, and you also cannot filter on new columns you've created in the select statement, such as aggregates or case statements, as those columns/results are not present in the table you're querying.
Fortunately there are ways around this, such as:
Making your main query a subquery, and then filtering on the result of that query, like below
Select * from (select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3) where total_good_stuff > 69
This works because the aggregate column (count) is no longer an aggregate at the time it's called in the where statement, it's in the result of the subquery.
Having clause
If a subquery isn't your cup of tea, you can use the having clause straight after your group by statement, which acts like a where statement except exclusively for handling aggregates.
This is better than resorting to a subquery in most cases, as it's more readable and I believe more efficient.
select *,count(that_good_stuff) as total_good_stuff from tasty_table group by 1,2,3 having total_good_stuff > 69
Finally, window statements are fantastic...they've really helped condense many queries I've made in the past by removing the need for subqueries/ctes. If you could share some example raw data (remove any pii of course) I'd be happy to share an example for your use case.
Nevertheless, hope this helps!
Tom

Related

counts' division doesn't work in full code

I do have a problem with a task because my division value is different when I use it alone and when I use it in full code. Let's say I do this code:
SELECT (count(paimta))::numeric / count(distinct paimta) as average
FROM Stud.Egzempliorius;
and finally a number I get is 2.(6)7, but when I use it in full code which is:
SELECT Stud.Egzempliorius.Paimta, COUNT(PAIMTA) as PaimtaKnyga
FROM Stud.Skaitytojas, Stud.Egzempliorius
WHERE Stud.Skaitytojas.Nr=Stud.Egzempliorius.Skaitytojas
GROUP BY Stud.Egzempliorius.Paimta
HAVING count(paimta) > (count(paimta))::numeric / count(distinct paimta);
it's value changes because division is not working anymore and let's say instead of having
HAVING count(paimta) > (count(paimta))::numeric / count(distinct paimta);
my codes turns into
HAVING count(paimta) > (count(paimta))::numeric;
and these values are equal, so I can't get final answer. That's database I use https://klevas.mif.vu.lt/~baronas/dbvs/biblio/show-table.php?table=Stud.Egzempliorius
I was struggling for 10 hours now and finally I've lost my patience... So, my question is what I have to do that this code:
SELECT (count(paimta))::numeric / count(distinct paimta) as average
FROM Stud.Egzempliorius;
value doesn't change in full code?
Picture how it changes Photo
Your solution fails because the two queries operate on a different groups of rows. The first query does a computation over the whole dataset, while the second one groups by paimta.
One option would have been to use window functions, but as far as concerns Postgres does not support count(distinct) as a window function.
I think that the simplest approach is to use a subquery :
select e.paimta, count(paimta) as paimtaknyga
from stud.skaitytojas s
inner join stud.egzempliorius e on s.nr = e.skaitytojas
group by e.paimta
having count(paimta) > (
select (count(paimta))::numeric / count(distinct paimta) from stud.egzempliorius
)

Grouping a percentage calculation in postgres/redshift

I keep running in to the same problem over and over again, hoping someone can help...
I have a large table with a category column that has 28 entries for donkey breed, then I'm counting two specific values grouped by each of those categories in subqueries like this:
WITH totaldonkeys AS (
SELECT donkeybreed,
COUNT(*) AS total
FROM donkeytable1
GROUP BY donkeybreed
)
,
sickdonkeys AS (
SELECT donkeybreed,
COUNT(*) AS totalsick
FROM donkeytable1
JOIN donkeyhealth on donkeytable1.donkeyid = donkeyhealth.donkeyid
WHERE donkeyhealth.sick IS TRUE
GROUP BY donkeybreed
)
,
It's my goal to end up with a table that has primarily the percentage of sick donkeys for each breed but I always end up struggling like hell with the problem of not being able to group by without using an aggregate function which I cannot do here:
SELECT (CAST(sickdonkeys.totalsick AS float) / totaldonkeys.total) * 100 AS percentsick,
totaldonkeys.donkeybreed
FROM totaldonkeys, sickdonkeys
GROUP BY totaldonkeys.donkeybreed
When I run this I end up with 28 results for each breed of donkey, one correct I believe but obviously hundreds of useless datapoints.
I know I'm probably being really dumb here but I keep hitting in to this same problem again and again with new donkeydata, I should obviously be structuring the whole thing a new way because you just can't do this final query without an aggregate function, I think I must be missing something significant.
You can easily count the proportion that are sick in the donkeyhealth table
SELECT d.donkeybreed,
AVG( (dh.sick)::int ) AS proportion_sick
FROM donkeytable1 d JOIN
donkeyhealth dh
ON d.donkeyid = dh.donkeyid
GROUP BY d.donkeybreed

Google Big Query: New Column of Aggregate Based On Condition of Current Row

Using the Google Big Query database bigquery-public-data.crypto_ethereum_classic.transactions as reference.
For each transaction row, I want to calculate the count of all transactions to the same address that occurred before that transaction, and sum of the gas usage of them. I am sure I can do this with a join as I have tried and Google accepts my old query, but since there is so much data as a result of the (inner) join, there is almost always a "quota limit exceeded" error. At the same time, I think a subquery solution is inefficient, as it is querying almost the same thing in both aggregate functions.
In a perfect world the query would use something like a join to create a temporary table with all columns I need (transaction_hash, receipt_gas_used, to_address, block_timestamp), according to the conditions (where to_address = table_1.to_address and block_timestamp < table_1.block_timestamp), where I can then perform the aggregate functions on the columns of that table.
What I have so far and what I'm looking for is something like...:
SELECT
table_1.*,
COUNT(
DISTINCT IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `hash`, NULL)
) as txn_count,
SUM(
IF(block_timestamp < table_1.block_timestamp and to_address = table_1.to_address, `receipt_gas_used`, NULL)
) as total_gas_used
from
`bigquery-public-data.crypto_ethereum_classic.transactions` as table_1
where block_number >= 3000000 and block number <= 3500000 #just to subset the data a bit
I think you want window functions:
select t.*,
row_number() over (partition by to_address order by block_timestamp) as txn_seqnum,
sum(receipt_gas_used) over (partition by to_address order by block_timestamp) as total_gas_used
from `bigquery-public-data.crypto_ethereum_classic.transactions` as t
where block_number >= 3000000 and block number <= 3500000 #just to subset the
If you really have ties and need the distinct, then use dense_rank() instead of row_number().

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.
As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)
Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)

Aggregate SQL Function to grab only the first from each group

I have 2 tables - an Account table and a Users table. Each account can have multiple users. I have a scenario where I want to execute a single query/join against these two tables, but I want all the Account data (Account.*) and only the first set of user data (specifically their name).
Instead of doing a "min" or "max" on my aggregated group, I wanted to do a "first". But, apparently, there is no "First" aggregate function in TSQL.
Any suggestions on how to go about getting this query? Obviously, it is easy to get the cartesian product of Account x Users:
SELECT User.Name, Account.* FROM Account, User
WHERE Account.ID = User.Account_ID
But how might I got about only getting the first user from the product based on the order of their User.ID ?
Rather than grouping, go about it like this...
select
*
from account a
join (
select
account_id,
row_number() over (order by account_id, id) -
rank() over (order by account_id) as row_num from user
) first on first.account_id = a.id and first.row_num = 0
I know my answer is a bit late, but that might help others. There is a way to achieve a First() and Last() in SQL Server, and here it is :
Stuff(Min(Convert(Varchar, DATE_FIELD, 126) + Convert(Varchar, DESIRED_FIELD)), 1, 23, '')
Use Min() for First() and Max() for Last(). The DATE_FIELD should be the date that determines if it is the first or last record. The DESIRED_FIELD is the field you want the first or the last value. What it does is :
Add the date in ISO format at the start of the string (23 characters long)
Append the DESIRED_FIELD to that string
Get the MIN/MAX value for that field (since it start with the date, you will get the first or last record)
Stuff that concatened string to remove the first 23 characters (the date part)
Here you go!
EDIT: I got problems with the first formula : when the DATE_FIELD has .000 as milliseconds, SQL Server returns the date as string with NO milliseconds at all, thus removing the first 4 characters from the DESIRED_FIELD. I simply changed the format to "20" (without milliseconds) and it works all great. The only downside is if you have two fields that were created at the same seconds, the sort can possibly be messy... in which cas you can revert to "126" for the format.
Stuff(Max(Convert(Varchar, DATE_FIELD, 20) + Convert(Varchar, DESIRED_FIELD)), 1, 19, '')
EDIT 2 : My original intent was to return the last (or first) NON NULL row. I got asked how to return the last or first row, wether it be null or not. Simply add a ISNULL to the DESIRED_FIELD. When you concatenate two strings with a + operator, when one of them is NULL, the result is NULL. So use the following :
Stuff(Max(Convert(Varchar, DATE_FIELD, 20) + IsNull(Convert(Varchar, DESIRED_FIELD), '')), 1, 19, '')
Select *
From Accounts a
Left Join (
Select u.*,
row_number() over (Partition By u.AccountKey Order By u.UserKey) as Ranking
From Users u
) as UsersRanked
on UsersRanked.AccountKey = a.AccountKey and UsersRanked.Ranking = 1
This can be simplified by using the Partition By clause. In the above, if an account has three users, then the subquery numbers them 1,2, and 3, and for a different AccountKey, it will reset the numnbering. This means for each unique AccountKey, there will always be a 1, and potentially 2,3,4, etc.
Thus you filter on Ranking=1 to grab the first from each group.
This will give you one row per account, and if there is at least one user for that account, then it will give you the user with the lowest key(because I use a left join, you will always get an account listing even if no user exists). Replace Order By u.UserKey with another field if you prefer that the first user be chosen alphabetically or some other criteria.
I've benchmarked all the methods, the simpelest and fastest method to achieve this is by using outer/cross apply
SELECT u.Name, Account.* FROM Account
OUTER APPLY (SELECT TOP 1 * FROM User WHERE Account.ID = Account_ID ) as u
CROSS APPLY works just like INNER JOIN and fetches the rows where both tables are related, while OUTER APPLY works like LEFT OUTER JOIN and fetches all rows from the left table (Account here)
You can use OUTER APPLY, see documentation.
SELECT User1.Name, Account.* FROM Account
OUTER APPLY
(SELECT TOP 1 Name
FROM [User]
WHERE Account.ID = [User].Account_ID
ORDER BY Name ASC) User1
SELECT (SELECT TOP 1 Name
FROM User
WHERE Account_ID = a.AccountID
ORDER BY UserID) [Name],
a.*
FROM Account a
The STUFF response from Dominic Goulet is slick. But, if your DATE_FIELD is SMALLDATETIME (instead of DATETIME), then the ISO 8601 length will be 19 instead of 23 (because SMALLDATETIME has no milliseconds) - so adjust the STUFF parameter accordingly or the return value from the STUFF function will be incorrect (missing the first four characters).
First and Last do not exist in Sql Server 2005 or 2008, but in Sql Server 2012 there is a First_Value, Last_Value function. I tried to implement the aggregate First and Last for Sql Server 2005 and came to the obstacle that sql server does guarantee the calculation of the aggregate in a defined order. (See attribute SqlUserDefinedAggregateAttribute.IsInvariantToOrder Property, which is not implemented.) This might be because the query analyser tries to execute the calculation of the aggregate on multiple threads and combine the results, which speeds up the execution, but does not guarantee an order in which elements are aggregated.
Define "First". What you think of as first is a coincidence that normally has to do with clustered index order but should not be relied on (you can contrive examples that break it).
You are right not to use MAX() or MIN(). While tempting, consider the scenario where you the first name and last name are in separate fields. You might get names from different records.
Since it sounds like all your really care is that you get exactly one arbitrary record for each group, what you can do is just MIN or MAX an ID field for that record, and then join the table into the query on that ID.
There are a number of ways of doing this, here a a quick and dirty one.
Select (SELECT TOP 1 U.Name FROM Users U WHERE U.Account_ID = A.ID) AS "Name,
A.*
FROM Account A
(Slightly Off-Topic, but) I often run aggregate queries to list exception summaries, and then I want to know WHY a customer is in the results, so use MIN and MAX to give 2 semi-random samples that I can look at in details e.g.
SELECT Customer.Id, COUNT(*) AS ProblemCount
, MIN(Invoice.Id) AS MinInv, MAX(Invoice.Id) AS MaxInv
FROM Customer
INNER JOIN Invoice on Invoice.CustomerId = Customer.Id
WHERE Invoice.SomethingHasGoneWrong=1
GROUP BY Customer.Id
Create and join with a subselect 'FirstUser' that returns the first user for each account
SELECT User.Name, Account.*
FROM Account, User,
(select min(user.id) id,account_id from User group by user.account_id) as firstUser
WHERE Account.ID = User.Account_ID
and User.id = firstUser.id and Account.ID = firstUser.account_id