SQL AVG() function returning incorrect values - sql

I want to use the AVG function in sql to return a working average for some values (ie based on the last week not an overall average). I have two values I am calculating, weight and restingHR (heart rate). I have the following sql statements for each:
SELECT AVG( weight ) AS average
FROM stats
WHERE userid='$userid'
ORDER BY date DESC LIMIT 7
SELECT AVG( restingHR ) AS average
FROM stats
WHERE userid='$userid'
ORDER BY date DESC LIMIT 7
The value I get for weight is 82.56 but it should be 83.35
This is not a massive error and I'm rounding it when I use it so its not too big a deal.
However for restingHR I get 45.96 when it should be 57.57 which is a massive difference.
I don't understand why this is going so wrong. Any help is much appreciated.
Thanks

Use a subquery to separate selecting the rows from computing the average:
SELECT AVG(weight) average
FROM (SELECT weight
FROM stats
WHERE userid = '$userid'
ORDER BY date DESC
LIMIT 7) subq

It seems you want to filter your data with ORDER BY date DESC LIMIT 7, but you have to consider, that the ORDER BY clause takes effect after everything else is done. So your AVG() function considers all values of restingHR from your $userId, not just the 7 latest.
To overcome this...okay, Barmar just posted a query.

Related

MAX in Select statement not returning the highest value?

I have a question regarding the max-statement in a select -
Without the MAX-statemen i have this select:
SELECT stockID, DATE, close, symbol
FROM ta_stockprice JOIN ta_stock ON ta_stock.id = ta_stockprice.stockID
WHERE stockid = 8648
ORDER BY close
At the end i only want to have the max row for the close-column so i tried:
Why i didnĀ“t get date = "2021-07-02" as output?
(i saw that i allways get "2021-07-01" as output - no matter if i use MAX / MIN / AVG...)
The MAX() turns the query into an aggregation query. With no GROUP BY, it returns one row. But the query is syntactically incorrect, because it mixes aggregated and unaggregated columns.
Once upon a time, MySQL allowed such syntax in violation of the SQL Standard but returned values from arbitrary rows for the unaggreged columns.
Use ORDER BY to do what you want:
SELECT stockID, DATE, close, symbol
FROM ta_stockprice JOIN ta_stock ON ta_stock.id = ta_stockprice.stockID
WHERE stockid = 8648
ORDER BY close DESC
LIMIT 1;

partition big query LIMIT over date range

I'm quite new to SQL & big query so this might be simple. I'm running some queries on the public dataset GDELT in BQ and have a question regarding the LIMIT. GDELT is massive (14.4 TB) and when I query for something, in this case a person, I could get up to 100k rows of results or more which is this case is too much. But when I use LIMIT it seems like it does not really partition the results evenly over the dates, causing me to get very random timelines. How does limit work and is there a way to get the results more evenly based on days?
SELECT DATE,V2Tone,DocumentIdentifier as URL, Themes, Persons, Locations
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE DATE>=20210610000000 and _PARTITIONTIME >= TIMESTAMP(#start_date)
AND DATE<=20210818999999 and _PARTITIONTIME <= TIMESTAMP(#end_date)
AND LOWER(DocumentIdentifier) like #url_topic
LIMIT #limit
When running this query and doing some preproc, I get the following time series:
It's based on 15k results, but they are distributed very unevenly/randomly across the days (since there are over 500k results in total if I don't use limit). I would like to make a query that limits my output to 15k but partitions the data somewhat equally over the days.
you need to order by , when you are not sorting your result , the order of returned result is not guaranteed:
but if you are looking to get the same number of rows per day , you can use window functions:
select * from (
SELECT
DATE,
V2Tone,
DocumentIdentifier as URL,
Themes,
Persons,
Locations,
row_number() over (partition by DATE) rn
FROM `gdelt-bq.gdeltv2.gkg_partitioned`
WHERE
DATE >= 20210610000000 AND DATE <= 20210818999999
and _PARTITIONDATE >= #start_date and _PARTITIONDATE <= #end_date
AND LOWER(DocumentIdentifier) like #url_topic
) t where rn = #numberofrowsperday
if you are passing date only you can use _PARTITIONDATE to filter the partitions.

Aggregate function -Avg is not working in my sql query

In my query I need to display date and average age:
SELECT (SYSDATE-rownum) AS DATE,
avg((SYSDATE - rownum)- create_time) as average_Age
FROM items
group by (SYSDATE-rownum)
But my output for average age is not correct. It's simply calculating/displaying the output of (SYSDATE - rownum)- create_time but not calculating the average of them though I use: avg((SYSDATE - rownum)- create_time).
Can someone tell me why the aggregate function AVG is not working in my query and what might be the possible solution
In the select clause you are using both an non-aggregate expression as wel as an aggregate expression. By dropping the (SYSDATE-rownum) AS DATE statemant you would generate an outcome over the whole data set. In that way the avg is calculated over the whole data set ... and not just per single record retrieve.
Then you might drop the group by too. In the end you just keep the avg statement
SELECT avg((SYSDATE - rownum)- create_time) as average_Age
FROM items
First you need to think on rows or group on which you need avg. this column will come in group by clause. as a simple thing if there is 5 rows with age of 20, 10, 20, 30 then avg will be (80/4=20) i.e. 20. so I think you need to fist calculate age by (sysdate - create_time).
eg.select months_between(sysdate,create_date)/12 cal3 from your_table
and then there will be outer query for avg on group.

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?

Query return rows whose sum of column value match given sum

I have tables with:
id desc total
1 baskets 25
2 baskets 15
3 baskets 75
4 noodles 10
I would like to ask the query with output which the sum of total is 40.
The output would be like:
id desc total
1 baskets 25
2 baskets 15
I believe this will get you a list of the results you're looking for, but not with your example dataset because nothing in your example dataset can provide a total sum of 40.
SELECT id, desc, total
FROM mytable
WHERE desc IN (
SELECT desc
FROM mytable
GROUP BY desc
HAVING SUM(total) = 40
)
Select Desc,SUM(Total) as SumTotal
from Table
group by desc
having SUM(Total) > = 40
Not quite sure what you want, but this may get you started
SELECT `desc`, SUM(Total) Total
FROM TableName
GROUP BY `desc`
HAVING SUM(Total) = 40
From reading your question, it sounds like you want a query that returns any subset of of sums that represent a certain target value and have the same description.
There is no simple way to do this. This migrates into algorithmic territory.
Assuming I am correct in what you are after, group bys and aggregate functions will not solve your problem. SQL cannot indicate that a query should be performed on subsets of data until it exhaust all possible permutations and finds the Sums that match your requirements.
You will have to intermix an algorithm into your sql ... i.e a stored procedure.
Or simply get all the data from the database that fits the desc then perform your algorithm on it in code.
I recall there was a CS algorithmic class I took where this was a known Problem:
I believe you could just adapt working versions of this algorithm to solve your problem
http://en.wikipedia.org/wiki/Subset_sum_problem
select desc
from (select desc, sum(total) as ct group by desc)