SQL- calculate ratio and get max ratio with corresponding user and date details - sql

I have a table with user, date and a col each for messages sent and messages received:
I want to get the max of messages_sent/messages_recieved by date and user for that ratio. So this is the output I expect:
Andrew Lean 10/2/2020 10
Andrew Harp 10/1/2020 6
This is my query:
SELECT
ds.date, ds.user_name, max(ds.ratio) from
(select a.user_name, a.date, a.message_sent/ a.message_received as ratio
from messages a
group by a.user_name, a.date) ds
group by ds.date
But the output I get is:
Andrew Lean 10/2/2020 10
Jalinn Kim 10/1/2020 6
In the above output 6 is the correct max ratio for the date grouped but the user is wrong. What am I doing wrong?

With a recent version of most databases, you could do something like this.
This assumes, as in your data, there's one row per user per day. If you have more rows per user per day, you'll need to provide a little more detail about how to combine them or ignore some rows. You could want to SUM them. It's tough to know.
WITH cte AS (
select a.user_name, a.date
, a.message_sent / a.message_received AS ratio
, ROW_NUMBER() OVER (PARTITION BY a.date ORDER BY a.message_sent / a.message_received DESC) as rn
from messages a
)
SELECT t.user_name, t.date, t.ratio
FROM cte AS t
WHERE t.rn = 1
;
Note: There's no attempt to handle ties, where more than one user has the same ratio. We could use RANK (or other methods) for that, if your database supports it.

Here, I am just calculating the ratio for each column in the first CTE.
In the second part, I am getting the maximum results of the ratio calculated in the first part on date level. This means I am assuming each user will have one row for each date.
The max() function on date level will ensure that we always get the highest ratio on date level.
There could be ties, between the ratios for that we can use ROW_NUMBER' OR RANK()` to set a rank for each row based on the criteria that we would like to pass in case of ties and then filter on the rank generated.
with data as (
select
date,
user_id,
messages_sent / messages_recieved as ratio
from [table name]
)
select
date,
max(ratio) as higest_ratio_per_date
from data
group by 1,2

Related

Function to REPLACE* last previous known value for NULL

I want to fill the NULL values with the last given value for that column. A small sample of the data:
2021-08-15 Bulgaria 1081636
2021-08-16 Bulgaria 1084693
2021-08-17 Bulgaria 1089066
2021-08-18 Bulgaria NULL
2021-08-19 Bulgaria NULL
In this example, the NULL values should be 1089066 until I reach the next non-NULL value.
I tried the answer given in this response, but to no avail. Any help would be appreciated, thank you!
EDIT: Sorry, I got sidetracked with trying to return the last value that I forgot my ultimate goal, which is to replace the NULL values with the previous known value.
Therefore the query should be
UPDATE covid_data
SET people_vaccinated = ISNULL(?)
Assuming the number you have is always increasing, you can use MAX aggregate over a window:
SELECT dt
, country
, cnt
, MAX(cnt) OVER (PARTITION BY country ORDER BY dt)
FROM #data
If the number may decrease, the query becomes a little bit more complex as we need to mark the rows that have nulls as belonging to the same group as the last one without a null first:
SELECT dt
, country
, cnt
, SUM(cnt) OVER (PARTITION BY country, partition)
FROM (
SELECT country
, dt
, cnt
, SUM(CASE WHEN cnt IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY country ORDER BY dt) AS partition
FROM #data
) AS d
ORDER BY dt
Here's a working demo on dbfiddle, it returns the same data with ever increasing amount, but if you change the number for 08-17 to be lower than that of 08-16, you'll see MAX(...) method producing wrong results.
In many datasets it is incorrect to make assumptions about the behaviour of the data in the underlying dataset, if your goal is simply to fill the blanks that might appear mid-way in a dataset then the answer to the post you referenced A:sql server nulls duplicate last known value in table is still one of the best solutions, here is an adaptation:
SELECT dt
, country
, cnt
, ISNULL(source.cnt, excludeNulls.LastCnt)
FROM #data source
OUTER APPLY ( SELECT TOP 1 cnt as LastCnt
FROM #data
WHERE dt < source.dt
AND cnt IS NOT NULL
ORDER BY dt desc) ExcludeNulls
ORDER BY dt
MAX and LAST_VALUE will give you the a value with respect to the entire record set, which would not work with the existing solutions if you had a value for 2021-08-19. In that case the last value would be used to fill the gaps, not the previous non-null value.
When we need to fill in gaps that occur part-way through the results we need to apply a filter to the window query, TOP 1 ... ORDER BY gives us the ability to filter and sort on entirely different fields to the one that we want to capture, but also means that we can display the last value for fields that are not numeric, see this fiddle a few other examples: https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=372285d29f97dbb9663e8552af6fb7a2

Get count of records of the top 3 rows and compare the counts

In SQL Server 2016, I have a query as such:
SELECT [Report_date], count(distinct indv_id)
FROM
[dbo].[STG_TABLE] group by report_date order by report_date desc
I get the results as below:
Report_date (No column name)
2020-08-21 47918
2020-08-12 968065
2020-07-31 977804
Now I want to compare the difference between the counts in each row. If the difference is more than 10%, then I need to send an email out in the SSIS package.
How can I go through each row and calculate the difference? I want to look at the first row and compare it with the second row.
You question seems to be about calculating the ratios between rows. For that, use lag(). To get the ratio:
SELECT [Report_date], COUNT(DISTINCT indv_id),
(COUNT(DISTINCT indv_id) * 1.0 / LAG(COUNT(DISTINCT indv_id)) OVER (ORDER BY report_date))
FROM [dbo].[STG_TABLE]
GROUP BY report_date
ORDER BYreport_date DESC;
I'm not sure what results you want, but this is the basic information.

How Can I Retrieve The Earliest Date and Status Per Each Distinct ID

I have been trying to write a query to perfect this instance but cant seem to do the trick because I am still receiving duplicated. Hoping I can get help how to fix this issue.
SELECT DISTINCT
1.Client
1.ID
1.Thing
1.Status
MIN(1.StatusDate) as 'statdate'
FROM
SAMPLE 1
WHERE
[]
GROUP BY
1.Client
1.ID
1.Thing
1.status
My output is as follows
Client Id Thing Status Statdate
CompanyA 123 Thing1 Approved 12/9/2019
CompanyA 123 Thing1 Denied 12/6/2019
So although the query is doing what I asked and showing the mininmum status date per status, I want only the first status date. I have about 30k rows to filter through so whatever does not run overload the query and have it not run. Any help would be appreciated
Use window functions:
SELECT s.*
FROM (SELECT s.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY statdate) as seqnum
FROM SAMPLE s
WHERE []
) s
WHERE seqnum = 1;
This returns the first row for each id.
Use whichever of these you feel more comfortable with/understand:
SELECT
*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY statusdate) as rn
FROM sample
WHERE ...
) x
WHERE rn = 1
The way that one works is to number all rows sequentially in order of StatusDate, restarting the numbering from 1 every time ID changes. If you thus collect all the number 1's togetyher you have your set of "first records"
Or can coordinate a MIN:
SELECT
*
FROM
sample s
INNER JOIN
(SELECT ID, MIN(statusDate) as minDate FROM sample WHERE ... GROUP BY ID) mins
ON s.ID = mins.ID and s.StatusDate = mins.MinDate
WHERE
...
This one prepares a list of all the ID and the min date, then joins it back to the main table. You thus get all the data back that was lost during the grouping operation; you cannot simultaneously "keep data" and "throw away data" during a group; if you group by more than just ID, you get more groups (as you have found). If you only group by ID you lose the other columns. There isn't any way to say "GROUP BY id, AND take the MIN date, AND also take all the other data from the same row as the min date" without doing a "group by id, take min date, then join this data set back to the main dataset to get the other data for that min date". If you try and do it all in a single grouping you'll fail because you either have to group by more columns, or use aggregating functions for the other data in the SELECT, which mixes your data up; when groups are done, the concept of "other data from the same row" is gone
Be aware that this can return duplicate rows if two records have identical min dates. The ROW_NUMBER form doesn't return duplicated records but if two records have the same minimum StatusDate then which one you'll get is random. To force a specific one, ORDER BY more stuff so you can be sure which will end up with 1

How can I make this query run efficiently?

In BigQuery, we're trying to run:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT value, UTC_USEC_TO_DAY(timestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [Datastore.PerformanceDatum]
WHERE type = "MemoryPerf"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
which returns a relatively small amount of data. But we're getting the message:
Error: Resources exceeded during query execution. The query contained a GROUP BY operator, consider using GROUP EACH BY instead. For more details, please see https://developers.google.com/bigquery/docs/query-reference#groupby
What is making this query fail, the size of the subquery? Is there some equivalent query we can do which avoids the problem?
Edit in response to comments: If I add GROUP EACH BY (and drop the outer ORDER BY), the query fails, claiming GROUP EACH BY is here not parallelizable.
I wrote an equivalent query that works for me:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, UTC_USEC_TO_DAY(dtimestamp) as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
If I run only the inner query, I get 3,660,624 results. Is your dataset bigger than that?
The outer select gives me only 4 results when grouped by day. I'll try a different grouping to see if I can hit a limit there:
SELECT day, AVG(value)/(1024*1024) FROM (
SELECT data value, dtimestamp / 1000 as day,
PERCENTILE_RANK() OVER (PARTITION BY day ORDER BY value ASC) as rank
FROM [io_sensor_data.moscone_io13]
WHERE sensortype = "humidity"
) WHERE rank >= 0.9 AND rank <= 0.91
GROUP BY day
ORDER BY day desc;
Runs too, now with 57,862 different groups.
I tried different combinations to get to the same error. I was able to get the same error as you doubling the amount of initial data. An easy "hack" to double the amount of data is changing:
FROM [io_sensor_data.moscone_io13]
To:
FROM [io_sensor_data.moscone_io13], [io_sensor_data.moscone_io13]
Then I get the same error. How much data do you have? Can you apply an additional filter? As you are already partitioning the percentile_rank by day, can you add an additional query to only analyze a fraction of the days (for example, only last month)?

Query return rows whose sum of column value match given sum

I have tables with:
id desc total
1 baskets 25
2 baskets 15
3 baskets 75
4 noodles 10
I would like to ask the query with output which the sum of total is 40.
The output would be like:
id desc total
1 baskets 25
2 baskets 15
I believe this will get you a list of the results you're looking for, but not with your example dataset because nothing in your example dataset can provide a total sum of 40.
SELECT id, desc, total
FROM mytable
WHERE desc IN (
SELECT desc
FROM mytable
GROUP BY desc
HAVING SUM(total) = 40
)
Select Desc,SUM(Total) as SumTotal
from Table
group by desc
having SUM(Total) > = 40
Not quite sure what you want, but this may get you started
SELECT `desc`, SUM(Total) Total
FROM TableName
GROUP BY `desc`
HAVING SUM(Total) = 40
From reading your question, it sounds like you want a query that returns any subset of of sums that represent a certain target value and have the same description.
There is no simple way to do this. This migrates into algorithmic territory.
Assuming I am correct in what you are after, group bys and aggregate functions will not solve your problem. SQL cannot indicate that a query should be performed on subsets of data until it exhaust all possible permutations and finds the Sums that match your requirements.
You will have to intermix an algorithm into your sql ... i.e a stored procedure.
Or simply get all the data from the database that fits the desc then perform your algorithm on it in code.
I recall there was a CS algorithmic class I took where this was a known Problem:
I believe you could just adapt working versions of this algorithm to solve your problem
http://en.wikipedia.org/wiki/Subset_sum_problem
select desc
from (select desc, sum(total) as ct group by desc)