How to group by 1 element and get sum of other?

How to group by 1 element and get sum of other? - sql

I have a table that goes like this:
Now i'm trying to get all called columns and a sum of the duration for a specific called var.
Now i'm trying this :
select called,SUM(duration) as dursum,time,CONVERT(VARCHAR(8), time, 4) AS Batch
from Calls
where caller='somevalue'
group by called,time
order by called
Now the problem is that i guess because i order by time as well i don't really get distinct values by called, i get the same number several times, and ofcourse the sum acts the same, sums only part of the rows.
I can't remove the group by time as it will cause an error for using the SUM function.
Again i want the result to have every called number once per day and the dursum to contain all the duration sum and grouped by the batch column somehow...
How can i solve this?

You are right, not grouping by time brings you the correct resultset. However, the error occuring then is easily explainable: it is caused by the selection of " time,CONVERT(VARCHAR(8), time, 4) AS Batch " in the select clause -> thus you just need some kind of aggregation for the time-field (for instance MAX(time),CONVERT(VARCHAR(8), MAX(time), 4) AS Batch.
Best regards

Well this is how i solved it in the end :
select called,SUM(duration) as dursum, dateadd(DAY,0, datediff(day,0, time))as Batch2 ,CONVERT(VARCHAR(8), time, 4) AS Batch
from Calls
where caller='0502081971'
group by called, dateadd(DAY,0, datediff(day,0, time)),CONVERT(VARCHAR(8), time, 4)
order by called

Related

DATEIME_DIFF throwing error when using safe divide

I've created a query that I'm hoping to use to fill a table with daily budgets at the end of every day. To do this, I just need some simple maths:
monthly budget / number of days left in the month.
Then, at the end of the month, I can SUM all of the rows to calculate an accurate budget.
The query is as follows:
SELECT *,
ROUND(SAFE_DIVIDE(budget, DATETIME_DIFF(CURRENT_DATE(), LAST_DAY(CURRENT_DATE()), DAY)),2) AS daily_budget
FROM `mm-demo-project.marketing_hub.budget_manager`
When executing the query, my results present as negative numbers, which according to the documentation for this function, is likely caused by the result overflowing the result type.
View the results of the query.
I've made a fools guess at rounding the calculation. Needless to say that it did not work at all.
How can I stop my query from returning negative number?

use below
SELECT *,
ROUND(SAFE_DIVIDE(budget, DATETIME_DIFF(LAST_DAY(CURRENT_DATE()), CURRENT_DATE(), DAY)),2) AS daily_budget
FROM `mm-demo-project.marketing_hub.budget_manager`

How does the group by function works in PostgreSQL? (beginner)

I don't know much at all about SQL, I've just toyed with it here and there through the years but never really 'used' it.
I'm trying to get a list of prices / volumes and aggregate them:
CREATE TABLE IF NOT EXISTS test (
ts timestamp without time zone NOT NULL,
price decimal NOT NULL,
volume decimal NOT NULL
);
what I'd like is to extract:
min price
max price
sum volume
sum (price * volume) / sum (volume)
By 1h slices
If I forget about the last line for now, I have:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, date_trunc('hour', ts) ts_group FROM test
GROUP BY ts_group;
My understanding is that 'GROUP BY ts_group' will calculate ts_group, build groups of rows and then apply the MIN / MAX / SUM functions after. Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups), I could be dramatically wrong here.
But that will not return the min_price, max_price and sum_vol results after the grouping; I get ts, price and volume in the results.
If I remove the GROUP BY line to try to see all the output, I get the error:
column "test.ts" must appear in the GROUP BY clause or be used in an aggregate function
Which I don't really understand either...
I looked at:
must appear in the GROUP BY clause or be used in an aggregate function but I don't really get it
and I looked at the doc (https://www.postgresqltutorial.com/postgresql-group-by/) which shows working example, but doesn't really clarify what is wrong with what I'm trying to do here.
While I'd be happy to have a working solution, I'm more looking from an explanation, or pointers toward good resources, that would allow me to understand this.
I have this working solution:
SELECT MIN(price) min_price, MAX(price) max_price, SUM(volume) sum_vol, (SUM(price * volume)/SUM(volume)) vwap FROM test
GROUP BY date_trunc('hour', ts);
but I still don't understand the error message from my question

All of your expressions in SQL must use data elements and functions that are known to PostgreSQL. In your first example, ts_group is neither an element of your table, nor a defined function, so it complained that it did not know how to calculate it.
Your second example works because date_trunc is a known function and ts is defined as a data element of the test table.
It also gets you the correct grouping (by hour intervals) because date_trunc 'blurs' all of those unique timestamps that by themselves would not combine into groups.

Without a GROUP BY, then having any aggregates in your select list means it will aggregate everything down to just one row. But how does it aggregate date_trunc('hour', ts) down to one row, since there is no aggregating function specified for it? If you were using MySQL, it would just pick some arbitrary value for the column from all the seen values and report that as the "aggregate". But PostgreSQL is not so cavalier with your data. If your query is vague in this way, it refuses to run it. If you just want to see some value from the set without caring which one it is, you could use min or max to aggregate it.
Since the syntax doesn't make any sense to me (entries on the select line would be treated differently while being declared together vs. building groups and then declaring operations on the groups),
You are trying to understand SQL as if it were C. But it is very different. Just learn it for what it is, without trying to force it to be something else. The select list is where you define the columns you want to see in the output. They may be computed in different ways, but what they have in common is that you want each of them to show up in the output, so they are listed together in that spot.

Cumulative count for calculating daily frequency using SQL query (in Amazon Redshift)

I have a dataset contains 'UI' (unique id), time, frequency (frequency for give value in UI column), as it is shown here:
What I would like to add a new column named 'daily_frequency' which simply counts each unique value in UI column for a given day sequentially as I show in the image below.
For example if UI=114737 and it is repeated 2 times in one day, we should have 1, and 2 in the daily_frequency column.
I could do that with Python and Panda package using group by and cumcount methods as follows ...
df['daily_frequency'] = df.groupby(['UI','day']).cumcount()+1
However, for some reason, I must do this via SQL queries (Amazon Redshift).

I think you want a running count, which could be calculated as:
COUNT(*) OVER (PARTITION BY ui, TRUNC(time) ORDER BY time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS daily_frequency

Although Salman's answer seems to be correct, I think ROW_NUMBER() is simpler:
COUNT(*) OVER (PARTITION BY ui, time::date
ORDER BY time
) AS daily_frequency

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

I have a table in BigQuery with the following fields:
time,a,b,c,d
time is a string in ISO8601 format but with a space, a is an integer from 1 to 16000, and the other columns are strings. The table contains one month's worth of data, and there are a few million records per day.
The following query fails with "response too large":
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,b,c,d,count(a),count(distinct a, 1000000)
from [myproject.mytable]
group by day,b,c,d
order by day,b,c,d asc
However, this query works (the data starts at 2012-01-01)
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day,
b,c,d,count(a),count(distinct a)
from [myproject.mytable]
where UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) = UTC_USEC_TO_DAY(PARSE_UTC_USEC('2012-01-01 00:00:00'))
group by day,b,c,d
order by day,b,c,d asc
This looks like it might be related to this issue. However, because of the group by clause, the top query is equivalent to repeatedly calling the second query. Is the query planner not able to handle this?
Edit: To clarify my test data:
I am using fake test data I generated. I originally used several fields and tried to get hourly summaries for a month (group by hour, where hour is defined using as in the select part of the query). When that failed I tried switching to daily. When that failed I reduced the columns involved. That also failed when using a count (distinct xxx, 1000000), but it worked when I just did one day's worth. (It also works if I remove the 1000000 parameter, but since that does work with the one-day query it seems the query planner is not separating things as I would expect.)
The one checked for count (distinct) has cardinality 16,000, and the group by columns have cardinality 2 and 20 for a total of just 1200 expected rows. Column values are quite short, around ten characters.

How many results do you expect? There is currently a limitation of about 64MB in the total size of results that are allowed. If you're expecting millions of rows as a result, than this may be an expected error.
If the number of results isn't extremely large, it may be that the size problem is not the final response, but the internal calculation. Specifically, if there are too many results from the GROUP BY, the query can run out of memory. One possible solution is to change "GROUP BY" to "GOUP EACH BY" which alters the way the query is executed. This is a feature that is currently experimental, and as such, is not yet documented.
For your query, since you reference fields named in the select in the group by, you might need to do this:
select day, b,c,d,day,count(a),count(distinct a, 1000000)
FROM (
select UTC_USEC_TO_DAY(PARSE_UTC_USEC(time)) as day, b, c, d
from [myproject.mytable]
)
group EACH by day,b,c,d
order by day,b,c,d asc

Simultaneous calls from CDR

I need to come up with an analysis of simultaneus events, when having only starttime and duration of each event.
Details
I've a standard CDR call detail record, that contains among others:
calldate (timedate of each call start
duration (int, seconds of call duration)
channel (a string)
What I need to come up with is some sort of analysys of simultaneus calls on each second, for a given timedate period. For example, a graph of simultaneous calls we had yesterday.
(The problem is the same if we have visitors logs with duration on a website and wish to obtain simultaneous clients for a group of web-pages)
What would your algoritm be?
I can iterate over records in the given period, and fill an array, where each bucket of the array corresponds to 1 second in the overall period. This works and seems to be fast, but if the timeperiod is big (say..1 year), I would need lots of memory (3600x24x365x4 bytes ~ 120MB aprox).
This is for a web-based, interactive app, so my memory footprint should be small enough.
Edit
By simultaneous, I mean all calls on a given second. Second would be my minimum unit. I cannot use something bigger (hour for example) becuse all calls during an hour do not need to be held at the same time.

I would implement this on the database. Using a GROUP BY clause with DATEPART, you could get a list of simultaneous calls for whatever time period you wanted, by second, minute, hour, whatever.
On the web side, you would only have to display the histogram that is returned by the query.

#eric-z-beard: I would really like to be able to implement this on the database. I like your proposal, and while it seems to lead to something, I dont quite fully understand it. Could you elaborate? Please recall that each call will span over several seconds, and each second need to count. If using DATEPART (or something like it on MySQL), what second should be used for the GROUP BY. See note on simultaneus.
Elaborating over this, I found a way to solve it using a temporary table. Assuming temp holds all seconds from tStart to tEnd, I could do
SELECT temp.second, count(call.id)
FROM call, temp
WHERE temp.second between (call.start and call.start + call.duration)
GROUP BY temp.second
Then, as suggested, the web app should use this as a histogram.

You can use a static Numbers table for lots of SQL tricks like this. The Numbers table simply contains integers from 0 to n for n like 10000.
Then your temp table never needs to be created, and instead is a subquery like:
SELECT StartTime + Numbers.Number AS Second
FROM Numbers

You can create table 'simultaneous_calls' with 3 fields: yyyymmdd Char(8),
day_second Number, -- second of the day,
count Number -- count of simultaneous calls
Your web service can take 'count' value from this table and make some statistics.
Simultaneous_calls table will be filled by some batch program which will be started every day after end of the day.
Assuming that you use Oracle, the batch may start a PL/SQL procedure which does the following:
Appends table with 24 * 3600 = 86400 records for each second of the day, with default 'count' value = 0.
Defines the 'day_cdrs' cursor for the query:
Select to_char(calldate, 'yyyymmdd') yyyymmdd,
(calldate - trunc(calldate)) * 24 * 3600 starting_second,
duration duration
From cdrs
Where cdrs.calldate >= Trunc(Sysdate -1)
And cdrs.calldate
Iterates the cursor to increment 'count' field for the seconds of the call:
For cdr in day_cdrs
Loop
Update simultaneos_calls
Set count = count + 1
Where yyyymmdd = cdr.yyyymmdd
And day_second Between cdr.starting_second And cdr.starting_second + cdr.duration;
End Loop;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to group by 1 element and get sum of other? - sql

Related

DATEIME_DIFF throwing error when using safe divide

How does the group by function works in PostgreSQL? (beginner)

Cumulative count for calculating daily frequency using SQL query (in Amazon Redshift)

BigQuery gives Response Too Large error for whole dataset but not for equivalent subqueries

Simultaneous calls from CDR

Categories

Resources