Beginning SQL group by and AVG - sql

I am trying to pull information from two columns titled clientstate and clientrevenue in my table. I want clientstate to show up as the state, and have only distinct names in it, and under client revenue I want the average revenue per state, and that will only show up if there are at least two clients from that state. I am very new at this, so what I have is pretty iffy:
SELECT clientstate, clientrevenue
FROM client
GROUP BY clientrevenue
HAVING COUNT (*) >=2;
Where am I going wrong here?

SELECT clientstate AS [State]
, AVG(clientrevenue) AS [Average Revenue]
FROM client
GROUP BY clientstate

Grouping by ClientRevenue will try to group similar values and that doesn't have a logical sense.
First, in order to get distinct states, clientstate column needs to be used in the GROUP BY statement.
Thus, the code would be :
SELECT clientstate, AVG(clientrevenue)
FROM Source_Table
GROUP BY clientstate --this would get you distinct states
Now, considering the 2 clients per state, it's rather a condition than a HAVING statement. HAVING statement limits your query results according to the aggregate function you are using. For instance, in the code aforementioned, the aggregate function is AVG(clientrevenue). So, we can only use it in HAVING. we can not add count(*) unless it was used in SELECT.
So, you need to add it as a condition like
SELECT clientstate, AVG(clientrevenue)
FROM Source_Table A
WHERE (SELECT count(DISTINCT client_ID) FROM Source_Table B
WHERE A.clientstate = B.clientstate) >= 2 --Condition
GROUP BY clientstate --this would get you distinct states

Related

Confused with the Group By function in SQL

Q1: After using the Group By function, why does it only output one row of each group at most? Does this mean that having is supposed to filter the group rather than filter the records in each group?
Q2: I want to find the records in each group whose ages are greater than the average age of that group. I tried the following, but it returns nothing. How should I fix this?
SELECT *, avg(age) FROM Mytable Group By country Having age > avg(age)
Thanks!!!!
You can calculate the average age for each country in a subquery and join that to your table for filtering:
SELECT mt.*, MtAvg.AvgAge
FROM Mytable mt
inner join
(
select mtavgs.country
, avg(mtavgs.age) as AvgAge
from Mytable mtavgs
group by mtavgs.country
) MTAvg
on mtavg.country=mt.country
and mt.Age > mtavg.AvgAge
GROUP BY returns always 1 row per unique combination of values in the GROUP BY columns listed (provided that they are not removed by a HAVING clause). The subquery in our example (alias: MTAvg) will calculate a single row per country. We will use its results for filtering the main table rows by applying the condition in the INNER JOIN clause; we will also report that average by including the calculated average age.
GROUP BY is a keyword that is called an aggregate function. Check this out here for further reading SQL Group By tutorial
What it does is it lumps all the results together into one row. In your example it would lump all the results with the same country together.
Not quite sure what exactly your query needs to be to solve your exact problem. I would however look into what are called window functions in SQL. I believe what you first need to do is write a window function to find the average age in each group. Then you can write a query to return the results you need
Depending on your dbms type and version, you may be able to use a "window function" that will calculate the average per country and with this approach it makes the calculation available on every row. Once that data is present as a "derived table" you can simply use a where clause to filter for the ages that are greater then the calculated average per country.
SELECT mt.*
FROM (
SELECT *
, avg(age) OVER(PARTITION BY country) AS AvgAge
FROM Mytable
) mt
WHERE mt.Age > mt.AvgAge

Join query in Access 2013

Currently have a single table with large amount of data in access, due to the size I couldn't easily work with it in Excel any more.
I'm partially there on a query to pull data from this table.
7 Column table
One column GL_GL_NUM contains a transaction number. ~ 75% of these numbers are pairs. I'm trying to pull the records (all columns information) for each unique transaction number in this column.
I have put together some code from googling that hypothetically should work but I think I'm missing something on the syntax or simply asking access to do what it cannot.
See below:
SELECT SOURCE_FUND, GLType, Contract, Status, Debit, Credit, GL_GL_NUM
FROM Suspense
JOIN (
SELECT TC_TXN_NUM TXN_NUM, COUNT(GL_GL_NUM) GL_NUM
FROM Suspense
GROUP BY TC_TXN_NUM HAVING COUNT(GL_GL_NUM) > 1 ) SUB ON GL_GL_NUM = GL_NUM
Hey Beth is this the suggested code? It says there is a syntax error in the FROM clause. Thanks.
SELECT * from SuspenseGL
JOIN (
SELECT TC_TXN_NUM, COUNT(GL_GL_NUM) GL_NUM
FROM Suspense
GROUP BY TC_TXN_NUM
HAVING COUNT(GL_GL_NUM) > 1
Do you want detailed results (all rows and columns) or aggregate results, with one row per tx number?
If you want an aggregate result, like the count of distinct transaction numbers, then you need to apply one or more aggregate functions to any other columns you include.
If you run
SELECT TC_TXN_NUM, COUNT(GL_GL_NUM) GL_NUM
FROM Suspense
GROUP BY TC_TXN_NUM
HAVING COUNT(GL_GL_NUM) > 1
you'll get one row for each distinct txn, but if you then join those results back with your original table, you'll have the same number of rows as if you didn't join them with distinct txns at all.
Is there a column you don't want included in your results? If not, then the only query you need to work with is
select * from suspense
Considering your column names, what you may want is:
SELECT SOURCE_FUND, GLType, Contract, Status, sum(Debit) as sum_debit,
sum(Credit) as sum_credit, count(*) as txCount
FROM Suspense
group by
SOURCE_FUND, GLType, Contract, Status
based on your comments, if you can't work with aggregate results, you need to work with them all:
Select * from suspense
What's not working? It doesn't matter if 75% of the txns are duplicates, you need to send out every column in every row.
OK, let's say
Select * from suspense
returns 8 rows, and
select GL_GL_NUM from suspense group by GL_GL_NUM
returns 5 rows, because 3 of them have duplicate GL_GL_NUMs and 2 of them don't.
How many rows do you want in your result set? if you want less than 8 rows back, you need to perform some sort of aggregate function on each column you want returned.
You could do something like the following:
SELECT S.* FROM
SUSPENSE AS S
INNER JOIN (SELECT DISTINCT GL_GL_NUM, MIN(ID) AS ID FROM SUSPENSE
GROUP BY GL_GL_NUM) AS S2
ON S.ID = S2.ID
AND S.GL_GL_NUM = S2.GL_GL_NUM
Which would return a single row for a unique gl_gl_num. However if the other rows have different data it will not be shown. You would have to either aggregate that data up using SUM(Credit), SUM(Debit) and then GROUP BY the gl_gl_num.
I have attached a SQL Fiddle to demonstrate my results and make this clearer.
http://sqlfiddle.com/#!3/8284f/2

SQL - making several groupings (performance)

I have some SQL query that founds records based on provided parameters. That query is pretty heavy, so I want to execute it less as possible.
After I getting result from that query, I need to perform its breakdown.
For example, consider the following query:
SELECT location, department, industry
FROM data
WHERE ...
After that, I need to perform breakdown of that results, e.g. I need to provide list of all locations where from I have results and counts of each type, same for departments and same for industries.
As I know, in order to get breakdown by locations, I need to perform GROUP BY (location) and then count.
My question is: is it possible, for performance considerations, to perform several groupings/ counts on query result without recalculating it over and over again for each grouping?
Yes, this is possible. Unless I misunderstood you.
You need to use windowed functions. For instance:
SELECT location
, department
, industry
, COUNT(*) OVER(PARTITION BY location, department)
, COUNT(*) OVER(PARTITION BY location, department, industry)
FROM data
WHERE ...;
Keep in mind, that doing a COUNT(DISTINCT column) is not possible.
If I understand correctly, you can do what you want with grouping sets (documented here):
SELECT location, department, industry, count(*)
FROM data
WHERE ...
GROUP BY GROUPING SETS ((location), (department), (industry))
This will return rows like:
location1 NULL NULL 10
. . .
NULL dept1 NULL 17
. . .
If you want to get fancy, and you have no NULL values in any of the columns, you can do:
SELECT (case when location is not null then 'location'
when department is not null then 'department'
when industry is not null then 'industry'
end) as which,
coalesce(location, department, industry) as name, count(*)
FROM data
WHERE ...
GROUP BY GROUPING SETS ((location), (department), (industry))
ORDER BY which;
You can actually do the same thing using the GROUPING() function, if you do have NULL values in the columns, but you have to replace the coalesce() as well.

Can peewee nest SELECT queries such that the outer query selects on an aggregate of the inner query?

I'm using peewee2.1 with python3.3 and an sqlite3.7 database.
I want to perform certain SELECT queries in which:
I first select some aggregate (count, sum), grouping by some id column; then
I then select from the results of (1), aggregating over its aggregate. Specifically, I want to count the number of rows in (1) that have each aggregated value.
My database has an 'Event' table with 1 record per event, and a 'Ticket' table with 1..N tickets per event. Each ticket record contains the event's id as a foreign key. Each ticket also contains a 'seats' column that specifies the number of seats purchased. (A "ticket" is really best thought of as a purchase transaction for 1 or more seats at the event.)
Below are two examples of working SQLite queries of this sort that give me the desired results:
SELECT ev_tix, count(1) AS ev_tix_n FROM
(SELECT count(1) AS ev_tix FROM ticket GROUP BY event_id)
GROUP BY ev_tix
SELECT seat_tot, count(1) AS seat_tot_n FROM
(SELECT sum(seats) AS seat_tot FROM ticket GROUP BY event_id)
GROUP BY seat_tot
But using Peewee, I don't know how to select on the inner query's aggregate (count or sum) when specifying the outer query. I can of course specify an alias for that aggregate, but it seems I can't use that alias in the outer query.
I know that Peewee has a mechanism for executing "raw" SQL queries, and I've used that workaround successfully. But I'd like to understand if / how these queries can be done using Peewee directly.
I posted the same question on the peewee-orm Google group. Charles Leifer responded promptly with both an answer and new commits to the peewee master. So although I'm answering my own question, obviously all credit goes to him.
You can see that thread here: https://groups.google.com/forum/#!topic/peewee-orm/FSHhd9lZvUE
But here's the essential part, which I've copied from Charles' response to my post:
I've added a couple commits to master which should make your queries
possible
(https://github.com/coleifer/peewee/commit/22ce07c43cbf3c7cf871326fc22177cc1e5f8345).
Here is the syntax,roughly, for your first example:
SELECT ev_tix, count(1) AS ev_tix_n FROM
(SELECT count(1) AS ev_tix FROM ticket GROUP BY event_id)
GROUP BY ev_tix
ev_tix = SQL('ev_tix') # the name of the alias.
(Ticket
.select(ev_tix, fn.count(ev_tix).alias('ev_tix_n'))
.from_(
Ticket.select(fn.count(Ticket.id).alias('ev_tix')).group_by(Ticket.event))
.group_by(ev_tix))
This yields the following SQL:
SELECT ev_tix, count(ev_tix) AS ev_tix_n FROM (SELECT Count(t2."id")
AS ev_tix FROM "ticket" AS t2 GROUP BY t2."event_id")
GROUP BY ev_tix

I'm not sure what is the purpose of "group by" here

I'm struggling to understand what this query is doing:
SELECT branch_name, count(distinct customer_name)
FROM depositor, account
WHERE depositor.account_number = account.account_number
GROUP BY branch_name
What's the need of GROUP BY?
You must use GROUP BY in order to use an aggregate function like COUNT in this manner (using an aggregate function to aggregate data corresponding to one or more values within the table).
The query essentially selects distinct branch_names using that column as the grouping column, then within the group it counts the distinct customer_names.
You couldn't use COUNT to get the number of distinct customer_names per branch_name without the GROUP BY clause (at least not with a simple query specification - you can use other means, joins, subqueries etc...).
It's giving you the total distinct customers for each branch; GROUP BY is used for grouping COUNT function.
It could be written also as:
SELECT branch_name, count(distinct customer_name)
FROM depositor INNER JOIN account
ON depositor.account_number = account.account_number
GROUP BY branch_name
Let's take a step away from SQL for a moment at look at the relational trainging language Tutorial D.
Because the two relations (tables) are joined on the common attribute (column) name account_number, we can use a natural join:
depositor JOIN account
(Because the result is a relation, which by definition has only distinct tuples (rows), we don't need a DISTINCT keyword.)
Now we just need to aggregate using SUMMARIZE..BY:
SUMMARIZE (depositor JOIN account)
BY { branch_name }
ADD ( COUNT ( customer_name ) AS customer_tally )
Back in SQLland, the GROUP BY branch_name is doing the same as SUMMARIZE..BY { branch_name }. Because SQL has a very rigid structure, the branch_name column must be repeated in the SELECT clause.
If you want to COUNT something (see SELECT-Part of the statement), you have to use GROUP BY in order to tell the query what to aggregate. The GROUP BY statement is used in conjunction with the aggregate functions to group the result-set by one or more columns.
Neglecting it will lead to SQL errors in most RDBMS, or senseless results in others.
Useful link:
http://www.w3schools.com/sql/sql_groupby.asp