Need to find average and number of repetitions of column - sql

I have an SQL sentence :
SELECT application.id,title,url,company.name AS company_name,package_name,ranking,date,platform,country.name AS country_name,collection.name AS collection_name,category.name AS category_name FROM application
JOIN application_history ON application_history.application_id = application.id
JOIN company ON application.company_id = company.id
JOIN country ON application_history.country_id = country.id
JOIN collection ON application_history.collection_id = collection.id
JOIN category ON application_history.category_id = category.id
WHERE application.platform=0
AND country.name ='CZ'
AND collection.name='topfreeapplications'
AND category.name='UTILITIES'
AND application_history.ranking <= 10
AND date::date BETWEEN date (CURRENT_DATE - INTERVAL '1 month') AND CURRENT_DATE
ORDER BY application_history.ranking ASC
It produces this result :
I'd like to add both a column average ranking for a given package, and a column number of appearances, which would count the number a package appears in the list. I'd also like to Group results by package_name, so that I don't have redundancies.
So far, I've tried to add a GROUP BY By clause before the ORDER BY :
GROUP BY package_name
But it returns me an error :
column "application.id" must appear in the GROUP BY clause or be used in an aggregate function
If I add each and every column it asks me for, it doesn't work.
I have also tried to count the number of package names, by adding after the SELECT :
COUNT(package_name) AS count
It produces a similar error.
How could I get the result I'm looking for ? Should I make two queries instead, or is it possible to get everything at once ?
I precise I have looked at other answers on S.O, but none of them tries to make the COUNT on a "produced" column.
Thank you for your help.
Edit :
Here is the result I expected at first :
Although Gordon's advice didn't give me the proper result it put me on the good track, when I read this :
From the docs : "Unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row."
So I came back to using COUNT and AVG alone. My problem was that I wanted to display the ranking column and date to check whether things were right. But putting these column into the Select prevented the GROUP BY to work as expected, as mentioned by Jarlh in the comments.
The working query :
SELECT application.id,title,url,company.name AS company_name,package_name,platform,country.name AS country_name,collection.name AS collection_name,category.name AS category_name,
COUNT(package_name) AS count, AVG(application_history.ranking) AS avg
FROM application
JOIN application_history ON application_history.application_id = application.id
JOIN company ON application.company_id = company.id
JOIN country ON application_history.country_id = country.id
JOIN collection ON application_history.collection_id = collection.id
JOIN category ON application_history.category_id = category.id
WHERE application.platform=0
AND country.name ='CZ'
AND collection.name='topfreeapplications'
AND category.name='UTILITIES'
AND application_history.ranking <= 10
AND date::date BETWEEN date (CURRENT_DATE - INTERVAL '1 month') AND CURRENT_DATE
GROUP BY package_name,application.id,company.name,country.name,collection.name,category.name
ORDER BY count DESC

I think you want window/analytic functions. The following adds two columns, one for the count of rows for each package and the other an average ranking for them:
SELECT application.id, title, url, company.name AS company_name, package_name,
ranking, date, platform, country.name AS country_name,
collection.name AS collection_name, category.name AS category_name,
count(*) over (partition by package_name) as count,
avg(ranking) over (partition by package_name) as avg_package_ranking
FROM application . . .

Related

SQL - Returning fields based on where clause then joining same table to return max value?

I have a table named Ticket Numbers, which (for this example) contain the columns:
Ticket_Number
Assigned_Group
Assigned_Group_Sequence_No
Reported_Date
Each ticket number could contain 4 rows, depending on how many times the ticket changed assigned groups. Some of these rows could contain an assigned group of "Desktop Support," but some may not. Here is an example:
Example of raw data
What I am trying to accomplish is to get the an output that contains any ticket numbers that contain 'Desktop Support', but also the assigned group of the max sequence number. Here is what I am trying to accomplish with SQL:
Queried Data
I'm trying to use SQL with the following query but have no clue what I'm doing wrong:
select ih.incident_number,ih.assigned_group, incident_history2.maxseq, incident_history2.assigned_group
from incident_history_public as ih
left join
(
select max(assigned_group_seq_no) maxseq, incident_number, assigned_group
from incident_history_public
group by incident_number, assigned_group
) incident_history2
on ih.incident_number = incident_history2.incident_number
and ih.assigned_group_seq_no = incident_history2.maxseq
where ih.ASSIGNED_GROUP LIKE '%DS%'
Does anyone know what I am doing wrong?
You might want to create a proper alias for incident_history. e.g.
from incident_history as incident_history1
and
on incident_history1.ticket_number = incident_history2.ticket_number
and incident_history1.assigned_group_seq_no = incident_history2.maxseq
In my humble opinion a first error could be that I don't see any column named "incident_history2.assigned_group".
I would try to use common table expression, to get only ticket number that contains "Desktop_support":
WITH desktop as (
SELECT distinct Ticket_Number
FROM incident_history
WHERE Assigned_Group = "Desktop Support"
),
Than an Inner Join of the result with your inner table to get ticket number and maxSeq, so in a second moment you can get also the "MAXGroup":
WITH tmp AS (
SELECT i2.Ticket_Number, i2.maxseq
FROM desktop D inner join
(SELECT Ticket_number, max(assigned_group_seq_no) as maxseq
FROM incident_history
GROUP BY ticket_number) as i2
ON D.Ticket_Number = i2.Ticket_Number
)
SELECT i.Ticket_Number, i.Assigned_Group as MAX_Group, T.maxseq, i.Reported_Date
FROM tmp T inner join incident_history i
ON T.Ticket_Number = i.Ticket_Number and i.assigned_group_seq_no = T.maxseq
I think there are several different method to resolve this question, but I really hope it's helpful for you!
For more information about Common Table Expression: https://www.essentialsql.com/introduction-common-table-expressions-ctes/

Calculated column syntax when using a group by function Teradata

I'm trying to include a column calculated as a % of OTYPE.
IE
Order type | Status | volume of orders at each status | % of all orders at this status
SELECT
T.OTYPE,
STATUS_CD,
COUNT(STATUS_CD) AS STATVOL,
(STATVOL / COUNT(ROW_ID)) * 100
FROM Database.S_ORDER O
LEFT JOIN /* Finding definitions for status codes & attaching */
(
SELECT
ROW_ID AS TYPEJOIN,
"NAME" AS OTYPE
FROM database.S_ORDER_TYPE
) T
ON T.TYPEJOIN = ORDER_TYPE_ID
GROUP BY (T.OTYPE, STATUS_CD)
/*Excludes pending and pending online orders */
WHERE CAST(CREATED AS DATE) = '2018/09/21' AND STATUS_CD <> 'Pending'
AND STATUS_CD <> 'Pending-Online'
ORDER BY T.OTYPE, STATUS_CD DESC
OTYPE STATUS_CD STATVOL TOTALPERC
Add New Service Provisioning 2,740 100
Add New Service In-transit 13 100
Add New Service Error - Provisioning 568 100
Add New Service Error - Integration 1 100
Add New Service Complete 14,387 100
Current output just puts 100 at every line, need it to be a % of total orders
Could anyone help out a Teradata & SQL student?
The complication making this difficult is my understanding of the group by and count syntax is tenuous. It took some fiddling to get it displayed as I have it, I'm not sure how to introduce a calculated column within this combo.
Thanks in advance
There are a couple of places the total could be done, but this is the way I would do it. I also cleaned up your other sub query which was not required, and changed the date to a non-ambiguous format (change it back if it cases an issue in Teradata)
SELECT
T."NAME" as OTYPE,
STATUS_CD,
COUNT(STATUS_CD) AS STATVOL,
COUNT(STATUS_CD)*100/TotalVol as Pct
FROM database.S_ORDER O
LEFT JOIN EDWPRDR_VW40_SBLCPY.S_ORDER_TYPE T on T.ROW_ID = ORDER_TYPE_ID
cross join (select count(*) as TotalVol from database.S_ORDER) Tot
GROUP BY T."NAME", STATUS_CD, TotalVol
WHERE CAST(CREATED AS DATE) = '2018-09-21' AND STATUS_CD <> 'Pending' AND STATUS_CD <> 'Pending-Online'
ORDER BY T."NAME", STATUS_CD DESC
A where clause comes before a group by clause, so the query
shown in the question isn't valid.
Always prefix every column reference with the relevant table alias, below I have assumed that where you did not use the alias that it belongs to the orders table.
You probably do not need a subquery for this left join. While there are times when a subquery is needed or good for performance, this does not appear to be the case here.
Most modern SQL compliant databases provide "window functions", and Teradata does do this. They are extremely useful, and here when you combine count() with an over clause you can get the total of all rows without needing another subquery or join.
Because there is neither sample data nor expected result provided with the question I do not actually know which numbers you really need for your percentage calculation. Instead I have opted to show you different ways to count so that you can choose the right ones. I suspect you are getting 100 for each row because the count(status_cd) is equal to the count(row_id). You need to count status_cd differently to how you count row_id. nb: The count() function increases by 1 for every non-null value
I changed the way your date filter is applied. It is not efficient to change data on every row to suit constants in a where clause. Leave the data untouched and alter the way you apply the filter to suit the data, this is almost always more efficient (search sargable)
SELECT
t.OTYPE
, o.STATUS_CD
, COUNT(o.STATUS_CD) count_status
, COUNT(t.ROW_ID count_row_id
, count(t.row_id) over() count_row_id_over
FROM dbo.S_ORDER o
LEFT JOIN dbo.S_ORDER_TYPE t ON t.TYPEJOIN = o.ORDER_TYPE_ID
/*Excludes pending and pending online orders */
WHERE o.CREATED >= '2018-09-21' AND o.CREATED < '2018-09-22'
AND o.STATUS_CD <> 'Pending'
AND o.STATUS_CD <> 'Pending-Online'
GROUP BY
t.OTYPE
, o.STATUS_CD
ORDER BY
t.OTYPE
, o.STATUS_CD DESC
As #TomC already noted, there's no need for the join to a Derived Table. The simplest way to get the percentage is based on a Group Sum. I also changed the date to an Standard SQL Date Literal and moved the where before group by.
SELECT
t."NAME",
o.STATUS_CD,
Count(o.STATUS_CD) AS STATVOL,
-- rule of thumb: multiply first then divide, otherwise you will get unexpected results
-- (Teradata rounds after each calculation)
100.00 * STATVOL / Sum(STATVOL) Over ()
FROM database.S_ORDER AS O
/* Finding definitions for status codes & attaching */
LEFT JOIN database.S_ORDER_TYPE AS t
ON t.ROW_ID = o.ORDER_TYPE_ID
/*Excludes pending and pending online orders */
-- if o.CREATED is a Timestamp there's no need to apply the CAST
WHERE Cast(o.CREATED AS DATE) = DATE '2018-09-21'
AND o.STATUS_CD NOT IN ('Pending', 'Pending-Online')
GROUP BY (T.OTYPE, o.STATUS_CD)
ORDER BY T.OTYPE, o.STATUS_CD DESC
Btw, you probably don't need an Outer Join, Inner should return the same result.

SQL - Two columns group by issue with MAX function

SELECT artist.name, recording.name, MAX(recording.length)
FROM recording
INNER JOIN (artist_credit
INNER JOIN (artist_credit_name
INNER JOIN artist
ON artist_credit_name.artist_credit=artist.id)
ON artist_credit_name.artist_credit=artist_credit.id)
ON recording.artist_credit=artist_credit.id
WHERE artist.gender=1
AND recording.length <= (SELECT MAX(recording.length) FROM recording)
GROUP BY artist.name, recording.name
ORDER BY artist.name
We are using the MusicBrainz database for school and we are having troubles with the "GROUP BY" because we have two columns (it works with one column, but not two). We want the result to display just one artist with his second longest recording time, but the code displays all the recording time of every song of the same artist.
Any suggestions? Thanks.
You don't need to do multiple joins looking closely at the join conditions. They can be reduce to just one join as shown below.
SELECT DISTINCT B.name, A.name, A.length
FROM recording A JOIN artist B
ON A.artist_credit=B.id
WHERE B.gender=1
AND A.length=(SELECT C.length FROM recording C
WHERE C.artist_credit=B.artist_credit
ORDER BY C.length LIMIT 1, 1)
ORDER BY B.name;
See Using MySQL LIMIT to get the nth highest value
As others have pointed out, the join statement can be reduced. Also there seems to be a problem with the operator in the AND statement; it should be < and not <= in order to get the second highest length (Se here: What is the simplest SQL Query to find the second largest value?).
I would suggest trying out the following:
SELECT artist.name, recording.name, MAX(recording.length)
FROM recording
JOIN artist ON recording.artist_credit = artist.id
WHERE
artist.gender=1
AND
recording.length < (SELECT MAX(recording.length) FROM recording)
GROUP BY artist.name
ORDER BY artist.name

Include missing years in Group By query

I am fairly new in Access and SQL programming. I am trying to do the following:
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
and group by year even when there is no amount in some of the years. I would like to have these years listed as well for a report with charts. I'm not certain if this is possible, but every bit of help is appreciated.
My code so far is as follows:
SELECT
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
Sum(SO_SalesOrderPaymentHistoryLineT.Amount) AS [Sum Of PaymentPerYear]
FROM
Base_CustomerT
INNER JOIN (
SO_SalesOrderPaymentHistoryLineT
INNER JOIN SO_SalesOrderT
ON SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId
) ON Base_CustomerT.CustomerId = SO_SalesOrderT.CustomerId
GROUP BY
Base_CustomerT.SalesRep,
SO_SalesOrderT.CustomerId,
Base_CustomerT.Customer,
SO_SalesOrderPaymentHistoryLineT.DatePaid,
SO_SalesOrderPaymentHistoryLineT.PaymentType,
Base_CustomerT.IsActive
HAVING
(((SO_SalesOrderPaymentHistoryLineT.PaymentType)=1)
AND ((Base_CustomerT.IsActive)=Yes))
ORDER BY
Base_CustomerT.SalesRep,
Base_CustomerT.Customer;
You need another table with all years listed -- you can create this on the fly or have one in the db... join from that. So if you had a table called alltheyears with a column called y that just listed the years then you could use code like this:
WITH minmax as
(
select min(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as minyear,
max(year(SO_SalesOrderPaymentHistoryLineT.DatePaid) as maxyear)
from SalesOrderPaymentHistoryLineT
), yearsused as
(
select y
from alltheyears, minmax
where alltheyears.y >= minyear and alltheyears.y <= maxyear
)
select *
from yearsused
join ( -- your query above goes here! -- ) T
ON year(T.SO_SalesOrderPaymentHistoryLineT.DatePaid) = yearsused.y
You need a data source that will provide the year numbers. You cannot manufacture them out of thin air. Supposing you had a table Interesting_year with a single column year, populated, say, with every distinct integer between 2000 and 2050, you could do something like this:
SELECT
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
Sum(NZ(data.Amount)) AS [Sum Of PaymentPerYear]
FROM
(SELECT * FROM Base_CustomerT INNER JOIN Year) AS base
LEFT JOIN
(SELECT * FROM
SO_SalesOrderT
INNER JOIN SO_SalesOrderPaymentHistoryLineT
ON (SO_SalesOrderPaymentHistoryLineT.SalesOrderId = SO_SalesOrderT.SalesOrderId)
) AS data
ON ((base.CustomerId = data.CustomerId)
AND (base.year = Year(data.DatePaid))),
WHERE
(data.PaymentType = 1)
AND (base.IsActive = Yes)
AND (base.year BETWEEN
(SELECT Min(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT)
AND (SELECT Max(year(DatePaid) FROM SO_SalesOrderPaymentHistoryLineT))
GROUP BY
base.SalesRep,
base.CustomerId,
base.Customer,
base.year,
ORDER BY
base.SalesRep,
base.Customer;
Note the following:
The revised query first forms the Cartesian product of BaseCustomerT with Interesting_year in order to have base customer data associated with each year (this is sometimes called a CROSS JOIN, but it's the same thing as an INNER JOIN with no join predicate, which is what Access requires)
In order to have result rows for years with no payments, you must perform an outer join (in this case a LEFT JOIN). Where a (base customer, year) combination has no associated orders, the rest of the columns of the join result will be NULL.
I'm selecting the CustomerId from Base_CustomerT because you would sometimes get a NULL if you selected from SO_SalesOrderT as in the starting query
I'm using the Access Nz() function to convert NULL payment amounts to 0 (from rows corresponding to years with no payments)
I converted your HAVING clause to a WHERE clause. That's semantically equivalent in this particular case, and it will be more efficient because the WHERE filter is applied before groups are formed, and because it allows some columns to be omitted from the GROUP BY clause.
Following Hogan's example, I filter out data for years outside the overall range covered by your data. Alternatively, you could achieve the same effect without that filter condition and its subqueries by ensuring that table Intersting_year contains only the year numbers for which you want results.
Update: modified the query to a different, but logically equivalent "something like this" that I hope Access will like better. Aside from adding a bunch of parentheses, the main difference is making both the left and the right operand of the LEFT JOIN into a subquery. That's consistent with the consensus recommendation for resolving Access "ambiguous outer join" errors.
Thank you John for your help. I found a solution which works for me. It looks quiet different but I learned a lot out of it. If you are interested here is how it looks now.
SELECT DISTINCTROW
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
FROM
Base_Customer_RevenueYearQ
LEFT JOIN CustomerPaymentPerYearQ
ON (Base_Customer_RevenueYearQ.RevenueYear = CustomerPaymentPerYearQ.[RevenueYear])
AND (Base_Customer_RevenueYearQ.CustomerId = CustomerPaymentPerYearQ.CustomerId)
GROUP BY
Base_Customer_RevenueYearQ.SalesRep,
Base_Customer_RevenueYearQ.CustomerId,
Base_Customer_RevenueYearQ.Customer,
Base_Customer_RevenueYearQ.RevenueYear,
CustomerPaymentPerYearQ.[Sum Of PaymentPerYear]
;

SQL Error - not a group by expression

With this SQL Query, I am attempting to list overdue book by patron. In addition, I want to group and order the books by patron. I understand you have to use some kind of aggregate function when doing a GROUP BY function. Even so, I am getting a "not a group by expression" error. Any help is appreciated.
Edit: updated code
What if I want to get the total fees per person and book? The bottom code is still resulting in the same error.
SELECT PATRON.LAST_NAME, PATRON.FIRST_NAME, PATRON.PHONE, BOOK.BOOK_TITLE, CHECKOUT.DUE_DATE, SUM((SYSDATE - CHECKOUT.DUE_DATE) * 1.00) AS FEE_BALANCE
FROM CHECKOUT
JOIN PATRON
ON CHECKOUT.PATRON_ID=PATRON.PATRON_ID
JOIN COPY
ON CHECKOUT.COPY_ID=COPY.COPY_ID
JOIN BOOK
ON COPY.BOOK_ID=BOOK.BOOK_ID
WHERE CHECKOUT.RETURN_DATE IS NULL
AND CHECKOUT.DUE_DATE > SYSDATE
GROUP BY PATRON.PATRON_ID
HAVING SUM((SYSDATE - CHECKOUT.DUE_DATE) > 0
ORDER BY PATRON.LAST_NAME, PATRON.FIRST_NAME;
You need to add these 3 columns to the GROUP BY clause...
PATRON.PHONE, BOOK.BOOK_TITLE, CHECKOUT.DUE_DATE
So it would be...
SELECT
PATRON.LAST_NAME,
PATRON.FIRST_NAME,
PATRON.PHONE,
BOOK.BOOK_TITLE,
CHECKOUT.DUE_DATE,
SUM((SYSDATE - CHECKOUT.DUE_DATE) * 1.00) AS BALANCE
FROM CHECKOUT
JOIN PATRON
ON CHECKOUT.PATRON_ID=PATRON.PATRON_ID
JOIN COPY
ON CHECKOUT.COPY_ID=COPY.COPY_ID
JOIN BOOK
ON COPY.BOOK_ID=BOOK.BOOK_ID
WHERE CHECKOUT.RETURN_DATE IS NULL
AND CHECKOUT.DUE_DATE > SYSDATE
GROUP BY PATRON.LAST_NAME, PATRON.FIRST_NAME, PATRON.PHONE, BOOK.BOOK_TITLE, CHECKOUT.DUE_DATE
ORDER BY PATRON.LAST_NAME, PATRON.FIRST_NAME
Bottom line is, if you have a GROUP BY clause, you can only select columns that are part of the GROUP BY unless you are using some kind of scalar value function on them, like COUNT, SUM, MAX, MIN, etc.
Otherwise if you aren't grouping or using a scalar-value function, multiple rows could have different values for that column, so what would the query results be?