SQL Percentile function missing?

SQL Percentile function missing? - sql

It is a mystery to me with this text book example. We have simply:
Transaction_ID (primary key), Client_ID, Transaction_Amount, Month
1 1 500 1
2 1 1000 1
3 1 10 2
4 2 11 2
5 3 300 2
6 3 10 2
... ... ... ...
I want to calculate in SQL the mean(Transaction_Amount), std(Transaction_Amount) and the some percentile(Transaction amount) grouped by Client_ID. But is seems, even given that percentile is a very similar calculation than the standard deviation, SQL cannot do it with a simple statement as:
SELECT
mean(Transaction_Amount),
std(Transaction_Amount),
percentile(Transaction_Amount)
FROM
myTable
GROUP BY
Client_ID, Month
Or can it?
It gets worse becuase I also need to Group By Month in addition to Client_ID.
Thanks a lot!
Sven

I'm sure Oracle can do the calculations you want. I just don't know what they are. You specify that you want something grouped by ClientId. Yet, your sample query has two keys in the GROUP BY.
Some functions that you want to look at are:
AVG()
STDDEV()
PERCENT_RANK()
Without sample data and desired results (or a very clear explanation of what you are trying to calculate), I can't put together a query.

Related

SQL Query to Count IDs

I'm trying to write a SQL Query in MS Access to count the number of times each ID appears in a data set. The data set is formatted as follows:
ID Time
1 12345
1 12346
1 12350
2 99999
2 99999
If the Time for one ID is within 3 seconds of another Time for that same ID, I only want it to be counted once. So the results should look like this:
ID Count
1 2
2 1
The time column is not formatted as a datetime, so I can't use the datediff function. Any help would be appreciated.

This:
SELECT ID, COUNT(newtime)
FROM (SELECT DISTINCT ID, Time\3 AS newtime FROM times)
GROUP BY ID
groups the Time field values in triples using the integer division for Time\3 in Access.

The comment provided by #Andy G worked for my purposes:
"You first need a function to round up (or down) to the nearest multiple of 3. See here (allenbrowne) for example."
I rounded the time values to the nearest multiple of 3, and counted based on that criteria.

Count Distinct values in one column based on other columns

I have a table that looks like the following:
app_id supplier_reached creation_date platform
10001 1 9/11/2018 iOS
10001 2 9/18/2018 iOS
10002 1 5/16/2018 android
10003 1 5/6/2018 android
10004 1 10/1/2018 android
10004 1 2/3/2018 android
10004 2 2/2/2018 web
10005 4 1/5/2018 web
10005 2 5/1/2018 android
10006 3 10/1/2018 iOS
10005 4 1/1/2018 iOS
The objective is to find the unique number of app_id submitted per month.
If I just do a count(distinct app_id) I will get the following results:
Group by month count(app number)
Jan 1
Feb 1
may 3
september 1
october 2
However, an application is considered unique based on a combination of other fields as well. For example, for the month of January, the app_id is the same however a combination of app_id, supplier_reached and platform show different values and hence the app_id should be counted twice.
Following the same pattern, the desired result should be:
Group by month Desired answer
Jan 2
Feb 2
may 3
september 2
october 2
Lastly, there can be many other columns in the table which may or may not contribute to the uniqueness of an application.
Is there a way to do this type of count in SQL?
I am using Redshift.

As pointed out above, in Redshift count(distinct ...) does not work with multiple fields.
You can first group by the columns that you want to be unique and then count the records like this:
select month,count(1) as app_number
from (
select month,app_id,supplier_reached,platform
from your_table
group by 1,2,3,4
)
group by 1

I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation:
count(distinct app_id || ':' || supplier_reached || ':' || platform)

Your objective's mean is wrong.
You don't want
to find the unique number of app_id submitted per month
you want
to find the unique number of app_id + supplier_reached + platform submitted per month.
And so, you need to use a) combination of columns like count(distinct col1||col2||col3) or b)
select t1.month, count(t1.*)
(select distinct
app_id,
supplier_reached,
platform,
month
from sometable) t1
group by month

Actually, you can count distinct ROW values conveniently in Postgres:
SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM tbl
GROUP BY 1;
The ROW keyword would be just noise here:
count(DISTINCT ROW(app_id, supplier_reached, platform))
I would discourage concatenating columns for the purpose. This is comparatively expensive, error prone (think of distinct data types and locale-dependent text representation) and introduces corner-case errors if the used separator can be contained in column values.
Alas, not supported by Redshift:
...
Value expressions
Subscripted expressions
Array constructors
Row constructors
...

How can I return rows which match on some columns and fulfil a DateTime comparison between two other columns using SQL?

I have a table which contains rows for jobs, example below, where 01/01/1980 is used rather than null in the ClosedDate column for jobs which are not finished:
JobNumber JobCategory CustomerID CreatedDate ClosedDate
1 Small 1 01/01/2016 03/01/2016
2 Small 2 03/01/2016 07/01/2016
3 Large 2 06/01/2016 07/01/2016
4 Medium 1 08/01/2016 10/01/2016
5 Small 3 10/01/2016 01/01/1980
6 Medium 3 15/01/2016 01/01/1980
7 Large 2 16/01/2016 17/01/2016
8 Large 2 19/01/2016 20/01/2016
9 Small 1 19/01/2016 01/01/1980
10 Medium 2 19/01/2016 01/01/1980
I need to return a list of any jobs where the same customer has had a job of the same category created within 3 days of the previous job being closed.
So, I would want to return:
7 Large 2 16/01/2016 17/01/2016
8 Large 2 19/01/2016 20/01/2016
because Customer 2 had a Large job closed on 17/01/2016 and another Large job opened on 19/01/2016, which is within 3 days.
In order to do this, I assume I need to compare each record in the table with each subsequent record, looking for a match on JobCategory and comparing CreatedDate with ClosedDate between rows.
Can anyone advise my best option for this using SQL? I'm using SQL Server 2012.

The first thing that you should do is get rid of "magic dates" in your system. If the job hasn't been closed yet then the ClosedDate is not known. SQL has a value for exactly that - NULL. That prevents anyone in the future from having to know the magic date of 1/1/1980 or from that having to be hard-coded throughout your system.
Next, you don't have to compare each row with each one after it. Define what you're looking for and find matches that meet those qualifications. You didn't specify which type of SQL Server you're using (you should tag your question with Oracle or MySQL or SQL Server), so the below query is written for SQL Server. Your version might have different date functions.
SELECT
J1.JobNumber,
J1.JobCategory,
J1.CustomerID,
J1.CreatedDate,
J1.ClosedDate,
J2.JobNumber,
J2.CreatedDate,
J2.ClosedDate
FROM
Jobs J1
INNER JOIN Jobs J2 ON
J2.CustomerID = J1.CustomerID AND
J2.JobCategory = J1.JobCategory AND
DATEDIFF(DAY, J1.ClosedDate, J2.CreatedDate) BETWEEN 0 AND 3 AND
J2.JobNumber <> J1.JobNumber
This will return the jobs in a single row instead of two rows. If that's a problem then the query could be altered slightly to do so. This can also be done a little more easily with windowed functions, but again, since you didn't specify your SQL vendor I didn't want to use those.
Since you're using SQL Server, you should be able to use windowed functions like so:
;WITH CTE_JobsWithDates AS -- Probably a poor name for the CTE
(
SELECT
JobNumber,
JobCategory,
CustomerID,
CreatedDate,
ClosedDate,
LEAD(CreatedDate, 1) OVER (PARTITION BY JobCategory, CustomerID ORDER BY CreatedDate) AS NextCreatedDate,
LAG(ClosedDate, 1) OVER (PARTITION BY JobCategory, CustomerID ORDER BY CreatedDate) AS PreviousClosedDate
FROM
Jobs
)
SELECT
JobNumber,
JobCategory,
CustomerID,
CreatedDate,
ClosedDate
FROM
CTE_JobsWithDates
WHERE
DATEDIFF(DAY, ClosedDate, NextCreatedDate) BETWEEN 0 AND 3 OR
DATEDIFF(DAY, LastClosedDate, CreatedDate) BETWEEN 0 AND 3
That was off the cuff, so please test and let me know if anything isn't quite right.

Try:
SELECT a.*
FROM
Job AS a
JOIN
Job AS b ON
a.CustomerID = b.CustomerID AND a.JobCategory = b.JobCategory
WHERE
a.JobNumber != b.JobNumber
AND (
b.CreatedDate - a.ClosedDate BETWEEN 0 AND 3
OR
a.CreatedDate - b.ClosedDate BETWEEN 0 AND 3)

SQL statement to match dates that are the closest?

I have the following table, let's call it Names:
Name Id Date
Dirk 1 27-01-2015
Jan 2 31-01-2015
Thomas 3 21-02-2015
Next I have the another table called Consumption:
Id Date Consumption
1 26-01-2015 30
1 01-01-2015 20
2 01-01-2015 10
2 05-05-2015 20
Now the problem is, that I think that doing this using SQL is the fastest, since the table contains about 1.5 million rows.
So the problem is as follows, I would like to match each Id from the Names table with the Consumption table provided that the difference between the dates are the lowest, so we have: Dirk consumes on 27-01-2015 about 30. In case there are two dates that have the same "difference", I would like to calculate the average consumption on those two dates.
While I know how to join, I do not know how to code the difference part.
Thanks.
DBMS is Microsoft SQL Server 2012.
I believe that my question differs from the one mentioned in the comments, because it is much more complicated since it involves comparison of dates between two tables rather than having one date and comparing it with the rest of the dates in the table.

This is how you could it in SQL Server:
SELECT Id, Name, AVG(Consumption)
FROM (
SELECT n.Id, Name, Consumption,
RANK() OVER (PARTITION BY n.Id
ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date]))) AS rnk
FROM Names AS n
INNER JOIN Consumption AS c ON n.Id = c.Id ) t
WHERE t.rnk = 1
GROUP BY Id, Name
Using RANK with PARTITION BY n.Id and ORDER BY ABS(DATEDIFF(d, n.[Date], c.[Date])) you can locate all matching records per Id: all records with the smallest difference in days are going to have rnk = 1.
Then, using AVG in the outer query, you are calculating the average value of Consumption between all matching records.
SQL Fiddle Demo

sql DB calculation moving summary‏‏‏‏‏

I would like to calculate moving summary‏‏‏‏‏:
Total amount:100
first receipt: 20
second receipt: 10
the first row in calculation column is a difference between total amount and the first receipt: 100-20=80
the second row in calculation column is a difference between the first calculated_row and the first receip: 80-10=70
The presentation is supposed to present receipt_amount, balance:
receipt_amount | balance
20 | 80
10 | 70
I'll be glad to use your help
Thanks :-)

You didn't really give us much information about your tables and how they are structured.
I'm assuming that there is an orders table that contains the total_amount and a receipt_table that contains each receipt (as a positive value):
As you also didn't specify your DBMS, this is ANSI SQL:
select sum(amount) over (order by receipt_nr) as running_sum
from (
select total_amount as amount
from orders
where order_no = 1
union all
select -1 * receipt_amount
from the_receipt_table
where order_no =
) t

First of all- thanks for your response.
I work with Cache DB which can be used both SQL and ORACLE syntax.
Basically, the data is locaed in two different tables, but I have them in one join query.
Couple of rows with different receipt amounts and each row (receipt) has the same total amount.
Foe example:
Receipt_no Receipt_amount Total_amount Balance
1 20 100 80
1 10 100 70
1 30 100 40
2 20 50 30
2 10 50 20
So, the calculation is supposed to be in a way that in the first receipt the difference calculation is made from the total_amount and all other receipts (in the same receipt_no) are being reduced from the balance
Thanks!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas