Count Distinct values in one column based on other columns

Count Distinct values in one column based on other columns - sql

I have a table that looks like the following:
app_id supplier_reached creation_date platform
10001 1 9/11/2018 iOS
10001 2 9/18/2018 iOS
10002 1 5/16/2018 android
10003 1 5/6/2018 android
10004 1 10/1/2018 android
10004 1 2/3/2018 android
10004 2 2/2/2018 web
10005 4 1/5/2018 web
10005 2 5/1/2018 android
10006 3 10/1/2018 iOS
10005 4 1/1/2018 iOS
The objective is to find the unique number of app_id submitted per month.
If I just do a count(distinct app_id) I will get the following results:
Group by month count(app number)
Jan 1
Feb 1
may 3
september 1
october 2
However, an application is considered unique based on a combination of other fields as well. For example, for the month of January, the app_id is the same however a combination of app_id, supplier_reached and platform show different values and hence the app_id should be counted twice.
Following the same pattern, the desired result should be:
Group by month Desired answer
Jan 2
Feb 2
may 3
september 2
october 2
Lastly, there can be many other columns in the table which may or may not contribute to the uniqueness of an application.
Is there a way to do this type of count in SQL?
I am using Redshift.

As pointed out above, in Redshift count(distinct ...) does not work with multiple fields.
You can first group by the columns that you want to be unique and then count the records like this:
select month,count(1) as app_number
from (
select month,app_id,supplier_reached,platform
from your_table
group by 1,2,3,4
)
group by 1

I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation:
count(distinct app_id || ':' || supplier_reached || ':' || platform)

Your objective's mean is wrong.
You don't want
to find the unique number of app_id submitted per month
you want
to find the unique number of app_id + supplier_reached + platform submitted per month.
And so, you need to use a) combination of columns like count(distinct col1||col2||col3) or b)
select t1.month, count(t1.*)
(select distinct
app_id,
supplier_reached,
platform,
month
from sometable) t1
group by month

Actually, you can count distinct ROW values conveniently in Postgres:
SELECT month, count(DISTINCT (app_id, supplier_reached, platform)) AS dist_apps
FROM tbl
GROUP BY 1;
The ROW keyword would be just noise here:
count(DISTINCT ROW(app_id, supplier_reached, platform))
I would discourage concatenating columns for the purpose. This is comparatively expensive, error prone (think of distinct data types and locale-dependent text representation) and introduces corner-case errors if the used separator can be contained in column values.
Alas, not supported by Redshift:
...
Value expressions
Subscripted expressions
Array constructors
Row constructors
...

Related

Using Parameter within timestamp_trunc in SQL Query for DataStudio

I am trying to use a custom parameter within DataStudio. The data is hosted in BigQuery.
SELECT
timestamp_trunc(o.created_at, #groupby) AS dateMain,
count(o.id) AS total_orders
FROM `x.default.orders` o
group by 1
When I try this, it returns an error saying that "A valid date part name is required at [2:35]"
I basically need to group the dates using a parameter (e.g. day, week, month).
I have also included a screenshot of how I have created the parameter in Google DataStudio. There is a default value set which is "day".

A workaround that might do the trick here is to use a rollup in the group by with the different levels of aggregation of the date, since I am not sure you can pass a DS parameter to work like that.
See the following example for clarity:
with default_orders as (
select timestamp'2021-01-01' as created_at, 1 as id
union all
select timestamp'2021-01-01', 2
union all
select timestamp'2021-01-02', 3
union all
select timestamp'2021-01-03', 4
union all
select timestamp'2021-01-03', 5
union all
select timestamp'2021-01-04', 6
),
final as (
select
count(id) as count_orders,
timestamp_trunc(created_at, day) as days,
timestamp_trunc(created_at, week) as weeks,
timestamp_trunc(created_at, month) as months
from
default_orders
group by
rollup(days, weeks, months)
)
select * from final
The output, then, would be similar to the following:
count | days | weeks | months
------+------------+----------+----------
6 | null | null | null <- this, represents the overall (counted 6 ids)
2 | 2021-01-01| null | null <- this, the 1st rollup level (day)
2 | 2021-01-01|2020-12-27| null <- this, the 1st and 2nd (day, week)
2 | 2021-01-01|2020-12-27|2021-01-01 <- this, all of them
And so on.
At the moment of visualizing this on data studio, you have two options: setting the metric as Avg instead of Sum, because as you can see there's kind of a duplication at each stage of the day column; or doing another step in the query and get rid of nulls, like this:
select
*
from
final
where
days is not null and
weeks is not null and
months is not null

Impala: values are in wrong columns in result query

In my result query the values are in wrong columns.
My SQL Query is like:
create table some_database.table name as
select
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name,
sum(t.operation_amount) operation_amt,
current_timestamp() calculation_moment
from operations t
left join status_dict d on
d.status_id = t.status_id
group by
extract(year from t.operation_date) operation_year,
extract(month from t.operation_date) operation_month,
extract(day from t.operation_date) operation_day,
d.status_name
(In fact, it's more complicated, but the main idea is that I'm aggregating source table and making some joins.)
The result I get is like:
#
operation_year
operation_month
operation_day
status_name
operation_amt
1
2021
1
1
success
100
2
2021
1
1
success
150
3
2021
1
2
success
120
4
null
2021-01-01 21:53:00
success
120
null
The problem is in row 4.
The field t.operation_date is not nullable, but in result query in column operation_year we get null
In operation_month we get untruncated timestamp
In operation_day we get string value from d.status_name
In status_name we get numeric aggregate from t.operation_amount
In operation_amt we get null
It looks very similar to a wrong parsing of a csv file when values jump to other columns, but obviously it can't be the case here. I can't figure out how on earth is it possible. I'm new to Hadoop and apparently I'm not aware of some important concept which causes the problem.

How to write the SQL query to select an ID if all 7 day week numbers exists

I have a table similar to the one in the attached picture.
The IDs in the ID-2 column have 1-7 day of week numbers in 'Day of week'column e.g ID-2 = 100.
However, sometimes there won't have all 7 day of week numbers. For example in ID-2= 500 has the Day of Week numbers 3,4 and 6 missing.
In my SQL query I want to select the distinct ID when only if it has all 1-7 day of week numbers.
I wonder if someone could guide me on this? In SQL what concept can be used for it, i.e. Case, partition, self join etc?

You can use group by and having. Assming that day_of_week is always between 1 and 7:
select id_2
from mytable
group by id_2
having count(distinct day_of_week) = 7
If there are no duplicates (id2, day_of_week) tuples, then the having clause can be more efficiently phrased as:
having count(*) = 7

SQL Find latest record only if COMPLETE field is 0

I have a table with multiple records submitted by a user. In each record is a field called COMPLETE to indicate if a record is fully completed or not.
I need a way to get the latest records of the user where COMPLETE is 0, LOCATION, DATE are the same and no additional record exist where COMPLETE is 1. In each record there are additional fields such as Type, AMOUNT, Total, etc. These can be different, even though the USER, LOCATION, and DATE are the same.
There is a SUB_DATE field and ID field that denote the day the submission was made and auto incremented ID number. Here is the table:
ID NAME LOCATION DATE COMPLETE SUB_DATE TYPE1 AMOUNT1 TYPE2 AMOUNT2 TOTAL
1 user1 loc1 2017-09-15 1 2017-09-10 Food 12.25 Hotel 65.54 77.79
2 user1 loc1 2017-09-15 0 2017-09-11 Food 12.25 NULL 0 12.25
3 user1 loc2 2017-08-13 0 2017-09-05 Flight 140 Food 5 145.00
4 user1 loc2 2017-08-13 0 2017-09-10 Flight 140 NULL 0 140
5 user1 loc3 2017-07-14 0 2017-07-15 Taxi 25 NULL 0 25
6 user1 loc3 2017-08-25 1 2017-08-26 Food 45 NULL 0 45
The results I would like is to retrieve are ID 4, because the SUB_DATE is later that ID 3. Which it has the same Name, Location, and Date information and there is no COMPLETE with a 1 value.
I would also like to retrieve ID 5, since it is the latest record for the User, Location, Date, and Complete is 0.
I would also appreciate it if you could explain your answer to help me understand what is happening in the solution.

Not sure if I fully understood but try this
SELECT *
FROM (
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your_table t
) a
WHERE CompleteForNameLocationAndDate = 0 AND
SUB_DATE = LastSubDate
So what we have done here:
First, if you run just the inner query in Management Studio, you will see what that does:
The first max function will partition the data in the table by each unique Name,Location,Date set.
In the case of your data, ID 1 & 2 are the first partition, 3&4 are the second partition, 5 is the 3rd partition and 6 is the 4th partition.
So for each of these partitions it will get the max value in the complete column. Therefore any partition with a 1 as it's max value has been completed.
Note also, the convert function. This is because COMPLETE is of datatype BIT (1 or 0) and the max function does not work with that datatype. We therefore convert to INT. If your COMPLETE column is type INT, you can take the convert out.
The second max function partitions by unique Name, Location and Date again but we are getting the max_sub date this time which give us the date of the latest record for the Name,Location,Date
So we take that query and add it to a derived table which for simplicity we call a. We need to do this because SQL Server doesn't allowed windowed functions in the WHERE clause of queries. A windowed function is one that makes use of the OVER keyword as we have done. In an ideal world, SQL would let us do
SELECT *,
MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) AS CompleteForNameLocationAndDate,
MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE) AS LastSubDate
FROM your)table t
WHERE MAX(CONVERT(INT,COMPLETE)) OVER (PARTITION BY NAME,LOCATION,DATE) = 0 AND
SUB_DATE = MAX(SUB_DATE) OVER (PARTITION BY NAME, LOCATION, DATE)
But it doesn't allow it so we have to use the derived table.
So then we basically SELECT everything from our derived table Where
CompleteForNameLocationAndDate = 0
Which are Name,Location, Date partitions which do not have a record marked as complete.
Then we filter further asking for only the latest record for each partition
SUB_DATE = LastSubDate
Hope that makes sense, not sure what level of detail you need?
As a side, I would look at restructuring your tables (unless of course you have simplified to better explain this problem) as follows:
(Assuming the table in your examples is called Booking)
tblBooking
BookingID
PersonID
LocationID
Date
Complete
SubDate
tblPerson
PersonID
PersonName
tblLocation
LocationID
LocationName
tblType
TypeID
TypeName
tblBookingType
BookingTypeID
BookingID
TypeID
Amount
This way if you ever want to add Type3 or Type4 to your booking information, you don't need to alter your table layout

SQL Percentile function missing?

It is a mystery to me with this text book example. We have simply:
Transaction_ID (primary key), Client_ID, Transaction_Amount, Month
1 1 500 1
2 1 1000 1
3 1 10 2
4 2 11 2
5 3 300 2
6 3 10 2
... ... ... ...
I want to calculate in SQL the mean(Transaction_Amount), std(Transaction_Amount) and the some percentile(Transaction amount) grouped by Client_ID. But is seems, even given that percentile is a very similar calculation than the standard deviation, SQL cannot do it with a simple statement as:
SELECT
mean(Transaction_Amount),
std(Transaction_Amount),
percentile(Transaction_Amount)
FROM
myTable
GROUP BY
Client_ID, Month
Or can it?
It gets worse becuase I also need to Group By Month in addition to Client_ID.
Thanks a lot!
Sven

I'm sure Oracle can do the calculations you want. I just don't know what they are. You specify that you want something grouped by ClientId. Yet, your sample query has two keys in the GROUP BY.
Some functions that you want to look at are:
AVG()
STDDEV()
PERCENT_RANK()
Without sample data and desired results (or a very clear explanation of what you are trying to calculate), I can't put together a query.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count Distinct values in one column based on other columns - sql

I don't think Postgres or Redshift supports COUNT(DISTINCT) with multiple arguments. One workaround is to use concatenation: count(distinct app_id || ':' || supplier_reached || ':' || platform)

Related

Using Parameter within timestamp_trunc in SQL Query for DataStudio

Impala: values are in wrong columns in result query

How to write the SQL query to select an ID if all 7 day week numbers exists

SQL Find latest record only if COMPLETE field is 0

SQL Percentile function missing?

Categories

Resources