SQL calculate share of grouped variables to total count

SQL calculate share of grouped variables to total count - sql

Its probably very easy, bot somehow I cannot get the desired result:
My data looks like this: I have a large table with items sold. Each item has a category assigned (here A-D) and country. I would like to calculate how many items were sold in Europe, in each category, and what is the share of this particular category to total sales
my data looks like this:
country
item_id
item_cat
Europe
1
A
Europe
2
A
Europe
3
B
Europe
4
B
Europe
5
C
Europe
6
C
Europe
7
C
USA
8
D
USA
9
D
USA
10
D
my desired output like this:
country
item_cat
cat_sales
total_sales
share
Europe
A
2
7
0.29
Europe
B
2
7
0.29
Europe
C
3
7
0.43
what I tried is:
SELECT
country,
item_cat,
count(*) as cat_sales,
count(*) OVER () as total_sales,
cat_sales / total_sales as share
FROM data
where country='Europe'
group by item_cat
but SQL tells me I cannot group and use windowing in one request.
How could i solve this?
Thanks in advance

A few ways, one would be to pre-count the total sales in a CTE and then select from it for the remaining aggregate.
I don't use impala however in standard SQL this should work
with tot as (
select *,
Count(*) over(partition by country) * 1.0 as total_sales
from t
)
select country, item_cat,
Count(*) as cat_sales,
total_sales,
Round(Count(*) / total_sales, 2) as Share
from tot
where country='europe'
group by country, item_cat, total_sales

Related

SQL Window Function over sliding time window

I have the following data:
country objectid objectuse
record_date
2022-07-20 chile 0 4
2022-07-01 chile 1 4
2022-07-02 chile 1 4
2022-07-03 chile 1 4
2022-07-04 chile 1 4
... ... ... ...
2022-07-26 peru 3088 4
2022-07-27 peru 3088 4
2022-07-28 peru 3088 4
2022-07-30 peru 3088 4
2022-07-31 peru 3088 4
The data describes the daily usage of an object within a country for a single month (July 2022), and not all object are used every day. One of the things I am interested in finding is the sum of the monthly maximums for the month:
WITH month_max AS (
SELECT
country,
objectid,
MAX(objectuse) AS maxuse
FROM mytable
GROUP BY
country,
objectid
)
SELECT
country,
SUM(maxuse)
FROM month_max
GROUP BY country;
Which results in this:
country sum
-------------
chile 1224
peru 17008
But what I actually want is to get the rolling sum of the maxima from the beginning of the month up to each date. So that I get something that looks like:
country sum
record_date
2022-07-01 chile 1
2022-07-01 peru 1
2022-07-02 chile 2
2022-07-02 peru 3
... ... ...
2022-07-31 chile 1224
2022-07-31 peru 17008
I tried using a window function like this to no avail:
SELECT
*,
SUM(objectuse) OVER (
PARTITION BY country
ORDER BY record_date ROWS 30 PRECEDING
) as cumesum
FROM mytable
order BY cumesum DESC;
Is there a way I can achieve the desired result in SQL?
Thanks in advance.
EDIT: For what it's worth, I asked the same question but on Pandas and I received an answer; perhaps it helps to figure out how to do it in SQL.

What ended up working is probably not the most efficient approach to this problem. I essentially created backwards looking blocks from each day in the month back towards the beginning of the month. Within each of these buckets I get the maximum of objectuse for each objectid within that bucket. After taking the max, I sum across all the maxima for that backward looking period. I do this for every day in the data.
Here is the query that does it:
WITH daily_lookback AS (
SELECT
A.record_date,
A.country,
B.objectid,
MAX(B.objectuse) AS maxuse
FROM mytable AS A
LEFT JOIN mytable AS B
ON A.record_date >= B.record_date
AND A.country = B.country
AND DATE_PART('month', A.record_date) = DATE_PART('month', B.record_date)
AND DATE_PART('year', A.record_date) = DATE_PART('year', B.record_date)
GROUP BY
A.record_date,
A.country,
B.objectid
)
SELECT
record_date,
country,
SUM(maxuse) AS usetotal
FROM daily_lookback
GROUP BY
record_date,
country
ORDER BY
record_date;
Which gives me exactly what I was looking for: the cumulative sum of the objectid maximums for the backward looking period, like this:
country sum
record_date
2022-07-01 chile 1
2022-07-01 peru 1
2022-07-02 chile 2
2022-07-02 peru 3
... ... ...
2022-07-31 chile 1224
2022-07-31 peru 17008

You need to change your inner query to use the windowed maximum:
WITH month_max AS (
SELECT record_date, country, objectid,
MAX(objectuse) over (PARTITION BY country, objectid ORDER BY record_date) AS mx
FROM mytable
)
SELECT record_date, country, SUM(mx) as "sum"
FROM month_max
GROUP BY record_date, country;
This does assume one row per object per date.
Here's a re-written version of your query. With indexing it seems possible that it might run faster:
select record_date, country, min(usetotal) as usetotal
from mytable d inner join lateral (
select distinct sum(max(objectuse)) over () as usetotal from mytable a
where a.record_date between date_trunc('month', d.record_date) and d.record_date
and a.country = d.country
group by objectid
) T on 1 = 1
group by record_date, country
order by record_date, country;
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=63760e30aecf4c885ec4967045b6cd03

Calculate the number of products responsible for 50% of my sales

I have a shop that sells products in different countries.
I end up with a sales table like this ( with much more month)
Month
Country
Product
Sales
01-2022
UK
Tomato
10
01-2022
UK
Banana
4
01-2022
UK
Garlic
1
01-2022
FR
Tomato
1
01-2022
FR
Banana
2
01-2022
FR
Garlic
1
I would like to know the number of products responsible for 50% of the sales per month and country. Something like this.
Month
Country
Nb products accountable for 50% sales
01-2022
UK
1
02-2022
UK
3
03-2022
UK
2
01-2022
FR
1
02-2022
FR
4
03-2022
FR
3
The objective is to have the percentage of my catalogue responsible for the majority of sales. Exemple: 10% of my catalogue represents 50% of sales.
I have tried to solve the problem with multiple window functions and I have already searched the open topics without success

I finally found solution tweaking windows functions.
,t1 AS (
SELECT
*
,SUM(sales) OVER (PARTITION BY country_group, order_date ORDER BY sales DESC ROWS BETWEEN UNBOUNDED PRECEDING AND 0 PRECEDING) AS running_total
,0.5*SUM(sales) OVER(PARTITION BY country_group, order_date) AS total_sales_x_50perc
FROM t0
ORDER BY 1
)
SELECT
order_date
,country_group
,COUNT(DISTINCT CASE WHEN running_total <= total_sales_x_50perc THEN product ELSE NULL END) AS nb_products
,COUNT(DISTINCT product) AS total_nb_products
,COUNT(DISTINCT CASE WHEN running_total <= total_sales_x_50perc THEN product ELSE NULL END)/COUNT(DISTINCT products) AS perc
FROM t1
GROUP BY 1,2
ORDER BY 1

I wonder how it works when multiple group by, like group by column_name(1), column_name(2), column_name(3)

When i checked it, it doesn't remove duplication of value. Why?
example) Group by a , Group by a,b,c
Is there a difference between Group by a, Group by a,b,c ?
I wrote SQL query like this ::
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country;
result ::
Table: Customers
COUNT(CustomerID) Country
---------------------------------
3 Argentina
2 Austria
2 Belgium
9 Brazil
3 Canada
2 Denmark
2 Finland
to
SELECT COUNT(CustomerID), Country
FROM Customers
GROUP BY Country, CustomerID;
Table: Customers
COUNT(CustomerID) Country
---------------------------------
1 Germany
1 Mexico
1 Mexico
1 UK
1 Sweden
1 Germany
1 France
Why doesn't tie same value changed query from Column_name?
It display all value along column_name.
I wonder if it works. thank you.

SQLite percentages with small values

So I have this table of subscribers of users and the country they are in.
UserID | Name | Country
-------+-------------------+------------
1 | Zaphod Beeblebrox | UK
2 | Arthur Dent | UK
3 | Gene Kelly | USA
4 | Nat King Cole | USA
I need to produce a list of all the users by percentage from each of the countries. I also need all the smaller member countries (under 1%) to be collapsed into an "OTHERS" category.
I can accomplish a simple "top x" of members trivially with a
SELECT COUNTRY, COUNT(*) AS POPULATION FROM SUBSCRIBERS GROUP BY COUNTRY ORDER BY POPULATION DESC LIMIT 10
and can generate the percentages by PHP server side code, but I don't quite know how to:
Do all of it in SQL including percentage calculations directly in the result
Club all under 1% members into a single OTHERS category.
So I need something like this:
Country | Population
--------+-----------
USA | 25.4%
Brazil | 12%
UK | 5%
OTHERS | 65%
Appreciate the help!

Here is query for this, I used a subquery to count the total number of rows and then used that to get the percentage value for each. The 'Others' category was generated in a separate query. Rows are sorted by descending population with the Others row last.
SELECT * FROM
(SELECT country , ROUND((100.0*COUNT(*)/count_all),1) ||'%' AS population
FROM (SELECT count(*) count_all FROM subscribers) AS sq,
subscribers s
WHERE (SELECT 100*count(*)/count_all
FROM subscribers s2
WHERE s2.country = s.country) > 1
GROUP BY country
ORDER BY population DESC)
UNION ALL
SELECT 'OTHERS', IFNULL(ROUND(100.0*COUNT(*)/count_all,1),0.0) ||'%' AS population
FROM (SELECT count(*) count_all FROM subscribers) AS sq,
subscribers s
WHERE (SELECT 100*count(*)/count_all
FROM subscribers s2
WHERE s2.country = s.country) <= 1

Ok I think I might have found a way to do this that's a hell of a lot quicker on execution speed:
SELECT territory,
Round(Sum(percentage), 3) AS Population
FROM (SELECT
Round((Count(*)*100.0)/(SELECT Count(*) FROM subscribers),3) AS Percentage,
CASE
WHEN ((Count(*)*100.0)/(SELECT Count(*) FROM subscribers)) > 2 THEN
country
ELSE 'Other'
END AS Territory
FROM subscribers
GROUP BY country
ORDER BY percentage DESC)
GROUP BY territory
ORDER BY population DESC;

SQL - select top xx% rows

I have a table, sales, which is ordered by descending TotalSales
user_id | TotalSales
----------------------
4 10
2 1.5
5 0.99
3 0.5
1 0.33
What I would like to do is find the percentage of the sum of all sales that the xx% most important sales represent.
For example if I wanted to do it for top 40% sales, here I would get (10+1.5)/(10+1.5+0.99+0.5+0.33)= 86%
But right now I haven't been able to select "top xx% rows".
Edit: DB management system can be MySQL or Vertica or Hive

select Sum(a) as s from sales where a in (Select TotalSales from sales where TotalSales>=x)
GROUP BY a
select Sum(TotalSales) as b from sales group by b
your result is s/b
and x= the percentage you set each time

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQL calculate share of grouped variables to total count - sql

Related

SQL Window Function over sliding time window

Calculate the number of products responsible for 50% of my sales

I wonder how it works when multiple group by, like group by column_name(1), column_name(2), column_name(3)

SQLite percentages with small values

SQL - select top xx% rows

Categories

Resources