Sum of Different Rows on Condition - sql

I have a query that looks like this:
with x as (
select *, date_format(SomeDate, 'MMM') as Month from SomeTable
)
select *, count(Package) over (partition by Company, Region order by SomeDate) as BoxCount
from x
Table SomeTable basically looks like this:
Package Company Region SomeDate
1 A East 20220101
2 A East 20220105
3 A East 20220310
4 A East 20220411
5 A East 20220502
6 A West 20220405
7 A West 20220505
8 A West 20220508
9 B East 20220106
10 B East 20220212
11 B East 20220311
12 B West 20220505
13 B North 20220908
The result I want is basically this:
Company Month BoxCount
A Jan 2
A Mar 3
A Apr 4
A May 8
B Jan 1
B Feb 2
B Mar 3
B May 4
B Sept 5
What I want is basically a CUSUM by Company and Region, however, when it's the month of the May, I'd like to calculate Region West with Region East then in September I'd like to calculate all 3 regions for each respective company. Is there a way to do this in Spark SQL?
My Query gives the cumulative sum, but I'm not sure how to go about from here.

Related

sql - How To Remove All Rows After 4th Occurence of Column Combination in postgresql

I have a sql query that results in a table similar to the following after grouping by name, quarter, year and ordering by year DESC, quarter DESC:
name
count
quarter
year
orange
22
4
2022
apple
1
4
2022
banana
123
3
2022
pie
93
2
2022
apple
12
2
2022
orange
0
1
2022
apple
900
4
2021
...
...
...
...
I want to remove any rows that come after the 4th unique combination of quarter and year is reached (for the table above this would be any rows after the last combination of quarter 1, year 2022), like so:
name
count
quarter
year
orange
22
4
2022
apple
1
4
2022
banana
123
3
2022
pie
93
2
2022
apple
12
2
2022
orange
0
1
2022
I am using Postgres 6.10.
If the next year were reached, it would still need to work with the quarter at the top being 1 and the year 2023.
select name
,count
,quarter
,year
from
(
select *
,dense_rank() over(order by year desc, quarter desc) as dns_rnk
from t
) t
where dns_rnk <= 4
name
count
quarter
year
orange
22
4
2022
apple
1
4
2022
banana
123
3
2022
pie
93
2
2022
apple
12
2
2022
orange
0
1
2022
Fiddle

Non-repeated values in Big Query

I am fairly new to SQL, so this might be an easy solution for most, but I am having an issue with joins in Big Query. I have two tables:
TABLE A
id name purchases
1 alex 2
2 jane 7
3 peter 8
4 mario 1
5 luigi 6
TABLE B
id name visited
1 alex jan
2 jane jan
2 jane feb
3 peter jan
3 peter feb
3 peter mar
4 mario feb
5 luigi mar
I want my end result to have unique number of purchases per name/id, so the following:
TABLE C
id name visited purchases
1 alex jan 2
2 jane jan 7
2 jane feb 0
3 peter jan 8
3 peter feb 0
3 peter mar 0
4 mario feb 1
5 luigi mar 6
However, no matter what joins I perform, I end up with number of purchases per user matched every time, like the following:
id name visited purchases
1 alex jan 2
2 jane jan 7
2 jane feb 7
3 peter jan 8
3 peter feb 8
3 peter mar 8
4 mario feb 1
5 luigi mar 6
What would be the query to have Table C from Tables A and B?
Thank you.
One method is using row_number()
select b.*, coalesce(a.purchases, 0) purchases
from (
select *, row_number() over(partition by id order by visited) rn
from b ) b
left join a on a.id = b.id and b.rn=1
You may wish to decode visited to an ordinal depending on ordering requirements, for example
.. order by case visited when 'jan' then 1 when .. end ..

Adding rows in a table from data that is not in a column

I'm trying to create a table to add all Medals won by the participant countries in the Olympics.
I scraped the data from Wikipedia and have something similar to this:
Year
Country_Name
Host_city
Host_Country
Gold
Silver
Bronze
1986
146
Los Angeles
United States
41
32
30
1986
67
Los Angeles
United States
12
12
12
And so on
I double-checked the data for some years, and it seems very accurate. The Country_Name has an ID because I have a Country_ID table that I created and updated the names with the ID:
Country_ID
Country_Name
1986
1
1986
2
So far so good. Now I want to create a new table where I'll have all countries in a specific year and the total medals for that country. I managed to easily do that for countries that participated in an edition, here's an example for the 1896 edition:
INSERT INTO Cumultative_Medals_by_Year(Country_ID, Year, Culmutative_Gold, Culmutative_Silver, Culmutative_Bronze, Total_Medals)
SELECT a.Country_Name, a.Year, SUM(a.Gold) As Cumultative_Gold, SUM(a.Silver) As Cumultative_Silver, SUM(a.Bronze) As Cumultative_Bronze, SUM(a.Gold) + SUM(a.Silver) + SUM(a.Bronze) AS Total_Medals
FROM Country_Medals a
Where a.Year >= 1896 AND Year < 1900
Group By a.Country_Name, a.Year
And I'll have this table:
Country_ID
Year
Cumultative_Gold
Cumultative_Silver
Cumultative_Bronze
Total_Medals
6
1986
2
0
0
5
7
1986
2
1
2
5
35
1986
1
2
3
6
46
1986
5
4
2
11
49
1986
6
5
2
13
51
1986
2
3
2
7
52
1986
10
18
19
47
58
1986
2
1
3
6
85
1986
1
0
1
2
131
1986
1
2
0
3
146
1986
11
7
2
20
To add the other editions I just have to edit the dates, "Where a.Year >= 1900 AND Year < 1904", for example.
INSERT INTO Cumultative_Medals_by_Year(Country_ID, Year, Culmutative_Gold, Culmutative_Silver, Culmutative_Bronze, Total_Medals)
SELECT a.Country_Name, a.Year, SUM(a.Gold) As Cumultative_Gold, SUM(a.Silver) As Cumultative_Silver, SUM(a.Bronze) As Cumultative_Bronze, SUM(a.Gold) + SUM(a.Silver) + SUM(a.Bronze) AS Total_Medals
FROM Country_Medals a
Where a.Year >= 1900 AND Year < 1904
Group By a.Country_Name, a.Year
And the table will grow.
But I'd like to also add all the other countries for the year 1896. This way I'll have a full record of all countries. So for example, you see that Country 1 has no medals in the 1896 Olympic edition, but I'd like to also add it there, even if the sum becomes NULL (where I'll update with a 0).
Why do I want that? I'd like to do an Animated Bar Chart Race, and with the data I have, some counties go "away" from the race. For example, the US didn't participate in the 1980 Olympics, so for a brief moment, the Bar for the US in the chart goes away just to return in 1984 (when it participated again). Another example is the Soviet Union, even though they do not participate anymore, it's the second participant with most medals won (only behind the US), but as the country does not have more participation after 1988, the bar just goes away after that year. By keeping a record of medals for all countries in all editions would prevent that from happening.
I'm pretty sure there are lots of countries that have won metals that were not around in 1896. But if you want a row for every country and every year, then generate the rows you want using cross join. Then join in the available information:
select c.Country_Name, y.Year,
SUM(cm.Gold) As Cumulative_Gold,
SUM(cm.Silver) As Cumulative_Silver,
SUM(cm.Bronze) As Cumulative_Bronze,
COALESCE(SUM(cm.Gold), 0) + COALESCE(SUM(cm.Silver), 0) + COALESCE(SUM(cm.Bronze), 0) AS Total_Medals
from (select distinct year from Country_Medals) y cross join
(select distinct country_name from country_medals) c left join
country_medals cm
on cm.year = y.year and
cm.country_name = c.country_name
group By c.Country_Name, y.Year

Select all values from last date that is shared between rows

I’ve got a Postgresql table with counts for countries over time. Not every country has a count for each date, and some have NULL values. I’d like to get the counts for each country up to the last date every country has data for, excluding NULL values.
I made a DB Fiddle with example data.
Example:
country date count id
Germany 2020-05-25 10 1
Germany 2020-05-26 11 2
Germany 2020-05-27 12 3
Germany 2020-05-28 13 4
Italy 2020-05-25 20 5
Italy 2020-05-26 21 6
Italy 2020-05-27 22 7
Italy 2020-05-28 23 8
France 2020-05-25 30 9
France 2020-05-26 31 10
France 2020-05-27 NULL 11
I’d like to get back the following:
country date count id
Germany 2020-05-25 10 1
Germany 2020-05-26 11 2
Italy 2020-05-25 20 5
Italy 2020-05-26 21 6
France 2020-05-25 30 9
France 2020-05-26 31 10
I’ve searched, but I’m relatively new to SQL and don’t seem to know what keywords to search for.
You can use window functions to count the number of rows with dates and then compare to the number of countries:
SELECT c.*
FROM (SELECT c.*, COUNT(count) over (partition by date) as num_countries_on_date
FROM countries c
) c
WHERE num_countries_on_date = (SELECT COUNT(DISTINCT c2.country) FROM countries c2);
Here is a db<>fiddle.
If you wanted to generate data for a range of dates -- sort of the opposite problem -- you could use a CROSS JOIN to generate the rows, a LEFT JOIN to bring in the data, and COALESCE() to turn NULL to 0:
SELECT c.country, d.date, coalesce(co.count, 0) as count
FROM (SELECT DISTINCT country FROM countries) c CROSS JOIN
generate_series('2020-05-26'::date, '2020-05-27'::date, interval '1 day') d(date) LEFT JOIN
countries co
ON co.country = c.country AND co.date = d.date;

How retrieve all parent and child rows population in Oracle sql?

I have a table "TB_Population" with some records about the population from all over the world.
at this time I want to calculate each title's population in particular row
and demonstrate each level in that table.
I have this table with the following data:
ID TITLE PARENT_ID POPULATION
1 WORLD 10
2 AFRICA 1 5
3 ASIA 1 10
4 EUROPE 1 4
5 GERMANY 4 6
6 FRANCE 4 10
7 ITALY 4 4
8 JAPAN 3 6
9 MORROCO 2 1
10 SPAIN 4 9
11 INDIA 3 8
12 PORTUGAL 4 2
13 USA 14 10
14 AMERICA 1 10
15 NEWYORK 13 5
The expected output table should be as below
ID TITLE POPULATION LEVEL
1 WORLD 100 1
2 AFRICA 6 2
3 ASIA 24 2
4 EUROPE 35 2
5 GERMANY 6 3
6 FRANCE 10 3
7 ITALY 4 3
8 JAPAN 6 3
9 MORROCO 1 3
10 SPAIN 9 3
11 INDIA 8 3
12 PORTUGAL 2 3
13 USA 15 3
14 AMERICA 25 2
15 NEWYORK 5 4
Thanks and best regards
The tricky part which I see here is you want the LEVEL of title from "BOTTOM TO TOP" and POPULATION from "TOP TO BOTTOM". For example, AMERICA's level has to be 2 which means the LEVEL has to be measured from AMERICA -> WORLD, but AMERICA's population has to be 25 which is the sum of population measured from AMERICA -> NEWYORK. So, I tried this:
SELECT TOP_TO_BOTTOM.TITLE_ALIAS, TOP_TO_BOTTOM.TOTAL_POPULATION, BOTTOM_TO_TOP.MAX_LEVEL FROM
(SELECT TITLE_ALIAS, SUM(POPULATION) AS "TOTAL_POPULATION" FROM
(SELECT CONNECT_BY_ROOT TITLE AS "TITLE_ALIAS", POPULATION
FROM TB_POPULATION
CONNECT BY PRIOR ID = PARENT_ID)
GROUP BY TITLE_ALIAS) "TOP_TO_BOTTOM"
INNER JOIN
(SELECT TITLE_ALIAS, MAX(LEV) AS "MAX_LEVEL" FROM
(SELECT CONNECT_BY_ROOT TITLE AS "TITLE_ALIAS", LEVEL AS "LEV"
FROM TB_POPULATION
CONNECT BY PRIOR PARENT_ID = ID)
GROUP BY TITLE_ALIAS) "BOTTOM_TO_TOP"
ON
BOTTOM_TO_TOP.TITLE_ALIAS = TOP_TO_BOTTOM.TITLE_ALIAS
ORDER BY BOTTOM_TO_TOP.MAX_LEVEL;
You can have a look at the simulation here: https://rextester.com/HFTIH47397.
Hope this helps you