Adding rows in a table from data that is not in a column - sql

I'm trying to create a table to add all Medals won by the participant countries in the Olympics.
I scraped the data from Wikipedia and have something similar to this:
Year
Country_Name
Host_city
Host_Country
Gold
Silver
Bronze
1986
146
Los Angeles
United States
41
32
30
1986
67
Los Angeles
United States
12
12
12
And so on
I double-checked the data for some years, and it seems very accurate. The Country_Name has an ID because I have a Country_ID table that I created and updated the names with the ID:
Country_ID
Country_Name
1986
1
1986
2
So far so good. Now I want to create a new table where I'll have all countries in a specific year and the total medals for that country. I managed to easily do that for countries that participated in an edition, here's an example for the 1896 edition:
INSERT INTO Cumultative_Medals_by_Year(Country_ID, Year, Culmutative_Gold, Culmutative_Silver, Culmutative_Bronze, Total_Medals)
SELECT a.Country_Name, a.Year, SUM(a.Gold) As Cumultative_Gold, SUM(a.Silver) As Cumultative_Silver, SUM(a.Bronze) As Cumultative_Bronze, SUM(a.Gold) + SUM(a.Silver) + SUM(a.Bronze) AS Total_Medals
FROM Country_Medals a
Where a.Year >= 1896 AND Year < 1900
Group By a.Country_Name, a.Year
And I'll have this table:
Country_ID
Year
Cumultative_Gold
Cumultative_Silver
Cumultative_Bronze
Total_Medals
6
1986
2
0
0
5
7
1986
2
1
2
5
35
1986
1
2
3
6
46
1986
5
4
2
11
49
1986
6
5
2
13
51
1986
2
3
2
7
52
1986
10
18
19
47
58
1986
2
1
3
6
85
1986
1
0
1
2
131
1986
1
2
0
3
146
1986
11
7
2
20
To add the other editions I just have to edit the dates, "Where a.Year >= 1900 AND Year < 1904", for example.
INSERT INTO Cumultative_Medals_by_Year(Country_ID, Year, Culmutative_Gold, Culmutative_Silver, Culmutative_Bronze, Total_Medals)
SELECT a.Country_Name, a.Year, SUM(a.Gold) As Cumultative_Gold, SUM(a.Silver) As Cumultative_Silver, SUM(a.Bronze) As Cumultative_Bronze, SUM(a.Gold) + SUM(a.Silver) + SUM(a.Bronze) AS Total_Medals
FROM Country_Medals a
Where a.Year >= 1900 AND Year < 1904
Group By a.Country_Name, a.Year
And the table will grow.
But I'd like to also add all the other countries for the year 1896. This way I'll have a full record of all countries. So for example, you see that Country 1 has no medals in the 1896 Olympic edition, but I'd like to also add it there, even if the sum becomes NULL (where I'll update with a 0).
Why do I want that? I'd like to do an Animated Bar Chart Race, and with the data I have, some counties go "away" from the race. For example, the US didn't participate in the 1980 Olympics, so for a brief moment, the Bar for the US in the chart goes away just to return in 1984 (when it participated again). Another example is the Soviet Union, even though they do not participate anymore, it's the second participant with most medals won (only behind the US), but as the country does not have more participation after 1988, the bar just goes away after that year. By keeping a record of medals for all countries in all editions would prevent that from happening.

I'm pretty sure there are lots of countries that have won metals that were not around in 1896. But if you want a row for every country and every year, then generate the rows you want using cross join. Then join in the available information:
select c.Country_Name, y.Year,
SUM(cm.Gold) As Cumulative_Gold,
SUM(cm.Silver) As Cumulative_Silver,
SUM(cm.Bronze) As Cumulative_Bronze,
COALESCE(SUM(cm.Gold), 0) + COALESCE(SUM(cm.Silver), 0) + COALESCE(SUM(cm.Bronze), 0) AS Total_Medals
from (select distinct year from Country_Medals) y cross join
(select distinct country_name from country_medals) c left join
country_medals cm
on cm.year = y.year and
cm.country_name = c.country_name
group By c.Country_Name, y.Year

Related

Sum of Different Rows on Condition

I have a query that looks like this:
with x as (
select *, date_format(SomeDate, 'MMM') as Month from SomeTable
)
select *, count(Package) over (partition by Company, Region order by SomeDate) as BoxCount
from x
Table SomeTable basically looks like this:
Package Company Region SomeDate
1 A East 20220101
2 A East 20220105
3 A East 20220310
4 A East 20220411
5 A East 20220502
6 A West 20220405
7 A West 20220505
8 A West 20220508
9 B East 20220106
10 B East 20220212
11 B East 20220311
12 B West 20220505
13 B North 20220908
The result I want is basically this:
Company Month BoxCount
A Jan 2
A Mar 3
A Apr 4
A May 8
B Jan 1
B Feb 2
B Mar 3
B May 4
B Sept 5
What I want is basically a CUSUM by Company and Region, however, when it's the month of the May, I'd like to calculate Region West with Region East then in September I'd like to calculate all 3 regions for each respective company. Is there a way to do this in Spark SQL?
My Query gives the cumulative sum, but I'm not sure how to go about from here.

How to obtain ratio of column X over column Y in SQLlite?

I am supposed to obtain the following result:
Demographics: Ratio of Men and Women
Write a query that shows the ratio of men and women (pop_male divided by pop_female) over the age of 18 by county in 2018. Order the results from the most female to the most male counties.
county ratio
0 MADERA 0.909718
1 YOLO 0.932634
2 CONTRA COSTA 0.936229
3 SACRAMENTO 0.943243
4 SHASTA 0.944123
This preview is limited to five rows.
A select * from population shows base data in this format:
fips county year age pop_female pop_male pop_total
0 6001 ALAMEDA 1970 0 8533 8671 17204
1 6001 ALAMEDA 1970 1 8151 8252 16403
2 6001 ALAMEDA 1970 2 7753 8015 15768
3 6001 ALAMEDA 1970 3 8018 8412 16430
4 6001 ALAMEDA 1970 4 8551 8648 17199
.....and so on from ages 0-100 for all years 1970- 2018. State is CA
I tried using:
select county, (sum(pop_male) / sum(pop_female)) as ratio
from population group by county, year having age > 18 and year = 2018;
output was instead:
county ratio
ALAMEDA 0
ALPINE 1
AMADOR 1
BUTTE 0
CALAVERAS 0
COLUSA 1
CONTRA COSTA 0
Note: I am aware I haven't done any order by yet as I am not even outputting correct data.
SQLRaptor gave me a suggestion and I tried:
select county, (CAST(sum(pop_male) AS DECIMAL(1,6)) / (CAST(sum(pop_female) AS DECIMAL(1,6)) as ratio
from population group by county, year having age > 18 and year = 2018
this gave me the response:
sqlite:///../Databases/population.sqlite3
(sqlite3.OperationalError) near "as": syntax error
[SQL: select county, (CAST(sum(pop_male) AS DECIMAL(1,6)) / (CAST(sum(pop_female) AS DECIMAL(1,6)) as ratio
from population group by county, year having age > 18 and year = 2018]
I took Esteban P's suggestion and used:
select county, (SUM(CAST(pop_male AS float)) / SUM(CAST(pop_female AS float))) as ratio
from population group by county, year having age > 18 and year = 2018 order by ratio
This worked.
Don't want to override the SQLRaptor's helpful response in comments, but for the sake of completeness :)
SQL treats integer division as integer, therefore truncating it. To avoid that -- cast at least one of the values to a floating point data type (e.g. REAL or FLOAT for SQLite -- check the manual on data types here: https://www.sqlite.org/datatype3.html)

MS Access selecting by year intervals

I have a table, where every row has its own date (year of purchase), I should select the purchases grouped into year intervals.
Example:
Zetor 1993
Zetor 1993
JOHN DEERE 2001
JOHN DEERE 2001
JOHN DEERE 2001
Means I have 2 zetor purchase in 1993 and 3 john deere purchase in 2001. I should select the count of the pruchases grouped into these year intervals:
<=1959
1960-1969
1970-1979
1980-1989
1990-1994
1995-1999
2000-2004
2004-2009
2010-2013
I have no idea how should I do this.
The result should look like this on the example above:
<=1959
1960-1969 0
1970-1979 0
1980-1989 0
1990-1994 2
1995-1999 0
2000-2004 3
2004-2009 0
2010-2013 0
Create table with intervals:
tblRanges([RangeName],[Begins],[Ends])
Populate it with your intervals
Use GROUP BY with your table tblPurchases([Item],YearOfDeal):
SELECT tblRanges.RangeName, Count(tblPurchases.YearOfDeal)
FROM tblRanges INNER JOIN tblPurchases ON (tblRanges.Begins <= tblPurchases.Year) AND (tblRanges.Ends >= tblPurchases.YearOfDeal)
GROUP BY tblRanges.RangeName;
You may wish to consider Partition for future use:
SELECT Partition([Year],1960,2014,10) AS [Group], Count(Stock.Year) AS CountOfYear
FROM Stock
GROUP BY Partition([Year],1960,2014,10)
Input:
Tractor Year
Zetor 1993
Zetor 1993
JOHN DEERE 2001
JOHN DEERE 2001
JOHN DEERE 2001
Pre 59 1945
1960 1960
Result:
Group CountOfYear
:1959 1
1960:1969 1
1990:1999 2
2000:2009 3
Reference: http://office.microsoft.com/en-ie/access-help/partition-function-HA001228892.aspx

Get count per year of data with begin and end dates

I have a set of data that lists each employee ever employed in a certain type of department at many cities, and it lists each employee's begin and end date.
For example:
name city_id start_date end_date
-----------------------------------------
Joe Public 54 3-19-1994 9-1-2002
Suzi Que 54 10-1-1995 9-1-2005
What I want is each city's employee count for each year in a particular period. For example, if this was all the data for city 54, then I'd show this as the query results if I wanted to show city 54's employee count for the years 1990-2005:
city_id year employee_count
-----------------------------
54 1990 0
54 1991 0
54 1992 0
54 1993 0
54 1994 1
54 1995 2
54 1996 2
54 1997 2
54 1998 2
54 1999 2
54 2000 2
54 2001 2
54 2002 2
54 2003 1
54 2004 1
54 2005 1
(Note that I will have many cities, so the primary key here would be city and year unless I want to have a separate id column.)
Is there an efficient SQL query to do this? All I can think of is a series of UNIONed queries, with one query for each year I wanted to get numbers for.
My dataset has a few hundred cities and 178,000 employee records. I need to find a few decades' worth of this yearly data for each city on my dataset.
replace 54 with your parameter
select
<city_id>, c.y, count(t.city_id)
from generate_series(1990, 2005) as c(y)
left outer join Table1 as t on
c.y between extract(year from t.start_date) and extract(year from t.end_date) and
t.city_id = <city_id>
group by c.y
order by c.y
sql fiddle demo

Using SQL DENSE_RANK to determine duplicates

Here is an example of the data for which I am trying to find all the orders with same quantities, ignoring the OrderID column
Product Location Customer OrderID Quantity
Eggs Chicago XYZ 2011 10
Eggs Chicago XYZ 2012 10
Eggs Chicago XYZ 2013 15
So, I used DENSE_RANK function in the SQL
Select Product,Location,Customer,OrderID,Quantity,
Ranking = DENSE_RANK() OVER (PARTITION BY Product,Location,Customer,Quantity
ORDER BY OrderID ASC)
FROM MyTable
to get the data below
Product Location Customer OrderID Quantity Ranking
Eggs Chicago XYZ 2011 10 1
Eggs Chicago XYZ 2012 10 2
Eggs Chicago XYZ 2013 15 1
So, based on the ranking I was able to filter out the records that have the same quantity across different orderIDs and treat them as one.
So far everything is good and I am happy. But, one of another crazy requirement is this form of aggregation should be done only for the first change in quantity. For example, if the above data happens to be like one below
Product Location Customer OrderID Quantity
Eggs Chicago XYZ 2011 10
Eggs Chicago XYZ 2012 10
Eggs Chicago XYZ 2013 15
Eggs Chicago XYZ 2014 15
Eggs Chicago XYZ 2015 15
The same SQL would produce result
Product Location Customer OrderID Quantity Ranking
Eggs Chicago XYZ 2011 10 1
Eggs Chicago XYZ 2012 10 2
Eggs Chicago XYZ 2013 15 1
Eggs Chicago XYZ 2013 15 2
Eggs Chicago XYZ 2013 15 3
But, I would need the result to be
Product Location Customer OrderID Quantity Ranking
Eggs Chicago XYZ 2011 10 1
Eggs Chicago XYZ 2012 10 2
Eggs Chicago XYZ 2013 15 1
Eggs Chicago XYZ 2013 15 1
Eggs Chicago XYZ 2013 15 1
Please, note the ranking remains 1 for all the records after the first change in quantity.
Is it possible to tweak my SQL to get the above behavior?
Thanks for any suggestions.
If I understand you correctly, you want to use DENSE_RANK() to eliminate duplicate rows in your data.
It seems you’ve already solved your problem. If you want to eliminate the duplicates, use the same SQL code you have above and delete any rows with Ranking > 1. This will leave you with one copy of each row with the same unique key (e.g. Product, Location, Customer, OrderID).
This feels a bit dirty but I think it's correct:
SELECT
Product,
Location,
Customer,
OrderID,
Quantity,
DENSE_RANK()
OVER (PARTITION BY
Product,
Location,
Customer,
Quantity
ORDER BY
CASE WHEN
Quantity = (SELECT MIN(Quantity) FROM Orders) THEN OrderID
ELSE 0 END ASC
) AS Ranking
FROM
Orders
See fiddle