How to obtain ratio of column X over column Y in SQLlite? - sql

I am supposed to obtain the following result:
Demographics: Ratio of Men and Women
Write a query that shows the ratio of men and women (pop_male divided by pop_female) over the age of 18 by county in 2018. Order the results from the most female to the most male counties.
county ratio
0 MADERA 0.909718
1 YOLO 0.932634
2 CONTRA COSTA 0.936229
3 SACRAMENTO 0.943243
4 SHASTA 0.944123
This preview is limited to five rows.
A select * from population shows base data in this format:
fips county year age pop_female pop_male pop_total
0 6001 ALAMEDA 1970 0 8533 8671 17204
1 6001 ALAMEDA 1970 1 8151 8252 16403
2 6001 ALAMEDA 1970 2 7753 8015 15768
3 6001 ALAMEDA 1970 3 8018 8412 16430
4 6001 ALAMEDA 1970 4 8551 8648 17199
.....and so on from ages 0-100 for all years 1970- 2018. State is CA
I tried using:
select county, (sum(pop_male) / sum(pop_female)) as ratio
from population group by county, year having age > 18 and year = 2018;
output was instead:
county ratio
ALAMEDA 0
ALPINE 1
AMADOR 1
BUTTE 0
CALAVERAS 0
COLUSA 1
CONTRA COSTA 0
Note: I am aware I haven't done any order by yet as I am not even outputting correct data.
SQLRaptor gave me a suggestion and I tried:
select county, (CAST(sum(pop_male) AS DECIMAL(1,6)) / (CAST(sum(pop_female) AS DECIMAL(1,6)) as ratio
from population group by county, year having age > 18 and year = 2018
this gave me the response:
sqlite:///../Databases/population.sqlite3
(sqlite3.OperationalError) near "as": syntax error
[SQL: select county, (CAST(sum(pop_male) AS DECIMAL(1,6)) / (CAST(sum(pop_female) AS DECIMAL(1,6)) as ratio
from population group by county, year having age > 18 and year = 2018]
I took Esteban P's suggestion and used:
select county, (SUM(CAST(pop_male AS float)) / SUM(CAST(pop_female AS float))) as ratio
from population group by county, year having age > 18 and year = 2018 order by ratio
This worked.

Don't want to override the SQLRaptor's helpful response in comments, but for the sake of completeness :)
SQL treats integer division as integer, therefore truncating it. To avoid that -- cast at least one of the values to a floating point data type (e.g. REAL or FLOAT for SQLite -- check the manual on data types here: https://www.sqlite.org/datatype3.html)

Related

Sum of Different Rows on Condition

I have a query that looks like this:
with x as (
select *, date_format(SomeDate, 'MMM') as Month from SomeTable
)
select *, count(Package) over (partition by Company, Region order by SomeDate) as BoxCount
from x
Table SomeTable basically looks like this:
Package Company Region SomeDate
1 A East 20220101
2 A East 20220105
3 A East 20220310
4 A East 20220411
5 A East 20220502
6 A West 20220405
7 A West 20220505
8 A West 20220508
9 B East 20220106
10 B East 20220212
11 B East 20220311
12 B West 20220505
13 B North 20220908
The result I want is basically this:
Company Month BoxCount
A Jan 2
A Mar 3
A Apr 4
A May 8
B Jan 1
B Feb 2
B Mar 3
B May 4
B Sept 5
What I want is basically a CUSUM by Company and Region, however, when it's the month of the May, I'd like to calculate Region West with Region East then in September I'd like to calculate all 3 regions for each respective company. Is there a way to do this in Spark SQL?
My Query gives the cumulative sum, but I'm not sure how to go about from here.

Adding rows in a table from data that is not in a column

I'm trying to create a table to add all Medals won by the participant countries in the Olympics.
I scraped the data from Wikipedia and have something similar to this:
Year
Country_Name
Host_city
Host_Country
Gold
Silver
Bronze
1986
146
Los Angeles
United States
41
32
30
1986
67
Los Angeles
United States
12
12
12
And so on
I double-checked the data for some years, and it seems very accurate. The Country_Name has an ID because I have a Country_ID table that I created and updated the names with the ID:
Country_ID
Country_Name
1986
1
1986
2
So far so good. Now I want to create a new table where I'll have all countries in a specific year and the total medals for that country. I managed to easily do that for countries that participated in an edition, here's an example for the 1896 edition:
INSERT INTO Cumultative_Medals_by_Year(Country_ID, Year, Culmutative_Gold, Culmutative_Silver, Culmutative_Bronze, Total_Medals)
SELECT a.Country_Name, a.Year, SUM(a.Gold) As Cumultative_Gold, SUM(a.Silver) As Cumultative_Silver, SUM(a.Bronze) As Cumultative_Bronze, SUM(a.Gold) + SUM(a.Silver) + SUM(a.Bronze) AS Total_Medals
FROM Country_Medals a
Where a.Year >= 1896 AND Year < 1900
Group By a.Country_Name, a.Year
And I'll have this table:
Country_ID
Year
Cumultative_Gold
Cumultative_Silver
Cumultative_Bronze
Total_Medals
6
1986
2
0
0
5
7
1986
2
1
2
5
35
1986
1
2
3
6
46
1986
5
4
2
11
49
1986
6
5
2
13
51
1986
2
3
2
7
52
1986
10
18
19
47
58
1986
2
1
3
6
85
1986
1
0
1
2
131
1986
1
2
0
3
146
1986
11
7
2
20
To add the other editions I just have to edit the dates, "Where a.Year >= 1900 AND Year < 1904", for example.
INSERT INTO Cumultative_Medals_by_Year(Country_ID, Year, Culmutative_Gold, Culmutative_Silver, Culmutative_Bronze, Total_Medals)
SELECT a.Country_Name, a.Year, SUM(a.Gold) As Cumultative_Gold, SUM(a.Silver) As Cumultative_Silver, SUM(a.Bronze) As Cumultative_Bronze, SUM(a.Gold) + SUM(a.Silver) + SUM(a.Bronze) AS Total_Medals
FROM Country_Medals a
Where a.Year >= 1900 AND Year < 1904
Group By a.Country_Name, a.Year
And the table will grow.
But I'd like to also add all the other countries for the year 1896. This way I'll have a full record of all countries. So for example, you see that Country 1 has no medals in the 1896 Olympic edition, but I'd like to also add it there, even if the sum becomes NULL (where I'll update with a 0).
Why do I want that? I'd like to do an Animated Bar Chart Race, and with the data I have, some counties go "away" from the race. For example, the US didn't participate in the 1980 Olympics, so for a brief moment, the Bar for the US in the chart goes away just to return in 1984 (when it participated again). Another example is the Soviet Union, even though they do not participate anymore, it's the second participant with most medals won (only behind the US), but as the country does not have more participation after 1988, the bar just goes away after that year. By keeping a record of medals for all countries in all editions would prevent that from happening.
I'm pretty sure there are lots of countries that have won metals that were not around in 1896. But if you want a row for every country and every year, then generate the rows you want using cross join. Then join in the available information:
select c.Country_Name, y.Year,
SUM(cm.Gold) As Cumulative_Gold,
SUM(cm.Silver) As Cumulative_Silver,
SUM(cm.Bronze) As Cumulative_Bronze,
COALESCE(SUM(cm.Gold), 0) + COALESCE(SUM(cm.Silver), 0) + COALESCE(SUM(cm.Bronze), 0) AS Total_Medals
from (select distinct year from Country_Medals) y cross join
(select distinct country_name from country_medals) c left join
country_medals cm
on cm.year = y.year and
cm.country_name = c.country_name
group By c.Country_Name, y.Year

calculate 3 days rolling average in sql stored in Google Big Query

My data is stored in Google Big QUery in a database. This is how my table looks like.
IP Age Sex Province Epid_ID
19/05/2020 43 Female Bagmati KTM-20-00206
18/05/2020 33 Male Province1 KTM-20-00205
18/05/2020 30 Male Province1 KTM-20-00204
18/05/2020 32 Male Province1 KTM-20-00203
18/05/2020 63 Male Province1 KTM-20-00202
17/05/2020 33 Male Province2 KTM-20-00201
17/05/2020 23 Male Province2 KTM-20-00200
16/05/2020 22 Male Province2 KTM-20-00199
16/05/2020 23 Male Province2 KTM-20-00198
Here, EpiD_ID is my unique ID. I want to calculate 3 days rolling average for each date. Following is my expected output.
Date Count_Epid_ID 2_days_rolling_avg
16/05/2020 2 0
17/05/2020 2 0
18/05/2020 4 2.66
19/05/2020 1 2.33
Explanation: 0 for the first 2 days and we start calculating the rolling average from the 3rd day. For 18/05/2020, 2.66= (2+2+4)/3, 2.33 = (2+4+1)/3
I tried to use the following question. However, I was not successful.
This is the Query I wrote which would only give me count of epid and not rolling average.
SELECT
IP,
COUNT(*) AS num,
FROM
interim-data.casedata.Interim Reloaded
GROUP BY
IP
You can use window functions -- assuming you have data on every day:
SELECT IP, COUNT(*) AS num,
AVG(COUNT(*)) OVER (ORDER BY IP ROWS BETWEEN 2 PRECEDING AND CURRENT ROW)
FROM interim-data.casedata.Interim Reloaded
GROUP BY IP;
It seems strange that a column called IP has a date value, but that seems to be how your data is modelled.

Finding average value of three weekdays, broken down on hours

I have an origin-destination table like this in Bigquery with weekday, date, UTC time/hour and count of trips:
Origin Destination Day Date Time Count
NY Station Downtown Mon 02.09.2019 15 12
NY Station Downtown Mon 02.09.2019 16 10
City libry Eastside Mon 02.09.2019 17 10
NY Station Downtown Tue 03.09.2019 15 8
NY Station Downtown Tue 03.09.2019 16 5
City libry Eastside Tue 03.09.2019 17 5
NY Station Downtown Wed 04.09.2019 15 8
NY Station Downtown Wed 04.09.2019 16 10
City libry Eastside Wed 04.09.2019 17 11
I wish to get the average Count for
each origin-destination pair (NY Station-Downtown and City libry-Eastside)
the average of Monday-Wednesday at each given time
The output should then be something like
Origin Destination Avg_Day Period Time Avg_Count
NY Station Downtown Mon-Wed Week1 (02.09.19-04.09.19) 15 9,33
NY Station Downtown Mon-Wed Week1 (02.09.19-04.09.19) 16 8,33
City libry Eastside Mon-Wed Week1 (02.09.19-04.09.19) 17 8,67
Ignore the Avg_day and Period columns as its just for help/showing for which days and dates i wish to achieve the average for. In other words the aim is to have an idea of the average counts for each origin-destination pair on a normal weekday (in this case defined as mon-wed) on certain hours of the day. The average count of for example the time 15 for NY Station-Downtown pair is 9,33, calculated by taking the average of the count for 15 o'clock at Monday, at Tuesday and at Wednesday (that is the average of 12, 8 and 8).
I have tried variants of CASE and WHERE SQL queries, but not even close to grasping the logic on how to make the query for this so no point in posting any query. Possibly have to create a temporary table also. Can anyone help me? it is HUGELY appreciated
Below is for BigQuery Standard SQL
#standardSQL
select
Origin,
Destination,
'Mon-Wed' AS Avg_Day,
FORMAT('Week%i (%s-%s)', week, min_date, max_date) AS Period,
Time,
Avg_Count
from (
SELECT
Origin,
Destination,
'Mon-Wed' AS Avg_Day,
EXTRACT(WEEK FROM PARSE_DATE('%d.%m.%Y', date)) week,
MIN(date) AS min_date,
MAX(date) AS max_date,
Time,
ROUND(AVG(count), 2) AS Avg_Count
FROM `project.dataset.table`
WHERE day IN ('Mon', 'Tue', 'Wed')
GROUP BY Origin, Destination, Time, week
)
if to apply to sample data from your question - output is
Row Origin Destination Avg_Day Period Time Avg_Count
1 NY Station Downtown Mon-Wed Week35 (02.09.2019-04.09.2019) 15 9.33
2 NY Station Downtown Mon-Wed Week35 (02.09.2019-04.09.2019) 16 8.33
3 City libry Eastside Mon-Wed Week35 (02.09.2019-04.09.2019) 17 8.67

Get count per year of data with begin and end dates

I have a set of data that lists each employee ever employed in a certain type of department at many cities, and it lists each employee's begin and end date.
For example:
name city_id start_date end_date
-----------------------------------------
Joe Public 54 3-19-1994 9-1-2002
Suzi Que 54 10-1-1995 9-1-2005
What I want is each city's employee count for each year in a particular period. For example, if this was all the data for city 54, then I'd show this as the query results if I wanted to show city 54's employee count for the years 1990-2005:
city_id year employee_count
-----------------------------
54 1990 0
54 1991 0
54 1992 0
54 1993 0
54 1994 1
54 1995 2
54 1996 2
54 1997 2
54 1998 2
54 1999 2
54 2000 2
54 2001 2
54 2002 2
54 2003 1
54 2004 1
54 2005 1
(Note that I will have many cities, so the primary key here would be city and year unless I want to have a separate id column.)
Is there an efficient SQL query to do this? All I can think of is a series of UNIONed queries, with one query for each year I wanted to get numbers for.
My dataset has a few hundred cities and 178,000 employee records. I need to find a few decades' worth of this yearly data for each city on my dataset.
replace 54 with your parameter
select
<city_id>, c.y, count(t.city_id)
from generate_series(1990, 2005) as c(y)
left outer join Table1 as t on
c.y between extract(year from t.start_date) and extract(year from t.end_date) and
t.city_id = <city_id>
group by c.y
order by c.y
sql fiddle demo