Why percentage is not working properly in SQLite3? - sql

I have the following code for the following question however percentage is happening just to be zero:
SELECT p.state, (p.popestimate2011/sum(p.popestimate2011)) * 100
FROM pop_estimate_state_age_sex_race_origin p
WHERE p.age >= 21
GROUP BY p.state;
Also here's the table schema:
sqlite> .schema pop_estimate_state_age_sex_race_origin
CREATE TABLE pop_estimate_state_age_sex_race_origin (
sumlev NUMBER,
region NUMBER,
division NUMBER,
state NUMBER,
sex NUMBER,
origin NUMBER,
race NUMBER,
age NUMBER,
census2010pop NUMBER,
estimatesbase2010 NUMBER,
popestimate2010 NUMBER,
popestimate2011 NUMBER,
PRIMARY KEY(state, age, sex, race, origin),
FOREIGN KEY(sumlev) REFERENCES SUMLEV(sumlev_cd),
FOREIGN KEY(region) REFERENCES REGION(region_cd),
FOREIGN KEY(division) REFERENCES DIVISION(division_cd),
FOREIGN KEY(sex) REFERENCES SEX(sex_cd),
FOREIGN KEY(race) REFERENCES RACE(race_cd),
FOREIGN KEY(origin) REFERENCES ORIGIN(origin_cd));
So when I run the query it just shows 0 for the percentage:
stat p.popestimate
---- -------------
1 0
2 0
4 0
5 0
6 0
8 0
9 0
10 0
11 0
12 0
13 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
I was trying to write it using nested queries by didn't get anywhere too:
SELECT p.state, 100.0 * sum(p.popestimate2011) / total_pop AS percentage
FROM pop_estimate_state_age_sex_race_origin p
JOIN (SELECT state, sum(p2.popestimate2011) AS total_pop
FROM pop_estimate_state_age_sex_race_origin p2) s ON (s.state = p.state)
WHERE age >= 21
GROUP BY p.state, total_pop
ORDER BY p.state;
The current problem I am having is that it just shows one row as result and just shows the result for the last state number (state ID=56):
56 0.131294163192301

Here's an approach (not tested) that does not require an inner query. It makes a single pass over the table, aggregating by state, and using CASE to calculate the numerator of population aged over 20 and denominator of total state population.
SELECT
state,
(SUM(CASE WHEN age >= 21 THEN popestimate2011 ELSE 0) / SUM(popestimate2011)) * 100
FROM pop_estimate_state_age_sex_race_origin
GROUP BY state

I'm not sure why your SQL statement is executing at all. You are including the non-aggregated column value popestimate2011 in a GROUP BY select and that should generate an error.
A closer reading of the SQLite documentation indicates that it does, in fact, support random value selection for non-aggregate columns in the result expression list (a feature also offered by MySQL). This explains:
Why your SELECT statement is able to execute (a random value is chosen for the non-aggregated popestimate2011 reference).
Why you are seeing a result of 0: the random value chosen is probably the first occurring row and if the rows were added to the database in order that row probably has an age value of 0. Since the numerator in your division would then be 0, the result is also 0.
As to the meat of your calculation it's not clear from your table definition whether the data in your base table is already aggregated or not and, if so, what the age column represents (an average? the grouping factor for that row?)
Finally, SQLite does not have a NUMBER data type. These columns will get the default affinity of NUMERIC which is probably what you want but might not be.

You need something along these lines (not tested):
SELECT state, SUM(popestimate2011) /
(SELECT SUM(popestimate2011)
FROM pop_estimate_state_age_sex_race_origin
WHERE age > 21)))
* 100 as percentage
FROM pop_estimate_state_age_sex_race_origi
WHERE age >= 21
GROUP by state
;

The NUMBER type does not exist in SQLite.
SQLite interprets as INTEGER and
decimals are lost in an integer division
(p.popestimate2011 / sum (p.popestimate2011))
is always 0.
Change the type of the column popestimate2011 REAL
or use CAST (...)
(CAST (p.popestimate2011 AS REAL) / SUM (p.popestimate2011))

Related

Calculating Column value based on row above and previous column [duplicate]

This question already has answers here:
How to calculate Running Multiplication
(4 answers)
Closed 6 months ago.
I have a table I'm trying to create that has a column that needs to be calculated based on the row above it multiplied by the previous column. The first row is defaulted to 100,000 and the rest of the rows would be calculated off of that. Here's an example:
Age
Population
Deaths
DeathRate
DeathPro
DeathProb
SurvivalProb
PersonsAlive
0
1742
0
0
0.1
0
1
100,000
51
2048
1
0.00048
0.5
0.00048
0.99951
99951.18379
52
1921
0
0
0.5
0
1
99951.18379
61
1965
1
0.00051
0.5
0.00051
0.99949
99900.33
I skipped some ages so I didn't have type it all in there, but the ages go from 0 - 85. This was orginally done in excel and the formula for PersonsAlive (which is what I'm trying to recreate) was G3*H2 aka previous value of PersonsAlive * Survival Probability.
I was thinking I could accomplish this with the lag function, but with the example I provided above, I get null values for everything after age 1 because there is no value in the previous row. What I want to happen is that PersonsAlive returns 100,000 until I get a death (in the example at Age 51) and then it does the calculation and returns the value (99951) until another death happens (Age 61). Here's my code, which includes two extra columns, ZipCode (the reason we want to do it in SQL is so we can calculate all zips at once) and PersonsAliveTemp, which I used to set Age 0 to 100,000:
SELECT
ZipCode
,Age
,[Population]
,Deaths
,DeathRate
,Death_Proportion
,DeathProbablity
,SurvivalProbablity
,PersonsAliveTemp
,(LAG(PersonsAliveTemp,1) OVER(PARTITION BY ZipCode ORDER BY Age))*SurvivalProbablity as PersonsAlive
FROM #temp4
I also tried it with defaulting PersonsAliveTemp to 100,000 and 0, which "works" but doesn't do the running calculation.
Is it possible to get the lag function (or some other function) to do a running row by row calc?
This converts a running product into an addition via logarithms.
select *,
100000 * exp(sum(log(SurvivalProb)) over
(partition by ZipCode order by Age
rows between unbounded preceding and current row)
) as PersonsAlive
from data
order by Age;
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=36be4d66260c74196f7d36833018682a

Misleading count of 1 on JOIN in Postgres 11.7

I've run into a subtlety around count(*) and join, and a hoping to get some confirmation that I've figured out what's going on correctly. For background, we commonly convert continuous timeline data into discrete bins, such as hours. And since we don't want gaps for bins with no content, we'll use generate_series to synthesize the buckets we want values for. If there's no entry for, say 10AM, fine, we stil get a result. However, I noticed that I'm sometimes getting 1 instead of 0. Here's what I'm trying to confirm:
The count is 1 if you count the "grid" series, and 0 if you count the data table.
This only has to do with count, and no other aggregate.
The code below sets up some sample data to show what I'm talking about:
DROP TABLE IF EXISTS analytics.measurement_table CASCADE;
CREATE TABLE IF NOT EXISTS analytics.measurement_table (
hour smallint NOT NULL DEFAULT NULL,
measurement smallint NOT NULL DEFAULT NULL
);
INSERT INTO measurement_table (hour, measurement)
VALUES ( 0, 1),
( 1, 1), ( 1, 1),
(10, 2), (10, 3), (10, 5);
Here are the goal results for the query. I'm using 12 hours to keep the example results shorter.
Hour Count sum
0 1 1
1 2 2
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
7 0 0
8 0 0
9 0 0
10 3 10
11 0 0
12 0 0
This works correctly:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(measurement_table.hour) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (measurement_table.hour = hour_series.hour)
GROUP BY 1
ORDER BY 1
This returns misleading 1's on the match:
WITH hour_series AS (
select * from generate_series (0,12) AS hour
)
SELECT hour_series.hour,
count(*) AS frequency,
COALESCE(sum(measurement_table.measurement), 0) AS total
FROM hour_series
LEFT JOIN measurement_table ON (hour_series.hour = measurement_table.hour)
GROUP BY 1
ORDER BY 1
0 1 1
1 2 2
2 1 0
3 1 0
4 1 0
5 1 0
6 1 0
7 1 0
8 1 0
9 1 0
10 3 10
11 1 0
12 1 0
The only difference between these two examples is the count term:
count(*) -- A result of 1 on no match, and a correct count otherwise.
count(joined to table field) -- 0 on no match, correct count otherwise.
That seems to be it, you've got to make it explicit that you're counting the data table. Otherwise, you get a count of 1 since the series data is matching once. Is this a nuance of joinining, or a nuance of count in Postgres?
Does this impact any other aggrgate? It seems like it sholdn't.
P.S. generate_series is just about the best thing ever.
You figured out the problem correctly: count() behaves differently depending on the argument is is given.
count(*) counts how many rows belong to the group. This just cannot be 0 since there is always at least one row in a group (otherwise, there would be no group).
On the other hand, when given a column name or expression as argument, count() takes in account any non-null value, and ignores null values. For your query, this lets you distinguish groups that have no match in the left joined table from groups where there are matches.
Note that this behavior is not Postgres specific, but belongs to the standard
ANSI SQL specification (all databases that I know conform to it).
Bottom line:
in general cases, uses count(*); this is more efficient, since the database does not need to check for nulls (and makes it clear to the reader of the query that you just want to know how many rows belong to the group)
in specific cases such as yours, put the relevant expression in the count()

SQL: Find rows that match closely but not exactly

I have a table inside a PostgreSQL database with columns c1,c2...cn. I want to run a query that compares each row against a tuple of values v1,v2...vn. The query should not return an exact match but should return a list of rows ordered in descending similarity to the value vector v.
Example:
The table contains sports records:
1,USA,basketball,1956
2,Sweden,basketball,1998
3,Sweden,skating,1998
4,Switzerland,golf,2001
Now when I run a query against this table with v=(Sweden,basketball,1998), I want to get all records that have a similarity with this vector, sorted by number of matching columns in descending order:
2,Sweden,basketball,1998 --> 3 columns match
3,Sweden,skating,1998 --> 2 columns match
1,USA,basketball,1956 --> 1 column matches
Row 4 is not returned because it does not match at all.
Edit: All columns are equally important. Although, when I really think of it... it would be a nice add-on if I could give each column a different weight factor as well.
Is there any possible SQL query that would return the rows in a reasonable amount of time, even when I run it against a million rows?
What would such a query look like?
SELECT * FROM countries
WHERE country = 'sweden'
OR sport = 'basketball'
OR year = 1998
ORDER BY
cast(country = 'sweden' AS integer) +
cast(sport = 'basketball' as integer) +
cast(year = 1998 as integer) DESC
It's not beautiful, but well. You can cast the boolean expressions as integers and sum them.
You can easily change the weight, by adding a multiplicator.
cast(sport = 'basketball' as integer) * 5 +
This is how I would do it ... the multiplication factors used in the case stmts will handle the importance(weight) of the match and they will ensure that those records that have matches for columns designated with the highest weight will come up top even if the other columns don't match for those particular records.
/*
-- Initial Setup
-- drop table sport
create table sport (id int, Country varchar(20) , sport varchar(20) , yr int )
insert into sport values
(1,'USA','basketball','1956'),
(2,'Sweden','basketball','1998'),
(3,'Sweden','skating','1998'),
(4,'Switzerland','golf','2001')
select * from sport
*/
select * ,
CASE WHEN Country='sweden' then 1 else 0 end * 100 +
CASE WHEN sport='basketball' then 1 else 0 end * 10 +
CASE WHEN yr=1998 then 1 else 0 end * 1 as Match
from sport
WHERE
country = 'sweden'
OR sport = 'basketball'
OR yr = 1998
ORDER BY Match Desc
It might help if you wrote a stored procedure that calculates a "similarity metric" between two rows. Then your query could refer to the return value of that procedure directly rather than having umpteen conditions in the where-expression and the order-by-expression.

Finding even or odd ID values

I was working on a query today which required me to use the following to find all odd number ID values
(ID % 2) <> 0
Can anyone tell me what this is doing? It worked, which is great, but I'd like to know why.
ID % 2 is checking what the remainder is if you divide ID by 2. If you divide an even number by 2 it will always have a remainder of 0. Any other number (odd) will result in a non-zero value. Which is what is checking for.
For finding the even number we should use
select num from table where ( num % 2 ) = 0
As Below Doc specify
dividend % divisor
Returns the remainder of one number divided by another.
https://learn.microsoft.com/en-us/sql/t-sql/language-elements/modulo-transact-sql#syntax
For Example
13 % 2 return 1
Next part is <> which denotes Not equals.
Therefor what your statement mean is
Remainder of ID when it divided by 2 not equals to 0
Be careful because this is not going to work in Oracle database. Same Expression will be like below.
MOD(ID, 2) <> 0
ID % 2 reduces all integer (monetary and numeric are allowed, too) numbers to 0 and 1 effectively.
Read about the modulo operator in the manual.
In oracle,
select num from table where MOD (num, 2) = 0;
dividend % divisor
Dividend is the numeric expression to divide. Dividend must be any expression of integer data type in sql server.
Divisor is the numeric expression to divide the dividend. Divisor must be expression of integer data type except in sql server.
SELECT 15 % 2
Output
1
Dividend = 15
Divisor = 2
Let's say you wanted to query
Query a list of CITY names from STATION with even ID numbers only.
Schema structure for STATION:
ID Number
CITY varchar
STATE varchar
select CITY from STATION as st where st.id % 2 = 0
Will fetch the even set of records
In order to fetch the odd records with Id as odd number.
select CITY from STATION as st where st.id % 2 <> 0
% function reduces the value to either 0 or 1
It's taking the ID , dividing it by 2 and checking if the remainder is not zero; meaning, it's an odd ID.
<> means not equal. however, in some versions of SQL, you can write !=

My aggregate is not affected by ROLLUP

I have a query similar to the following:
SELECT CASE WHEN (GROUPING(Name) = 1) THEN 'All' ELSE Name END AS Name,
CASE WHEN (GROUPING(Type) = 1) THEN 'All' ELSE Type END AS Type,
sum(quantity) AS [Quantity],
CAST(sum(quantity) * (SELECT QuantityMultiplier FROM QuantityMultipliers WHERE a = t.b) AS DECIMAL(18,2)) AS Multiplied Quantity
FROM #Table t
GROUP BY Name, Type WITH ROLLUP
I'm trying to return a list of Names, Types, a summed Quantity and a summed quantity multiplied by an arbitrary number. All fine so far. I also need to return a sub-total row per Name and per Type, such as the following
Name Type Quantity Multiplied Quantity
------- --------- ----------- -------------------
a 1 2 4
a 2 3 3
a ALL 5 7
b 1 6 12
b 2 1 1
b ALL 7 13
ALL ALL 24 40
The first 3 columns are fine. I'm getting null values in the rollup rows for the multiplied quantity though. The only reason I can think this is happening is because SQL doesn't recognize the last column as an aggregate now that I've multiplied it by something.
Can I somehow work around this without things getting too convoluted?
I will be falling back onto temporary tables if this can't be done.
In your sub-query to acquire the multiplier, you have WHERE a=b. Are either a or b from the tables in your main query?
If these values are static (nothing to do with the main query), it looks like it should be fine...
If the a or b values are the name or type field, they can be NULL for the rollup records. If so, you can change to something similiar to...
CAST(sum(quantity * (<multiplie_query>)) AS DECIMAL(18,2)).
If a or b are other field from your main query, you'd be getting multiple records back, not just a single multiplier. You could change to something like...
CAST(sum(quantity) * (SELECT MAX(multiplier) FROM ...)) AS DECIMAL(18,2))