Assign an age to a person based on known population average but no Date of birth - sql

I would like to use Postgres SQL to assign an age category to a list of househoulds, where we don't know the date of birth of any of the family members.
Dataset looks like:
household_id
household_size
x1
5
x2
1
x3
8
...
...
I then have a set of percentages for each age group with that dataset looking like:
age_group
percentage
0-18
30
19-30
40
31-100
30
I want the query to calculate overall what will make the whole dataset as close to the percentages as possible and if possible similar at the household level(not as important). the dataset will end up looking like:
household_id
household_size
0-18
19-30
31-100
x1
5
2
2
1
x2
1
0
1
0
x3
8
3
3
2
...
...
...
....
...
I have looked at the ntile function but any pointers as to how I could handle this with postgres would be really helpful.

I didn't want to post an answer with just a link so I figured I'll give it a shot and see if I can simplify depesz weighted_random to plain sql. The result is this slower, less readable, worse version of it, but in shorter, plain sql:
CREATE FUNCTION weighted_random( IN p_choices ANYARRAY, IN p_weights float8[] )
RETURNS ANYELEMENT language sql as $$
select choice
from
( select case when (sum(weight) over (rows UNBOUNDED PRECEDING)) >= hit
then choice end as choice
from ( select unnest(p_choices) as choice,
unnest(p_weights) as weight ) inputs,
( select sum(weight)*random() as hit
from unnest(p_weights) a(weight) ) as random_hit
) chances
where choice is not null
limit 1
$$;
It's not inlineable because of aggregate and window function calls. It's faster if you assume weights will only be probabilities that sum up to 1.
The principle is that you provide any array of choices and an equal length array of weights (those can be percentages but don't have to, nor do they have to sum up to any specific number):
update test_area t
set ("0-18",
"19-30",
"31-100")
= (with cte AS (
select weighted_random('{0-18,19-30,31-100}'::TEXT[], '{30,40,30}')
as age_group
from generate_series(1,household_size,1))
select count(*) filter (where age_group='0-18') as "0-18",
count(*) filter (where age_group='19-30') as "19-30",
count(*) filter (where age_group='31-100') as "31-100"
from cte)
returning *;
Online demo showing that both his version and mine are statistically reliable.

A minimum start could be:
SELECT
household_id,
MIN(household_size) as size,
ROUND(SUM(CASE WHEN agegroup_from=0 THEN g ELSE 0 END),1) as g1,
ROUND(SUM(CASE WHEN agegroup_from=19 THEN g ELSE 0 END),1) as g2,
ROUND(SUM(CASE WHEN agegroup_from=31 THEN g ELSE 0 END),1) as g3
FROM (
SELECT
h.household_id,
h.household_size,
p.agegroup_from,
p.percentage/100.0 * h.household_size as g
FROM households h
CROSS JOIN PercPerAge p) x
GROUP BY household_id
ORDER BY household_id;
output:
household_id
size
g1
g2
g3
x1
5
1.5
2.0
1.5
x2
1
0.3
0.4
0.3
x3
8
2.4
3.2
2.4
see: DBFIDDLE
Notes:
Of course you should round the columns g to whole numbers, taking into account the complete split (g1+g2+g3 = total)
Because g1,g2 and g3 are based on percentages, their values can change (as long as the total split is OK.... (see, for more info: Return all possible combinations of values on columns in SQL )

Related

finding percentages between 2 different columns in sql

I was create this query:
select first_price, last_price, cast((sum(1 - (first_price / nullif(last_price,0)))) as double) as first_vs_last_percentages
from prices
group by first_price, last_price
having first_vs_last_percentages >= 0.1
unfortunately this is my wrong data in first_vs_last_percentages col
ID
first_price
last_price
first_vs_last_percentages
1
10
11
1-(10/11) = 1.0
2
66
68
1-(66/68) = 1.0
It was supposed to return this output:
ID
first_price
last_price
first_vs_last_percentages
1
10
11
1-(10/11) = 0.0909
2
66
68
1-(66/68) = 0.0294
if someone has a good solution and it will be in presto syntax it will be wonderful.
It seems you got struck by another case of integer division (your cast to double is a bit late), update the query so the divisor or dividend type changes (for example by multiplying one of them by 1.0 which is a bit shorter then cast to double):
select -- ...
, sum(1 - (first_price * 1.0) / nullif(last_price, 0)) first_vs_last_percentages
from ...
P.S.
Your query is a bit strange, not sure why do you need grouping and sum here.
It depends on which database engine you work upon. Typically, most query confusion rely on either conceptual or syntatic mistakes. In either one or the other cases, it seek to operate a row-percentage double 100.0*(last-first)/first. It means, you can drop the group by and having, since we MUST NOT group by double values, rather intervals they belong.
select
first_price,
last_price,
CASE
WHEN first_price = 0 THEN NULL
ELSE (last_price-first_price)/first_price
end as first_vs_last_percentage
from prices

SQL - How to filter out responses with no variation for survery collection to do multi-linear regression?

I'm new to SQL and I am trying to filter out responses with no variation for survery collection (invalid responses) to do multi-linear regression. Do take note that there is actually more than 100 records for this table and I have simplified it for the illustration.
Database: MySQL 8.0.30 : TLSv1.2 (TablePlus)
ID is the respondent number.
Variables - x1, x2, x3 is the independent variables.
Values - Survery response.
For example this is the current table I have:
ID
Variables
Values
1
x1
1
1
x2
1
1
x3
1
2
x1
2
2
x2
3
2
x3
4
3
x1
5
3
x2
5
3
x3
5
Scripts used:
SELECT ID, Variables, Values
FROM TableA
GROUP BY ID
I am trying to achieve the following table, where I only want to keep the records which have a variation in the responses:
ID
Variables
Values
2
x1
2
2
x2
3
2
x3
4
I have tried to use the functions WHERE, DISTINCT, WHERE NOT, HAVING, but I can't seem to get the results that I require, or showing blank most times (like the table below). If anyone is able to help, that would be most helpful.
ID
Variables
Values
Thank you very much!
Your problem has two parts so you are going to need to use a subquery for this.
you want to know which responses have variations. For this you'll want to group by the responses by the id, and then check that the responses that have the same id all have the same value, or not. For this you can select only those having more than one distinct value:
select `id`
from results
group by `id`
having count(distinct `values`) > 1
based on that you can just wrap it with a select to get all the fields that you want, ungrouped:
select *
from results
where `id` in (
select `id`
from results
group by `id`
having count(distinct `values`) > 1
)
This is MySQL syntax, but shouldn't have that many differences for main dbs
SQL Fiddle: http://sqlfiddle.com/#!9/a266f806/4/0
Hope that helps
Try the following:
WITH ids_with_variations as
(
SELECT ID
,COUNT(DISTINCT [Values]) as unique_value_count
FROM TableA
GROUP BY ID
HAVING COUNT(DISTINCT [Values]) = 3 -- this assumes that you expect each ID to have exactly three responses
)
SELECT *
FROM TableA
WHERE ID IN (SELECT ID FROM ids_with_variations)
This is TSQL dialect. This also assumes that you expect exactly three variations in the value column.

How to filter the max value and write to row?

Postgres 9.3.5, PostGIS 2.1.4.
I have tow tables (polygons and points) in a database.
I want to find out how many points are in each polygon. There will be 0 points per polygon or more than 200000. The little hick up is the following.
My point table looks the following:
x y lan
10 11 en
10 11 fr
10 11 en
10 11 es
10 11 en
- #just for demonstration/clarification purposes
13 14 fr
13 14 fr
13 14 es
-
15 16 ar
15 16 ar
15 16 ps
I do not simply want to count the number of points per polygon. I want to know what is the most often occuring lan in each polygon. So, assuming each - indicates that the points are falling into a new polygon, my results would look the following:
Polygon table:
polygon Count lan
1 3 en
2 2 fr
3 2 ar
This is what I got so far.
SELECT count(*), count.language AS language, hexagons.gid AS hexagonsWhere
FROM hexagonslan AS hexagons,
points_count AS france
WHERE ST_Within(count.geom, hexagons.geom)
GROUP BY language, hexagonsWhere
ORDER BY hexagons DESC;
It gives me the following:
Polygon Count language
1 3 en
1 1 fr
1 1 es
2 2 fr
2 1 es
3 2 ar
3 1 ps
Two things remain unclear.
How to get only the max value?
How will cases be treated where there are by any chance the max values identical?
Answer to 1.
To get the most common language and its count per Polygon, you could use a simple DISTINCT ON query:
SELECT DISTINCT ON (h.gid)
h.gid AS polygon, count(c.geom) AS ct, c.language
FROM hexagonslan h
LEFT JOIN points_count c ON ST_Within(c.geom, h.geom)
GROUP BY h.gid, c.language
ORDER BY h.gid, count(c.geom) DESC, c.language; -- language name is tiebreaker
Select first row in each GROUP BY group?
But for the data distribution you describe (up to 200.000 points per polygon), this should be substantially faster (hoping to make better use of an index on c.geom):
SELECT h.gid AS polygon, c.ct, c.language
FROM hexagonslan h
LEFT JOIN LATERAL (
SELECT c.language, count(*) AS ct
FROM points_count c
WHERE ST_Within(c.geom, h.geom)
GROUP BY 1
ORDER BY 2 DESC, 1 -- again, language name is tiebreaker
LIMIT 1
) c ON true
ORDER BY 1;
Optimize GROUP BY query to retrieve latest record per user
LEFT JOIN LATERAL .. ON true preserves polygons not containing any points.
Call a set-returning function with an array argument multiple times
In cases where there are by any chance the max values identical, the alphabetically first language is picked in the example, by way of the added ORDER BY item. If you want all languages that happen to share the maximum count, you have to do more:
Answer to 2.
SELECT h.gid AS polygon, c.ct, c.language
FROM hexagonslan h
LEFT JOIN LATERAL (
SELECT c.language, count(*) AS ct
, rank() OVER (ORDER BY count(*) DESC) AS rnk
FROM points_count c
WHERE ST_Within(c.geom, h.geom)
GROUP BY 1
) c ON c.rnk = 1
ORDER BY 1, 3 -- language only as additional sort critieria
Using the window function rank() here, (not row_number()!). We can get the count or points and the ranking of the count in a single SELECT. Consider the sequence of events:
Best way to get result count before LIMIT was applied

Tricky aggregation Oracle 11G

I am using Oracle 11G.
Here is my data in table ClassGrades:
ID Name APlusCount TotalStudents PctAplus
0 All 44 95 46.31
1 Grade1A 13 24 54.16
2 Grade1B 11 25 44.00
3 Grade1C 8 23 34.78
4 Grade1D 12 23 52.17
The data (APlusCount,TotalStudents) for ID 0 is the sum of data for all classes.
I want to calculate how each class compares to other classes except itself.
Example:
Take Grade1A that has PctAplus = 54.16.
I want to add all values for Grade1B,Grade1C and Grade1D which is;
((Sum of APlusCount for Grade 1B,1C,1D)/(Sum of TotalStudents for Grade 1B,1C,1D))*100
=(31/71)*100=> 43.66%
So Grade1A (54.16%) is doing much better when compared to its peers (43.66%)
I want to calculate Peers collective percentage for each Grade.
How do I do this?
Another approach might be to leverage the All record for totals (self cross join as mentioned in the comments), i.e.,
WITH g1 AS (
SELECT apluscount, totalstudents
FROM grades
WHERE name = 'All'
)
SELECT g.name, 100*(g1.apluscount - g.apluscount)/(g1.totalstudents - g.totalstudents)
FROM grades g, g1
WHERE g.name != 'All';
However I think that #Wernfried's solution is better as it doesn't depend on the existence of an All record.
UPDATE
Alternately, one could use an aggregate along with a GROUP BY in the WITH statement:
WITH g1 AS (
SELECT SUM(apluscount) AS apluscount, SUM(totalstudents) AS totalstudents
FROM grades
WHERE name != 'All'
)
SELECT g.name, 100*(g1.apluscount - g.apluscount)/(g1.totalstudents - g.totalstudents)
FROM grades g, g1
WHERE g.name != 'All';
Hope this helps. Again, the solution using window functions is probably the best, however.
I don't know how to deal with "All" record but for the others this is an approach:
select Name,
100*(sum(APlusCount) over () - APlusCount) /
(sum(TotalStudents) over () - TotalStudents) as result
from grades
where name <> 'All';
NAME RESULT
=================================
Grade1A 43.661971830986
Grade1B 47.142857142857
Grade1C 50
Grade1D 44.444444444444
See example in SQL Fiddle

Count the number of rows that contain a letter/number

What I am trying to achieve is straightforward, however it is a little difficult to explain and I don't know if it is actually even possible in postgres. I am at a fairly basic level. SELECT, FROM, WHERE, LEFT JOIN ON, HAVING, e.t.c the basic stuff.
I am trying to count the number of rows that contain a particular letter/number and display that count against the letter/number.
i.e How many rows have entries that contain an "a/A" (Case insensitive)
The table I'm querying is a list of film names. All I want to do is group and count 'a-z' and '0-9' and output the totals. I could run 36 queries sequentially:
SELECT filmname FROM films WHERE filmname ilike '%a%'
SELECT filmname FROM films WHERE filmname ilike '%b%'
SELECT filmname FROM films WHERE filmname ilike '%c%'
And then run pg_num_rows on the result to find the number I require, and so on.
I know how intensive like is and ilike even more so I would prefer to avoid that. Although the data (below) has upper and lower case in the data, I want the result sets to be case insensitive. i.e "The Men Who Stare At Goats" the a/A,t/T and s/S wouldn't count twice for the resultset. I can duplicate the table to a secondary working table with the data all being strtolower and working on that set of data for the query if it makes the query simpler or easier to construct.
An alternative could be something like
SELECT sum(length(regexp_replace(filmname, '[^X|^x]', '', 'g'))) FROM films;
for each letter combination but again 36 queries, 36 datasets, I would prefer if I could get the data in a single query.
Here is a short data set of 14 films from my set (which actually contains 275 rows)
District 9
Surrogates
The Invention Of Lying
Pandorum
UP
The Soloist
Cloudy With A Chance Of Meatballs
The Imaginarium of Doctor Parnassus
Cirque du Freak: The Vampires Assistant
Zombieland
9
The Men Who Stare At Goats
A Christmas Carol
Paranormal Activity
If I manually lay out each letter and number in a column and then register if that letter appears in the film title by giving it an x in that column and then count them up to produce a total I would have something like this below. Each vertical column of x's is a list of the letters in that filmname regardless of how many times that letter appears or its case.
The result for the short set above is:
A x x xxxx xxx 9
B x x 2
C x xxx xx 6
D x x xxxx 6
E xx xxxxx x 8
F x xxx 4
G xx x x 4
H x xxxx xx 7
I x x xxxxx xx 9
J 0
K x 0
L x xx x xx 6
M x xxxx xxx 8
N xx xxxx x x 8
O xxx xxx x xxx 10
P xx xx x 5
Q x 1
R xx x xx xxx 7
S xx xxxx xx 8
T xxx xxxx xxx 10
U x xx xxx 6
V x x x 3
W x x 2
X 0
Y x x x 3
Z x 1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 x x 1
In the example above, each column is a "filmname" As you can see, column 5 marks only a "u" and a "p" and column 11 marks only a "9". The final column is the tally for each letter.
I want to build a query somehow that gives me the result rows: A 9, B 2, C 6, D 6, E 8 e.t.c taking into account every row entry extracted from my films column. If that letter doesn't appear in any row I would like a zero.
I don't know if this is even possible or whether to do it systematically in php with 36 queries is the only possibility.
In the current dataset there are 275 entries and it grows by around 8.33 a month (100 a year). I predict it will reach around 1000 rows by 2019 by which time I will be no doubt using a completely different system so I don't need to worry about working with a huge dataset to trawl through.
The current longest title is "Percy Jackson & the Olympians: The Lightning Thief" at 50 chars (yes, poor film I know ;-) and the shortest is 1, "9".
I am running version 9.0.0 of Postgres.
Apologies if I've said the same thing multiple times in multiple ways, I am trying to get as much information out so you know what I am trying to achieve.
If you need any clarification or larger datasets to test with please just ask and I'll edit as needs be.
Suggestion are VERY welcome.
Edit 1
Erwin Thanks for the edits/tags/suggestions. Agree with them all.
Fixed the missing "9" typo as suggested by Erwin. Manual transcribe error on my part.
kgrittn, Thanks for the suggestion but I am not able to update the version from 9.0.0. I have asked my provider if they will try to update.
Response
Thanks for the excellent reply Erwin
Apologies for the delay in responding but I have been trying to get your query to work and learning the new keywords to understand the query you created.
I adjusted the query to adapt into my table structure but the result set was not as expected (all zeros) so I copied your lines directly and had the same result.
Whilst the result set in both cases lists all 36 rows with the appropriate letters/numbers however all the rows shows zero as the count (ct).
I have tried to deconstruct the query to see where it may be falling over.
The result of
SELECT DISTINCT id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM films
is "No rows found". Perhaps it ought to when extracted from the wider query, I'm not sure.
When I removed the unnest function the result was 14 rows all with "NULL"
If I adjust the function
COALESCE(y.ct, 0) to COALESCE(y.ct, 4)<br />
then my dataset responds all with 4's for every letter instead of zeros as explained previously.
Having briefly read up on COALESCE the "4" being the substitute value I am guessing that y.ct is NULL and being substituted with this second value (this is to cover rows where the letter in the sequence is not matched, i.e if no films contain a 'q' then the 'q' column will have a zero value rather than NULL?)
The database I tried this on was SQL_ASCII and I wondered if that was somehow a problem but I had the same result on one running version 8.4.0 with UTF-8.
Apologies if I've made an obvious mistake but I am unable to return the dataset I require.
Any thoughts?
Again, thanks for the detailed response and your explanations.
This query should do the job:
Test case:
CREATE TEMP TABLE films (id serial, film text);
INSERT INTO films (film) VALUES
('District 9')
,('Surrogates')
,('The Invention Of Lying')
,('Pandorum')
,('UP')
,('The Soloist')
,('Cloudy With A Chance Of Meatballs')
,('The Imaginarium of Doctor Parnassus')
,('Cirque du Freak: The Vampires Assistant')
,('Zombieland')
,('9')
,('The Men Who Stare At Goats')
,('A Christmas Carol')
,('Paranormal Activity');
Query:
SELECT l.letter, COALESCE(y.ct, 0) AS ct
FROM (
SELECT chr(generate_series(97, 122)) AS letter -- a-z in UTF8!
UNION ALL
SELECT generate_series(0, 9)::text -- 0-9
) l
LEFT JOIN (
SELECT letter, count(id) AS ct
FROM (
SELECT DISTINCT -- count film once per letter
id, unnest(string_to_array(lower(film), NULL)) AS letter
FROM films
) x
GROUP BY 1
) y USING (letter)
ORDER BY 1;
This requires PostgreSQL 9.1! Consider the release notes:
Change string_to_array() so a NULL separator splits the string into
characters (Pavel Stehule)
Previously this returned a null value.
You can use regexp_split_to_table(lower(film), ''), instead of unnest(string_to_array(lower(film), NULL)) (works in versions pre-9.1!), but it is typically a bit slower and performance degrades with long strings.
I use generate_series() to produce the [a-z0-9] as individual rows. And LEFT JOIN to the query, so every letter is represented in the result.
Use DISTINCT to count every film once.
Never worry about 1000 rows. That is peanuts for modern day PostgreSQL on modern day hardware.
A fairly simple solution which only requires a single table scan would be the following.
SELECT
'a', SUM( (title ILIKE '%a%')::integer),
'b', SUM( (title ILIKE '%b%')::integer),
'c', SUM( (title ILIKE '%c%')::integer)
FROM film
I left the other 33 characters as a typing exercise for you :)
BTW 1000 rows is tiny for a postgresql database. It's beginning to get large when the DB is larger then the memory in your server.
edit: had a better idea
SELECT chars.c, COUNT(title)
FROM (VALUES ('a'), ('b'), ('c')) as chars(c)
LEFT JOIN film ON title ILIKE ('%' || chars.c || '%')
GROUP BY chars.c
ORDER BY chars.c
You could also replace the (VALUES ('a'), ('b'), ('c')) as chars(c) part with a reference to a table containing the list of characters you are interested in.
This will give you the result in a single row, with one column for each matching letter and digit.
SELECT
SUM(CASE WHEN POSITION('a' IN filmname) > 0 THEN 1 ELSE 0 END) AS "A",
SUM(CASE WHEN POSITION('b' IN filmname) > 0 THEN 1 ELSE 0 END) AS "B",
SUM(CASE WHEN POSITION('c' IN filmname) > 0 THEN 1 ELSE 0 END) AS "C",
...
SUM(CASE WHEN POSITION('z' IN filmname) > 0 THEN 1 ELSE 0 END) AS "Z",
SUM(CASE WHEN POSITION('0' IN filmname) > 0 THEN 1 ELSE 0 END) AS "0",
SUM(CASE WHEN POSITION('1' IN filmname) > 0 THEN 1 ELSE 0 END) AS "1",
...
SUM(CASE WHEN POSITION('9' IN filmname) > 0 THEN 1 ELSE 0 END) AS "9"
FROM films;
A similar approach like Erwins, but maybe more comfortable in the long run:
Create a table with each character you're interested in:
CREATE TABLE char (name char (1), id serial);
INSERT INTO char (name) VALUES ('a');
INSERT INTO char (name) VALUES ('b');
INSERT INTO char (name) VALUES ('c');
Then grouping over it's values is easy:
SELECT char.name, COUNT(*)
FROM char, film
WHERE film.name ILIKE '%' || char.name || '%'
GROUP BY char.name
ORDER BY char.name;
Don't worry about ILIKE.
I'm not 100% happy about using the keyword 'char' as table title, but hadn't had bad experiences so far. On the other hand it is the natural name. Maybe if you translate it to another language - like 'zeichen' in German, you avoid ambiguities.