SQL 2 Tables, get counts on first, group by second - sql

I'm working in MS Access 2003.
I have Table with records of that kind of structure:
ID, Origin, Destination, Attr1, Attr2, Attr3, ... AttrX
for example:
1, 1000, 1100, 20, M, 5 ...
2, 1000, 1105, 30, F, 5 ...
3, 1001, 1000, 15, M, 10 ...
...
I also have table which has Origin And Destination Codes Grouped
Code, Country, Continent
1000, Albania, Europe
1001, Belgium, Europe
...
1100, China, Asia
1105, Japan, Asia
...
What I need is to get 2 tables which would count records based on criteria related to attributes I specify but grouped by:
1. Origin Continent and Destination Continent
2. Origin Continent and Destination Country
for example:
Case 1.
Origin, Destination, Total, Females, Males, Older than 20, Younger than 20, ...
Europe, China, 300, 100, 200, 120, 180 ...
Europe, Japan, 150, 100, 50, ...
...
Case 2.
Origin, Destination, Total, Females, Males, Older than 20, Younger than 20, ...
Europe, Asia, 1500, 700, 800 ...
Asia, Europe, 1200, ...
...
Can that be done in the way so I could add more columns/criteria easily enough?

Case 1:
select count(1) as total ,t2.continent,t3.country,t1.attr1,t1.attr2,t1.attr3 ... t1.attrX from table1 t1
join table2 t2 on t1.origin = t2.code
join table3 t3 on t1.destination = t3.code
group by t2.continent,t3.country,t1.attr1,t1.attr2,t1.attr3 ... t1.attrX
order by total desc
Case 2:
select count(1) as total ,t2.continent,t3.continent,t1.attr1,t1.attr2,t1.attr3 ... t1.attrX from table1 t1
join table2 t2 on t1.origin = t2.code
join table3 t3 on t1.destination = t3.code
group by t2.continent,t3.continent,t1.attr1,t1.attr2,t1.attr3 ... t1.attrX
order by total desc

You can join queries with queries, so this is a crosstab for Male/Female (attr2)
TRANSFORM Count(Data.ID) AS CountOfID
SELECT Data.Origin, Data.Destination, Count(Data.ID) AS Total
FROM Data
GROUP BY Data.Origin, Data.Destination
PIVOT Data.Attr2;
This is ages:
TRANSFORM Count(Data.ID) AS CountOfID
SELECT Data.Origin, Data.Destination, Count(Data.ID) AS Total
FROM Data
GROUP BY Data.Origin, Data.Destination
PIVOT Partition([Attr1],10,100,10);
This combines the two:
SELECT Ages.Origin, Ages.Destination, Ages.Total,
MF.F, MF.M, Ages.[10: 19], Ages.[20: 29], Ages.[30: 39]
FROM Ages, MF;
As you can see, this could be easier to manage in VBA.

Related

For joins in SQL, Is common column not compulsory?

I came to know that we can use other than JOIN ON a.col = b.col, with JOIN ON a.col CONDITION b.col2. How does it works?.
Example:
2 Tables are:
We can join them as follows (ignoring the required output code, how is that join worked?)
SELECT Students.Name, Grades.Grade, Students.Marks FROM Students
INNER JOIN Grades ON Students.Marks BETWEEN Grades.Min_Mark AND Max_Mark
WHERE Grades.Grade > 7
ORDER BY Grades.Grade DESC, Students.Name ASC;
Joins don't have to nominate a column at all:
SELECT * FROM People CROSS JOIN Addresses
This combines all people with all addresses. If there were 2 people and 3 addresses, 6 records would result
Person.Name, Address.Name
-------------------------
Person1, Address1
Person1, Address2
Person1, Address3
Person2, Address1
Person2, Address2
Person2, Address3
Joins that have a condition don't have to use any columns from the sets of data being joined, they just have to evaluate to true in order for the row to appear in the output. You can consider that when a database is processing any join, it first produces the cross product of every row (like above) then the truth of the condition is checked per row to decide which of those rows make it into the output
First let's do a join on something that makes some sense, using the above data
SELECT * FROM People INNER JOIN Addresses ON Person.Name = Address.Name
That would produce nothing, because 'Person1' is never equal to 'Address1' and so on.. But suppose we altered it to:
SELECT * FROM People INNER JOIN Addresses ON REPLACE(Person.Name, 'Person', 'Address') = Address.Name
The 6 rows would be prepared like before:
Person1, Address1
Person1, Address2
Person1, Address3
Person2, Address1
Person2, Address2
Person2, Address3
The DB would replace the word Person with the word Address just while it was evaluating the truth of the join, and the tests would be performed:
Person1, Address1 --'Person1'->'Address1', does 'Address1'='Address1'? YES; OUTPUT the row
Person1, Address2 --'Person1'->'Address1', does 'Address1'='Address2'? no; discard the row
Person1, Address3 --'Person1'->'Address1', does 'Address1'='Address3'? no; discard the row
Person1, Address1 --'Person2'->'Address2', does 'Address2'='Address1'? no; discard the row
Person2, Address2 --'Person2'->'Address2', does 'Address2'='Address2'? YES; OUTPUT the row
Person2, Address3 --'Person2'->'Address2', does 'Address2'='Address3'? no; discard the row
So you get 2 rows:
Person1, Address1
Person2, Address2
Now let's make it really wacky; Suppose you had:
SELECT * FROM People INNER JOIN Addresses ON DAY_OF_WEEK(NOW()) = 'Monday'
The query would produce 6 rows, but only on Monday. As soon as it turned to Tuesday the query would produce 0 rows. It doesn't make much sense to do, but you're still allowed to do it. So long as you provide something the DB can evaluate to true, or false, the DB will join every row to every other row, then check the truth for every combination, and discard any combination if it sees a false
Imagine other scenarios. Person and Address are supposed to be related on Address having a PersonId
If you did:
SELECT * FROM People LEFT JOIN Addresses ON Person.Id = (Address.PersonId + 1)
You'd see:
PersonId, Name, Address_PersonId, Street
0, Tim, NULL, NULL
1, John, 0, TheRoad
2, Mary, 1, TheAvenue
It's supposed to be Tim that lives at TheRoad, but we wrote a nonsensical join condition that was evaluated and churned out results anyway
You could divide the ID by 2, you could join on a random number being less than 0.5.. It doesn't matter, it's just a truth and most of the time it's at its most useful when it uses column data..
how is that join worked?
This BETWEEN form is actually quite a useful one. It allows you to band loads of different scores into set bands.
Suppose you have 2 people scores and 3 bands (i'm keeping it small to make it easier to type out):
Name, Score
Tim, 79
John, 68
ScoreLower, ScoreHigher, Rating
0, 50, Bronze
51, 75, Silver
76, 100, Gold
And you do
SELECT * FROM People JOIN Scores ON Score BETWEEN ScoreLower AND ScoreHigher
Remember that the DB conceptually combines EVERY person with EVERY score band first:
Tim, 79, 0, 50, Bronze
Tim, 79, 51, 75, Silver
Tim, 79, 76, 100, Gold
John, 68, 0, 50, Bronze
John, 68, 51, 75, Silver
John, 68, 76, 100, Gold
And then it goes through knocking out the ones that aren't true
Tim, 79, 0, 50, Bronze --FALSE, 79 is not BETWEEN 0 and 50, discard this one
Tim, 79, 51, 75, Silver --FALSE, 79 is not BETWEEN 51 and 75, discard this one
Tim, 79, 76, 100, Gold --TRUE, keep
John, 68, 0, 50, Bronze --FALSE, discard
John, 68, 51, 75, Silver --TRUE, keep
John, 68, 76, 100, Gold --FALSE, discard
And you get just the keeps:
Tim, 79, 76, 100, Gold
John, 68, 51, 75, Silver
You could have a list of all the chinese astrological years, and a list of people with a known birthday, and then do their birthday BETWEEN yearstart AND yearend and find out if they're a Horse, Dog etc.. You could have a list of all the letters in the alphabet and a color, and put people into color groups based on the first letter of their name. It doesn't have to be =, we could use LIKE:
People JOIN AlphabetColors ON People.FirstName LIKE AlphabetColors.Letter + '%'
Or we could:
People JOIN AlphabetColors ON LEFT(People.FirstName, 1) = AlphabetColors.Letter
Either way, data like:
Albert
Bill
Charlie
A, Red
B, Green
C, Blue
Ends up as
Albert, A, Red
Bill, B, Green
Charlie, C, Blue

Hive query to select only records in certain percentile

I have table with two columns - ID and total duration:
id tot_dur
123 1
124 2
125 5
126 8
I want to have a Hive query that select only 75th percentile. It should be only the last record:
id tot_dur
126 8
This is what I have, but its hard for me to understand the use of OVER() and PARTITIONED BY() functions, since from what I researched, this are the functions I should use. Before I get the tot_dur column I should sum and group by column duration. Not sure if percentile is the correct function, because I found use cases with percentile_approx.
select k1.id as id, percentile(cast(tot_dur as bigint),0.75) OVER () as tot_dur
from (
SELECT id, sum(duration) as tot_dur
FROM data_source
GROUP BY id) k1
group by id
If I've got you right, this is what you want:
with data as (select stack(4,
123, 1,
124, 2,
125, 5,
126, 8) as (id, tot_dur))
-----------------------------------------------------------------------------
select data.id, data.tot_dur
from data
join (select percentile(tot_dur, 0.75) as threshold from data) as t
where data.tot_dur >= t.threshold;

SQL sum non-distinct weights for distinct rows with category totals

I want to cross-tabulate some weighted survey data in a context where an individual can contribute to more than one cell. The challenge is to make sure that subtotals and grand totals are done without double-counting.
I can get the individual cell values but not the totals using methods similar to the solutions at How do I SUM DISTINCT Rows? or Sum Distinct By Other Column . I'm trying to use the Oracle CUBE statement to get the totals in a nice way.
Here's a baby example. Suppose we're counting people according to what pets they own and according to their hobbies. The problem is that a person might have more than one pet, or more than one hobby. We need to turn this set of unit records:
person_id, weight
1, 10
2, 10
3, 12
person_id, pet
1, "cat"
1, "dog"
2, "cat"
3, "cat"
person_id, hobby
1, "chess"
2, "chess"
2, "skydiving"
3, "skydiving"
into this pair of tables:
Unweighted count
| chess | skydiving | total
------+-------+-----------+--------
cat | 2 | 2 | 3
------+-------+-----------+--------
dog | 1 | 0 | 1
------+-------+-----------+--------
total | 2 | 2 | 3
Weighted count
| chess | skydiving | total
------+-------+-----------+--------
cat | 20 | 22 | 32
------+-------+-----------+--------
dog | 10 | 0 | 10
------+-------+-----------+--------
total | 20 | 22 | 32
Notice that the unweighted total for the "cat" row is 3, not 2+2=4, as person number 2 is counted in two different places. Only three distinct people contribute to this row. Similarly for other totals.
Notice that the weighted total for "cat, chess" is 20=10+10, as two different people each contribute weight 10 to this cell.
Notice that the grand total for the weighted table is 32. This comes from people 1 and 2 contributing 10 each, and person 3 contributing 12. The grand total is not just the sum of all the individual cells!
For the unweighted counts, I can get all the cell counts and totals by:
CREATE TABLE weights(person_id INTEGER, weight INTEGER);
INSERT INTO weights(person_id,weight) VALUES (1,10);
INSERT INTO weights(person_id,weight) VALUES (2,10);
INSERT INTO weights(person_id,weight) VALUES (3,12);
CREATE TABLE pets(person_id INTEGER, pet VARCHAR(3));
INSERT INTO pets(person_id,pet) VALUES (1,'cat');
INSERT INTO pets(person_id,pet) VALUES (1,'dog');
INSERT INTO pets(person_id,pet) VALUES (2,'cat');
INSERT INTO pets(person_id,pet) VALUES (3,'cat');
CREATE TABLE hobbies(person_id INTEGER, hobby VARCHAR(9));
INSERT INTO hobbies(person_id,hobby) VALUES (1,'chess');
INSERT INTO hobbies(person_id,hobby) VALUES (2,'chess');
INSERT INTO hobbies(person_id,hobby) VALUES (2,'skydiving');
INSERT INTO hobbies(person_id,hobby) VALUES (3,'skydiving');
SELECT pet, hobby, COUNT(DISTINCT weights.person_id)
FROM weights JOIN pets on weights.person_id=pets.person_ID
JOIN hobbies on weights.person_id=hobbies.person_id
GROUP BY CUBE(pet, hobby);
The combination of COUNT(DISTINCT ...) and CUBE gives the correct totals.
For weighted counts, if I try the same idea:
SELECT pet, hobby, SUM(DISTINCT weight)
FROM weights JOIN pets on weights.person_id=pets.person_ID
JOIN hobbies on weights.person_id=hobbies.person_id
GROUP BY CUBE(pet, hobby);
the "cat, chess" cell comes to 10 not 20, because people 1 and 2 both have the same weight. Removing the "distinct" key word means that the individual cell counts are correct but the totals are wrong (it produces a grand total of 52 where it should be 32, because persons 1 and 2 are double-counted in the total).
Any suggestions?
You can do this using a nested query, where the inner query specifies a mapping from rows to table cells (i.e. which records are in scope for each table cell), and the outer query specifies the summary function(s) to be applied:
SELECT pet, hobby, COUNT(1), SUM(weight) FROM
(SELECT pet, hobby, weights.person_ID, weight
FROM weights JOIN pets on weights.person_id=pets.person_ID
JOIN hobbies on weights.person_id=hobbies.person_id
GROUP BY CUBE(pet, hobby), weights.person_ID, weight)
GROUP BY pet, hobby;
Results
Aside: You can also write the inner query without using the CUBE operator, but it's a lot messier:
WITH
pet_cube_map as (SELECT DISTINCT pet, NULL as pet_cubed FROM pets UNION ALL SELECT DISTINCT pet, pet as pet_cubed FROM pets),
hobby_cube_map as (SELECT DISTINCT hobby, NULL as hobby_cubed FROM hobbies UNION ALL SELECT DISTINCT hobby, hobby as hobby_cubed FROM hobbies)
SELECT DISTINCT pet_cubed as pet, hobby_cubed as hobby, weights.person_ID, weight
FROM weights
JOIN pets on weights.person_ID=pets.person_ID
JOIN pet_cube_map on pets.pet=pet_cube_map.pet
JOIN hobbies on weights.person_ID=hobbies.person_ID
JOIN hobby_cube_map on hobbies.hobby=hobby_cube_map.hobby
;
try this, below gives the correct result but it is most simplified one
SELECT pet, hobby, SUM(weight)
FROM weights JOIN pets on weights.person_id=pets.person_ID
JOIN hobbies on weights.person_id=hobbies.person_id
GROUP BY pet, hobby
UNION
SELECT pet, NULL, SUM(weight)
FROM weights JOIN pets on weights.person_id=pets.person_ID
GROUP BY pet
UNION
SELECT NULL, hobby, SUM(weight)
FROM weights JOIN hobbies on weights.person_id=hobbies.person_id
GROUP BY hobby
UNION
SELECT SUM(weight)
FROM weights
still working on single select
I think you need to do some math like this:
;WITH t AS (
SELECT
p.pet,
SUM(DISTINCT CASE WHEN h.hobby = 'chess' THEN POWER(2,h.person_id) ELSE 0 END) chess,
SUM(DISTINCT CASE WHEN h.hobby = 'skydiving' THEN POWER(2,h.person_id) ELSE 0 END) skydiving,
SUM(DISTINCT POWER(2,h.person_id)) total
FROM
hobbies h
LEFT JOIN
pets p ON h.person_id = p.person_id
GROUP BY
p.pet
UNION ALL
SELECT
'total',
SUM(DISTINCT CASE WHEN h.hobby = 'chess' THEN POWER(2,h.person_id) ELSE 0 END),
SUM(DISTINCT CASE WHEN h.hobby = 'skydiving' THEN POWER(2,h.person_id) ELSE 0 END),
SUM(DISTINCT POWER(2,h.person_id))
FROM
hobbies h
), w(person_id, weight) as (
SELECT POWER(2,person_id), weight
FROM weights
), cte(person_id, weight) AS (
SELECT *
FROM w
UNION ALL
SELECT w1.person_id + w2.person_id, w1.weight + w2.weight
FROM cte w1 JOIN w w2 ON w2.person_id > w1.person_id
)
SELECT
pet,
COALESCE((SELECT cte.weight FROM cte WHERE cte.person_id = t.chess), 0) AS chess,
COALESCE((SELECT cte.weight FROM cte WHERE cte.person_id = t.skydiving), 0) AS skydiving,
COALESCE((SELECT cte.weight FROM cte WHERE cte.person_id = t.total), 0) AS total
FROM t;
Not cubed, static way and a bit dirty. But I just test it in SQL Server ;).
This can be a cubed version (Not tested):
;With t as (
SELECT h.hobby, p.pet, POWER(2,h.person_id) weight
FROM hobbies h
JOIN pets p
ON h.person_id = p.person_id
JOIN weights w
ON h.person_id = w.person_id
), w(person_id, weight) as (
SELECT POWER(2,person_id), weight
FROM weights
), cte(person_id, weight) AS (
SELECT *
FROM w
UNION ALL
SELECT w1.person_id + w2.person_id, w1.weight + w2.weight
FROM cte w1 JOIN w w2 ON w2.person_id > w1.person_id
), c as (
SELECT
hobby, pet, SUM(DISTINCT weight) person_id
FROM t
GROUP BY CUBE(hobby, pet)
)
SELECT c.hobby, c.pet, cte.weight
FROM c JOIN
cte ON c.person_id = cte.person_id;

SQL Query - using COUNT

I have a database structure like this:
Table - lodges
LodgeID (PK)
Lodge
etc
Table - scores
ScoreID (PK)
Score
CategoryID
LodgeID (FK)
I'm trying to return results in the form:
LodgeID, Lodge, Category, Number of Scores in that Category, Average Score in that Category
So for example, if I had:
lodges
LodgeID, Lodge
1, Lodge One
2, Lodge Two
scores
ScoreID, Score, CategoryID, LodgeID
1, 3, 101, 1
2, 5, 101, 1
3, 7, 101, 1
4, 10, 102, 2
5, 20, 102, 2
6, 30, 102, 2
7, 40, 102, 2
I'd like to return:
1, Lodge One, 3, 5
2, Lodge Two, 4, 25
I've been trying things like:
SELECT COUNT(ScoreID) as scoreCount, AVG(Score) as AverageScore, Lodge
FROM scores_temp
INNER JOIN lodges_temp ON scores_temp.LodgeID = lodges_temp.LodgeID
SELECT lodges_temp.LodgeID, Lodge, COUNT(ScoreID) as scoreCount, AVG(Score) as AverageScore FROM lodges_temp INNER JOIN scores_temp ON lodges_temp.LodgeID = scores_temp.LodgeID
Without any success. Any pointers would be much appreciated.
Try this
SELECT COUNT(ScoreID) as scoreCount, AVG(Score) as AverageScore, Lodge
FROM scores_temp
INNER JOIN lodges_temp ON scores_temp.LodgeID = lodges_temp.LodgeID
GROUP BY Lodge
You are missing a group by clause:
SELECT COUNT(ScoreID) as scoreCount, AVG(Score) as AverageScore, Lodge
FROM scores
INNER JOIN lodges ON scores.LodgeID = lodges.LodgeID
GROUP BY lodge

How can a group by be converted to a self-join

for a table such as:
employeeID | groupCode
1 red111
2 red111
3 blu123
4 blu456
5 red553
6 blu423
7 blu341
how can I count the number of employeeIDs that are in parent groups (such as red or blu, but there are many more groups in the real table) that have a total number of group members greater than 2 (so all those with blu in this particular example) excluding themselves.
To expand: groupCode consists of a parent group (three letters), followed by some numbers for the subgroup.
using a self-join, or at least without using a group by statement.
So far I have:
SELECT T1.employeeID
FROM TABLE T1, TABLE T2
WHERE T1.groupCode <> T2.groupCode
AND SUBSTR(T1.groupCode, 1, 3) = SUBSTR(T2.gorupCode, 1, 3);
but that doesn't do much for me...
Add an index on the first 3 characters of EMPLOYEE.
Then try this one:
SELECT ed.e3
, COUNT(*)
FROM EMPLOYEE e
JOIN
( SELECT DISTINCT
SUBSTR(groupCode, 1, 3) AS e3
FROM EMPLOYEE
) ed
ON e.groupCode LIKE CONCAT(ed.e3, '%')
GROUP BY ed.e3
HAVING COUNT(*) >= 3 --- or whatever is wanted
What about
SELECT substring(empshirtno, 1, 3),
Count(SELECT 1 from myTable as myTable2
WHERE substring(mytable.empshirtno, 1, 3) = substring(mytable2.empshirtno, 1, 3))
FROM MyTable
GROUP BY substring(mytable2.empshirtno, 1, 3)
maybe counting from a subquery is speedier with an index