Access SQL - divide value in each row by total (using self join) - sql

I have a table (tblGoals) which shows how many goals each player has scored, e.g:
| Player | Goals |
--------------------
| John | 6 |
| Chris | 10 |
| Ben | 4 |
I am trying to write a query that will output each player along with the percentage of the teams total goals that they have scored:
| Player | PercentageGoals |
------------------------------
| John | 0.3 |
| Chris | 0.5 |
| Ben | 0.2 |
I have already figured out how to do this with a sub query as shown
SELECT
Player,
Goals/ (SELECT SUM(Goals)FROM tblGoals) AS PercentageGoals
FROM tblGoals
The example table I have shown is just to demonstrate what I am trying to do. The actual table I am using is a much larger dataset and trying to use a subquery to get the percentage in this way is running quite slowly.
I have noticed before in Access that self-joins are usually optimised more efficiently than sub queries, and so I am trying to figure out if the above query can be rewritten using a self join.
I have tried something along the lines of below but obviously this is incorrect as t2 is grouped by Player which means I am not getting the true total, but if I leave the Player name out then I can't join on it?
SELECT
t1.Player,
t1.Goals/t2.sumGoals AS PercentageGoals
FROM
tbl1 t1
INNER JOIN
(SELECT Player, SUM(Goals) as sumGoals
FROM tbl1
GROUP BY Player) t2
ON t1.Player= t2.Player
Is there any way to do this?

I'll try this as an answer - although I'm not sure if it will be faster as only using a three records.
Your second table in the join should just have a total of all the goals to join to each record in the first table - a cartesian product.
You can then divide one by the other:
SELECT Player
, Goals / TotalGoals AS Total
FROM tblGoals, (SELECT SUM(Goals) AS TotalGoals FROM tblGoals)
Is it any faster on a big table? The idea being that in your SQL it was calculating the total for each record, while this creates the total as a table and joins to it so should only be calculated once.

Related

how would you generate fake account numbers for a company project with SQL?

I have a big data set in Redshift which my company will share with university students to analyze. I need to mask the real customer account numbers.
I've looked at the random function but there's one catch: some customers are repeated, so I need to retain that for the analysis to be useful. Also, with a random number there's a small possibility you would repeat account numbers, right?
How would you achieve this? Generate a new_random_id. It must be unique from all others in the table (there are over 4 million in the table), but must be the same for those rows where the actual account ID is the same.
+-------------------+---------------+---------+
| actual_accound_id | new_random_id | status |
+-------------------+---------------+---------+
| 100 | 123 | new |
| 100 | 123 | upgrade |
| 200 | 249 | new |
| 300 | 401 | upgrade |
+-------------------+---------------+---------+
I realize I could first generate a mapping table like this below, and then join to the main table, but it still doesn't solve the problem of possibly repeating new random IDs.
select distinct actual_account_id, cast(random()*1000000 as int) as new_random_id
into mapping_table
from t1;
I would create a mapping table using window functions:
select actual_account_id,
row_number() over (order by random()) as fake_account_id
from t1
group by actual_account_id;
This should be a meaningless sequential number.
Redshift might be a bit slow on the ROW_NUMBER() with no PARTITION BY. If performance is an issue, you can use something like this:
select actual_account_id,
count(*) * 100 + row_number(partition by tmp order by random()) as fake_acocunt_number
from (select actual_account_id,
cast(random()*1000000 as int) as tmp
from t1
group by actual_account_id
) t;

Why is INNER JOIN producing more records than original file?

I have two tables. Table A & Table B. Table A has 40516 rows, and records sales by seller_id. The first column in Table A is the seller_id that repeats every time a sale is made.
Example: Table A (40516 rows)
seller_id | item | cost
------------------------
1 | dog | 5000
1 | cat | 50
4 |lizard| 80
5 |bird | 20
5 |fish | 90
The seller_id is also present in Table B, and also contains the corresponding name of the seller.
Example: Table B (5851 rows)
seller_id | seller_name
-------------------------
1 | Dog and Cat World INC
4 | Reptile Love.com
5 | Ocean Dogs Inc
I want to join these two tables, but only display the seller name from Table B and all other columns from Table A. When I do this with an INNER JOIN I get 40864 rows (348 extra rows). Shouldn't the query produce only the original 40516 rows?
Also not sure if this matters, but the seller_id can contain several zeros before the number (e.g., 0000845, 0000549).
I've looked around on here and haven't really found an answer. I've tried LEFT and RIGHT joins and get the same results for one and way more results for the other.
SQL Code Example:
SELECT public.table_B.seller_name, *
FROM public.table_A
INNER JOIN public.table_B ON public.table_A.seller_id =
public.table_B.seller_id;
Expected Results:
seller_name | seller_id | item | cost
------------------------------------------------
Dog and Cat World INC | 1 | dog | 5000
Dog and Cat World INC | 1 | cat | 50
Reptile Love.com | 4 |lizard| 80
Ocean Dogs Inc | 5 |bird | 20
Ocean Dogs Inc | 5 |fish | 90
I expected the results to contain the same number of rows in Table A. Instead I gut names matching up and an additional 348 rows...
Update:
I changed "unique_id" to "seller_id" in the question.
I guess I should have chosen a better name for unique_id in the original example. I didn't mean it to be unique in the sense of a key. It is just the seller's id that repeats every time there is a sale (in Table A). The seller's ID does repeat in Table A because it is supposed to. I simply want to pair up the seller IDs with the seller names.
Thanks again everyone for their help!
unique_id is already not correctly named in the first table, so there is no reason to assume it is unique in the second table either.
Run this query to find the duplicates:
select unique_id
from table_b
group by unique_id
having count(*) > 1;
You can fix the query using distinct on:
SELECT b.seller_name, a.*
FROM public.table_A a JOIN
(SELECT DISTINCT ON (b.unique_id) b.*
FROM public.table_B b
ORDER BY b.unique_id
) b
ON a.unique_id = b.unique_id;
In this case, you may get fewer records, if there are no matches. To fix that, use a LEFT JOIN.
Because unique id column is not unique.
Gordon Linoff was correct. The seller_id (formerly listed as unique_id) was indeed duplicated throughout the data set. I foolishly assumed otherwise. Also the seller_name had many duplicates too! In the end I had to use the CONCAT() function to join the seller_id with second identifier to create a type of foreign key. After I did this the join worked as expected. Thanks everyone!

Find spectators that have seen the same shows (match multiple rows for each)

For an assignment I have to write several SQL queries for a database stored in a PostgreSQL server running PostgreSQL 9.3.0. However, I find myself blocked with last query. The database models a reservation system for an opera house. The query is about associating the a spectator the other spectators that assist to the same events every time.
The model looks like this:
Reservations table
id_res | create_date | tickets_presented | id_show | id_spectator | price | category
-------+---------------------+---------------------+---------+--------------+-------+----------
1 | 2015-08-05 17:45:03 | | 1 | 1 | 195 | 1
2 | 2014-03-15 14:51:08 | 2014-11-30 14:17:00 | 11 | 1 | 150 | 2
Spectators table
id_spectator | last_name | first_name | email | create_time | age
---------------+------------+------------+----------------------------------------+---------------------+-----
1 | gonzalez | colin | colin.gonzalez#gmail.com | 2014-03-15 14:21:30 | 22
2 | bequet | camille | bequet.camille#gmail.com | 2014-12-10 15:22:31 | 22
Shows table
id_show | name | kind | presentation_date | start_time | end_time | id_season | capacity_cat1 | capacity_cat2 | capacity_cat3 | price_cat1 | price_cat2 | price_cat3
---------+------------------------+--------+-------------------+------------+----------+-----------+---------------+---------------+---------------+------------+------------+------------
1 | madama butterfly | opera | 2015-09-05 | 19:30:00 | 21:30:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
2 | don giovanni | opera | 2015-09-12 | 19:30:00 | 21:45:00 | 2 | 315 | 630 | 945 | 195 | 150 | 100
So far I've started by writing a query to get the id of the spectator and the date of the show he's attending to, the query looks like this.
SELECT Reservations.id_spectator, Shows.presentation_date
FROM Reservations
LEFT JOIN Shows ON Reservations.id_show = Shows.id_show;
Could someone help me understand better the problem and hint me towards finding a solution. Thanks in advance.
So the result I'm expecting should be something like this
id_spectator | other_id_spectators
-------------+--------------------
1| 2,3
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
Note based on comments: Wanted to make clear that this answer may be of limited use as it was answered in the context of SQL-Server (tag was present at the time)
There is probably a better way to do it, but you could do it with the 'stuff 'function. The only drawback here is that, since your ids are ints, placing a comma between values will involve a work around (would need to be a string). Below is the method I can think of using a work around.
SELECT [id_spectator], [id_show]
, STUFF((SELECT ',' + CAST(A.[id_spectator] as NVARCHAR(10))
FROM reservations A
Where A.[id_show]=B.[id_show] AND a.[id_spectator] != b.[id_spectator] FOR XML PATH('')),1,1,'') As [other_id_spectators]
From reservations B
Group By [id_spectator], [id_show]
This will show you all other spectators that attended the same shows.
Meaning that every time spectator with id 1 went to a show, spectators 2 and 3 did too.
In other words, you want a list of ...
all spectators that have seen all the shows that a given spectator has seen (and possibly more than the given one)
This is a special case of relational division. We have assembled an arsenal of basic techniques here:
How to filter SQL results in a has-many-through relation
It is special because the list of shows each spectator has to have attended is dynamically determined by the given prime spectator.
Assuming that (d_spectator, id_show) is unique in reservations, which has not been clarified.
A UNIQUE constraint on those two columns (in that order) also provides the most important index.
For best performance in query 2 and 3 below also create an index with leading id_show.
1. Brute force
The primitive approach would be to form a sorted array of shows the given user has seen and compare the same array of others:
SELECT 1 AS id_spectator, array_agg(sub.id_spectator) AS id_other_spectators
FROM (
SELECT id_spectator
FROM reservations r
WHERE id_spectator <> 1
GROUP BY 1
HAVING array_agg(id_show ORDER BY id_show)
#> (SELECT array_agg(id_show ORDER BY id_show)
FROM reservations
WHERE id_spectator = 1)
) sub;
But this is potentially very expensive for big tables. The whole table hast to be processes, and in a rather expensive way, too.
2. Smarter
Use a CTE to determine relevant shows, then only consider those
WITH shows AS ( -- all shows of id 1; 1 row per show
SELECT id_spectator, id_show
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
)
SELECT sub.id_spectator, array_agg(sub.other) AS id_other_spectators
FROM (
SELECT s.id_spectator, r.id_spectator AS other
FROM shows s
JOIN reservations r USING (id_show)
WHERE r.id_spectator <> s.id_spectator
GROUP BY 1,2
HAVING count(*) = (SELECT count(*) FROM shows)
) sub
GROUP BY 1;
#> is the "contains2 operator for arrays - so we get all spectators that have at least seen the same shows.
Faster than 1. because only relevant shows are considered.
3. Real smart
To also exclude spectators that are not going to qualify early from the query, use a recursive CTE:
WITH RECURSIVE shows AS ( -- produces exactly 1 row
SELECT id_spectator, array_agg(id_show) AS shows, count(*) AS ct
FROM reservations
WHERE id_spectator = 1 -- your prime spectator here
GROUP BY 1
)
, cte AS (
SELECT r.id_spectator, 1 AS idx
FROM shows s
JOIN reservations r ON r.id_show = s.shows[1]
WHERE r.id_spectator <> s.id_spectator
UNION ALL
SELECT r.id_spectator, idx + 1
FROM cte c
JOIN reservations r USING (id_spectator)
JOIN shows s ON s.shows[c.idx + 1] = r.id_show
)
SELECT s.id_spectator, array_agg(c.id_spectator) AS id_other_spectators
FROM shows s
JOIN cte c ON c.idx = s.ct -- has an entry for every show
GROUP BY 1;
Note that the first CTE is non-recursive. Only the second part is recursive (iterative really).
This should be fastest for small selections from big tables. Row that don't qualify are excluded early. the two indices I mentioned are essential.
SQL Fiddle demonstrating all three.
It sounds like you have one half of the total question--determining which id_shows a particular id_spectator attended.
What you want to ask yourself is how you can determine which id_spectators attended an id_show, given an id_show. Once you have that, combine the two answers to get the full result.
So the final answer I got, looks like this :
SELECT id_spectator, id_show,(
SELECT string_agg(to_char(A.id_spectator, '999'), ',')
FROM Reservations A
WHERE A.id_show=B.id_show
) AS other_id_spectators
FROM Reservations B
GROUP By id_spectator, id_show
ORDER BY id_spectator ASC;
Which prints something like this:
id_spectator | id_show | other_id_spectators
-------------+---------+---------------------
1 | 1 | 1, 2, 9
1 | 14 | 1, 2
Which suits my needs, however if you have any improvements to offer, please share :) Thanks again everybody!

Showing data in group by

I am running a query which get data from multiple tables and condition with inner join. I want this query to group by a single column but when i do it i get: ORA-00979: not a GROUP BY expression, error message. Well as per my understanding this is because of other table column which not support this group by.
This query I am writing to generate reports from iReport. for example below column I am getting from three different tables details, food and hobbies, I want to combine this result group by name...
Name | food | hobby
-------------------------
peter | chips | traveling
peter | burger | tennis
peter | burger | writing
Dave | lamb | game
Dave | kebab | reading
fine result that i want will be: here I only want to get name once and respective all values (even when it is duplicate) and other duplicate name rows should not contains any data..please help me with this sql query.. if there's any option in iReport to do this please let me know or any other keyword/inner queries in sql, i tried there group by option while you design table in it.. but it is not working... thanks in advance
Name | food | hobby
--------------------------------------------------------------
peter | chips | traveling
------ | burger | tennis
------ | burger | writing
Dave | lamb | game
-------| kebab | reading
Query for it:
SELECT org.Location AS organisation_location, list.listId as list_listid, org.Centre AS org_Centre,
org.Department AS org_Department, org.Position AS org_Position, q.content AS q_content,
q.dueTime AS q_dueTime, a.submitted_date AS a_submitted_date, list.frequency AS list_frequency,
a.comments AS a_comments, a.userid AS a_userid, a.submitted as a_submitted
FROM org INNER JOIN list ON org.id = list.org_id INNER JOIN q ON klist.id = q.list_id INNER JOIN a ON qid = a.q_id
WHERE a.submitted=0 andlist.listid='xyz'
I want to group the same by list.listid
Your query doesn't contain "Name", "Food" or "Hobby" so I'm little confused, but following query should help you create your own to achieve desired goal.
SELECT
CASE WHEN X.VERIFY_COL = 1 THEN X.YOUR_UNIQUE_COL ELSE NULL END AS YOUR_COL_NAME,
* FROM
(SELECT
ROW_NUMBER() OVER (PARTITION BY YOUR_UNIQUE_COL ORDER BY YOUR_UNIQUE_COL) AS VERIFY_COL,
* FROM YOUR_VIEW
) X
You can partition your data by column you would like to have only once in your query YOUR_UNIQUE_COL. Then easy take advantage of ROW_COUNT() to set NULL for all rows' names with ROW_COUNT() > 1.
Please note it's SQL SERVER solution. What database engine do you use?
I don't think you need to group your data, try deactivating "Print repeated values"

Using multiple left joins to calculate averages and counts

I am trying to figure out how to use multiple left outer joins to calculate average scores and number of cards. I have the following schema and test data. Each deck has 0 or more scores and 0 or more cards. I need to calculate an average score and card count for each deck. I'm using mysql for convenience, I eventually want this to run on sqlite on an Android phone.
mysql> select * from deck;
+----+-------+
| id | name |
+----+-------+
| 1 | one |
| 2 | two |
| 3 | three |
+----+-------+
mysql> select * from score;
+---------+-------+---------------------+--------+
| scoreId | value | date | deckId |
+---------+-------+---------------------+--------+
| 1 | 6.58 | 2009-10-05 20:54:52 | 1 |
| 2 | 7 | 2009-10-05 20:54:58 | 1 |
| 3 | 4.67 | 2009-10-05 20:55:04 | 1 |
| 4 | 7 | 2009-10-05 20:57:38 | 2 |
| 5 | 7 | 2009-10-05 20:57:41 | 2 |
+---------+-------+---------------------+--------+
mysql> select * from card;
+--------+-------+------+--------+
| cardId | front | back | deckId |
+--------+-------+------+--------+
| 1 | fron | back | 2 |
| 2 | fron | back | 1 |
| 3 | f1 | b2 | 1 |
+--------+-------+------+--------+
I run the following query...
mysql> select deck.name, sum(score.value)/count(score.value) "Ave",
-> count(card.front) "Count"
-> from deck
-> left outer join score on deck.id=score.deckId
-> left outer join card on deck.id=card.deckId
-> group by deck.id;
+-------+-----------------+-------+
| name | Ave | Count |
+-------+-----------------+-------+
| one | 6.0833333333333 | 6 |
| two | 7 | 2 |
| three | NULL | 0 |
+-------+-----------------+-------+
... and I get the right answer for the average, but the wrong answer for the number of cards. Can someone tell me what I am doing wrong before I pull my hair out?
Thanks!
John
It's running what you're asking--it's joining card 2 and 3 to scores 1, 2, and 3--creating a count of 6 (2 * 3). In card 1's case, it joins to scores 4 and 5, creating a count of 2 (1 * 2).
If you just want a count of cards, like you're currently doing, COUNT(Distinct Card.CardId).
select deck.name, coalesce(x.ave,0) as ave, count(card.*) as count -- card.* makes the intent more clear, i.e. to counting card itself, not the field. but do not do count(*), will make the result wrong
from deck
left join -- flatten the average result rows first
(
select deckId,sum(value)/count(*) as ave -- count the number of rows, not count the column name value. intent is more clear
from score
group by deckId
) as x on x.deckId = deck.id
left outer join card on card.deckId = deck.id -- then join the flattened results to cards
group by deck.id, x.ave, deck.name
order by deck.id
[EDIT]
sql has built-in average function, just use this:
select deckId, avg(value) as ave
from score
group by deckId
What's going wrong is that you're creating a Cartesian product between score and card.
Here's how it works: when you join deck to score, you may have multiple rows match. Then each of these multiple rows is joined to all of the matching rows in card. There's no condition preventing that from happening, and the default join behavior when no condition restricts it is to join all rows in one table to all rows in another table.
To see it in action, try this query, without the group by:
select *
from deck
left outer join score on deck.id=score.deckId
left outer join card on deck.id=card.deckId;
You'll see a lot of repeated data in the columns that come from score and card. When you calculate the AVG() over data that has repeats in it, the redundant values magically disappear (as long as the values are repeated uniformly). But when you COUNT() or SUM() them, the totals are way off.
There may be remedies for inadvertent Cartesian products. In your case, you can use COUNT(DISTINCT) to compensate:
select deck.name, avg(score.value) "Ave", count(DISTINCT card.front) "Count"
from deck
left outer join score on deck.id=score.deckId
left outer join card on deck.id=card.deckId
group by deck.id;
This solution doesn't solve all cases of inadvertent Cartesian products. The more general-purpose solution is to break it up into two separate queries:
select deck.name, avg(score.value) "Ave"
from deck
left outer join score on deck.id=score.deckId
group by deck.id;
select deck.name, count(card.front) "Count"
from deck
left outer join card on deck.id=card.deckId
group by deck.id;
Not every task in database programming must be done in a single query. It can even be more efficient (as well as simpler, easier to modify, and less error-prone) to use individual queries when you need multiple statistics.
Using left joins isn't a good approach, in my opinion. Here's a standard SQL query for the result you want.
select
name,
(select avg(value) from score where score.deckId = deck.id) as Ave,
(select count(*) from card where card.deckId = deck.id) as "Count"
from deck;