Why is INNER JOIN producing more records than original file? - sql

I have two tables. Table A & Table B. Table A has 40516 rows, and records sales by seller_id. The first column in Table A is the seller_id that repeats every time a sale is made.
Example: Table A (40516 rows)
seller_id | item | cost
------------------------
1 | dog | 5000
1 | cat | 50
4 |lizard| 80
5 |bird | 20
5 |fish | 90
The seller_id is also present in Table B, and also contains the corresponding name of the seller.
Example: Table B (5851 rows)
seller_id | seller_name
-------------------------
1 | Dog and Cat World INC
4 | Reptile Love.com
5 | Ocean Dogs Inc
I want to join these two tables, but only display the seller name from Table B and all other columns from Table A. When I do this with an INNER JOIN I get 40864 rows (348 extra rows). Shouldn't the query produce only the original 40516 rows?
Also not sure if this matters, but the seller_id can contain several zeros before the number (e.g., 0000845, 0000549).
I've looked around on here and haven't really found an answer. I've tried LEFT and RIGHT joins and get the same results for one and way more results for the other.
SQL Code Example:
SELECT public.table_B.seller_name, *
FROM public.table_A
INNER JOIN public.table_B ON public.table_A.seller_id =
public.table_B.seller_id;
Expected Results:
seller_name | seller_id | item | cost
------------------------------------------------
Dog and Cat World INC | 1 | dog | 5000
Dog and Cat World INC | 1 | cat | 50
Reptile Love.com | 4 |lizard| 80
Ocean Dogs Inc | 5 |bird | 20
Ocean Dogs Inc | 5 |fish | 90
I expected the results to contain the same number of rows in Table A. Instead I gut names matching up and an additional 348 rows...
Update:
I changed "unique_id" to "seller_id" in the question.
I guess I should have chosen a better name for unique_id in the original example. I didn't mean it to be unique in the sense of a key. It is just the seller's id that repeats every time there is a sale (in Table A). The seller's ID does repeat in Table A because it is supposed to. I simply want to pair up the seller IDs with the seller names.
Thanks again everyone for their help!

unique_id is already not correctly named in the first table, so there is no reason to assume it is unique in the second table either.
Run this query to find the duplicates:
select unique_id
from table_b
group by unique_id
having count(*) > 1;
You can fix the query using distinct on:
SELECT b.seller_name, a.*
FROM public.table_A a JOIN
(SELECT DISTINCT ON (b.unique_id) b.*
FROM public.table_B b
ORDER BY b.unique_id
) b
ON a.unique_id = b.unique_id;
In this case, you may get fewer records, if there are no matches. To fix that, use a LEFT JOIN.

Because unique id column is not unique.

Gordon Linoff was correct. The seller_id (formerly listed as unique_id) was indeed duplicated throughout the data set. I foolishly assumed otherwise. Also the seller_name had many duplicates too! In the end I had to use the CONCAT() function to join the seller_id with second identifier to create a type of foreign key. After I did this the join worked as expected. Thanks everyone!

Related

Sum of a column value of table B in table A, is there a automated way ? Is it good practice ? - Oracle SQL

Basically each user has a team, and each team has 11 players, so whenever a player scores they earn some points. Now is there a automated way to do this -
As in when there is a update/entry in the USER_TEAM_PLAYERS table, summate the points of all players to the USER_TEAM table for the corresponding user in some column (in this case TEAM_TOTAL column).
I have two tables:
USER_TEAM with columns USER_ID, TEAM_TOTAL
USER_TEAM_PLAYERS with columns PLAYER_NAME, PLAYER_POINTS, USER_ID
Example:
TABLE - USER_TEAM
USER_ID | TEAM_TOTAL
---------------------
1 | 40
2 | 50
TABLE - USER_TEAM_PLAYERS
PLAYER_NAME | PLAYER_POINTS | USER_ID
-------------------------------------
Adam | 10 | 1
Alex | 30 | 1
Botas | 40 | 2
Pepe | 5 | 2
Diogo | 5 | 2
The first table should be only a view of the second one
CREATE VIEW USER_TEAM2 AS
SELECT USER_ID, SUM(PLAYER_POINTS) AS TEAM_TOTAL
FROM USER_TEAM_PLAYERS
GROUP BY USER_ID
ORDER BY USER_ID;
Doing this, you have no duplicate data and a view can be in SELECT, ... like a table.
Nota 1 : I used the name USER_TEAM2 because your first table still exists but you can delete it.
Nota 2 : If you want to have some specific data to the TEAM_TABLE, keep the 2 names, and modifify your view as needed by adding some fields with a JOIN of this first table.

Count unique entries for ts_stat count in full text search

I'm struggling with using ts_stat to get the number of unique occurrences of tags in a table and sort them by the highest count.
What I need though is to only count each entry one time so that only unique entries are counted. I tried group by and distinct but nothing is working for me.
e.g. table
user_id | tags | post_date
===================================
2 | dog cat | 1580049400
2 | dog | 1580039400
3 | dog | 1580038400
3 | dog dog cat | 1580058400
4 | dog horse | 1580028400
Here is the current query
SELECT word, ndoc, nentry
FROM ts_stat($$SELECT to_tsvector('simple', tags) FROM tags WHERE post_date > 1580018400$$)
ORDER BY ndoc DESC
LIMIT 10;
Right now this will produce
word | ndoc | nentry
====================
dog | 5 | 6
cat | 2 | 2
horse| 1 | 1
The result I would be looking for is unique counts so no 1 user can count more than once even if they have > 1 entries after a certain date as noted in the post_date condition (Which might be irrelevant). Like below.
word | total_count_per_user
===========================
dog | 3 (because there are 3 unique users with this term)
cat | 2 (because there are 2 unique users with this term)
horse| 1 (because there are 1 unique users with this term)
UPDATE: I changed the column name to reflect output. The point is no matter how many times a user enters a word. It only needs the unique count per user. e.g. if a user in that scenario creates 100 entries with dog in the text it will only count dog 1 time for that user not 100 counts of dog.
You can use COUNT on DISTINCT value if I get your point correct. The sample query is as below-
SELECT tags,COUNT(DISTINCT user_id)
FROM your_table
GROUP BY tags
I guess this one was tough. Just in case someone happens to have a similar requirement I was able to get this to work. Seems odd to have to get total with ts_stat then filter it again using distinct, cross join etc so that no matter how many times it finds a word each user only counts once per word. I'm not sure how efficient it will be on a large data set but it yields the expected results.
UPDATE: This is works without using a CTE. Also cross join is the key to filtering on user id.
SELECT DISTINCT (t.word) as tag, count(DISTINCT h.user_id) as posts
FROM ts_stat($$SELECT hashtagsearch FROM tagstable WHERE post_date > 1580018400$$) t
CROSS JOIN tagstable h WHERE hashtagsearch ## to_tsquery('simple',t.word)
GROUP BY t.word HAVING count(DISTINCT h.user_id) > 1 ORDER BY posts DESC LIMIT 10'
This answer helped quite a bit. https://stackoverflow.com/a/42704207/330987

Trying to find non-duplicate entries in mostly identical tables(access)

I have 2 different databases. They track different things about inventory. in essence they share 3 common fields. Location, item number and quantity. I've extracted these into 2 tables, with only those fields. Every time I find an answer, it doesn't get all the test cases, just some of the fields.
Items can be in multiple locations, and as a turn each location can have multiple items. The primary key would be location and item number.
I need to flag when an entry doesn't match all three fields.
I've only been able to find queries that match an ID or so, or who's queries are beyond my comprehension. in the below, I'd need a query that would show that rows 1,2, and 5 had issues. I'd run it on each table and have to verify it with a physical inventory.
Please refrain from commenting on it being silly having information in 2 different databases, All I get in response it to deal with it =P
Table A
Location ItemNum | QTY
-------------------------
1a1a | as1001 | 5
1a1b | as1003 | 10
1a1b | as1004 | 2
1a1c | as1005 | 15
1a1d | as1005 | 15
Table B
Location ItemNum | QTY
-------------------------
1a1a | as1001 | 10
1a1d | as1003 | 10
1a1b | as1004 | 2
1a1c | as1005 | 15
1a1e | as1005 | 15
This article seemed to do what I wanted but I couldn't get it to work.
To find entries in Table A that don't have an exactly matching entry in Table B:
select A.*
from A
left join B on A.location = B.location and A.ItemNum = B.ItemNum and A.qty = B.qty
where B.location Is Null
Just swap all the A's and B's to get the list of entries in B with no matching entry in A.

sql insert value from another table with original nulls but not unmatched entries

OK. So this is a hard one to explain, but I am replacing the type of a foreign key in a database. To do this I need to update the values in a table that references it. That is all fine and good, and nice and easy to do.
I'm inserting this stuff into a temporary table which will replace the original table, but the insert query isn't at all difficult, it's the select that I get the values from.
However, I also want to keep any entries where the original reference was NULL. Also not hard, I could use a Left Inner Join for that.
But we're not done yet: I don't want the entries for which there is no match in the second table. I've been dinking around with this for 2 hours now, and am no closer to figuring this out than I am to the moon.
Let me give you an example data set:
____________________________
| Inventory || Customer |
|============||============|
| ID Cust || ID Name |
|------------||------------|
| 1 A || 1 A |
| 2 B || 2 B |
| 3 E || 3 C |
| 4 NULL || 4 D |
|____________||____________|
Let's say the database used to use the Customer.Name field as its Primary Key, and I need to change it to a standard int identity(1,1) not null ID. I've added the field with no issues in the Customer table, and kept the Name because I need it for other stuff. I have had no trouble with this in all the tables that do not allow NULLs, but since the "Inventory" table allows something to be associated with No customer, I'm running into troubles.
If I did a left inner join, my results would be:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
| 3 NULL |
| 4 NULL |
|____________|
However, Inventory #3 was referencing a customer which does not exist. I want that to be filtered out.
This database is my development database, where I hack, slash, and destroy things with wanton disregard for validity. So a lot of links in these tables are no longer valid.
The next step is replicating this process in the beta-testing environment, where bad records shouldn't exist, but I can't guarantee that. So I'd like to keep the filter, if possible.
The query I have right now is using a sub-query to find all rows in Inventory whose CustID either exists in Customers, or is null. It then tries to only grab the value from those rows which the subquery found. Here's the translated query:
insert into results
(
ID,
Cust
)
select
inv.ID, cust.ID
from Inventory inv, Customer cust
where inv.ID in
(
select inv.ID from Inventory inv, Customer cust
where inv.Cust is null
or cust.Name = inv.Cust
)
and cust.Name = inv.Cust
But, as I'm sure you can see, this query isn't right. I've tried using 2, 3 subqueries, inner joins, left joins, bleh. The results of this query, and many others I've tried (that weren't horribly, horribly wrong) are:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
|____________|
Which is essentially an inner-join. Considering my actual data has around 1100 records which have NULL values in that field, I don't think truncating them is the answer.
The answer I'm looking for is:
______________
| Results |
|============|
| ID Cust |
|------------|
| 1 1 |
| 2 2 |
| 4 NULL |
|____________|
The trickiest part of this insert into select is the fact that I'm looking to insert either a value from another table, or essentially a value from this table or the literal NULL. That just isn't something I know how to do; I'm still getting the hang of SQL.
Since I'm inserting the results of this query into a table, I've considered doing the insert using a select which leaves out the NULL values and un-matched records, then going back through and adding in all the NULL records, but I really want to learn how to do the more advanced queries like this.
So do any of yous folks have any ideas? 'Cause I'm lost.
How about a union?
Select all records where ID and Cust match and union that with all records where ID matches and inventory.cust is null.

Select ID given the list of members

I have a table for the link/relationship between two other tables, a table of customers and a table of groups. a group is made up of one or more customers. The link table is like
APP_ID | GROUP_ID | CUSTOMER_ID
1 | 1 | 123
1 | 1 | 124
1 | 1 | 125
1 | 2 | 123
1 | 2 | 125
2 | 3 | 123
3 | 1 | 123
3 | 1 | 124
3 | 1 | 125
I now have a need, given a list of customer IDs to be able to get the group ID for that list of customer IDs. Group ID may not be unique, the same group ID will contain the same list of customer IDs but this group may exist in more than one app_id.
I'm thinking that
SELECT APP_ID, GROUP_ID, COUNT(CUSTOMER_ID) AS COUNT
FROM GROUP_CUST_REL
WHERE CUSTOMER_ID IN ( <list of ids> )
GROUP BY APP_ID, GROUP_ID
HAVING COUNT(CUSTOMER_ID) = <number of ids in list>
will return me all of the group IDs that contain all of the customer ids in the given list and only those group ids. So for a list of (123,125) only group id 2 would be returned from the above example
I will then have to link with the app table to use its created timestamp to identify the most recent application that the group existed in so that I can then pull the correct/most up to date info from the group table.
Does anyone have any thoughts on whether this is the most efficient way to do this? If there is another quicker/cleaner way I'd appreciate your thoughts.
This smells like a division:
Division sample
Other related stack overflow question
Taking a look at the provided links you'll see the solution to similar issues from relational alegebra's point of view, doesn't seem to be quicker and arguably cleaner.
I didn't look at your solution at first, and when I solved this I turned out to have solved this the same way you did.
Actually, I thought this:
<number of ids in list>
Could be turned into something like this (so that you don't need the extra parameter):
select count(*) from (<list of ids>) as t
But clearly, I was wrong. I'd stay with your current solution if I were you.