Joining multiple Tables in Oracle gives out duplicated records - sql

I am a newbie to sql. I have three tables mr1,mr2,mr3. Caseid is the primary keys in all these tables. I need to join all these table columns and display result.
Problem is that i dont know which join to use.
when i joined all these just like below query:
select mr1.col1,mr1.col2,mr2.col1,mr2.col2,mr3.col1,mr3.col2
from mr1,mr2,mr3
where mr1.caseid = mr2.caseid
and mr2.caseid = mr3.caseid;
it displays 4 records, eventhough the maximum number of records is two, which is in table mr2.
records are duplicated, can anyone help me in this regard?

Distinct will do it but it's not the correct approch.
You need to add another join (mr1.caseid = mr3.caseid) because mr2 and mr3 rows must be related to the same row in mr1, otherwise you end up with 2 pairs, onde for each tabled joined to your primary table (mr2).
First answer in SO, so forgive me if it wasn't that clear.

Your problem is that your tables are in a one-to many relationship. When you join them, it is expected that the number of rows will go up unless you take steps to limit the records returned. How to fix depends on the meaning of the data.
If all the fields are exactly the same, then adding DISTINCT will fix the problem. However, it may be faster, depending on the size of the tables and the number of records you are returning, to use a derived table to limit the records in the join to only one from the table with multiple records.
If at least one of the fields is different however, then you need to know the business rule that will allow you to pick the correct record. It might be accomplished by adding a where clause or by using an aggregate function and group by or even both. This really depends on the meaning of the result set which is why you need to ask further question in your own organization as they are the only ones who will know which of the multiple records is the correct one to pick from the perspectives of the people who will be using the results of the query. Further, the business might actually want to see all of the records and you have no problem at all.

Related

Selecting a large number of rows by index using SQL

I am trying to select a number of rows by the value of a column called ID. I know you can do this pretty easily by:
SELECT col1, col2, col3 FROM mytable WHERE id IN (1,2,3,4,5...)
However, what if there are a few million IDs I want to select and the IDs don't always have pattern (which means I can't use something like BETWEEN x AND y)? Does this select statement still work or is there better ways of doing so?
The actual application is this. Filters are specified by users, which is compared to some attributes of the records. From those filters, we create a subset of the data which is of interest to a particular user. There are about 30 million records each with roughly ~3000 attributes (which is stored in roughly 30 tables, but every table has ID as a primary key), so every time someone makes a query about their desired subset of records, we'd have to join many tables, apply those filters, and figure out what his subset looks like. In order to avoid joining many tables all the time, I thought maybe it's a better idea to join the tables once, figure out the id of the selected subset, and this way each time a new query is made, all we have to do is select the relevant columns of the rows that match the filtered ids.
This depends on the database and the interface you are using. For a few hundred or thousand values, no problem. But your question specifies millions. And that could start to get into limits on the length of the query -- either specified by the database, the tool you are using, or intermediate libraries.
If you have so many ids, I would strongly recommend that you load them into a table in the database with the id as the primary key. Then use join or exists to identify the rows in your table that match.
Often, such a list would be generated in the database anyway. In that case, you can use a subquery or CTE and just include that code in your final query.

Bigquery - remove duplicates of certain columns, but not all

I have two tables I am left joining together. The first tables has transnational level detail, causing the key I join to the second table to duplicate. When I left join the second table, the measure "company_spend" is highly inflated.
I need a way to keep only a single value of the duplicated data, and my thought was to run a distinct function on only those columns, but I am not seeing that Bigquery supports distinct functions on only a few columns, but not all.
SELECT UPPER(cwnextt.Current_Contract_Number) AS Current_Contract_Number,
UPPER(cwnextt.Replacement_Contract_Number) AS Replacement_Contract_Number,
UPPER(cwnextt.Current_Contract_Name) AS Current_Contract_Name,
UPPER(cwnextt.Supplier_Top_Parent_Entity_Code) AS Supplier_Top_Parent_Entity_Code,
UPPER(cwnextt.Supplier_Top_Parent_Name) AS Supplier_Top_Parent_Name,
UPPER(cwnextt.company_Entity_Code) AS company_Entity_Code,
UPPER(cwnextt.Facility_Name) AS Facility_Name,
smart.company_Spend AS companySpend
FROM `test_etl_field.contracts_with_member_entity_codes_test_view_2` cwnextt
--this table is what is causing the below table to duplicate,
--but I need all of this data AS well in its current format.
LEFT JOIN `test.trans_analysis` tsa
ON TRIM(UPPER(cwnextt.company_entity_code)) = TRIM(UPPER(tsa.company_entity_code))
AND TRIM(UPPER(cwnextt.Supplier_Top_Parent_Entity_Code)) = TRIM(UPPER(tsa.manufacturer_top_parent_entity_code))
AND TRIM(UPPER(cwnextt.Current_Contract_Name)) = TRIM(UPPER(tsa.contract_category))
AND cwnextt.spend_period_yyyyqmm = tsa.spend_period_yyyyqmm
--this table contains "company_spend" which is now duplicated
LEFT JOIN `test_etl_field.ecr_smart_data` smart
ON smart.company_entity_code = cwnextt.company_entity_code
AND (smart.contract_number = cwnextt.current_contract_number
OR smart.contract_number = cwnextt.replacement_contract_number)
AND smart.month_key = cwnextt.spend_period_yyyyqmm
If something can be created that will keep company_spend from duplicating on the second left join, that is what I am after.
Not sure to understand all the details of your problem but here's a fact from BigQuery doc :
SELECT DISTINCT
A SELECT DISTINCT statement discards duplicate rows
and returns only the remaining rows.
You can't apply DISTINCT on specific columns because it doesn't make sense. Let's say you have 4 columns and call DISTINCT on 3 columns, what is SQL supposed to do with the last one ?
You must tell SQL which value to keep for the remaining column and GROUP BY is the right solution here.
So if you want to:
Remove a column that has been duplicated : Just adjust your SELECT to get only the columns you want
Remove lines that have the same value in specific columns : I would suggest a GROUP BY on the targeted column and taking the aggregation you want (first, avg, sum or whatever) for the remaining ones.
Remove the value from a row if another row has the same : You may not want to do that. A row has to keep its value and you won't get it back. Besides, same problem, which row do you want to keep ?
Hope this helps ! Feel free to give clarification on your problem if you want more specific answers.
While I couldn't resolve this issue in SQL, I used Tableau via a FIXED LOD to aggregate the data passed duplicates so the end user could visualize the output with accuracy. Not ideal, but the SQL route wasn't make sense.

How to understand this query?

SELECT DISTINCT
...
...
...
FROM Reviews Rev
INNER JOIN Reviews SubRev ON Subrev.W_ID=Rev.ID
WHERE Rev.Status='Approved'
This is a small part of a long query that I've been trying to understand for a day now. What is happening with the join? Reviews table appears to be joined with itself, under different aliases. Why is this done? What does it achieve? Also, ID field of the Reviews table is null for the entries that are nevertheless selected and returned. This is correct, but I don't understand how that can happen if the W_ID field is not null.
It allows you to join one row from the table to a different row in the table.
I've both seen this done, and used it myself, in cases where you maybe have a relationship between those rows.
Real-world examples:
An old version of a record and a newer version
Some sort of hierarchical relationship (e.g. if the table contains records of people, you can record that someone is a parent of someone else). There are probably plenty of other possible use cases, too.
SQL allows you to create a foreign key which relates between two different columns in the same table.

SQL 2 JOINS USING SINGLE REFERENCE TABLE

I'm trying to achieve 2 joins. If I run the 1st join alone it pulls 4 lots of results, which is correct. However when I add the 2nd join which queries the same reference table using the results from the select statement it pulls in additional results. Please see attached. The squared section should not be being returned
So I removed the 2nd join to try and explain better. See pic2. I'm trying to get another column which looks up InvolvedInternalID against the initial reference table IRIS.Practice.idvClient.
Your database is simply doing as you tell it. When you add in the second join (confusingly aliased as tb1 in a 3 table query) the database is finding matching rows that obey the predicate/truth statement in the ON part of the join
If you don't want those rows in there then one of two things must be the case:
1) The truth you specified in the ON clause is faulty; for example saying SELECT * FROM person INNER JOIN shoes ON person.age = shoes.size is faulty - two people with age 13 and two shoes with size 13 will produce 4 results, and shoe size has nothing to do with age anyway
2) There were rows in the table joined in that didn't apply to the results you were looking for, but you forgot to filter them out by putting some WHERE (or additional restriction in the ON) clause. Example, a table holds all historical data as well as current, and the current record is the one with a NULL in the DeletedOn column. If you forget to say WHERE deletedon IS NULL then your data will multiply as all the past rows that don't apply to your query are brought in
Don't alias tables with tbX, tbY etc.. Make the names meaningful! Not only do aliases like tbX have no relation to the original table name (so you encounter tbX, and then have to go searching the rest of the query to find where it's declared so you can say "ah, it's the addresses table") but in this case you join idvclient in twice, but give them unhelpful aliases like tb1, tb3 when really you should have aliased them with something that describes the relationship between them and the rest of the query tables
For example, ParentClient and SubClient or OriginatingClient/HandlingClient would be better names, if these tables are in some relationship with each other.
Whatever the purpose of joining this table in twice is, alias it in relation to the purpose. It may make what you've done wriong easier to spot, for example "oh, of course.. i'm missing a WHERE parentclient.type = 'parent'" (or WHERE handlingclient.handlingdate is not null etc..)
The first step to wisdom is by calling things their proper names

SQL to Spotfire query filtering issue with multiple tables

I am trying to calculate hours flowing in and out of a cost center. When the cost center lends out an employee for an hour it's +1 and when they borrow an employee for an hour it's -1.
Right now I'm using a query that says
select
columns
from dbo.table
where EmployeeCostCenter <> ProjectCostCenter
So when ProjectCostCenter = ID_CostCenter it returns +HoursQuantity.
Then I update ID_CostCenter = EmployeeCostCenter then where ID_CostCenter = EmployeeCostCenter to take -HoursQuantity.
That works fine. The problem is when I import it to Spotfire I can't filter on the main table even after I added the table relations. Can anyone explain why?
I can upload the actual code if needed, but I use 4 queries and a couple of them are quite lengthy. The main table, a temp table to calculate incoming hours, and a temp table to calculate outgoing hours are the only ones involved in this problem I think.
(moved to answer to avoid lengthy discussion)
Essentially, data relations are used to populate filtering / marking between different data-sets. Just like in RDBMS, the relation is what Spotfire uses as the link between dataset. Essentially it's the same as the column or columns you join on. Thus, any column that you wish to filter in TableA and have the result set limited in TableB (or visa versa) must be a relation.
Column matches aren't related columns, but are associated for aggregations, category axis, etc within each visualization. So if TableA has "amount" and TableB has "amount debit" and you wanted to use both of these in an expression, say Sum([TableA].[amount],[TableB].[amount debit]), they would need to be matched in order to not produce erroneous results.
Lastly, once you set up your relations, you should check your filter panel to set up how you want the filtering to work. You can have the rows included, excluded, or ignored all together. Here is a link explaining that.