SQL Query is creating way too many repeated rows - sql

i have an issue with a sql query and how the output is being displayed, you see, i have 3 tables and have at least one field in common, the thing is when i join 2 tables together the information i need is displayed properly, but when i join the third the output goes insane and duplicates the results way too much and i need to figure out why it is happening, down below i'll show you all the tables and relations between each other
this is how the tables are related to each other
This is how the first table (dbo_predios) is made the first three fields are the only relevant in this case
This is how the second table (dbo_permisos_obras_mayores) is made the first three fields are the only relevant in this case as well, the second two can match the first table (dbo_predios)
And here is how the third table (dbo_recepciones_obras_mayores) is made, the fourth field is the only relevant in this case, it could relate to the second table (dbo_permisos_obras_mayores) to the same name field
okay, now that is structurewise, now the query i'm executing is the following:
SELECT
dbo_predios.codigo_unico_predio,
dbo_permisos_obras_mayores.numero_permiso_edificacion,
dbo_permisos_obras_mayores.fecha_permiso_edificacion
FROM dbo_predios
INNER JOIN dbo_permisos_obras_mayores ON dbo_predios.codigo_manzana_predio = dbo_permisos_obras_mayores.codigo_manzana_predio AND dbo_predios.codigo_lote_predio = dbo_permisos_obras_mayores.codigo_lote_predio
INNER JOIN dbo_recepciones_obras_mayores ON dbo_permisos_obras_mayores.numero_recepcion_permiso = dbo_recepciones_obras_mayores.numero_recepcion_permiso
WHERE dbo_permisos_obras_mayores.codigo_manzana_predio = 9402 AND dbo_permisos_obras_mayores.codigo_lote_predio = 30
And the result of executing the query in that way is this:
Later on i did some trial and error and removed the second inner join line, and the result surprised me, here is what happened:
Conclusion: in brief the third table is causing the cartesian product, why? i wish i knew why, what do you think of this particular case? i'd thank any help you could give me, thanks in advance.

Here's the solution - since you are saying that the numero_recepcion_permiso is blank, just add the condition to the inner join, to exclude empty ones:
SELECT
dbo_predios.codigo_unico_predio,
dbo_permisos_obras_mayores.numero_permiso_edificacion,
dbo_permisos_obras_mayores.fecha_permiso_edificacion
FROM dbo_predios
INNER JOIN dbo_permisos_obras_mayores ON dbo_predios.codigo_manzana_predio = dbo_permisos_obras_mayores.codigo_manzana_predio AND dbo_predios.codigo_lote_predio = dbo_permisos_obras_mayores.codigo_lote_predio
INNER JOIN dbo_recepciones_obras_mayores ON dbo_permisos_obras_mayores.numero_recepcion_permiso = dbo_recepciones_obras_mayores.numero_recepcion_permiso
AND dbo_recepciones_obras_mayores.numero_recepcion_permiso <>''
WHERE dbo_permisos_obras_mayores.codigo_manzana_predio = 9402 AND dbo_permisos_obras_mayores.codigo_lote_predio = 30
With that said, should that field allowed to be blank or NULL? Perhaps you need to add a constraint to your table to prevent that scenario. Another suggestion - why did you choose NUMERIC(18,0) as the data type on the primary key for those tables? I would prefer a simple INT or BIGINT and maybe let the database generate the sequence for me.

Okay, i did what Icarus told me and i figured out something that is useful, you see, i made a big mistake and the number combination i was trying out didn't have a numero_recepcion_permiso so the output column is completely blank, however when there is an actual numero_recepcion_permiso it shows correctly, anyway i still need that doesn't output that much amount of repeated rows, how can i fix that? thank y'all for your help so far

First of all, make sure that both values exist in both fields and they actually match or else could generate that amount of repeated rows, however the amount of rows repeated is something i can't tell since i don't know what your actual data is, but that may clear up a Little bit that issue

Related

Bigquery - remove duplicates of certain columns, but not all

I have two tables I am left joining together. The first tables has transnational level detail, causing the key I join to the second table to duplicate. When I left join the second table, the measure "company_spend" is highly inflated.
I need a way to keep only a single value of the duplicated data, and my thought was to run a distinct function on only those columns, but I am not seeing that Bigquery supports distinct functions on only a few columns, but not all.
SELECT UPPER(cwnextt.Current_Contract_Number) AS Current_Contract_Number,
UPPER(cwnextt.Replacement_Contract_Number) AS Replacement_Contract_Number,
UPPER(cwnextt.Current_Contract_Name) AS Current_Contract_Name,
UPPER(cwnextt.Supplier_Top_Parent_Entity_Code) AS Supplier_Top_Parent_Entity_Code,
UPPER(cwnextt.Supplier_Top_Parent_Name) AS Supplier_Top_Parent_Name,
UPPER(cwnextt.company_Entity_Code) AS company_Entity_Code,
UPPER(cwnextt.Facility_Name) AS Facility_Name,
smart.company_Spend AS companySpend
FROM `test_etl_field.contracts_with_member_entity_codes_test_view_2` cwnextt
--this table is what is causing the below table to duplicate,
--but I need all of this data AS well in its current format.
LEFT JOIN `test.trans_analysis` tsa
ON TRIM(UPPER(cwnextt.company_entity_code)) = TRIM(UPPER(tsa.company_entity_code))
AND TRIM(UPPER(cwnextt.Supplier_Top_Parent_Entity_Code)) = TRIM(UPPER(tsa.manufacturer_top_parent_entity_code))
AND TRIM(UPPER(cwnextt.Current_Contract_Name)) = TRIM(UPPER(tsa.contract_category))
AND cwnextt.spend_period_yyyyqmm = tsa.spend_period_yyyyqmm
--this table contains "company_spend" which is now duplicated
LEFT JOIN `test_etl_field.ecr_smart_data` smart
ON smart.company_entity_code = cwnextt.company_entity_code
AND (smart.contract_number = cwnextt.current_contract_number
OR smart.contract_number = cwnextt.replacement_contract_number)
AND smart.month_key = cwnextt.spend_period_yyyyqmm
If something can be created that will keep company_spend from duplicating on the second left join, that is what I am after.
Not sure to understand all the details of your problem but here's a fact from BigQuery doc :
SELECT DISTINCT
A SELECT DISTINCT statement discards duplicate rows
and returns only the remaining rows.
You can't apply DISTINCT on specific columns because it doesn't make sense. Let's say you have 4 columns and call DISTINCT on 3 columns, what is SQL supposed to do with the last one ?
You must tell SQL which value to keep for the remaining column and GROUP BY is the right solution here.
So if you want to:
Remove a column that has been duplicated : Just adjust your SELECT to get only the columns you want
Remove lines that have the same value in specific columns : I would suggest a GROUP BY on the targeted column and taking the aggregation you want (first, avg, sum or whatever) for the remaining ones.
Remove the value from a row if another row has the same : You may not want to do that. A row has to keep its value and you won't get it back. Besides, same problem, which row do you want to keep ?
Hope this helps ! Feel free to give clarification on your problem if you want more specific answers.
While I couldn't resolve this issue in SQL, I used Tableau via a FIXED LOD to aggregate the data passed duplicates so the end user could visualize the output with accuracy. Not ideal, but the SQL route wasn't make sense.

SQL 2 JOINS USING SINGLE REFERENCE TABLE

I'm trying to achieve 2 joins. If I run the 1st join alone it pulls 4 lots of results, which is correct. However when I add the 2nd join which queries the same reference table using the results from the select statement it pulls in additional results. Please see attached. The squared section should not be being returned
So I removed the 2nd join to try and explain better. See pic2. I'm trying to get another column which looks up InvolvedInternalID against the initial reference table IRIS.Practice.idvClient.
Your database is simply doing as you tell it. When you add in the second join (confusingly aliased as tb1 in a 3 table query) the database is finding matching rows that obey the predicate/truth statement in the ON part of the join
If you don't want those rows in there then one of two things must be the case:
1) The truth you specified in the ON clause is faulty; for example saying SELECT * FROM person INNER JOIN shoes ON person.age = shoes.size is faulty - two people with age 13 and two shoes with size 13 will produce 4 results, and shoe size has nothing to do with age anyway
2) There were rows in the table joined in that didn't apply to the results you were looking for, but you forgot to filter them out by putting some WHERE (or additional restriction in the ON) clause. Example, a table holds all historical data as well as current, and the current record is the one with a NULL in the DeletedOn column. If you forget to say WHERE deletedon IS NULL then your data will multiply as all the past rows that don't apply to your query are brought in
Don't alias tables with tbX, tbY etc.. Make the names meaningful! Not only do aliases like tbX have no relation to the original table name (so you encounter tbX, and then have to go searching the rest of the query to find where it's declared so you can say "ah, it's the addresses table") but in this case you join idvclient in twice, but give them unhelpful aliases like tb1, tb3 when really you should have aliased them with something that describes the relationship between them and the rest of the query tables
For example, ParentClient and SubClient or OriginatingClient/HandlingClient would be better names, if these tables are in some relationship with each other.
Whatever the purpose of joining this table in twice is, alias it in relation to the purpose. It may make what you've done wriong easier to spot, for example "oh, of course.. i'm missing a WHERE parentclient.type = 'parent'" (or WHERE handlingclient.handlingdate is not null etc..)
The first step to wisdom is by calling things their proper names

SQL to identify duplicate columns from table having hundreds of column

I've 250+ columns in customer table. As per my process, there should be only one row per customer however I've found few customers who are having more than one entry in the table
After running distinct on entire table for that customer it still returns two rows for me. I suspect one of column may be suffixed with space / junk from source tables resulting two rows of same information.
select distinct * from ( select * from customer_table where custoemr = '123' ) a;
Above query returns two rows. If you see with naked eye to results there is not difference in any of column.
I can identify which column is causing duplicates if I run query every time for each column with distinct but thinking that would be very manual task for 250+ columns.
This sounds like very dumb question but kind of stuck here. Please suggest if you have any better way to identify this, thank you.
Solving this one-time issue with sql is too much effort. Simply copy-paste to excel, transpose data into columns and use some simple function like "if a==b then 1 else 0".

Joining multiple Tables in Oracle gives out duplicated records

I am a newbie to sql. I have three tables mr1,mr2,mr3. Caseid is the primary keys in all these tables. I need to join all these table columns and display result.
Problem is that i dont know which join to use.
when i joined all these just like below query:
select mr1.col1,mr1.col2,mr2.col1,mr2.col2,mr3.col1,mr3.col2
from mr1,mr2,mr3
where mr1.caseid = mr2.caseid
and mr2.caseid = mr3.caseid;
it displays 4 records, eventhough the maximum number of records is two, which is in table mr2.
records are duplicated, can anyone help me in this regard?
Distinct will do it but it's not the correct approch.
You need to add another join (mr1.caseid = mr3.caseid) because mr2 and mr3 rows must be related to the same row in mr1, otherwise you end up with 2 pairs, onde for each tabled joined to your primary table (mr2).
First answer in SO, so forgive me if it wasn't that clear.
Your problem is that your tables are in a one-to many relationship. When you join them, it is expected that the number of rows will go up unless you take steps to limit the records returned. How to fix depends on the meaning of the data.
If all the fields are exactly the same, then adding DISTINCT will fix the problem. However, it may be faster, depending on the size of the tables and the number of records you are returning, to use a derived table to limit the records in the join to only one from the table with multiple records.
If at least one of the fields is different however, then you need to know the business rule that will allow you to pick the correct record. It might be accomplished by adding a where clause or by using an aggregate function and group by or even both. This really depends on the meaning of the result set which is why you need to ask further question in your own organization as they are the only ones who will know which of the multiple records is the correct one to pick from the perspectives of the people who will be using the results of the query. Further, the business might actually want to see all of the records and you have no problem at all.

Linking a table to two columns in a second table

I have an issue where I think my major problem is figuring out how to phrase it to get an acceptable answer from Google.
The situation:
Table A is 'Invoice's it has a column that links to Table B 'Jobs' in two places. It either links to our 'Job Number' column or the 'Client Number' column. The major issue is the fact that 'Client Number' and 'Job Number' can be the same number if we set the job up instead of the client setting the job up.
What I'm getting is that every time there is the same number in either column the results are duplicated.
Now this is extremely simplifying the situation to try and make it a bit more understandable, but I am essentially looking for a statement that looks at Table A gets the value then compares against Column B1 if that doesn't match then compares it against B2 if that doesn't match then excludes it from the results. The key would be that if it matches when it compares against B1 it doesn't go on to compare it against B2.
Any help with this would be greatly appreciated, even if it is just a point in the direction of the very obvious operator or function that does this. It's hitting the end of a very long day.
Thank you.
Edit:
A further description:
Invoice Table
---------------------------------
PK, INVOICE_NUMBER, LINK_TO_JOB
Job Table
---------------------------------
PK, JOB_NUMBER, CLIENT_JOB_NUMBER
Now the crux of the matter is that both PK are database generated sequential numbers, no overlap there. The invoice number and the job number are both application generated sequential numbers with no overlap the link to job is application generated and when an invoice is raised links to one of two fields in the jobs table based on rules. For simplicity lets say those rules are if there is a Client Job Number link to that if not link to the job number.
Now the Client job number is a field that is written into buy people, lots of mistakes can and do happen, but lots of crap gets put in this field as well. Stuff like 'Email' 'Fax' are very common answers. So when there is crap in there like 'Email' it links to a series of other fields holding the same 'Email' tag.
So that's problem one.
Problem two Where Statement:
SELECT INVOICE_NUMBER,
LINK_TO_JOB
JOB_NUMBER,
CLINET_JOB_NUMBER
FROM JOBS_TABLE a,
INVOICE_TABLE b
How do I set up the where statement to get the desire result, I've tried:
WHERE (LINK_TO_JOB = JOB_NUMBER OR LINK_TO_JOB = CLIENT_JOB_NUMBER)
This returns lots of multiples, such as when the job number and client job number are identical and when there are multiple identical written in answers 'email' etc. Now this might be unavoidable and I will end up using a Distinct with this where statement to do the best I can with what I have. However what I want to do is:
WHERE (LINK_TO_JOB = JOB_NUMBER (+) OR LINK_TO_JOB = CLIENT_JOB_NUMBER (+))
Which comes back with an error as you can use outer joins with an OR operator.
If nothing comes from this I might just have to go with the OR connection and then throw in the Select Distinct and then build redundancy into Invoicing process so that when the database misses links a manual process catches them.
Although I'm all ears for any ideas.
One way of doing this would be to use a set operation. UNION will give you a distinct set of values. You haven't given much detail so I'm guessing at the specifics: you'll need to amend them for your needs.
with j as ( select * from jobs )
select j.*, inv.*
from invoices inv
join j on ( inv.job_no = j.job_no)
union
select j.*, inv.*
from invoices inv
join j on ( inv.job_no = j.client_no)
The underlying reason for your difficulties is that the data model is half-cooked. In a proper design INVOICES.JOB_NO would have a foreign key relationship referencing JOBS.JOB_NO. Whereas JOBS.CLIENT_NO would be an additional piece of information, a business key, but would not be referenced by INVOICES. Of course it can be displayed on an actual invoice, that's why Nature gave us joins.
Use SELECT DISTINCT to remove the duplicates from your results set.
OK, well group effort here. I used the union join like suggested by APC. and modified to fit my data and all of it's eccentricities (read the French couldn't data model there way out of a paper bag) And then I surrounded everything in a distinct statement suggested by user1871207 and Hikaru-Shindo.
But negative marks go to me, the reason my question was so unclear was several fold, but the big piece of information that was difficult for me to grasp / explain was that Invoices are not always for jobs, coupled with the fact that Invoices can be consolidated (which just went and screwed everything up) and This is just a big mess that I've with your help managed to put a very small piece of two year old scotch tape on.
My only hope for a continued career here is to use the exceptions that come up (and they will come at me like a spider monkey!) to hopefully amend the entire invoice process so that we can report some basic profit and loss numbers.
Cheers for all your help.