SQL extract IDS based on two columns on common

SQL extract IDS based on two columns on common - sql

i need to figure out how i can i accomplish this task that was given to me, you see, i have imported an Excel, cleaned out the information and used this information to start joining the tables i need to, when i started i realized i needed to make it very precisely so i needed the id of the data i'm using which doesn't come in this Excel document i imported (since the id are stored in the database and the Excel was built by other people who don't handle databases) so i have a workmate whom i asked about how to do this task, he told me to do the inner join on the columns in common, but the way i did it appeared an error and logically didn't work, therefore i thought extracting the id from the table they are stored would be a good idea (well maybe not) but i don't know how to do it nor if it Will work, i'll give you some examples of how the tables would look like:
table 1
----------------------
|ID|column_a|column_b|
|1 |2234 |3 |
|2 |41245 |23 |
|3 |442 |434 |
|4 |1243 |1 |
----------------------
table 2
---------------------------------
|creation_date|column_a|column_b|
|1/12/2018 |2234 |3 |
|4/31/2011 |41245 |23 |
|7/22/2014 |442 |434 |
|10/14/2017 |1243 |1 |
---------------------------------
as you can see, the values of the columns a and b match perfectly, so there could be a bridge between the two tables, i tried to join the data by the column a but did not work since the output was much larger that i should, i also tried doing a simple query with an IN statement but did not work either since i brought up nearly the entire databases duplicated (i'm working with big databases the table 1 contains nearly 35.000 rows and the table 2 contains nearly 10.000) extracting the ids ad if they were row files won't work since they are very different from what is in the id tables in the actual table i'm working with, so what do you think it would be the best way to achieve this task? any kind of help i would be grateful, thanks in advance.
EDIT
Based on the answer of R3_ i tried his query but adapted to my needs and worked in some cases but in others i got the cartesian product, the example i'm using is that i have in table 2 in column_a the number 1000 and column_b has number 1, table 1 has 10 ids for that combination of numbers since the 1000-1 number is not the same (technically it is, but it has stored different information and is usually differenced by the ID) so the output is either 10 rows (assuming that it is only picking those with id) or 450 not the 45 i need as result, the query i'm using is like this:
SELECT DISTINCT table_1.id, table_2.column_a, table_2.column_b --if i pick the columns from table 1 returns 10 rows if i pick them from table 2 it returns 450
FROM table_2
INNER JOIN table_1 ON table_2.column_a = table_1.column_a AND table_1.column_b = table_2.column_b
WHERE table_2.column_a = 1022 AND table_2.column_b = 1
so the big deal has to do with the 10 id that has that 1000-1 combination so the sql doesn't know how to identify where the id should go, how can i do to obtain those 45 i need?
also i figured out that if i do the general query, there are some rows missing, here is how i print it:
SELECT table_1.id, table_1.column_a, table_1.column_b
FROM table_2 --in this case i try switching the columns i return from table 1 or 2
INNER JOIN table_1 ON table_2.column_a = table_1.column_a AND table_2.column_b = table_1.column_b
the output of the latter example is 2666 rows and should be 2733, what am i doing wrong?

SELECT DISTINCT -- Adding DISTINCT clause for unique pairs of ID and creation_date
ID, tab1.column_a, tab1.column_b, creation_date
FROM [table 1] as tab1
LEFT JOIN [table 2] as tab2 -- OR INNER JOIN
ON tab1.column_a = tab2.column_a
AND tab1.column_b = tab2.column_b
-- WHERE ID IN ('01', '02') -- Filtering by desired ID

Related

Is there a validate the cardinality of SQL joins?

When using SQL (snowflake) I often join tables.
I can never be 100% sure that the join is a one-to-one, one-to-many, many-to-many, etc...
In Python pandas this is a setting in merge statements that will assert that the join is of the expected kind.
Is there an equivalent in SQL?
EDIT: This is the pandas API (which I like)
table1 = pd.DataFrame({
'userId': [1,2,3],
'age':[20,30,40]
})
#| userId | age |
#|---------:|------:|
#| 1 | 20 |
#| 2 | 30 |
#| 3 | 40 |
table2 = pd.DataFrame({
'userId': [1,2,2,3],
'gender':['M','M','F', 'F'],
'gender_valid_to_date': [None, '2020-01-01', None, None]
})
#| userId | gender | gender_valid_to_date |
#|---------:|:---------|:-----------------------|
#| 1 | M | |
#| 2 | M | 2020-01-01 |
#| 2 | F | |
#| 3 | F | |
pd.merge(table1, table2, on='userId', how='left', validate='one_to_one')
# This raises a merge error
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
You can imagine it's easy to naively go for a LEFT JOIN here assuming it'd just add the gender column, but you know what they say about assuming...
In this case the "right" solution to get a table and the user's current gender is
SELECT
userId,
age,
gender AS current_gender
FROM table1 t1
LEFT JOIN (
SELECT userId, gender FROM table2 WHERE gender_valid_to_date is Null
) t2 ON t2.userId = t1.userId
However, you can see how this can get complicated to "check" constantly.

Give a try with ERROR_ON_NONDETERMINISTIC_MERGE which helps to raise an error in case if there are dup's which comes from join and try to merge. May be this could help you.
CREATE TABLE T1(USERID NUMBER, NAME STRING);
CREATE TABLE T2(USERID NUMBER, NAME STRING);
INSERT INTO
T1 VALUES(1, 'E'),(1, 'D'),(2, 'C');
INSERT INTO
T2 VALUES(1, 'A'),(1, 'B'),(2, 'C');
MERGE INTO T2 USING (
Select
T1.USERID,
T1.NAME
FROM
T1
) AS T1 ON T1.USERID = T2.USERID(+)
WHEN MATCHED THEN UPDATE SET
T2.USERID = T1.USERID,
T2.NAME = T1.NAME;

The immediate answer to this question is that Snowflake itself does not provide this feature built-in, although IMO it would be an interesting and useful new feature (I might even call it an "innovation") in data warehouses.
In order to make this job easier, save all non-trivial query results to a view or temporary table.
The first solution is to craft ad-hoc validity checks to be run on the resulting view or table.
You can do this in SQL directly by putting each check into a field or row of a table and visually ensuring that they are all TRUE.
In the example you gave, if can save the results to joined_1_2, then you can run a query like the following:
SELECT
NOT EXISTS (
SELECT
user_id, count(*) as n
FROM joined_1_2
GROUP BY user_id
HAVING n > 1
) AS no_duplicates
;
Depending on how fancy and/or automated you want things to be, can write stored procedures, Python scripts, etc. to run such queries and provide various outputs based on their results. The Python option could be particularly interesting, as you can use native Python assert statements or even a test framework like Pytest.
If these joins are being automated in some kind of recurring ETL/ELT pipeline, you might also want to consider using a tool like Great Expectations to implement data quality and correctness checks.
Another strategy is to impose constraints on the temporary tables that you create.
For example, you can impose a PRIMARY KEY constraint on the table that is intended to hold the results of a 1:1 join. Then if the data being inserted into that table results in duplicated primary keys, then you know that there is a problem in the source tables.
I think this technique is the most direct analogy of setting validate= in Pandas, but it requires a bit more setup and forethought. Unfortunately, I don't think Snowflake supports combining constraints (e.g. foreign key) with create-table-as-select syntax, so you might need to run a separate CREATE TABLE command for every query output, which can be very annoying if you have a lot of columns and you only want to check a few of them.

No, there isn’t (not that I’m aware of). A SQL join is normally 1 to many, is less commonly 1 to 1 and there is no such thing as a many to many join (hence the need for intersection tables to model many to many relationships).
Is there a specific reason why you need to know the join type or why you don’t know it anyway - I’m not sure how you would write almost any SQL query if you didn’t already know the data model?

Do a specific query for each row of a table, one by one

Let's say I have a table:
| key1 | key2 | value |
+------+------+-------+
| 1 | 1 | 1337 |
| 1 | 2 | 6545 |
| 2 | 1 | 213 |
| 3 | 1 | 131 |
What I would like to do is traverse this table row by row, then using the key two values in further queries (all other tables contain the unique combination of these two keys + other data)
How do I do this kind of thing in SQL?
EDIT: I would want to extract key1, key2 from row 1 (1,1) then do a query on it, which would result in a number.
Then I would move to the second row, an identical query which would again result in a number.
All of these numbers would be then inserted into a pre-prepared view.
EDIT2: I need to traverse it because the specific use of my database.
It is a database of planets which contains sectors (the keys are the IDs of these two). All of these sectors contain resources, turrets and walls.
The table I have in my post is an example of table of sectors, with the value being enemy force.
Table of resources, turrets etc. contain these two keys so they are linked to and only to a specific sector.
I need to go row by row so I can use this keys to select only specific resources/turrets/walls from my tables, aggregate them and then subtract them from the value in my sector table. Resulting number would then be inserted into a pre-prepared view (again, into the row which matches the combination of my two keys)

This sounds like a correlated subquery or lateral join. You don't have that much explanation, but something like this:
select t1.*, t2.*
from table1 t1 cross join lateral
(select . . .
from table2 t2 . . .
where t2.key1 = t1.key1 and t2.key2 = t1.key2
) t2
You are not clear on what the second query looks like. The where clause is called a correlation clause. It connects the subquery to the outer query. A correlation clause is not strictly needed for this to work.
The columns from the outer query can be used elsewhere in the subquery. I am just assuming that an equality condition connects the two (lacking other information).

Including columns from duplicate rows in select

I am working on creating a query which would return one row for each primary key. The question is that in the database there is another table I am trying to join with, and in this other table the primary key from the first table can appear multiple times but with a code which describes what type of information is stored in a column called text_info which stores text related to what the code represents.
For example:
PrimaryKey|Code |text_info
--------------------------------
5555 |1 |1/4/2017
5555 |2 |Approved
What I would am trying to get to is a select statement that would return something like this.
PrimaryKey|Date |Status
----------------------------------
5555 |1/4/2017 |Approved
I have been trying to join the two tables in various ways but my attempts have always returned multiple rows which I do not want for this query. Any help would be greatly appreciated for this.

I think a simple conditional aggregation would do the trick here. If you have a large and/or variable number of codes, you may have to go DYNAMIC.
Select PrimaryKey
,Date = max(case when code=1 then text_info else null end)
,Status = max(case when code=2 then text_info else null end)
From YourTable
Group By PrimaryKey

Basic SQL: How to use properties of one table as a lookup on other tables

Suppose I have three Tables:
Table A:
Column A | Column B
Z Q
Q Z
Table Z:
Column A | Column B
100 50
Table Q:
Column A | Column B
200 75
What I'm looking to do is produce a result like the following using Table A as a sort of guide:
DESIRED RESULT:
#Temp Table
Column A | Column B
100 75
200 50
I was hoping to be able to perform something too this effect in SQL as a Stored Procedure but I'm having trouble getting the results I want. Could use some help.

If I'm understanding your requirements, here is one way to do it, but it won't work if you have multiple values in your z and q tables unless you have something to join on:
select
case when a.cola = 'Z' then z.cola else q.cola end cola,
case when a.colb = 'Z' then z.colb else q.colb end colb
from tablea a, tableq q, tablez z
Sample Fiddle Demo

It's a bit confusing in the contrived example to use the same column name on different tables, but I believe what you're trying to describe is a classic many-to-many relationship.
If so, your Table A would be your cross-reference table.
Unfortunately for this to work you need to have related foreign key values stored in your cross-reference table, not the actual values as you propose.
This is probably what you're after:
Table A
QA |ZA
1 |2
3 |4
Table Q
QA |QB
1 |100
3 |200
Table Z
ZA |ZB
2 |300
4 |400
SELECT Q.QB, Z.ZB
FROM Q INNER JOIN A ON Q.QA = A.QA
INNER JOIN Z ON Z.ZA = A.ZA
...results in:
QB |ZB
100|300
200|400

I DISTINCTly hate MySQL (help building a query)

This is staight forward I believe:
I have a table with 30,000 rows. When I SELECT DISTINCT 'location' FROM myTable it returns 21,000 rows, about what I'd expect, but it only returns that one column.
What I want is to move those to a new table, but the whole row for each match.
My best guess is something like SELECT * from (SELECT DISTINCT 'location' FROM myTable) or something like that, but it says I have a vague syntax error.
Is there a good way to grab the rest of each DISTINCT row and move it to a new table all in one go?

SELECT * FROM myTable GROUP BY `location`
or if you want to move to another table
CREATE TABLE foo AS SELECT * FROM myTable GROUP BY `location`

Distinct means for the entire row returned. So you can simply use
SELECT DISTINCT * FROM myTable GROUP BY 'location'
Using Distinct on a single column doesn't make a lot of sense. Let's say I have the following simple set
-id- -location-
1 store
2 store
3 home
if there were some sort of query that returned all columns, but just distinct on location, which row would be returned? 1 or 2? Should it just pick one at random? Because of this, DISTINCT works for all columns in the result set returned.

Well, first you need to decide what you really want returned.
The problem is that, presumably, for some of the location values in your table there are different values in the other columns even when the location value is the same:
Location OtherCol StillOtherCol
Place1 1 Fred
Place1 89 Fred
Place1 1 Joe
In that case, which of the three rows do you want to select? When you talk about a DISTINCT Location, you're condensing those three rows of different data into a single row, there's no meaning to moving the original rows from the original table into a new table since those original rows no longer exist in your DISTINCT result set. (If all the other columns are always the same for a given Location, your problem is easier: Just SELECT DISTINCT * FROM YourTable).
If you don't care which values come from the other columns you can use a (bad, IMHO) MySQL extension to SQL and do:
SELECT * FROM YourTable GROUP BY Location
which will give a result set with one row per location and values for the other columns derived from the original data in an undefined fashion.

Multiple rows with identical values in all columns don't have any sense. OK - the question might be a way to correct exactly that situation.
Considering this table, with id being the PK:
kram=# select * from foba;
id | no | name
----+----+---------------
2 | 1 | a
3 | 1 | b
4 | 2 | c
5 | 2 | a,b,c,d,e,f,g
you may extract a sample for every single no (:=location) by grouping over that column, and selecting the row with minimum PK (for example):
SELECT * FROM foba WHERE id IN (SELECT min (id) FROM foba GROUP BY no);
id | no | name
----+----+------
2 | 1 | a
4 | 2 | c

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas