When using SQL (snowflake) I often join tables.
I can never be 100% sure that the join is a one-to-one, one-to-many, many-to-many, etc...
In Python pandas this is a setting in merge statements that will assert that the join is of the expected kind.
Is there an equivalent in SQL?
EDIT: This is the pandas API (which I like)
table1 = pd.DataFrame({
'userId': [1,2,3],
'age':[20,30,40]
})
#| userId | age |
#|---------:|------:|
#| 1 | 20 |
#| 2 | 30 |
#| 3 | 40 |
table2 = pd.DataFrame({
'userId': [1,2,2,3],
'gender':['M','M','F', 'F'],
'gender_valid_to_date': [None, '2020-01-01', None, None]
})
#| userId | gender | gender_valid_to_date |
#|---------:|:---------|:-----------------------|
#| 1 | M | |
#| 2 | M | 2020-01-01 |
#| 2 | F | |
#| 3 | F | |
pd.merge(table1, table2, on='userId', how='left', validate='one_to_one')
# This raises a merge error
MergeError: Merge keys are not unique in right dataset; not a one-to-one merge
You can imagine it's easy to naively go for a LEFT JOIN here assuming it'd just add the gender column, but you know what they say about assuming...
In this case the "right" solution to get a table and the user's current gender is
SELECT
userId,
age,
gender AS current_gender
FROM table1 t1
LEFT JOIN (
SELECT userId, gender FROM table2 WHERE gender_valid_to_date is Null
) t2 ON t2.userId = t1.userId
However, you can see how this can get complicated to "check" constantly.
Give a try with ERROR_ON_NONDETERMINISTIC_MERGE which helps to raise an error in case if there are dup's which comes from join and try to merge. May be this could help you.
CREATE TABLE T1(USERID NUMBER, NAME STRING);
CREATE TABLE T2(USERID NUMBER, NAME STRING);
INSERT INTO
T1 VALUES(1, 'E'),(1, 'D'),(2, 'C');
INSERT INTO
T2 VALUES(1, 'A'),(1, 'B'),(2, 'C');
MERGE INTO T2 USING (
Select
T1.USERID,
T1.NAME
FROM
T1
) AS T1 ON T1.USERID = T2.USERID(+)
WHEN MATCHED THEN UPDATE SET
T2.USERID = T1.USERID,
T2.NAME = T1.NAME;
The immediate answer to this question is that Snowflake itself does not provide this feature built-in, although IMO it would be an interesting and useful new feature (I might even call it an "innovation") in data warehouses.
In order to make this job easier, save all non-trivial query results to a view or temporary table.
The first solution is to craft ad-hoc validity checks to be run on the resulting view or table.
You can do this in SQL directly by putting each check into a field or row of a table and visually ensuring that they are all TRUE.
In the example you gave, if can save the results to joined_1_2, then you can run a query like the following:
SELECT
NOT EXISTS (
SELECT
user_id, count(*) as n
FROM joined_1_2
GROUP BY user_id
HAVING n > 1
) AS no_duplicates
;
Depending on how fancy and/or automated you want things to be, can write stored procedures, Python scripts, etc. to run such queries and provide various outputs based on their results. The Python option could be particularly interesting, as you can use native Python assert statements or even a test framework like Pytest.
If these joins are being automated in some kind of recurring ETL/ELT pipeline, you might also want to consider using a tool like Great Expectations to implement data quality and correctness checks.
Another strategy is to impose constraints on the temporary tables that you create.
For example, you can impose a PRIMARY KEY constraint on the table that is intended to hold the results of a 1:1 join. Then if the data being inserted into that table results in duplicated primary keys, then you know that there is a problem in the source tables.
I think this technique is the most direct analogy of setting validate= in Pandas, but it requires a bit more setup and forethought. Unfortunately, I don't think Snowflake supports combining constraints (e.g. foreign key) with create-table-as-select syntax, so you might need to run a separate CREATE TABLE command for every query output, which can be very annoying if you have a lot of columns and you only want to check a few of them.
No, there isn’t (not that I’m aware of). A SQL join is normally 1 to many, is less commonly 1 to 1 and there is no such thing as a many to many join (hence the need for intersection tables to model many to many relationships).
Is there a specific reason why you need to know the join type or why you don’t know it anyway - I’m not sure how you would write almost any SQL query if you didn’t already know the data model?
Related
i need to figure out how i can i accomplish this task that was given to me, you see, i have imported an Excel, cleaned out the information and used this information to start joining the tables i need to, when i started i realized i needed to make it very precisely so i needed the id of the data i'm using which doesn't come in this Excel document i imported (since the id are stored in the database and the Excel was built by other people who don't handle databases) so i have a workmate whom i asked about how to do this task, he told me to do the inner join on the columns in common, but the way i did it appeared an error and logically didn't work, therefore i thought extracting the id from the table they are stored would be a good idea (well maybe not) but i don't know how to do it nor if it Will work, i'll give you some examples of how the tables would look like:
table 1
----------------------
|ID|column_a|column_b|
|1 |2234 |3 |
|2 |41245 |23 |
|3 |442 |434 |
|4 |1243 |1 |
----------------------
table 2
---------------------------------
|creation_date|column_a|column_b|
|1/12/2018 |2234 |3 |
|4/31/2011 |41245 |23 |
|7/22/2014 |442 |434 |
|10/14/2017 |1243 |1 |
---------------------------------
as you can see, the values of the columns a and b match perfectly, so there could be a bridge between the two tables, i tried to join the data by the column a but did not work since the output was much larger that i should, i also tried doing a simple query with an IN statement but did not work either since i brought up nearly the entire databases duplicated (i'm working with big databases the table 1 contains nearly 35.000 rows and the table 2 contains nearly 10.000) extracting the ids ad if they were row files won't work since they are very different from what is in the id tables in the actual table i'm working with, so what do you think it would be the best way to achieve this task? any kind of help i would be grateful, thanks in advance.
EDIT
Based on the answer of R3_ i tried his query but adapted to my needs and worked in some cases but in others i got the cartesian product, the example i'm using is that i have in table 2 in column_a the number 1000 and column_b has number 1, table 1 has 10 ids for that combination of numbers since the 1000-1 number is not the same (technically it is, but it has stored different information and is usually differenced by the ID) so the output is either 10 rows (assuming that it is only picking those with id) or 450 not the 45 i need as result, the query i'm using is like this:
SELECT DISTINCT table_1.id, table_2.column_a, table_2.column_b --if i pick the columns from table 1 returns 10 rows if i pick them from table 2 it returns 450
FROM table_2
INNER JOIN table_1 ON table_2.column_a = table_1.column_a AND table_1.column_b = table_2.column_b
WHERE table_2.column_a = 1022 AND table_2.column_b = 1
so the big deal has to do with the 10 id that has that 1000-1 combination so the sql doesn't know how to identify where the id should go, how can i do to obtain those 45 i need?
also i figured out that if i do the general query, there are some rows missing, here is how i print it:
SELECT table_1.id, table_1.column_a, table_1.column_b
FROM table_2 --in this case i try switching the columns i return from table 1 or 2
INNER JOIN table_1 ON table_2.column_a = table_1.column_a AND table_2.column_b = table_1.column_b
the output of the latter example is 2666 rows and should be 2733, what am i doing wrong?
SELECT DISTINCT -- Adding DISTINCT clause for unique pairs of ID and creation_date
ID, tab1.column_a, tab1.column_b, creation_date
FROM [table 1] as tab1
LEFT JOIN [table 2] as tab2 -- OR INNER JOIN
ON tab1.column_a = tab2.column_a
AND tab1.column_b = tab2.column_b
-- WHERE ID IN ('01', '02') -- Filtering by desired ID
I have a sparse table structured like:
id | name | phone | account
There is no primary key or index
There are also null values. What I want is to "glue" data from different rows together, e.g.:
Given
id | name | phone | account
1 null '339-33-27' 4
null 'John' '339-33-27' 4
I want to end up with
id | name | phone | account |
1 'John' '339-33-27' 4
However, I don't know which values are missed in the table.
What are the general way to approach this kind of problem? Do I need to use only joins or might be recursive functions?
Update: Provided more clear example
id to account is many-to-many
account to name is many-to-many
phone to name is one-to-one
The database is basically raw transactional data
What I want to is to get all the rows for which I already have / could find an account
If I understand you correctly then this might work. What you need is a self join
select t2.id, t1.name, t1.phone, t1.account
from table1 t1
join table1 t2 on t1.account = t2.account and t1.phone = t2.phone
where t1.name is not null
However this particular query relies on an assumption from your example data. My assumption is that if name is not null, Id will be null and the Id can be found by looking at the phone number and account. If this assumption is not true , then we may need more sample data to solve your problem.
Depending on the data, you might need left joins or to swap so that T1 gets the id and not the name and the where condition is that ID is not null. It's hard to tell with such a small data sample size.
I have the follwing given two tables which can not be changed.
1: DataTypes
+----------------------+-----------------------+
| datatypename(String) | datatypetable(String) |
+----------------------+-----------------------+
Example data:
+-----------+------------+
| CycleTime | datalong |
+-----------+------------+
| InjTime1 | datadouble |
+-----------+------------+
2: datalong_1 (data model does not matter here)
I want to make a query now that reads the datatypetable attribute from the datatypes table, adds the String "_1" to it and selects all content from it.
I imagined it, from a programmatic perspective, to look something similar to this statement which obviously doesn't work yet:
SELECT * FROM
(SELECT datatypetable FROM datatypes WHERE datatypename = 'CycleTime') + '_1'
How can I make this happen in SQL using HSQLDB?
Thanks to Leonidas199x I know now how to get in the '_1' in but how do I tell the FROM statement that the subselect is not a new table I want to read from but instead the name of an existing table I want to read from.
SELECT * FROM
(SELECT RTRIM(datatypetable)+'_1' FROM datatypes WHERE datatypename = 'CycleTime')
According to this question which is identical to mine this is not possible:
using subquery instead of the tablename
:(
Can you explain your data model in a little more detail? I am not sure I understand exactly what it is you are looking to do.
If you are wanting to add _1 to the 'datatypename', you can use:
SELECT datatypename+'_1'
FROM datatypes
Right now I am planning to add a filter system to my site.
Examples:
(ID=apple, COLOR=red, TASTE=sweet, ORIGIN=US)
(ID=mango, COLOR=yellow, TASTE=sweet, ORIGIN=MEXICO)
(ID=banana, COLOR=yellow, TASTE=bitter-sweet, ORIGIN=US)
so now I am interested in doing the following:
SELECT ID FROM thisTable WHERE COLOR='yellow' AND TASTE='SWEET'
But my problem is I am doing this for multiple categories in my site, and the columns are NOT consistent. (like if the table is for handphones, then it will be BRAND, 3G-ENABLED, PRICE, COLOR, WAVELENGTH, etc)
how could I design a general schema that allows this?
Right now I am planning on doing:
table(ID, KEY, VALUE)
This allows arbitary number of columns, but for the query, I am using
SELECT ID FROM table WHERE (KEY=X1 AND VALUE=V1) AND (KEY=X2 AND VALUE=V2), .. which returns an empty set.
Can someone recommend a good solution to this? Note that the number of columns WILL change regularly
The entity-attribute-value model that you suggest could fit in this scenario.
Regarding the filtering query, you have to understand that with the EAV model you will sacrifice plenty of query power, so this can become quite tricky. However this one way to tackle your problem:
SELECT stuff.id
FROM stuff
JOIN (SELECT COUNT(*) matches
FROM table
WHERE (`key` = X1 AND `value` = V1) OR
(`key` = X2 AND `value` = V2)
GROUP BY id
) sub_t ON (sub_t.matches = 2 AND sub_t.id = stuff.id)
GROUP BY stuff.id;
One inelegant feature of this approach is that you need to specify the number of attribute/value pairs that you expect to match in sub_t.matches = 2. If we had three conditions we would have had to specify sub_t.matches = 3, and so on.
Let's build a test case:
CREATE TABLE stuff (`id` varchar(20), `key` varchar(20), `value` varchar(20));
INSERT INTO stuff VALUES ('apple', 'color', 'red');
INSERT INTO stuff VALUES ('mango', 'color', 'yellow');
INSERT INTO stuff VALUES ('banana', 'color', 'yellow');
INSERT INTO stuff VALUES ('apple', 'taste', 'sweet');
INSERT INTO stuff VALUES ('mango', 'taste', 'sweet');
INSERT INTO stuff VALUES ('banana', 'taste', 'bitter-sweet');
INSERT INTO stuff VALUES ('apple', 'origin', 'US');
INSERT INTO stuff VALUES ('mango', 'origin', 'MEXICO');
INSERT INTO stuff VALUES ('banana', 'origin', 'US');
Query:
SELECT stuff.id
FROM stuff
JOIN (SELECT COUNT(*) matches, id
FROM stuff
WHERE (`key` = 'color' AND `value` = 'yellow') OR
(`key` = 'taste' AND `value` = 'sweet')
GROUP BY id
) sub_t ON (sub_t.matches = 2 AND sub_t.id = stuff.id)
GROUP BY stuff.id;
Result:
+-------+
| id |
+-------+
| mango |
+-------+
1 row in set (0.02 sec)
Now let's insert another fruit with color=yellow and taste=sweet:
INSERT INTO stuff VALUES ('pear', 'color', 'yellow');
INSERT INTO stuff VALUES ('pear', 'taste', 'sweet');
INSERT INTO stuff VALUES ('pear', 'origin', 'somewhere');
The same query would return:
+-------+
| id |
+-------+
| mango |
| pear |
+-------+
2 rows in set (0.00 sec)
If we want to restrict this result to entities with origin=MEXICO, we would have to add another OR condition and check for sub_t.matches = 3 instead of 2.
SELECT stuff.id
FROM stuff
JOIN (SELECT COUNT(*) matches, id
FROM stuff
WHERE (`key` = 'color' AND `value` = 'yellow') OR
(`key` = 'taste' AND `value` = 'sweet') OR
(`key` = 'origin' AND `value` = 'MEXICO')
GROUP BY id
) sub_t ON (sub_t.matches = 3 AND sub_t.id = stuff.id)
GROUP BY stuff.id;
Result:
+-------+
| id |
+-------+
| mango |
+-------+
1 row in set (0.00 sec)
As in every approach, there are certain advantages and disadvantages when using the EAV model. Make sure you research the topic extensively in the context of your application. You may even want to consider an alternative relational databases, such as Cassandra, CouchDB, MongoDB, Voldemort, HBase, SimpleDB or other key-value stores.
The following worked for me:
SELECT * FROM mytable t WHERE
t.key = "key" AND t.value = "value" OR
t.key = "key" AND t.value = "value" OR
....
t.key = "key" AND t.value = "value"
GROUP BY t.id having count(*)=3;
count(*)=3 must match the amount of
t.key = "key" AND t.value = "value"
cases
What you are suggesting is known as an Entity-Attribute-Value structure and is highly discouraged. One of the (many) big problems with EAV designs for example is in data integrity. How you do enforce that colors only consist of "red", "yellow", "blue" etc? In short, you can't without a lot of hacks. Another problem rears itself in querying (as you have seen) and searching for data.
Instead, I would recommend creating a table that represents each type of entity and thus each table can have attributes (columns) that are specific to that type of entity.
In order to convert the data into columns in a result query as you are seeking, you will need to create what is often called a crosstab query. There are report engines that will do it and you can do it code but most database products will not do it natively (meaning without building the SQL string manually). The performance of course will not be good if you have a lot of data and you will run into problems filtering on the data. For example, suppose that some of the values are supposed to be numeric. Because the value part of the EAV is likely to be a string, you will have to cast those values to an integer before you can filter on them and that presumes that the data will be convertible to an integer.
The price you pay for simplistic table design at this stage will cost you in terms of performance in the long run. Using ORM to reduce the cost of modifying the database to fit data in an appropriate structure would probably be a good time investment, even in spite of ORM's performance cost.
Otherwise, you may want to look for a "reverse ORM" that maps the code from your database, which has the benefit of being less expensive and having higher performance. (Slightly higher starting cost compared to ORM, but better long-term performance and reliability.)
It's a costly problem regardless of how you slice it. Do you want to pay now with development time or pay later when your performance tanks? ("Pay later" is the wrong answer.)
I have 3 different transaction tables, which look very similar, but have slight differences. This comes from the fact that there are 3 different transaction types; depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
As an example:
t1:
date,user,amount
t2:
date,user,who,amount
t3:
date,user,what,amount
Now I need a query who is going to get me all transactions in each table for the same user, something like
select * from t1,t2,t3 where user='me';
(which of course doesn't work).
I am studying JOIN statements but haven't got around the right way to do this. Thanks.
EDIT: Actually I need then all of the columns from every table, not just the ones who are the same.
EDIT #2: Yeah,having transaction_type doesn't break 3NF, of course - so maybe my design is utterly wrong. Here is what really happens (it's an alternative currency system):
- Transactions are between users, like mutual credit. So units get swapped between users.
- Inventarizations are physical stuff brought into the system; a user gets units for this.
- Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields). So the question changes to "Can/should this be in one table or be multiple tables (as I have it for now)?"
I need the previously described SQL statement to display running balances.
(Should this now become a new question altogether or is that OK to EDIT?).
EDIT #3: As EDIT #2 actually transforms this to a new question, I also decided to post a new question. (I hope this is ok?).
You can supply defaults as constants in the select statements for columns where you have no data;
so
SELECT Date, User, Amount, 'NotApplicable' as Who, 'NotApplicable' as What from t1 where user = 'me'
UNION
SELECT Date, User, Amount, Who, 'NotApplicable' from t2 where user = 'me'
UNION
SELECT Date, User, Amount, 'NotApplicable', What from t3 where user = 'me'
which assumes that Who And What are string type columns. You could use Null as well, but some kind of placeholder is needed.
I think that placing your additional information in a separate table and keeping all transactions in a single table will work better for you though, unless there is some other detail I've missed.
I think the meat of your question is here:
depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
I'm no 3NF expert, but I would approach your schema a little differently (which might clear up your SQL a bit).
It looks like your data elements are as such: date, user, amount, who, and what. With that in mind, a more normalized schema might look something like this:
User
----
id, user info (username, etc)
Who
---
id, who info
What
----
id, what info
Transaction
-----------
id, date, amount, user_id, who_id, what_id
Your foreign key constraint verbiage will vary based on database implementation, but this is a little clearer (and extendable).
You should consider STI "architecture" (single table inheritance). I.e. put all different columns into one table, and put them all under one index.
In addition you may want to add indexes to other columns you're making selection.
What is the result schema going to look like? - If you only want the minimal columns that are in all 3 tables, then it's easy, you would just UNION the results:
SELECT Date, User, Amount from t1 where user = 'me'
UNION
SELECT Date, User, Amount from t2 where user = 'me'
UNION
SELECT Date, User, Amount from t3 where user = 'me'
Or you could 'SubClass' them
Create Table Transaction
(
TransactionId Integer Primary Key Not Null,
TransactionDateTime dateTime Not Null,
TransactionType Integer Not Null,
-- Othe columns all transactions Share
)
Create Table Type1Transactions
{
TransactionId Integer PrimaryKey Not Null,
// Type 1 specific columns
}
ALTER TABLE Type1Transactions WITH CHECK ADD CONSTRAINT
[FK_Type1Transaction_Transaction] FOREIGN KEY([TransactionId])
REFERENCES [Transaction] ([TransactionId])
Repeat for other types of transactions...
What about simply leaving the unnecessary columns null and adding a TransactionType column? This would result in a simple SELECT statement.
select *
from (
select user from t1
union
select user from t2
union
select user from t3
) u
left outer join t1 on u.user=t1.user
left outer join t2 on u.user=t2.user
left outer join t3 on u.user=t3.user