How do I use sets in SQL? - sql

In case you're not clear what a set is it allows a user to insert a value, check if a value is in the list and find an intersect (if value exist in two sets). Also a set will only hold a value once.
How do I use a set in SQL? I'm using sqlite but will be moving to postgresql. The most obvious option to me was have a Set table of SetID, SetValue with those two values being the primary key. However I have a feeling this wouldn't be efficient as SetID would be repeated many times. The other option is to have a table int SetId, blob SetArray but I'll need to implement set logic by hand.
What's the best way to do sets in SQL? For either sqlite or postgresql?

A couple options:
Trigger to see of value exists and if it does, return NULL (accomplishes exactly what you are doing)
Just use unique constraints and handle errors
Both these move the table from bags to sets but they do so in different ways, one by giving it more select-like semantics (for a slight performance cost) and the other, more traditional SQL way, to force data validation on entry.
If you want to select as a set, use select distinct instead.

Related

How to copy data from one table to another with nested required fields in repeatable objects

I'm trying to copy data from one table to another. The schemas are identical except that the source table has fields as nullable when they were meant to be required. Big query is complaining that the fields are null. I'm 99% certain the issue is that in many entries the repeatable fields are absent, which causes no issues when inserting into the table using our normal process.
The table I'm copying from used to have the exact same schema, but accidentally lost the required fields when recreating the table with a different partitioning scheme.
From what I can tell, there is no way to change the fields from nullable to required in an existing table. It looks to me like you must create a new table then use a select query to copy data.
I tried enabling "Allow large results" and unchecking "flatten results" but I'm seeing the same issue. The write preference is "append to table"
(Note: see edit below as I am incorrect here - it is a data issue)
I tried building a query to better confirm my theory (and not that the records exist but are null) but I'm struggling to build a query. I can definitely see in the preview that having some of the repeated fields be null is a real use case, so I would presume that translates to the nested required fields also being null. We have a backup of the table before it was converted to the new partitioning, and it has the same required schema as the table I'm trying to copy into. A simple select count(*) where this.nested.required.field is null in legacy sql on the backup indicates that there are quite a few columns that fit this criteria.
SQL used to select for insert:
select * from my_table
Edit:
When making a partition change on the table was also setting certain fields to a null value. It appears that somehow the select query created objects with all fields null rather than simply a null object. I used a conditional to set a nested object to either null or pick its existing value. Still investigating, but at this point I think what I'm attempting to do is normally supported, based on playing with some toy tables/queries.
When trying to copy from one table to another, and using SELECT AS STRUCT, run a null check like this:
IF(foo.bar is null, null, (SELECT AS STRUCT foo.bar.* REPLACE(...))
This prevents null nested structures from turning into structures full of null values.
To repair it via select statement, use a conditional check against a value that is required like this:
IF (bar.req is null, null, bar)
Of course a real query is more complicated than that. The good news is that the repair query should look similar to the original query that messed up the format

SELECT * - pros /cons

I always hear from SQL specialists that it is not efficient to use the '*' sign in SELECT statement and it is better to list all the field names instead.
But I don't find it efficient for me personally when it comes to adding new fields to a table and then updating all the stored procedures accordingly.
So what are the pros and cons in using '*' ?
Thanks.
In general, the use of SELECT * is not a good idea.
Pros:
When you add/remove columns, you don't have to make changes where you did use SELECT *
It is shorter to write
Also see the answers to: Can select * usage ever be justified?
Cons:
You are returning more data than you need. Say you add a VARBINARY column that contains 200k per row. You only need this data in one place for a single record - using SELECT * you can end up returning 2MB per 10 rows that you don't need
Explicit about what data is used
Specifying columns means you get an error when a column is removed
The query processor has to do some more work - figuring out what columns exist on the table (thanks #vinodadhikary)
You can find where a column is used more easily
You get all columns in joins if you use SELECT *
You can't use ordinal referencing (though using ordinal references for columns is bad practice in itself)
Also see the answers to: What is the reason not to use select *?
Pros:
when you really need all the columns, it's shorter to write select *
Cons:
most of the time, you don't need all the columns, but only some of them. It's more efficient to only retrieve what you want
you have no guarantee of the order of the retrieved columns (or at least, the order is not obvious from the query), which forbids accessing columns by index (only by name). But the names are also far from obvious
when joining multiple tables having potentially columns with the same name, you can define aliases for these columns

Why are positional queries bad?

I'm reading CJ Date's SQL and Relational Theory: How to Write Accurate SQL Code, and he makes the case that positional queries are bad — for example, this INSERT:
INSERT INTO t VALUES (1, 2, 3)
Instead, you should use attribute-based queries like this:
INSERT INTO t (one, two, three) VALUES (1, 2, 3)
Now, I understand that the first query is out of line with the relational model since tuples (rows) are unordered sets of attributes (columns). I'm having trouble understanding where the harm is in the first query. Can someone explain this to me?
The first query breaks pretty much any time the table schema changes. The second query accomodates any schema change that leaves its columns intact and doesn't add defaultless columns.
People who do SELECT * queries and then rely on positional notation for extracting the values they're concerned about are software maintenance supervillains for the same reason.
While the order of columns is defined in the schema, it should generally not be regarded as important because it's not conceptually important.
Also, it means that anyone reading the first version has to consult the schema to find out what the values are meant to mean. Admittedly this is just like using positional arguments in most programming languages, but somehow SQL feels slightly different in this respect - I'd certainly understand the second version much more easily (assuming the column names are sensible).
I don't really care about theoretical concepts in this regard (as in practice, a table does have a defined column order). The primary reason I would prefer the second one to the first is an added layer of abstraction. You can modify columns in a table without screwing up your queries.
You should try to make your SQL queries depend on the exact layout of the table as little as possible.
The first query relies on the table only having three fields, and in that exact order. Any change at all to the table will break the query.
The second query only relies on there being those three felds in the table, and the order of the fields is irrelevant. You can change the order of fields in the table without breaking the query, and you can even add fields as long as they allow null values or has a default value.
Although you don't rearrange the table layout very often, adding more fields to a table is quite common.
Also, the second query is more readable. You can tell from the query itself what the values put in the record means.
Something that hasn't been mentioned yet is that you will often be having a surrogate key as your PK, with auto_increment (or something similar) to assign a value. With the first one, you'd have to specify something there — but what value can you specify if it isn't to be used? NULL might be an option, but that doesn't really fit in considering the PK would be set to NOT NULL.
But apart from that, the whole "locked to a specific schema" is a much more important reason, IMO.
SQL gives you syntax for specifying the name of the column for both INSERT and SELECT statements. You should use this because:
Your queries are stable to changes in the column ordering, so that maintenance takes less work.
The column ordering maps better to how people think, so it's more readable. It's more clear to think of a column as the "Name" column rather than the 2nd column.
I prefer to use the UPDATE-like syntax:
INSERT t SET one = 1 , two = 2 , three = 3
Which is far easier to read and maintain than both the examples.
Long term, if you add one more column to your table, your INSERT will not work unless you explicitly specify list of columns. If someone changes the order of columns, your INSERT may silently succeed inserting values into wrong columns.
I'm going to add one more thing, the second query is less prone to error orginally even before tables are changed. Why do I say that? Becasue with the seocnd form you can (and should when you write the query) visually check to see if the columns in the insert table and the data in the values clause or select clause are in fact in the right order to begin with. Otherwise you may end up putting the Social Security Number in the Honoraria field by accident and paying speakers their SSN instead of the amount they should make for a speech (example not chosen at random, except we did catch it before it actually happened thanks to that visual check!).

MERGE INTO insertion order

I have a statement that looks something like this:
MERGE INTO someTable st
USING
(
SELECT id,field1,field2,etc FROM otherTable
) ot on st.field1=ot.field1
WHEN NOT MATCHED THEN
INSERT (field1,field2,etc)
VALUES (ot.field1,ot.field2,ot.etc)
where otherTable has an autoincrementing id field.
I would like the insertion into someTable to be in the same order as the id field of otherTable, such that the order of ids is preserved when the non-matching fields are inserted.
A quick look at the docs would appear to suggest that there is no feature to support this.
Is this possible, or is there another way to do the insertion that would fulfil my requirements?
EDIT: One approach to this would be to add an additional field to someTable that captures the ordering. I'd rather not do this if possible.
... upon reflection the approach above seems like the way to go.
I cannot speak to what the Questioner is asking for here because it doesn't make any sense.
So let's assume a different problem:
Let's say, instead, that I have a Heap-Table with no Identity-Field, but it does have a "Visited" Date field.
The Heap-Table logs Person WebPage Visits and I'm loading it into my Data Warehouse.
In this Data Warehouse I'd like to use the Surrogate-Key "WebHitID" to reference these relationships.
Let's use Merge to do the initial load of the table, then continue calling it to keep the tables in sync.
I know that if I'm inserting records into an table, then I'd prefer the ID's (that are being generated by an Identify-Field) to be sequential based on whatever Order-By I choose (let's say the "Visited" Date).
It is not uncommon to expect an Integer-ID to correlate to when it was created relative to the rest of the records in the table.
I know this is not always 100% the case, but humor me for a moment.
This is possible with Merge.
Using (what feels like a hack) TOP will allow for Sorting in our Insert:
MERGE DW.dbo.WebHit AS Target --This table as an Identity Field called WebHitID.
USING
(
SELECT TOP 9223372036854775807 --Biggest BigInt (to be safe).
PWV.PersonID, PWV.WebPageID, PWV.Visited
FROM ProdDB.dbo.Person_WebPage_Visit AS PWV
ORDER BY PWV.Visited --Works only with TOP when inside a MERGE statement.
) AS Source
ON Source.PersonID = Target.PersonID
AND Source.WebPageID = Target.WebPageID
AND Source.Visited = Target.Visited
WHEN NOT MATCHED BY Target THEN --Not in Target-Table, but in Source-Table.
INSERT (PersonID, WebPageID, Visited) --This Insert populates our WebHitID.
VALUES (Source.PersonID, Source.WebPageID, Source.Visited)
WHEN NOT MATCHED BY Source THEN --In Target-Table, but not in Source-Table.
DELETE --In case our WebHit log in Prod is archived/trimmed to save space.
;
You can see I opted to use TOP 9223372036854775807 (the biggest Integer there is) to pull everything.
If you have the resources to merge more than that, then you should be chunking it out.
While this screams "hacky workaround" to me, it should get you where you need to go.
I have tested this on a small sample set and verified it works.
I have not studied the performance impact of it on larger complex sets of data though, so YMMV with and without the TOP.
Following up on MikeTeeVee's answer.
Using TOP will allow you to Order By within a sub-query, however instead of TOP 9223372036854775807, I would go with
SELECT TOP 100 PERCENT
Unlikely to reach that number, but this way just makes more sense and looks cleaner.
Why would you care about the order of the ids matching? What difference would that make to how you query the data? Related tables should be connected through primary and foreign keys, not order records were inserted. Tables are not inherently ordered a particular way in databases. Order should come from the order by clause.
More explanation as to why you want to do this might help us steer you to an appropriate solution.

MySQL SELECT statement using Regex to recognise existing data

My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.
Needless to say, this can become very slow as the data set size increases.
So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.
SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?
This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.
Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:
SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?
Where CLEAN_ME is a proc which strips the field of the erroneous data.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
Thanks a lot
can this cleaning be done in a SQL statement
Yes, you can write a stored procedure to do it in the database layer:
mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
-> RETURNS VARCHAR(255) DETERMINISTIC
-> RETURN REPLACE(s, '.', ' ');
mysql> SELECT clean_me('BANK.12323.DESCRIPTION');
BANK 12323 DESCRIPTION
This will perform very poorly across a large table though.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).
Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.
The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.
The cleanest way is indeed to make sure only correct data is in the database.
In this example the "BANK.12323.DESCRIPTION" would be returned by:
SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?
But this might impose performance issues when you have a lot of data in the table.
Another way that you could do it is as follows:
Clean the description before inserting.
Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.
Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:
REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)
'on duplicate key' allows you to specify which columns to update on a duplicate key error:
INSERT INTO transaction (desc, dated_on, amount) values (?,?,?)
ON DUPLICATE KEY SET amount = amount
By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.
If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.
Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.