I just found out whenever I call db.Create(), there will be two sql queries called: insert and select. Especially, the select query:
SELECT "num", "my_text", "my_int" FROM "product" WHERE (id = 2)
(1) Why does it call select query? Without it, the performance should be even better.
(2) Why select these three columns? There are 18 columns in this table. I don't find any common relationships between the three columns. They are all different types.
I think I figured out why. Please correct me if I am wrong.
For example, when I do db.Create(&product),
(1) GORM will load fields’ values, which it doesn't know, from the database to the variable product.
(2) The reason why it only selects the three fields is because I don't provide those values in the original product variable and GORM doesn't know what those values will be, thus select those fields and then assign them to product. For instance, num is an auto incremented serial.
(3) If I provide all fields' values to product before creating the row, GORM will not call select after the insert.
Btw, GORM is not very smart on this because my_text has a default value in the definition such as
MyText string `gorm:"default:'abc'"`
Thus, even if I don't specify the field, GORM should know what my_text value is and no need to select it anymore. But whatever, this may just how GORM designs for now.
Related
I'm trying to write a SQL query that will insert test data into two tables, one of which references the other.
Tables are created from something like the following:
CREATE TABLE address (
address_id INTEGER IDENTITY PRIMARY KEY,
...[irrelevant columns]
);
CREATE TABLE member (
...[irrelevant columns],
address INTEGER,
FOREIGN KEY(address) REFERENCES address(address_id)
);
I want ids in both tables to auto increment, so that I can easily insert new rows later without having to look into the table for ids.
I need to insert some test data into both tables, about 25 rows in each. Hardcoding ids for the insert causes issues with inserting new rows later, as the automatic values for the id columns try and start with 1 (which is already in the database). So I need to let the ids be automatically generated, but I also need to know which ids are in the database for inserting test data into the member database - I don't believe the autogenerated ones are guaranteed to be consecutive, so can't assume I can safely hardcode those.
This is test data - I don't care which record I link each member row I am inserting to, only that there is an address record in the address table with that id.
My thoughts for how to do this so far include:
Insert addresses individually, returning the id, then use that to insert an individual member (cons: potentially messy, not sure of the syntax, harder to see expected sets of addresses/members in the test data)
Do the member insert with a SELECT address_id FROM address WHERE [some condition that will only give one row] for the address column (cons: also a bit messy, involves a quite long statement for something I don't care about)
Is there a neater way around this problem?
I particularly wonder if there is a way to either:
Let the auto increment controlling functions be aware of manually inserted id values, or
Get the list of inserted ids from the address table into a variable which I can use values from in turn to insert members.
Ideally, I'd like this to work with as many (irritatingly slightly different) database engines as possible, but I need to support at least postgresql and sqlite - ideally in a single query, although I could have two separate ones. (I have separate ones for creating the tables, the sole difference being INTEGER GENEREATED BY DEFAULT AS IDENTITY instead of just IDENTITY.)
https://www.postgresql.org/docs/8.1/static/functions-sequence.html
Sounds like LASTVAL() is what you're looking for. It was also work in the real world to maintain transactional consistency between multiple selects, as it's scoped to your sessions last insert.
I'm building a web app which takes preferences and saves them.
The dataset I want to save will consist of a unique ID, a search string and some finite list of parameters which could be represented as True or False. This list of parameters could get up to say 10 in number.
I haven't decided what type of database I'm using but assuming it has rows and columns, would it be more efficient to have ID, search string and all the parameters as separate columns OR would it be more efficient to have ID, search string and then a single column representing all my parameters using some sort of dictionary that I would translate on the back end.
For example I could represent option A, C and D as A-C-D in a single column and then use a dictionary on retrieval to work with it in the application. Or else I would be using ColA: True, ColB: False, ColC: True, ColD: True, ..., ColN in the table and working with that when I pull it through
Would it be more useful to choose an SQL style DB over something like MongoDB in either case?
The answer to this depends. Normally, one uses relational databases to store relational information. This would mean that you have separate columns for options and values. There are traditionally two ways of doing this.
The most common is a normalized form, where each option has a column in a Users table. The key is the user id and you can just read the values. This works very well when there is a finite list of options that doesn't change much. It is also really useful when you want to query the table by options -- which users have a particular option, for instance.
Another method is called entity-attribute-value (EAV). In this method, the UserOptions table would have a separate row for each user and each option. The key would normally consist of the user/option pair (and the option itself might be an id that references a master list of options). This is flexible; it is easy to add values and it can handle an unlimited number of options per user. The downside is that getting all options for a user can be cumbersome; there is no data type validation on the values; implementing check constraints to validate values is tricky.
A third method can be useful for some purposes. That is to store all the options in a single "string" -- more typically, a JSON object. This is useful when you are using the database only for its ACID properties and don't need to query individual options. You can read the "options object" into your application, and it parses them into the options.
And, these are three examples of methods of solving the problem. There are also hybrid approaches that combine elements from more than one solution.
Which solution works best for you depends on your application. If you just have a handful of predetermined options, I would go with the first suggestion, a single column per option in a table.
Neither of the two options you specified is ideal.
Would it be more efficient to have ID, search string and all the parameters as separate columns
The problem with this is not only does this assume that you have a fixed maximum number of parameters, but querying this data would require you to always include every param column. An example query for this would be like this:
SELECT Id, <other fields>, Param1, Param2, Param3, Param4, ..., Param10
FROM YourTable
WHERE <stuff>
This can be very cumbersome on the back-end trying to check for NULL values, and you may run into the situation where you don't have enough columns. Plus, indexing would be very high overhead to add an index to each Param.
In short, don't do that method.
OR would it be more efficient to have ID, search string and then a single column representing all my parameters using some sort of dictionary that I would translate on the back end.
Also, no. There is a large problem with this method when it comes to querying data. If, say, you wanted to retrieve all records with parameter xyz, you would need to construct a query that parses out all of the params and compares them. Such a query cannot be indexed, and performance will be dreadful. In addition, it requires more coding on the application layer to actually make sense of the data returned.
Proposed Solution
You should make a separate table for the parameters. The structure would look something similar to this:
Dataset: DatasetParameters:
Id DatasetId
<Other Fields> Parameter
Using this structure, let's say for ID 1, you have parameters A, B, C, and D. You can insert into the DatasetParameters four columns:
DatasetId Parameter
----------------------
1 A
1 B
1 C
1 D
If you want to add more parameters later, you can simply insert (or delete, should you wish to remove) from this table with the DatasetId being the ID of the Dataset table.
To query this, all you would need to do is use a JOIN:
SELECT D.*, P.Param
FROM Dataset D
INNER JOIN DatasetParam P ON D.ID = P.DatasetID
I am writing a program that recovers structured data as individual records from a (damaged) file and collects the results into a sqlite database.
The program is invoked several times with slightly different recovery parameters. That leads to recovering often the same, but sometimes different data from the file.
Now, every time I run my program with different parameters, it's supposed to add just the newly (different) found items to the same database.
That means that I need a fast way to tell if each recovered record is already present in the DB or not, in order to add them only if they're not existing in the DB yet.
I understand that for each record I want to add, I could first do a SELECT for all columns to see if there is already a matching record in the DB, and only add the new one if no same is found.
But since I'm adding 10000s of records, doing a SELECT for each of these records feels pretty inefficient (slow) to me.
I wonder if there's a smarter way to handle this? I.e, is there a way I can tell sqlite that I do not want duplicate entries, and so it automatically detects and rejects them? I know about the UNIQUE modifier, but that's not it because it applies to single columns only, doesn't it? I'd need to be able to say that the combination of COL1+COL2+COL3 must be unique. Is there a way to do that?
Note: I never want to update any existing records. I only want to collect a set of different records.
Bonus part - performance
In a classic programming language, I'd use a key-value dictionary where the key is the sum of all a record's values. Similarly, I could calculate a Hash code for each added record and look that hash code up first. If there's no match, then the record is surely not in the DB yet; If there is a match I'd still have to search the DB for any duplicates. That'd surely be faster already, but I still wonder if sqlite can make this more efficient.
Try:
sqlite> create table foo (
...> a int,
...> b int,
...> unique(a, b)
...> );
sqlite>
sqlite> insert into foo values(1, 2);
sqlite> insert into foo values(2, 1);
sqlite> insert into foo values(1, 2);
Error: columns a, b are not unique
sqlite>
You could use UNIQUE column constraint or to declare a multiple columns unique constraint you can use UNIQUE () ON CONFLICT :
CREATE TABLE name ( id int , UNIQUE (col_name1 type , col_name2 type) ON CONFLICT IGNORE )
SQLite has two ways of expressing uniqueness constraints: PRIMARY KEY and UNIQUE. Both of them create an index and so the lookup happens through the created index.
If you do not want to use an SQL approach (as mentioned in other answers) you can do a select for all your data when the program starts, store the data in a dictionary and work with the dictionary do decide which records to insert to your DB.
The benefit of this approach is the single select is much faster than many small selects.
The disadvantage is that it won't work well if you don't have enough memory to store your data in.
In sqlite3, I can force two columns to alias to the same name, as in the following query:
SELECT field_one AS overloaded_name,
field_two AS overloaded_name
FROM my_table;
It returns the following:
overloaded_name overloaded_name
--------------- ---------------
1 2
3 4
... ...
... and so on.
However, if I create a named table using the same syntax, it appends one of the aliases with a :1:
sqlite> CREATE TABLE temp AS
SELECT field_one AS overloaded_name,
field_two AS overloaded_name
FROM my_table;
sqlite> .schema temp
CREATE TABLE temp(
overloaded_name TEXT,
"overloaded_name:1" TEXT
);
I ran the original query just to see if this was possible, and I was surprised that it was allowed. Is there any good reason to do this? Assuming there isn't, why is this allowed at all?
EDIT:
I should clarify: the question is twofold: why is the table creation allowed to succeed, and (more importantly) why is the original select allowed in the first place?
Also, see my clarification above with respect to table creation.
I can force two columns to alias to the same name...
why is [this] allowed in the first place?
This can be attributed to the shackles of compatibility. In the SQL Standards, nothing is ever deprecated. An early version of the Standard allowed the result of a table expression to include columns with duplicate names, probably because an influential vendor had allowed it, possibly due to the inclusion of a bug or the omission of a design feature, and weren't prepared to take the risk of breaking their customers' code (the shackles of compatibility again).
Is there any use to duplicate column names in a table?
In the relational model, every attribute of every relation has a name that is unique within the relevant relation. Just because SQL allows duplicate column names that doesn't mean that as a SQL coder you should utilise such as feature; in fact I'd say you have to vigilant not to invoke this feature in error. I can't think of any good reason to have duplicate column names in a table but I can think of many obvious bad ones. Such a table would not be a relation and that can't be a good thing!
why is the [base] table creation allowed to succeed
Undoubtedly an 'extension' to (a.k.a purposeful violation of) the SQL Standards, I suppose it could be perceived as a reasonable feature: if I attempt to create columns with duplicate names the system automatically disambigutes them by suffixing an ordinal number. In fact, the SQL Standard specifies that there be an implementation dependent way to ensure the result of a table expression does not implicitly have duplicate column names (but as you point out in the question this does not perclude the user from explicitly using duplicate AS clauses). However, I personally think the Standard behaviour of disallowing the duplicate name and raising an error is the correct one. Aside from the above reasons (i.e. that duplicate columns in the same table are of no good use), a SQL script that creates an object without knowing if the system has honoured that name will be error prone.
The table itself can't have duplicate column names because inserting and updating would be messed up. Which column gets the data?
During selects the "duplicates" are just column labels so do not hurt anything.
I assume you're talking about the CREATE TABLE ... AS SELECT command. This looks like an SQL extension to me.
Standard SQL does not allow you to use the same column name for different columns, and SQLite appears to be allowing that in its extension, but working around it. While a simple, naked select statement simply uses as to set the column name, create table ... as select uses it to create a brand new table with those column names.
As an aside, it would be interesting to see what the naked select does when you try to use the duplicated column, such as in an order by clause.
If you were allowed to have multiple columns with the same name, it would be a little difficult for the execution engine to figure out what you meant with:
select overloaded_name from table;
The reason why you can do it in the select is to allow things like:
select id, surname as name from users where surname is not null
union all
select id, firstname as name from users where surname is null
so that you end up with a single name column.
As to whether there's a good reason, SQLite is probably assuming you know what you're doing when you specify the same column name for two different columns. Its essence seems to be to allow a great deal of latitude to the user (as evidenced by the fact that the columns are dynamically typed, for example).
The alternative would be to simply refuse your request, which is what I'd prefer, but the developers of SQLite are probably more liberal (or less anal-retentive) than I :-)
I'm copying a subset of some data, so that the copy will be independently modifiable in future.
One of my SQL statements looks something like this (I've changed table and column names):
INSERT Product(
ProductRangeID,
Name, Weight, Price, Color, And, So, On
)
SELECT
#newrangeid AS ProductRangeID,
Name, Weight, Price, Color, And, So, On
FROM Product
WHERE ProductRangeID = #oldrangeid and Color = 'Blue'
That is, we're launching a new product range which initially just consists of all the blue items in some specified current range, under new SKUs. In future we may change the "blue-range" versions of the products independently of the old ones.
I'm pretty new at SQL: is there something clever I should do to avoid listing all those columns, or at least avoid listing them twice? I can live with the current code, but I'd rather not have to come back and modify it if new columns are added to Product. In its current form it would just silently fail to copy the new column if I forget to do that, which should show up in testing but isn't great.
I am copying every column except for the ProductRangeID (which I modify), the ProductID (incrementing primary key) and two DateCreated and timestamp columns (which take their auto-generated values for the new row).
Btw, I suspect I should probably have a separate join table between ProductID and ProductRangeID. I didn't define the tables.
This is in a T-SQL stored procedure on SQL Server 2008, if that makes any difference.
In a SELECT, you can either use * or list all columns and expressions explicitly, there are no partial wildcards.
You could possibly do a SELECT * into a temporary table or a table variable followed by an UPDATE of that table if that is just an one-off query and performance is of no importance.
You can omit the field names for the target table, but generally you shouldn't.
If you don't specify the fields for the target table, you will be relying on the order that they are defined in the table. If someone changes that order, the query will stop working. Or ever worse, it can put the values in the wrong field.
Another aspect is that it's easier to see what the query does, so that you don't have to open the table to see what the fields are to understand the query.