What is the fastest way to compare 2 rows in SQL? - sql

I have 2 different databases. Upon changing something in the big one (i don't have access to), i get some rows imported in my databases in a similar HUGE table. I have a job checking for records in this table, and if any, execute a stored procedure, process and delete from table.
Performance. (Huge amount of data) I would like to know what is the fastest way to know if something has changed using let's say 2 imported rows with 100 columns each. Don't have FK-s, don't need. Chances are, that even though I have records in my table, nothing has actually changed.
Also. Let's say there is actually changed something. Is it possible for example to check only for changes inside datetime columns?
Thanks

You can always use update triggers - these will give you access to two logical tables, inserted and updated. You can compare the values of these and base your action on the results.

Related

Merging multiple tables from multiple databases with all rows and columns

I have 30 databases from a survey application that all have a table of results with approximately 100 columns in each. Most of the columns are identical but each survey seems to have a unique column or two added in with no real pattern (these are the added questions and results of the survey). As I am working on the statement to join all of the tables into one large master table the code is getting quite complex. Is there a more efficient way to merge these tables from multiple databases and just select all rows and columns so it will merge if the column exists and create if it encounters a new column?
No, there isn't an automatic way to merge a bunch of similar, but not quite the same, tables into one. At least, not in any database system that I know of.
You could possibly automate something like that with a fairly simple script that relies on your database's information schema (or equivalent).
However, with only 30 tables and only a column or two different in each, I'm not sure it's worth it. A manual approach, with copying and pasting and making minor changes, would probably be faster.
Also, consider whether the "extra" columns that are unique to individual tables need to go into the combined table. The point of making a big single table is to process/analyze all the data together. If something only applies to a single source, this isn't possible.

Running an update query to a memo field

This may be a simple one (whether it is possible or not).. I have two tables which are identical with a set of records. I need to find the best way that when one of the memo fields is modified, that the modifications get carried to the identical record on the corrisponding identical table. In many cases the field may have hundreds of words.. The only way that i know how to potentially do this is using an update query, but i am wondering if the test within is really large, will it break? Or is there another way? Thanks, A
Since you're working with Access 2010 you can use an After Update data macro to copy the changes back to the main table:
For more information see
Create a data macro

PLSQL: Get number of records updated vs inserted when a merge statement is used

Merge will always give you the number of records merged regardless of how my records inserted or updated using SQL%ROWCOUNT.
But how to find out number of records that were actually inserted vs number of records that were actually updated.
I tried options from this post but this doesn't seem to work -
https://asktom.oracle.com/pls/asktom/f?p=100:11:0::NO::P11_QUESTION_ID:122741200346595110
Any help?
You cannot, in general, differentiate how a row affected by a MERGE statement was affected in order to get separate counts for inserted, updated, and deleted rows.
If you really need separate figures, you could issue separate INSERT and UPDATE statements though that is likely to be less efficient. There are non-general solutions that depend on particular query plans but those are going to be rather brittle and generally wouldn't be recommended.

Should I use a separate table for repetitive values (varchar)?

I have a table in which 3 rows of data are added per second and in which I intend to keep around 30M rows. (Older data will be removed).
I need to add a column: varchar(1000). I can't tell in advance what it's content will be but I do know it will be very repetitive : thousands to millions of rows will have the same value. It is usually around 200 characters long.
Since data is being added using a Stored Procedure I see two option
Add a column varchar(1000)
Create a table (int id,varchar(1000) value)
Within the StoredProcedure, look if the value exist in that other table or create it
I would expect this other table to have a maximum of 100 value at all time.
I know some of the tradeoff between these two options but I have difficulty making up my mind on the question.
Option 1 is heavier but I get faster inserts. Requires less joins hence query are simpler.
Option 2 is lighter insert take longers but query have the potential to be faster. I think I'm closer to normal form but then I also have a table with a single meaningful column.
From the information I gave you, which option seems better? (You can also come up with another option).
You should also investigate page compression, perhaps you can do the simple thing and still get a a small(ish) table. Although, if you say is SQL Express, you won't be able to use it as is an Enterprise Edition requirement.
I have used repeatedly in my projects your second approach. Every insert would have to go through a stored procedure that gets the lookup value id, or inserts a new one if not found and returns the id. Specially for such large columns like your seems to be, with plenty of rows yet so few distinct values, the saving in space should trump the additional overhead of the foreign key and lookup cost in query joins. See also Disk is Cheap... That's not the point!.

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.