I have essentially two tables that are copies of each other. One is dynamic and some DML statements happen quite constantly, so this table serve as a stage table, the other is used as a way to synchronize the changes form this stage table. So the tables can have different data at different times, and I use a merge statement to sync the tables. Something along these lines:
MERGE INTO source s
USING (
SELECT
*
FROM
stage st
) se ON ( s.eim_product_id = st.eim_product_id )
...
The problem is that eim_product_id is neither a primary key, nor unique. So my merge statement essentially throws this error:
Error report -
ORA-30926: unable to get a stable set of rows in the source tables
And the only pseudo-columns I can think of to use is something like an identity column id_seq INTEGER GENERATED ALWAYS AS IDENTITY or a rowid. However, the problem is that it will not be consistent this approach to uniquely identify the row across both tables, right ? I believe I need some kind of hash that does the job , but unsure what would be the best and simplest approach in this case.
The rowid pseudo-column won't match between the tables, and isn't necessarily constant. Creating a hash could get expensive in terms of CPU; an updated row in the first table wouldn't have a matching hash in the second table for the merge to find. If you only generate the hash at insert and never update then it's just a more expensive, complicated sequence.
Your best bet is an identity column with a unique constraint on the first table, copied to the second table by the merge: it is unique, only calculated very efficiently once at insert, will always identify the same row in both tables, and need never change.
Related
I have a DB with about 100 tables. One table Client is referenced by almost all other tables (directly or indirectly).
I am trying to delete all data of one Client by executing this query:
DELETE FROM Client
WHERE Id = SomeNumber
this query should CASCADE delete all rows in all the tables that are
connected, directly or indirectly, to this Id of Client
The problem is that the query is getting stuck and I don't understand why.
this is the query plan
I checked for locks by this script
select * from sysprocesses where blocked > 0
but no result. didn't get any errors also. and I don't have any triggers in my DB.
I do see that a couple hundred of rows from some table are been deleted but
after a few seconds the query get stuck.
You can quite clearly see in the plan, that some of the dependent tables do not have indexes in the foreign key.
When a cascade happens, the plan starts by dumping all rows into an in-memory table. You can see this with a Table Spool at the top left of the plan, feeding off the Clustered Delete.
Then it reads these rows back, and joins them to the dependent tables. These tables must have an index with the foreign key being the leading key, otherwise you will get a full table scan.
This has happened in your case with a large number of tables, in some cases double-cascaded with two scans and a hash join.
When you create indexes, don't just make a one-column index. Create sensible indexes with other columns and INCLUDE columns, just make sure the foreign key is the leading column.
I must say, with this many foreign-keys, you are always going to have some issues, and you may want to turn off CASCADE because of that.
I know its better to use INSERT WHERE NOT EXISTS than INSERT as it leads to duplicated records or unique key violation issues.
But with respect to performance, will it create any big difference ?
INSERT WHERE NOT EXISTS will internally triggers extra SELECT statement to check the record is existing or not. In case of large tables, which is recommended to use INSERT vs INSERT WHERE NOT EXITS ?
And someone pls explain cost of execution difference between the both.
Most Oracle IN clause queries involve a series of literal values, and when a table is present a standard join is better. In most cases the Oracle cost-based optimizer will create an identical execution plan for IN vs EXISTS, so there is no difference in query performance.
The Exists keyword evaluates true or false, but the IN keyword will compare all values in the corresponding subquery column. If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand, if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.
The EXISTS subquery is used when we want to display all rows where we have a matching column in both tables. In most cases, this type of subquery can be re-written with a standard join to improve performance.
The EXISTS clause is much faster than IN when the subquery results is very large. Conversely, the IN clause is faster than EXISTS when the subquery results is very small.
Also, the IN clause can't compare anything with NULL values, but the EXISTS clause can compare everything with NULLs.
It's not a matter of "what's fastest" but a matter of "what's correct".
When you INSERT into a table (without any restriction) you simply add records to that table. If an existing identical record already was in there this will then result there now being two such records. This may be fine or this may be an issue depending on your needs (**).
When you add a WHERE NOT EXISTS() to your INSERT construction the system will only add records to the table that aren't there yet, thus avoiding the situation of ending up with multiple identical records.
(**: suppose you have unique or primary key constraint on the target table then the INSERT of a duplicate record will result in a UQ/PK Violation error. IF your question was: "What's fastest: try to insert the row and if there is such an error simply ignore it versus try to insert where not exists() and avoid the error" then I can't give you conclusive answer but I'm fairly certain it will be a close call. What I can say however is that the WHERE NOT EXISTS() approach will look much nicer in code and (importantly!) it will also work for set-based operations, the try/catch approach will fail for the entire set even if only 1 record causes an issue.)
INSERT will check inserted data against any existing schema constraints, PK, FK, Unique Indexes, not nulls and any other custom constraints - whatever the table schema demands. If those checks are ok, the row will be inserted and loop on to the next row.
INSERT WHERE NOT EXISTS, prior to the above check, will check data of all columns of the row against data of all rows of the table. Even if 1 column is different it is ok and then it will move on exactly as INSERT above.
The performance impact mostly depends:
1. number of existing rows at the table
2. size of row
So as the table gets larger followed by larger row size the difference grows.
I am migrating data from csv SQL files (1 per table) to a Cassandra database that is using a pre-determined and standardized format. As a result, I am doing transformations, joins, etc on the SQL data to get it matching this format before writing it to Cassandra. My issue is that this db migration is happening in batches (not all at once) and I cannot ensure that information from the multiple sides of a table join will be present when an entry to Cassandra is written.
ex.
Table 1 and table 2 both have the partitioning and clustering keys (allowing the join since their combination is unique) and are joined using full outer join. With the way that we are being given data, however, there is a chance that we could get a record from Table 1 but not from Table 2 in a "batch" of data. When I perform the full outer join, no problems...extra columns from the other table are added and just fill with nulls. On the next interval that I get data, I then receive the Table 2 portion that should have previously been joined to Table 1.
How do I get those entries combined?
I have looked for an update or insert type method in Spark depending if that set of partitioning and clustering keys exists but have not turned up anything. Is this the most efficient way? Will I just have to add every entry with spark.sql query then update/write?
Note: using uuids that would prevent the primary key conflict will not solve the issue, I do not want 2 partial entries. All data with that particular primary key needs to be in the same row.
Thanks for any help that you can provide!
I think you should be able to just directly write the data to cassandra and not have to worry about it, assuming all primary keys are the same.
Cassandra's inserts are really "insert or update" so I believe when you insert one side of a join, it will just leave some columns empty. Then when you insert the other side of the join, it will update that row with the new columns.
Take this with a grain of salt, as I don't have a Spark+Cassandra cluster available to test and make sure.
I have a question regarding Oracle.
I know that Oracle only support the use of aliases to the first subquery level. This poses a problem when I want to group more than one time while updating a table.
Example: I have some server groups and a database containing information about them. I have one table that contains information about the groups and one table where I store with timestamp (to be exact: I used date actually) the workload of specific servers within the groups.
Now I have for performance issues a denormalized field in the server table containing the highest workload the group had within one day.
What I would like to do is something like
update server_group
set last_day_workload=avg(workload1)
from (select max(workload) workload1
from server_performance
where server_performance.server_group_ID_fk=server_group.ID
and time>sysdate-1
group by server_performance.server_group_ID_fk)
While ID is the primary key of server_group and server_group_ID_fk a foreign key reference from the server_performance table. The solution I am using so far is writing the first join into a temporary table and update from that temporary table in the next statement. Is there a better way to do this?
In this problem it isn`t such a problem yet, but if the amount of data increase using a temporary table cost not only some time, but also a notable amount of RAM.
Thank you for your answers!
If I were you, I would work out the results that I wanted in a select statement, and then use a MERGE statement to do the necessary update.
I want to optimize storage of a big table by taking out values of columns of type varchar to external lookup table (there are many duplicated values)
the process of doing it is very technical in it's nature (creating lookup table and reference it instead of the actual value) and it sounds like it should be part of the infrastructure (sql server in this case or any RDBMS).
than I thought, it should be an option of a index - do not store duplicate values.
only a reference to the duplicated value.
can index be optimized in such a manner - not holding duplicated values, but just reference?
it should make the size of the table and index much smaller when there are many duplicated values.
SQL Server cannot do deduplication of column values. An index stores one row for each row of the base table. They are just sorted differently.
If you want to deduplicate you can keep a separate table that holds all possible (or actually occurring) values with a much shorter ID. You can then refer to the values by only storing their ID.
You can maintain that deduplication table in the application code or using triggers.