SQL-SERVER why my cascading delete gets stuck - sql

I have a DB with about 100 tables. One table Client is referenced by almost all other tables (directly or indirectly).
I am trying to delete all data of one Client by executing this query:
DELETE FROM Client
WHERE Id = SomeNumber
this query should CASCADE delete all rows in all the tables that are
connected, directly or indirectly, to this Id of Client
The problem is that the query is getting stuck and I don't understand why.
this is the query plan
I checked for locks by this script
select * from sysprocesses where blocked > 0
but no result. didn't get any errors also. and I don't have any triggers in my DB.
I do see that a couple hundred of rows from some table are been deleted but
after a few seconds the query get stuck.

You can quite clearly see in the plan, that some of the dependent tables do not have indexes in the foreign key.
When a cascade happens, the plan starts by dumping all rows into an in-memory table. You can see this with a Table Spool at the top left of the plan, feeding off the Clustered Delete.
Then it reads these rows back, and joins them to the dependent tables. These tables must have an index with the foreign key being the leading key, otherwise you will get a full table scan.
This has happened in your case with a large number of tables, in some cases double-cascaded with two scans and a hash join.
When you create indexes, don't just make a one-column index. Create sensible indexes with other columns and INCLUDE columns, just make sure the foreign key is the leading column.
I must say, with this many foreign-keys, you are always going to have some issues, and you may want to turn off CASCADE because of that.

Related

how to uniquely identify rows in two table copies

I have essentially two tables that are copies of each other. One is dynamic and some DML statements happen quite constantly, so this table serve as a stage table, the other is used as a way to synchronize the changes form this stage table. So the tables can have different data at different times, and I use a merge statement to sync the tables. Something along these lines:
MERGE INTO source s
USING (
SELECT
*
FROM
stage st
) se ON ( s.eim_product_id = st.eim_product_id )
...
The problem is that eim_product_id is neither a primary key, nor unique. So my merge statement essentially throws this error:
Error report -
ORA-30926: unable to get a stable set of rows in the source tables
And the only pseudo-columns I can think of to use is something like an identity column id_seq INTEGER GENERATED ALWAYS AS IDENTITY or a rowid. However, the problem is that it will not be consistent this approach to uniquely identify the row across both tables, right ? I believe I need some kind of hash that does the job , but unsure what would be the best and simplest approach in this case.
The rowid pseudo-column won't match between the tables, and isn't necessarily constant. Creating a hash could get expensive in terms of CPU; an updated row in the first table wouldn't have a matching hash in the second table for the merge to find. If you only generate the hash at insert and never update then it's just a more expensive, complicated sequence.
Your best bet is an identity column with a unique constraint on the first table, copied to the second table by the merge: it is unique, only calculated very efficiently once at insert, will always identify the same row in both tables, and need never change.

Scala Spark Cassandra update or insert rows on primary key match

I am migrating data from csv SQL files (1 per table) to a Cassandra database that is using a pre-determined and standardized format. As a result, I am doing transformations, joins, etc on the SQL data to get it matching this format before writing it to Cassandra. My issue is that this db migration is happening in batches (not all at once) and I cannot ensure that information from the multiple sides of a table join will be present when an entry to Cassandra is written.
ex.
Table 1 and table 2 both have the partitioning and clustering keys (allowing the join since their combination is unique) and are joined using full outer join. With the way that we are being given data, however, there is a chance that we could get a record from Table 1 but not from Table 2 in a "batch" of data. When I perform the full outer join, no problems...extra columns from the other table are added and just fill with nulls. On the next interval that I get data, I then receive the Table 2 portion that should have previously been joined to Table 1.
How do I get those entries combined?
I have looked for an update or insert type method in Spark depending if that set of partitioning and clustering keys exists but have not turned up anything. Is this the most efficient way? Will I just have to add every entry with spark.sql query then update/write?
Note: using uuids that would prevent the primary key conflict will not solve the issue, I do not want 2 partial entries. All data with that particular primary key needs to be in the same row.
Thanks for any help that you can provide!
I think you should be able to just directly write the data to cassandra and not have to worry about it, assuming all primary keys are the same.
Cassandra's inserts are really "insert or update" so I believe when you insert one side of a join, it will just leave some columns empty. Then when you insert the other side of the join, it will update that row with the new columns.
Take this with a grain of salt, as I don't have a Spark+Cassandra cluster available to test and make sure.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.
A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?
Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.
Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)
It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers
One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

Does an index on a unique field in a table allow a select count(*) to happen instantly? If not why not?

I know just enough about SQL tuning to get myself in trouble. Today I was doing EXPLAIN plan on a query and I noticed it was not using indexes when I thought it probably should. Well, I kept doing EXPLAIN on simpler and simpler (and more indexable in my mind) queries, until I did EXPLAIN on
select count(*) from table_name
I thought for sure this would return instantly and that the explain would show use of an index, as we have many indexes on this table, including an index on the row_id column, which is unique. Yet the explain plan showed a FULL table scan, and it took several seconds to complete. (We have 3 million rows in this table).
Why would oracle be doing a full table scan to count the rows in this table? I would like to think that since oracle is indexing unique fields already, and having to track every insert and update on that table, that it would be caching the row count somewhere. Even if it's not, wouldn't it be faster to scan the entire index than to scan the entire table?
I have two theories. Theory one is that I am imagining how indexes work incorrectly. Theory two is that some setting or parameter somewhere in our oracle setup is messing with Oracle's ability to optimize queries (we are on oracle 9i). Can anyone enlighten me?
Oracle does not cache COUNT(*).
MySQL with MyISAM does (can afford this), because MyISAM is transactionless and same COUNT(*) is visible by anyone.
Oracle is transactional, and a row deleted in other transaction is still visible by your transaction.
Oracle should scan it, see that it's deleted, visit the UNDO, make sure it's still in place from your transaction's point of view, and add it to the count.
Indexing a UNIQUE value differs from indexing a non-UNIQUE one only logically.
In fact, you can create a UNIQUE constraint over a column with a non-unique index defined, and the index will be used to enforce the constraint.
If a column is marked as non-NULL, the an INDEX FAST FULL SCAN over this column can be used for COUNT.
It's a special access method, used for cases when the index order is not important. It does not traverse the B-Tree, but instead just reads the pages sequentially.
Since an index has less pages than the table itself, the COUNT can be faster with an INDEX_FFS than with a FULL
It is certainly possible for Oracle to satisfy such a query with an index (specifically with an INDEX FAST FULL SCAN).
In order for the optimizer to choose that path, at least two things have to be true:
Oracle has to be certain that every row in the table is represented in the index -- basically, that there are no NULL entries that would be missing from the index. If you have a primary key this should be guaranteed.
Oracle has to calculate the cost of the index scan as lower than the cost of a table scan. I don't think it necessarily true to assume that an index scan is always cheaper.
Possibly, gathering statistics on the table would change the behavior.
Expanding a little on the "transactions" reason. When a database supports transactions, at any point in time there might be records in different states, even in a "deleted" state. If a transaction fails, the states are rolled back.
A full table scan is done so that the current "version" of each record can be accessed for that point in time.
MySQL MyISAM doesn't have this problem since it uses table locking, instead of record locking required for transactions, and caches the record count. So it's always instantlyy returned. InnoDB under MySQL works the same as Oracle, but returns and "estimate".
You may be able to get a quicker query by counting the distinct values on the primary key, then only the index would be accessed.

faster way to use sets in MySQL

I have a MySQL 5.1 InnoDB table (customers) with the following structure:
int record_id (PRIMARY KEY)
int user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..
There are roughly 7 million rows in the table. Currently, the table is being queried like this:
SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...
in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!
There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.
I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.
I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?
I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.
EDIT
Ran EXPLAIN, which said:
select_type = SIMPLE
table = customers
type = range
possible_keys = userid_idx
key = userid_idx
key_len = 5
ref = (NULL)
rows = 637640
Extra = Using where
does that help?
First, check if there is an index on USER_ID and make sure it's used.
You can do it with running EXPLAIN.
Second, create a temporary table and use it in a JOIN:
CREATE TABLE temptable (user_id INT NOT NULL)
SELECT *
FROM temptable t
JOIN customers c
ON c.user_id = t.user_id
Third, how may rows does your query return?
If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.
NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.
Update:
The index is used, the plan is fine except that it returns more than half a million rows.
Do you really need to put all these 638,000 rows into the report?
Hope its not printed: bad for rainforests, global warming and stuff.
Speaking seriously, you seem to need either aggregation or pagination on your query.
"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).
First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.
I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.
You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).
Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.
As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.
BUT my biggest tips are:
Have a goal in mind, work out what it is, and when you reach it, stop.
Don't take anybody's word for it - try it and see
Ensure that your performance test system is the same hardware spec as production
Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
If your query is optimal (as it probably already is), try tuning the schema, then the database itself.
Is this your most important query? Is this a transactional table?
If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).
If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.
Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl
Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?
You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.
You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.