Are there problems with this 'Soft Delete' solution using EAV tables?

Are there problems with this 'Soft Delete' solution using EAV tables? - sql

I've read some information about the ugly side of just setting a deleted_at field in your tables to signify a row has been deleted.
Namely
http://richarddingwall.name/2009/11/20/the-trouble-with-soft-delete/
Are there any potential problems with taking a row from a table you want to delete and pivoting it into some EAV tables?
For instance.
Lets Say I have two tables deleted and deleted_row respectively described as follows.
mysql> describe deleted;
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| tablename | varchar(255) | YES | | NULL | |
| deleted_at | timestamp | YES | | NULL | |
+------------+--------------+------+-----+---------+----------------+
mysql> describe deleted_rows;
+--------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| entity | int(11) | YES | MUL | NULL | |
| name | varchar(255) | YES | | NULL | |
| value | blob | YES | | NULL | |
+--------+--------------+------+-----+---------+----------------+
Now when you wanted to delete a row from any table you would delete it from the table then insert it into these tables as such.
deleted
+----+-----------+---------------------+
| id | tablename | deleted_at |
+----+-----------+---------------------+
| 1 | products | 2011-03-23 00:00:00 |
+----+-----------+---------------------+
deleted_row
+----+--------+-------------+-------------------------------+
| id | entity | name | value |
+----+--------+-------------+-------------------------------+
| 1 | 1 | Title | A Great Product |
| 2 | 1 | Price | 55.00 |
| 3 | 1 | Description | You guessed it... it's great. |
+----+--------+-------------+-------------------------------+
A few things I see off the bat.
You'll need to use application logic
to do the pivot (Ruby, PHP, Python,
etc)
The table could grow pretty big
because I'm using blob to handle
the unknown size of the row value
Do you see any other glaring problems with this type of soft delete?

Why not mirror your tables with archive tables?
create table mytable(
col_1 int
,col_2 varchar(100)
,col_3 date
,primary key(col_1)
)
create table mytable_deleted(
delete_id int not null auto_increment
,delete_dtm datetime not null
-- All of the original columns
,col_1 int
,col_2 varchar(100)
,col_3 date
,index(col_1)
,primary key(delete_id)
)
And then simply add on-delete-triggers on your tables that inserts the current row in the mirrored table before the deletion? That would provide you with dead-simple and very performant solution.
You could actually generate the tables and trigger code using the data dictionary.
Note that I might not want to have a unique index on the original primary key (col_1) in the archive table, because you may actually end up deleting the same row twice over time if you are using natural keys. Unless you plan to hook up the archive tables in your application (for undo purposes) you can drop the index entirely. Also, I added the time of delete (deleted_dtm) and a surrogate key that can be used to delete the deleted (hehe) rows.
You may also consider range partitioning the archive table on deleted_dtm. This makes it pretty much effortless to purge data from the tables.

Related

Replacing for loop by sql

I have SQL for example
show tables from mydb;
It shows the list of table
|table1|
|table2|
|table3|
Then,I use sql sentence for each table.
such as "show full columns from table1 ;"
+----------+--------+-----------+------+-----+---------+----------------+---------------------------------+---------+
| Field | Type | Collation | Null | Key | Default | Extra | Privileges | Comment |
+----------+--------+-----------+------+-----+---------+----------------+---------------------------------+---------+
| id | bigint | NULL | NO | PRI | NULL | auto_increment | select,insert,update,references | |
| user_id | bigint | NULL | NO | MUL | NULL | | select,insert,update,references | |
| group_id | int | NULL | NO | MUL | NULL | | select,insert,update,references | |
+----------+--------+-----------+------+-----+---------+----------------+---------------------------------+---------+
So in this case I can use programming language such as .(this is not correct code just showing the flow)
tables = "show tables from mydb;"
for t in tables:
cmd.execute("show full columns from {t} ;")
However is it possible to do this in sql only?

If you are using MySQL you can use the system view - INFORMATION_SCHEMA.
It contains table name and column name (and other details). No loop is require and you can easily filter by other information, too.
SELECT *
FROM INFORMATION_SCHEMA.COLUMNS
If you are using Microsoft SQL Server, you can use the above command

MariaDB - insert record or update timestamp if record exists

Objective: cronjob runs a task; when completed successfully, insert new host record. If record exists, update timestamp to reflect this status.
# Table layout
> describe hosts_completed;
+-----------+---------------------+------+-----+-------------------+-----------------------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+-------------------+-----------------------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| timestamp | timestamp | NO | | CURRENT_TIMESTAMP | on update CURRENT_TIMESTAMP |
| hostname | varchar(32) | YES | MUL | NULL | |
+-----------+---------------------+------+-----+-------------------+-----------------------------+
# Current inventory
> select * from hosts_completed;
+----+---------------------+----------+
| id | timestamp | hostname |
+----+---------------------+----------+
| 10 | 2020-11-02 12:51:08 | myHost1 |
| 11 | 2020-11-02 14:32:16 | MyHost2 |
+----+---------------------+----------+
I want to update the status for myHost1 and my best shot would be like
> insert into hosts_completed(hostname) values("myHost1") ON DUPLICATE KEY UPDATE timestamp=now();
and it runs but adds a new record, it does not update the myHost1 record.
Where is the glitch?

The on duplicate key syntax requires a unique constraint on the column that is used to detect the conflict. Create it first:
alter table hosts_completed
add constraint unique_hostname
unique (hostname);
Note that this pre-requires no duplicates in the column (otherwise you need to housekeep your data before you can create the constraint).
Then you can use your curent query:
insert into hosts_completed(hostname)
values('myHost1')
on duplicate key update timestamp = now();

Select * sql query vs Select specific columns sql query [duplicate]

This question already has answers here:
Closed 12 years ago.
The community reviewed whether to reopen this question 2 months ago and left it closed:
Original close reason(s) were not resolved
Possible Duplicate:
Why is SELECT * considered harmful?
Probably a database nOOb question.
Our application has a table like the following
TABLE WF
Field | Type | Null | Key | Default | Extra |
+--------------------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| children | text | YES | | NULL | |
| w_id | int(11) | YES | | NULL | |
| f_id | int(11) | YES | | NULL | |
| filterable | tinyint(1) | YES | | 1 | |
| created_at | datetime | YES | | NULL | |
| updated_at | datetime | YES | | NULL | |
| status | smallint(6) | YES | | 1 | |
| visible | tinyint(1) | YES | | 1 | |
| weight | int(11) | YES | | NULL | |
| root | tinyint(1) | YES | | 0 | |
| mfr | tinyint(1) | YES | | 0 | |
+--------------------+-------------+------+-----+---------+----------------+
This table is expected to be upwards of ten million records. The schema is not expected to change much. I need to retrieve the columns f_id, children, status, visible, weight, root, mfr.
Which approach is faster for data retrieval?
1) Select * from WF where w_id = 1 AND status = 1;
I will strip the unnecessary columns in the application layer.
2) Select children,f_id,status,visible,weight,root,mfr from WF where w_id = 1 AND status = 1;
There is no need to strip the unnecessary columns as its pre-selected in the query.
Does any one have a real life benchmark as to which is faster. I know some say Select * is evil, but will MySQL respond faster while trying to get the whole chunk as opposed to retrieving selective columns?
I am using MySQL version: 5.1.37-1ubuntu5 (Ubuntu) and the application is Rails3 app.

As an example of how a select statement that includes a subset of columns can be significantly faster, it can use a covering index on the table that includes just those columns, potentially resulting in much better query performance.

If you return fewer columns there is less data to go across the network and less data for the database to process and it will almost always return faster. Databases also tend to be slower using select * because the database then has to go figure out what the columns are and thus do more work than when you specify. Further select * will often return bad results if the structure changes significantly. It may end up showing the user fields you don;t want them to see or if someone is silly enough to rearrange the columns, then the application may actually appear to show things in the wrong order or if doing an insert from the data, put them in the wrong column. It is almost alawys a poor practice to use selct * in production code.

Rewriting this subquery?

I am trying to build a new table such that the values in the existing table are NOT contained (but obviously the following checks for contained) in another table. Following is my table structure:
mysql> explain t1;
+-----------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-----------+---------------------+------+-----+---------+-------+
| id | int(11) | YES | | NULL | |
| point | bigint(20) unsigned | NO | MUL | 0 | |
+-----------+---------------------+------+-----+---------+-------+
mysql> explain whitelist;
+-------------+---------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------------+------+-----+---------+----------------+
| id | bigint(20) unsigned | NO | PRI | NULL | auto_increment |
| x | bigint(20) unsigned | YES | | NULL | |
| y | bigint(20) unsigned | YES | | NULL | |
| geonetwork | linestring | NO | MUL | NULL | |
+-------------+---------------------+------+-----+---------+----------------+
My query looks like this:
SELECT point
FROM t1
WHERE EXISTS(SELECT source
FROM whitelist
WHERE MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))));
Explain:
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
| 1 | PRIMARY | t1 | index | NULL | point | 8 | NULL | 1001 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | whitelist | ALL | _geonetwork | NULL | NULL | NULL | 3257 | Using where |
+----+--------------------+--------------------+-------+-------------------+-----------+---------+------+------+--------------------------+
The query is taking 6 seconds to execute for 1000 records in t1 which is unacceptable for me. How can I rewrite this query using Joins (or perhaps a faster way if that exists) if I don't have a column to join on? Even a stored procedure is acceptable I guess in the worst case. My goal is to finally create a new table containing entries from t1. Any suggestions?

Unless the query optimizer is failing, a WHERE EXISTS construct should result in the same plan as a join with a GROUP clause. Look at optimizing MBRContains(geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)')))), that's probably where your query is spending all its time. I don't have a suggestion for that, but here's your query written with a JOIN:
Select t1.point
from t1
join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
group by t1.point
;
or to get the points in t1 not in whitelist:
Select t1.point
from t1
left join whitelist on MBRContains(whitelist.geonetwork, GeomFromText(CONCAT('POINT(', t1.point, ' 0)'))))
where whitelist.id is null
;

This seems like a case where de-nomalizing t1 might be beneficial. Adding a GeomFrmTxt column with a value of GeomFromText(CONCAT('POINT(', t1.point, ' 0)')) could speed up the query you already have.

SQL LIKE question

I was wondering if there's a drawback (other than bad practice) to using something like this
SELECT * FROM my_table WHERE id LIKE '1';
where id is an integer. I know you're supposed to use id=1 but I am writing a java program and if everything can use LIKE it'll be a lot easier for me. Also, so far, everything works fine; I get the correct query results, so if there is no drawback I will continue doing it like this.
edit: I am using MySQL.

MySQL will allow it, but will ignore the index:
mysql> describe METADATA_44;
+---------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+--------------+------+-----+---------+-------+
| AtextId | int(11) | NO | PRI | NULL | |
| num | varchar(128) | YES | | NULL | |
| title | varchar(128) | YES | | NULL | |
| file | varchar(128) | YES | | NULL | |
| context | varchar(128) | YES | | NULL | |
| source | varchar(128) | YES | | NULL | |
+---------+--------------+------+-----+---------+-------+
6 rows in set (0.00 sec)
mysql> explain select * from METADATA_44 where Atextid like '7';
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
| 1 | SIMPLE | METADATA_44 | ALL | PRIMARY | NULL | NULL | NULL | 591 | Using where |
+----+-------------+-------------+------+---------------+------+---------+------+------+-------------+
mysql> explain select * from METADATA_44 where Atextid=7;
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
| 1 | SIMPLE | METADATA_44 | const | PRIMARY | PRIMARY | 4 | const | 1 | |
+----+-------------+-------------+-------+---------------+---------+---------+-------+------+-------+
1 row in set (0.00 sec)

You'd need to look at the Query Execution Plan on your RDBMS to verify that LIKE with no wildcards is treated as efficiently as an = would be. A quick test in SQL Server shows that it would give you an index scan rather than a seek so I guess it doesn't look at that when generating the plan and for SQL Server using = would be much more efficient. I don't have a MySQL install to test against.
Edit: Just to update this SQL Server seems to handle it fine and do a seek when the data type is varchar. When it is run against an int column though you get the scan. This is because it does an implicit conversion to varchar on the int column so can't use the index.

You are better off writing your query as
SELECT * FROM my_table WHERE id = 1;
otherwise mysql will have to typecast '1' to int which is the type of the column id
so obviously there is a small performance penalty, when u know the type of the column supply the value according to that type

Speed. [15-char filler as there's not much more to say]

Without using any wildcards with LIKE, is should be fine for your needs if the speed/efficiency is something you don't bother with.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Are there problems with this 'Soft Delete' solution using EAV tables? - sql

Related

Replacing for loop by sql

MariaDB - insert record or update timestamp if record exists

Select * sql query vs Select specific columns sql query [duplicate]

Rewriting this subquery?

SQL LIKE question

Categories

Resources