End to end testing - delete or keep records

End to end testing - delete or keep records - testing

What is the best way to handle records in the database after end-to-end tests have passed? Should we change affected records deleted_at column to the current date or remove these records at all since new tests will create more new records all the time?
The system will ignore those records with deleted_at flag but at the same time every new test will add new records and it will be equal for trashing database.
What is the best practice to use?

We usually start a docker container with a clean database or an initial setup every time we run the tests.
You can also clean the database in a beforeAll() function.
There is no need to persist the records created by the tests since you can add them
again

Related

How to consistently track all new rows in a SQL database table

What I am trying to do
I am developing a web service, which runs in multiple server instances, all accessing the same RDBMS (PostgreSQL). While the database is needed for persistence, it contains very little data, which is why every server instance has a cache of all the data. Further the application is really simple in that it only ever inserts new rows in rather simple tables and selects that data in a scheduled fashion from all server instances (no updates or changes... only inserts and reads).
The way it is currently implemented
basically I have a table which roughly looks like this:
id BIGSERIAL,
creation_timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
-- further data columns...
The server is doing something like this every couple of seconds (pseudocode):
get all rows with creation_timestamp > lastMaxTimestamp
lastMaxTimestamp = max timestamp for all data just retrieved
insert new rows into application cache
The issue I am running into
The application skips certain rows when updating the caches. I analyzed the issue and figured out, that the problem is caused in the following way:
one server instance is creating a new row in the context of a transaction. An id for the new row is retrieved from the associated sequence (id=n) and the creation_timestamp (with value ts_1) is set.
another server does the same in the context of a different transaction. The new row in this transaction gets id=n+1 and a creation_timestamp ts_2 (where ts_1 < ts_2).
transaction 2 finishes before transaction 1
one of the servers executes a "select all rows with creation_timestamp > lastMaxTimestamp". It gets row n+1, but not n1. It sets lastMaxTimestamp to ts_2.
transaction 1 completes
some time later the server from step 4 executes "select all rows with creation_timestamp > lastMaxTimestamp" again. But since lastMaxTimestamp=ts_2 and ts_2>ts_1 the row n will never be read on that server.
Note: CURRENT_TIMESTAMP has the same value during a transaction, which is the transaction start time.
So the application gets inconsistent data into its cache and can't get new rows based on the insertion timestamp OR based on the sequence id. Transaction isolation levels don't really change anything about the situation, since the problem is created in essence by transaction 2 finishing before transaction 1.
My question
Am I missing something? I am thinking there must be a straightforward way to get all new rows of a RDBMS, but I can't come up with a simple solution... at least with a simple solution that is consistent. Extensive locking (e.g. of tables) wouldn't be acceptable because of performance reasons. Simply trying to ensure to get all ids from that sequence seems like a) a complicated solution and b) can't be done easily, since rollbacks during transactions can happen (which would lead to sequence ids not being used).
Anyone has the solution?

After a lot of searching, I found the right keywords to google for... "transaction commit timestamp" to leads to all sorts of transaction timestamp tracking and system columns like xmin:
https://dba.stackexchange.com/questions/232273/is-there-way-to-get-transaction-commit-timestamp-in-postgres
This post has some more detailed information:
Questions about Postgres track_commit_timestamp (pg_xact_commit_timestamp)
In short:
you can turn on a postgresql option to track timestamps of commits and compare those instead of the current_timestamps/clock_timestamps inside the transaction
it seems though, that it is only tracked when a transaction is completed - not when it is commited, which makes the solution not bullet proof. There are also further issue to consider like transaction id (xmin) rollover for example
logical decoding / replication is something to look into for a proper solution
Thanks to everyone trying to help me find an answer. I hope this summary is useful to someone in the future.

Best way of storing an array in an SQL database?

For an Android Launcher (Home Screen) app project i want to implement a feature called "Sort by usage". This will sort by the launch count of an app within a user settable timeframe.
The current idea for the implementation is to store an array of unich epoch timestamps, one for each launch.
Additionaly it'll store a counter caching the current amount of launches within the selected timeframe, incremented with every launch. Of course, this would regularly have to be rebuild as time passes, but merely every few hours or at least x percent of the selected timeframe, so computations definitely wouldn't run as often as without the counter, since this information is required everytime when any app entries on screen need to get sorted - but i'm not quite sure if it matters in any way during actual use.
I am now unsure how to store the timestamp array inside the SQL database. As there is a table holding one record with information about each launcher entry i thought about the following options:
Store the array of unix epochs in serialized form (maybe JSON Array) to one field of the entries record
Create a seperate table for launch times with
a. each record starting with an id associated with an entry followed by all launch times, one for each field
b. each record a combination of entry id and one launch time
these options would obvously have the advatage of storing the timestamp using an appropriate type

I probably didn't quite understand why you need a second piece of data for your launch counter - the fact you saved a timestamp already means a launch - why not just count timestamps? Less updating, less record locking, more concurrency.
Now, let's say you've got a separate table with timestamps in a classic one to many setting.
Pros of this setup - you never need to update anything - just keep inserting. You can easily cluster your table by timestamp, run a filter on your timeframe and issue a group by and count rows. The client then will get the numbers and sort by count (I believe it's generally better to not sort in SQL). Cons - you need a join to parent table and probably need to get your indexes right.
Alternatively you store timestamps in a blob text (JSON, CSV, whatever) with your main records. This definitely means you'll have to update your records a lot, which potentially opens you up to locking issues. Then, I'm not entirely sure what you'll have to do to get your final launch counts - you read all entities, deserialise all timestamps, filter by timeframe and then count? It does feel a bit more convoluted in your case.
I don't think there's such thing as a "best" way. You have to consider pros and cons. From what I gather, you might be better off with classic SQL approach unless there's something I didn't catch that will outweigh my points above

SQL update command and table locking

i have an SQL table and VB.NET application.
the application loads the sql table to a datatable then it starts updating data to records by fetching some websites, it takes an average of 1.4 sec to fill datatable row with new data.
now i was wondering if its ok to use the sql update command to update a single record in the sql table and run it every time a record is updated which means run the update command for a single record every 1.4 sec
problem is other applications use this table in the same time and one of them writes to the same table but other columns,will the table get locked for other applications during this process?

SQL won't lock the table by default, but you probably should lock the table while updating it to prevent data corruption if those apps are doing alterations. performance will take a small hit, yes, but better that than having to rebuild it because it got messed up. this is a good explanation of locking
http://www.developerfusion.com/article/84509/managing-database-locks-in-sql-server/
if the other applications are just querying the table while you're updating, there shouldn't be any impact BUT they might get some odd results if they query it mid-update. locking is mainly about the risk of 2 people modifying the same record at the same time.

You need to find out why it takes 1.4 second to update a single record. Chances are it's because VB.NET needs to do some processing (while it's fetching some websites). For example, it could be taking you 1.3 seconds to perform necessary calculations (client time), and 0.1 second to update a single record (server time). In this case, you could perform update in batches, to minimize database access time.
Table will get locked, but only for a short time, so you don't need to worry about that, in general.

collecting mysql statistics

What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.

Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html

That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.

If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.

From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.

You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.

This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)

Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.

See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.

Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt

Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.

so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?

Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.

show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx

Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.

Is every DDL SQL command reversible? [database version control]

I want to setup a mechanism for tracking DB schema changes, such the one described in this answer:
For every change you make to the
database, you write a new migration.
Migrations typically have two methods:
an "up" method in which the changes
are applied and a "down" method in
which the changes are undone. A single
command brings the database up to
date, and can also be used to bring
the database to a specific version of
the schema.
My question is the following: Is every DDL command in an "up" method reversible? In other words, can we always provide a "down" method? Can you imagine any DDL command that can not be "down"ed?
Please, do not consider the typical data migration problem where during the "up" method we have loss of data: e.g. changing a field type from datetime (DateOfBirth) to int (YearOfBirth) we are losing data that can not be restored.

in sql server every DDL command that i know of is an up/down pair.

Other than loss of data, every migration I've ever done is reversible. That said, Rails offers a way to mark a migration as "destructive":
Some transformations are destructive
in a manner that cannot be reversed.
Migrations of that kind should raise
an ActiveRecord::IrreversibleMigration
exception in their down method.
See the API documentation here.

Yes, you've identified cases where you lose data, either by transforming it or simply DROP COLUMN in the "up" migration.
Another example is that you could drop a SEQUENCE object, thus losing its state. The "down" migration would recreate the sequence, but it would start over at 1. This could cause duplicate values to be generated by the sequence. Not a problem if you're performing a migration on an empty database, and you want the sequence to start at 1 anyway, but if you have some number of rows of data, you'd want the sequence to be reset to the greatest value currently in use, which is hard to do reliably, unless you have an exclusive lock on that table.
Any other DDL that is dependent on the state of data in the database has similar problems. That's probably not a good schema design in the first place, I'm just trying to think of any cases that fit your question.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas