Is it viable to have a SQL table with only one row and one column? - sql

I'm currently working on my first application that uses a database so I'm very new to this. The database has multiple tables that are what you would expect to normally see.
However, I created one table which only has one row and one column used to keep a count of the total items processed by the program so it's available to access elsewhere. I can't just use
SELECT COUNT(*) FROM table_name
because these items that I am processing I do not want to actually keep in a table.
It seems like a waste to use a table to store one value so I am wondering if there a better way to keep track of this value.

What is your table storing? it's storing some kind of processing audit. So make it a little more useful - add a column storing the last datetime that the data was processed. Add a column for the time it took to process. Add another column which stores the username (or some identifier) of whoever ran the process. Now add a row for every table that is processed (there's only one now but there might be more in future). Try and envisage how your processing is going to grow in future

Related

Keeping track of mutated rows in BigQuery?

I have a large table whose rows get updated/inserted/merged periodically from a few different queries. I need a scheduled process to run (via API) to periodically check for which rows in that table were updated since the last check. So here are my issues...
When I run the merge query, I don't see a way for it to return which records were updated... otherwise, I could be copying those updated rows to a special updated_records table.
There are no triggers so I can't keep track of mutations that way.
I could add a last_updated timestamp column to keep track that way, but then repeatedly querying the entire table all day for that would be a huge amount of data billed (expensive).
I'm wondering if I'm overlooking something obvious or if maybe there's some kind of special BQ metadata that could help?
The reason I'm attempting this is that I'm wanting to extract and synchronize a smaller subset of this table into my PostgreSQL instance because the latency for querying BQ is just too much for smaller queries.
Any ideas? Thanks!
One way is to periodically save intermediate state of the table using the time travel feature. Or store only the diffs. I just want to leave this option here:
FOR SYSTEM_TIME AS OF references the historical versions of the table definition and rows that were current at timestamp_expression.
The value of timestamp_expression has to be within last 7 days.
The following query returns a historical version of the table from one hour ago.
SELECT * FROM table
FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 HOUR);
The following query returns a historical version of the table at an absolute point in time.
SELECT * FROM table
FOR SYSTEM_TIME AS OF '2017-01-01 10:00:00-07:00';
An approach would be to have 3 tables:
one basetable in "append only" mode, only inserts are added, and updates as full row, in this table would be every record like a versioning system.
a table to hold deletes (or this can be incorporated as a soft delete if there is a special column kept in the first table)
a livetable where you hold the current data (in this table you would do your MERGE statements most probably from the first base table.
If you choose partitioning and clustering, you could end up leverage a lot for long time storage discounted price and scan less data by using partitioning and clustering.
If the table is large but the amount of data updated per day is modest then you can partition and/or cluster the table on the last_updated_date column. There are some edge cases, like the first today's check should filter for last_updated_date being either today or yesterday.
Depending of how modest this amount of data updated throughout a day is, even repeatedly querying the entire table all day could be affordable because BQ engine will scan one daily partition only.
P.S.
Detailed explanation
I could add a last_updated timestamp column to keep track that way
I inferred from that the last_updated column is not there yet (so the check-for-updates statement cannot currently distinguish between updated rows and non-updated ones) but you can modify the table UPDATE statements so that this column will be added to the newly modified rows.
Therefore I assumed you can modify the updates further to set the additional last_updated_date column which will contain the date portion of the timestamp stored in the last_updated column.
but then repeatedly querying the entire table all day
From here I inferred there are multiple checks throughout the day.
but the data being updated can be for any time frame
Sure, but as soon as a row is updated, no matter how old this row is, it will acquire two new columns last_updated and last_updated_date - unless both columns have already been added by the previous update in which cases the two columns will be updated rather than added. If there are several updates to the same row between the update checks, then the latest update will still make the row to be discoverable by the checks that use the logic described below.
The check-for-update statement will (conceptually, not literally):
filter rows to ensure last_updated_date=today AND last_updated>last_checked. The datetime of the previous update check will be stored in last_checked and where this piece of data is held (table, durable config) is implementation dependent.
discover if the current check is the first today's check. If so then additionally search for last_updated_date=yesterday AND last_updated>last_checked.
Note 1If the table is partitioned and/or clustered on the last_updated_date column, then the above update checks will not cause table scan. And subject to ‘modest’ assumption made at the very beginning of my answer, the checks will satisfy your 3rd bullet point.
Note 2The downside of this approach is that the checks for updates will not find rows that had been updated before the table UPDATE statements were modified to include the two extra columns. (Such rows will be in the__NULL__ partition with rows that never were updated.) But I assume until the changes to the UPDATE statements are made it will be impossible to distinguish between updated rows and non-updated ones anyway.
Note 3 This is an explanatory concept. In the real implementation you might need one extra column instead of two. And you will need to check which approach works better: partitioning or clustering (with partitioning on a fake column) or both.
The detailed explanation of the initial (e.g. above P.S.) answer ends here.
Note 4
clustering only helps performance
From the point of view of table scan avoidance and achieving a reduction in the data usage/costs, clustering alone (with fake partitioning) could be as potent as partitioning.
Note 5
In the comment you mentioned there is already some partitioning in place. I’d suggest to examine if the existing partitioning is indispensable, can it be replaced with clustering.
Some good ideas posted here. Thanks to those who responded. Essentially, there are multiple approaches to tackling this.
But anyway, here's how I solved my particular problem...
Suppose the data needs to ultimately end up in a table called MyData. I created two additional tables, MyDataStaging and MyDataUpdate. These two tables have an identical structure to MyData with the exception of MyDataStaging has an additional Timestamp field, "batch_timestamp". This timestamp allows me to determine which rows are the latest versions in case I end up with multiple versions before the table is processed.
DatFlow pushes data directly to MyDataStaging, along with a Timestamp ("batch_timestamp") value indicating when the process ran.
A scheduled process then upserts/merges MyDataStaging to MyDataUpdate (MyDataUpdate will now always contain only a unique list of rows/values that have been changed). Then the process upserts/merges from MyDataUpdate into MyData as well as being exported & downloaded to be loaded into PostgreSQL. Then staging/update tables are emptied appropriately.
Now I'm not constantly querying the massive table to check for changes.
NOTE: When merging to the main big table, I filter the update on unique dates from within the source table to limit the bytes processed.

The order of records in a regularly updated bigquery database

I am going to be maintaining a local copy of a database on bigquery. I will be using the API and tabledata:list. This database is not my own, and is regularly updated by the maintainers by appending new data (say every hour).
First, can I assume that when this data is appended, it will definitely be added to the end of the database?
Now, let's assume that currently the database has 1,000,000 rows and I am now downloading all of these by paging through tabledata:list. Also, let's assume that the database is updated partway through (with 10,000 rows). By using the page tokens, can I be assured that I will only download the 1m rows present when I started in the order they are in in the database?
Finally, now let's say that I come to update my copy. If I initiate the tabledata:list with a startIndex of 1,000,000 and I use a maxResults of 1000, will I get 10 pages containing the updated data that I am expecting?
I suppose all these questions boil down to whether bigquery respects the order the data is in, whether this order is used by tabledata:list, and whether appended data is guaranteed to follow previous data.
As there is a column whose values are unique, and I can perform a simple select count(1) from table to get the length of the table, I can of course check that my local copy is complete by comparing the length of my local db with that of the remote, however if the above weren't guaranteed and I ended up with holes in my data, it would be quite impractical to remedy as the primary key is not sequential (otherwise I could just fill in the missing rows) and the database is very large.
When you append data, we will append to the end of the table data list, however, bigquery may periodically coalesce data, which does not respect ordering. We have been discussing being able to preserve the ordering, or at least have a way of accessing the most recent data, but this is not yet implemented or designed. If it is an important feature for you, let us know and we'll prioritize it accordingly.
If you use page tokens, you are assured of a stable listing. If the table gets updated in the middle of paging through the data, you'll still only see the data that was in the table when you created the page token. Note that because of this, page tokens are only valid for 24 hours.
This should work as long as no coalesce has occurred since you have updated the table.
You can get the number of rows in the table by calling tables.get, which is usually simpler and faster than running a query.

How to keep track of which rows have been imported in SQL?

Let's say I want to import all the customers (or all the rows in some other specific table) to some external system. Not all at once but every one after they have been created in database. To do that I have to keep record of all the rows that have already been reported because I want to find only the ones that have not been reported yet. Is it generally better to add a column to do that or to create some kind of a batchlog table?
I'm using MS SQL Server if that is relevant
A Simplified example:
select * from Customer where reportedToExternalSystem is null
or
select * from Customer where cus_id not in (select cus_id from integrationBatchLog)
or is there maybe some more ways to do that that might be even better? This is the first time I do something like this so I don't know the best practise yet.
The simple solution is to add a column that marks the row as imported. A status int (0/1) or if you want to keep track of when it was imported an imported date. This solution does have some limitations:
You can only import the row once. Do you need to import the customer again when the record is updated? Are you going to clear the update field when the customer is updated?
It causes the row to be locked when you update the row status. Are you sure the application that inserts the customer record will be happy with your code locking the records?
On some system it causes the entire row to be written to the log system for recovery. Depending on the size of the row this can be a lot of log writing for just one field.
In a highly parallel import system you can have a lot of contention for resources. If one import program is locking the table, think how bad it would be if many import programs are locking the table at the same time.
If the customer data is updated several times between your import polling interval, you will only see the latest data and will skip over the intermediate updates. This is only an issue if you care about the intermedaite updates. For customers you might not care, for order statuses you might care a lot.
You have to modify the table structure. This might not be allowed by the source application due to data/support/political issues.
Besides putting a status column in the table, one technique that works well is to put a trigger on the table and mirror the import data to a second table. You would then 'consume' the data in the second table. This has several advantages:
It keeps the locking issues contained to the second table.
It allows you to process every update to the main table.
You can add an index to the second table that is used to keep track of the update statuses without the issues of changing the main table.
If you delete the rows from the second table (either immediately as they are consumed or after a short audit period) the size of the table/index will be kep to a minimum.
When I use this technique in Sql Server I put the second table in a seperate schema. Since most apps store their tables in dbo, you can end up with dbo.Customers and Import.Customers. This can help you to keep track of which tables you are importing and keeps you from having to come up with new names for your import tables.
Unless you have to complicate implementation, go with the simplest solution possible. One important thing you should consider, is how hard would it be to refactor this simple to more general one, in case if you need it.
In your case I see only one problem in upgrading from column to table. If you would need history of imports. Solution: make reportedToExternalSystem column of DateTime (or Timestamp) type
I would use a separate table indicating, say, import date cross-referenced to the key of the record in the table you're tracking. In other words, a table with 3 columns: auto-increment key, record-id-from-other-table, import-date. Something like that. This also allows the case if a record is ever re-imported later. You'd have track of all the imports by date.
I Prefer having a column for importing status. Maintaining a separate log leads to time consumable results with growing table size. I do have conceptual idea on SQL Servers but seems that it works. Keep posting!

collecting mysql statistics

What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.

Database history for client usage

I'm trying to figure out what would be the best way to have a history on a database, to track any Insert/Delete/Update that is done. The history data will need to be coded into the front-end since it will be used by the users. Creating "history tables" (a copy of each table used to store history) is not a good way to do this, since the data is spread across multiple tables.
At this point in time, my best idea is to create a few History tables, which the tables would reflect the output I want to show to the users. Whenever a change is made to specific tables, I would update this history table with the data as well.
I'm trying to figure out what the best way to go about would be. Any suggestions will be appreciated.
I am using Oracle + VB.NET
I have used very successfully a model where every table has an audit copy - the same table with a few additional fields (time stamp, user id, operation type), and 3 triggers on the first table for insert/update/delete.
I think this is a very good way of handling this, because tables and triggers can be generated from a model and there is little overhead from a management perspective.
The application can use the tables to show an audit history to the user (read-only).
We've got that requirement in our systems. We added two tables, one header, one detail called AuditRow and AuditField. The AuditRow contains one row per row changed in any other table, and the AuditField contains one row per column changed with old value and new value.
We have a trigger on every table that writes a header row (AuditRow) and the needed detail rows (one per changed colum) on each insert/update/delete. This system does rely on the fact that we have a guid on every table that can uniquely represent the row. Doesn't have to be the "business" or "primary" key, but it's a unique identifier for that row so we can identify it in the audit tables. Works like a champ. Overkill? Perhaps, but we've never had a problem with auditors. :-)
And yes, the Audit tables are by far the largest tables in the system.
If you are lucky enough to be on Oracle 11g, you could also use the Flashback Data Archive
Personally, I would stay away from triggers. They can be a nightmare when it comes to debugging and not necessarily the best if you are looking to scale out.
If you are using an PL/SQL API to do the INSERT/UPDATE/DELETEs you could manage this in a simple shift in design without the need (up front) for history tables.
All you need are 2 extra columns, DATE_FROM and DATE_THRU. When a record is INSERTed, the DATE_THRU is left NULL. If that record is UPDATEd or DELETEd, just "end date" the record by making DATE_THRU the current date/time (SYSDATE). Showing the history is as simple as selecting from the table, the one record where DATE_THRU is NULL will be your current or active record.
Now if you expect a high volume of changes, writing off the old record to a history table would be preferable, but I still wouldn't manage it with triggers, I'd do it with the API.
Hope that helps.