cassandra : send onse store multiple times - optimization

I need populate 4 different tables in cassandra, by absolutely same data.
I mean, I have N fields, each has some value, but I need store them in 4 different tables (with different PK definitions) to allow different selects.
For that I need trigger 4 inserts.
4 times more network overhead and work for both cassandra and data producer.
Is there way in cassandra to send 1-ce and save in N different tables?
I need some optimization, but batches are not looking appropriate for that.
Please, help!!

Cassandra triggers could also work for you. You create a trigger Java class and deploy it in a Jar to each node. When it intercepts your main table insert, it also tacks on writes to the others.

If you don't want to use batch inserts, and you have the chance to use Cassandra 3.x you can use a new feature Materialized Views.
With materialized views you can basically do exactly what you asked for.
You will have to create one main table where you do all your inserts on. Then you would have to create 3 materialized views.
When you insert data into your main table, all your views will be updated by cassandra. Internally cassandra will also do those inserts, so the traffic overhead is still there (only starting on the coordinator and not on your client)
Note that those new features are not for free. As I said, there will be overhead traffic and work to be done by the coordinator. Currently there are only simple select statements possible, more complex queries are not working right now.
The provided link gives more insights of how these views work, so please have an additional look there.

Related

Extending a set of existing tables into a dynamic client defined structure?

We have an old repair database that has alot of relational tables and it works as it should but i need to update it to be able to handle different clients ( areas ) - currenty this is done as a single client only.
So i need to extend the tables and the sql statements so ex i can login as user A and he will see his own system only and user B will have his own system too.
Is it correctly understood that you wouldnt create new tables for each client but just add a clientID to every record in every ( base ) table and then just filter with a clientid in all sql statements to be able to achieve multiple clients ?
Is this also something that would work ( how is it done ) on hosted solutions ? Am worried about performance if thats an issue lets say i had 500 clients ( i wont but from a theoretic viewpoint ) ?
The normal situation is to add a client key to each table where appropriate. Many tables don't need them -- such as reference tables.
This is preferred for many reasons:
You have the data for all clients in one place, so you can readily answers a question such as "what is the average X for each client".
If you change the data structure, then it affects all clients at the same time.
Your backup and restore strategy is only implemented once.
Your optimization is only implemented once.
This is not always the best solution. You might have requirements that specify that data must be separated -- in which case, each client should be in a separate database. However, indexes on the additional keys are probably a minor consideration and you shouldn't worry about it.
This question has been asked before. The problem with adding the key to every table is that you say you have a working system, and this means every query needs to be updated.
Probably the easiest is to create a new database for each client, so that the only thing you need to change is the connection string. This also means you can get automated query tools for example to work without worrying about cross-client data leakage.
And it also allows you to backup, transfer, delete a single client easily as well.
There are of course pros and cons to this approach, but it will simplify the development effort. Also remember that if you plan to spin it up in a cloud environment then spinning up databases like this is also very easy.

Stream data into rotating log tables in BigQuery

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

Create BigQuery job that creates tables daily [duplicate]

I want to stream some time series data into BigQuery with insertAll but only retain the last 3 months (say) to avoid unbounded storage costs. The usual answer is to save each day of data into a separate table but AFAICT this would require each such table to be created in advance. I intend to stream data directly from unsecured clients authorized with a token that only has bigquery.insertdata scope, so they wouldn't be able to create the daily tables themselves. The only solution I can think of would be to run a secure daily cron job to create the tables -- not ideal, especially since if it misfires data will be dropped until the table is created.
Another approach would be to stream data into a single table and use table decorators to control query costs as the table grows. (I expect all queries to be for specific time ranges so the decorators should be pretty effective here.) However, there's no way to delete old data from the table, so storage costs will become unsustainable after a while. I can't figure out any way to "copy and truncate" the table atomically either, so that I can partition old data into daily tables without losing rows being streamed at that time.
Any ideas on how to solve this? Bonus points if your solution lets me re-aggregate old data into temporally coarser rows to retain more history for the same storage cost. Thanks.
Edit: just realized this is a partial duplicate of Bigquery event streaming and table creation.
If you look at the streaming API discovery document, there's a curious new experimental field called "templateSuffix", with a very relevant description.
I'd also point out that no official documentation has been released, so special care should probably go into using this field -- especially in a production setting. Experimental fields could possibly have bugs etc. Things I could think to be careful of off the top of my head are:
Modifying the schema of the base table in non-backwards-compatible ways.
Modifying the schema of a created table directly in a way that is incompatible with the base table.
Streaming to a created table directly and via this suffix -- row insert ids might not apply across boundaries.
Performing operations on the created table while it's actively being streamed to.
And I'm sure other things. Anyway, just thought I'd point that out. I'm sure official documentation will be much more thorough.
Most of us are doing the same thing as you described.
But we don't use a cron, as we create tables advance for 1 year or on some project for 5 years in advance. You may wonder why we do so, and when.
We do this when the schema is changed by us, by the developers. We do a deploy and we run a script that takes care of the schema changes for old/existing tables, and the script deletes all those empty tables from the future and simply recreates them. We didn't complicated our life with a cron, as we know the exact moment the schema changes, that's the deploy and there is no disadvantage to create tables in advance for such a long period. We do this based on tenants too on SaaS based system when the user is created or they close their accounts.
This way we don't need a cron, we just to know that the deploy needs to do this additional step when the schema changed.
As regarding don't lose streaming inserts while I do some maintenance on your tables, you need to address in your business logic at the application level. You probably have some sort of message queue, like Beanstalkd to queue all the rows into a tube and later a worker pushes to BigQuery. You may have this to cover the issue when BigQuery API responds with error and you need to retry. It's easy to do this with a simple message queue. So you would relly on this retry phase when you stop or rename some table for a while. The streaming insert will fail, most probably because the table is not ready for streaming insert eg: have been temporary renamed to do some ETL work.
If you don't have this retry phase you should consider adding it, as it not just helps retrying for BigQuery failed calls, but also allows you do have some maintenance window.
you've already solved it by partitioning. if table creation is an issue have an hourly cron in appengine that verifies today and tomorrow tables are always created.
very likely the appengine wont go over the free quotas and it has 99.95% SLO for uptime. the cron will never go down.

materialized view logging exclude deletes

I am using MVIEWs with Fast refresh to replicate some tables across a network. Everything works great, however I ran into an issue when considering my Delete/Purge process.
The source for the MVIEWs that are feeding the log tables have a data retention of 7 days. Ie I will be running a nightly purge process to delete data older than 7 days from current date.
The target MVIEWs however are on an ODS and have a data retention policy of 30 days. Also, these MVIEWs are NOT currently populating another schema or set of tables.
Problem is, when I Delete from the source tables, those delete statements will propagate through to the target MVIEWs and now I no longer have 30 days worth of data - only 7.
Is there a way to exclude logging DELETE for the MVIEW log tables? I noticed in the MLOG$_Table_Name there is a column 'DMLTYPE$$'. Could I somehow delete from the Log table all records where DMLTYPE$$ = 'D'?
Thanks everyone, and yes, I did try researching this online first.
Regards,
Steve
I suppose that you could manually delete data from the materialized view logs before running the refresh. That would probably work. But it would not be a solution that I'd be really comfortable with. It would be a very bespoke solution that would probably not be officially supported. And it if there might ever be another materialized view that depends on the materialized view log, you'd have to ensure that you're only deleting those rows that relate to your materialized view's subscription. Plus, the materialized view on the destination would need to be updatable in order for you to be able to manually remove the rows older than 30 days via a separate process.
If these are the business requirements, something like Oracle Streams (or GoldenGate) would be a much more appropriate architectural solution. Those products are designed to give you more flexibility about which logical change records (LCRs) you apply. In Streams, for example, it is easy enough to create a custom apply handler that discards delete LCRs. And since you're applying LCRs to a table on the destination rather than a materialized view, your 30 day purge process is much easier to manage. This would be a relatively common Streams setup rather than a very unique materialized view setup.

Is having a copy of SQL data in an application a good idea to save SQL SELECTS?

I am working on a multithreading .NET 4 application which acquires data continuously and writes them into a SQL database (MySQL or SQL Server - not yet sure).
Everytime when a INSERT is executed, at leat one prior SELECT is necessary in order so synchronize with the databaes. This means the applications gets a block which contains new and old data and then has to check which data sets are new and which are already in the database.
This means a lot of SELECTS which result everytime in more or less the same data.
Would it be a good idea to have a copy of the last x entries per table within the application?
This way the synchronization could be done on the copy instead of the database.
Pro:
Faster
Contra:
Uses a lot of memory
Risk of becomming unsynchronized with the database
What do you think? What is the best practice for such a use case?
Any other pros and cons?
Unless you have an external program writing to your database at the same time, you can use buffering.
But instead of buffering SELECT results, just add to the insert method a buffer of the last X (a reasonable number) insertion requests, and only insert the new one if it isn't on that list.
You might also want to lock the insertion method, to make sure the inclusion check is always correct.
If you have multiple processes writing to the database, it is non-trivial to maintain perfect synchronization between in-memory data and the database. In fact the only way to confirm you are synchronized is by making a SELECT query on database. So you have a trade-off between perfect synchronization and synchronization with some tolerance which is very efficient.
My suggestions, which may help in both cases, would be:
Tune your SELECT queries. Add indexes if necessary.
Create meta-data, like version numbers. So that you have to only check something very trivial to determine if you need synchronization.
Write a stored proc which implements your SELECT and INSERT logic. Then you do not have to worry about making multiple calls to the database.