Cassandra Batch - sql

I has just started newly with Cassandra, and I had one common question that
"Suppose I need to insert nearly about 2000+ records, most of people do say that don't use batch here, but on the other side also heard that "The closest feature to a stored procedure will be a batch as it will allow you to "bundle" different DML statements associated to an insert, update or delete."
So can anyone suggest what is the best way where I can create once, store and call for several times whenever it is required which can support faster execution as SP's in SQL

Batches in Cassandra have very specific usage:
to apply multiple changes at one, often to multiple tables, to provide consistency in the update of the data, guaranteeing that they all will be applied, or all will fail. This often called "logged batch" - in this case, Cassandra is doing a copy of batch on the multiple servers before applying changes, and delete after successful apply of batch operations. As result, such batches are much slower than usual operations.
to apply multiple operations inside the single partition - often it's called "unlogged batch" - in this case, all operations are considered as one mutation, and as result this is very fast, compared to multiple individual operations.
So batches could be used for multiple inserts/updates/deletes only inside single partition (otherwise you'll get worse performance compared to the individual statements), or when you need consistency of data between several tables. The fastest way to insert a lot of data is to issue multiple async operations. Also, if you want to load data from files, then maybe it's better to look to the tools like DSBulk that are heavily optimized for high performance load & unload of the data.
In more details about good & bad use of batches you can read in documentation, and DSE Architecture guide.
P.S. Technically speaking, Cassandra does classify batches either as multipatitioned - in this case they are always logged, or single partition - they aren't logged.

Related

SSAS Tabular - Other options to update tables rather than Process Full...?

We have a fairly large SSAS Tabular cube with many different tables (some which contain measures and dimensions, etc). On occasion we will run into scenarios where I have to optimize the cube partitions (break them into smaller parts) or cube structure so that not as much memory is consumed when it processes (daily). Occasionally we've had to increase the memory limits of the server just to make sure the job doesn't crash. One of our sql server consultants asked if we had considered changing the process mode on the scripted job to 'Default' rather than 'Full' (since every table in the script is set to full in the process mode). I said I hadn't considered this but my concern is that, it seems based on my research, that default won't actually update the data but will really only replenish the tables structure if it changes in some way. I need a processing mode that will just pull in any new rows (and update any rows that have changed) since the last time the partition was processed. Is there any mode which accomplishes this rather than Process full (which obviously wipes the current partition it's processing and rebuilds the entire thing = memory intensive)? Anything less memory intensive that will still pull in new rows and update outdated ones?
fyi, all the tables are based on sql queries
One option is doing a Process Data instead of a Process full on the tables in your tabular model. You may also want to consider implementing partitioning in your tables in order to take advantage of the ability for SSAS to process these in parallel. Since your tables are already based on SQL queries, you'll only need to modify the filters in the queries to make the data uniform across multiple partitions. Partitioning the tables will also allow for incremental processing using Process Add to incrementally update the partition. Looking into other ways to reduce unnecessary memory, such as removing unused columns and replacing calculated columns where possible (read about the cost calculated columns here) will also help the memory issues.

How to avoid slow query if data is too big for one partition in NoSQL?

I am learning cassandra. Now, I am thinking about SQL's problems that NoSQL addresses, and I have a question about cases of very big data.
About SQL handling very big data, I thought that many pages are saying that tables will be on different servers and queries are slow because of joining tables on different servers. This is a problem of SQL that NoSQL addresses. But, even with NoSQL, if partitions are too big, do not I need to change my data model, make smaller partitions and make multiple queries on them to get the same result? And, is not it slow? Or, you never run out of a space in partition because 2B cells are big enough?
I think your question is mixing several different issues.
First of all, the problem with big data and SQL is usually not that queries become slow, but that the solution cannot scale as the data grows bigger and bigger. If you choose to manually split your tables to several servers, as you suggested, what do you do when you need even more servers - redesign your data model? Also, how do you ensure consistency when an update requires modifying several tables but they are on different hosts?
Second, you mentioned joins, and this is something which NoSQL solutions like Cassandra do not support. You need to manually denormalize the data yourself (i.e., put the already joined data in a table). For some things, Cassandra's new "Materialized Views" feature can come in handy.
Third, and perhaps most importantly, you asked about huge partitions. Indeed Cassandra is not designed to handle huge partitions, and the best practice is far below the 2-billion hard limit which you mentioned: Datastax (the commercial company behind Cassandra's development) suggests in https://docs.datastax.com/en/dse-planning/doc/planning/planningPartitionSize.html that a good rule of thumb is to "keep the maximum number of rows below 100,000 items and the disk size under 100 MB.".
There are several reasons why huge partitions are ill-advised in Cassandra. One of them is that the disk format (sstables and their so-called "promoted index") makes it inefficient to jump to the middle of a huge partition, and you need to do this when you want to read a specific row or iterate through all the rows. Some operations such as compaction and repair work on entire partitions and can become very slow (and in the worst case, also use a lot of memory). E.g., a case that a billion-row partition differs on two nodes by just one row, and the partition-based repair needs to send the entire partition over the network.
Scylla (https://en.wikipedia.org/wiki/Scylla_(database)), a Cassandra clone which is generally more efficient than Apache Cassandra, also has similar issues with huge partitions (as in Cassandra, moderately large partitions are fine), but these issues are actively being worked on, including re-designing the file format, so eventually Scylla should support arbitrary-sized partitions. However, we're still not there yet, and today the recommendation of not letting partitions grow too huge still applies to Scylla as well.
Finally, if you want to get around the problem of too many rows in a single partition, then, yes, you need to tweak your data model to avoid these huge partitions. Sometimes, you just need to fix design mistakes in your model - e.g., I have seen people sticking a lot of unrelated data into the same partition, when it could have easily (and more efficiently!) be put in separate partitions. Sometimes, you need to artificially split your partitions. This is common in so-called "time-series data" modeling in Cassandra, where we (for example) get a new value of some measurement every second and add it as a row to a partition. Here, instead of having one huge partition for all data ever, the accepted practice is to create a separate partition per time window (e.g., a new partition every day, or week, or whatever). Since most queries involve just one time window anyway, they don't even become slower.

SQL Server - Any better alternative to improve performance of a lengthy transaction with lot of inserts?

I have a scenario where a user action on screen results in new records getting created in about 50 different tables, real-time. The design of the use case is such that the new records that are created as a result of a user action - is required immediately for the user to make changes. So no possibility of offline or delayed creation.
Having said that, the obvious problem is - the insertion statements (along with some additional manipulation statements) are inside a transaction, which makes it a really lengthy transaction. This runs for about 30 seconds and often results in timeout or blocks other queries.
Transaction is required for atomicity. Is there a better way I can split the transaction and still retain the consistency? Or any other ways to improve upon the current situation?
insert queries are waiting on other (mostly select) queries that are
running in parallel at that moment
You should consider using a row versioned based isolation level, aka. SNAPSHOT, because under row-versioned based isolation levels the reads don't block writes and writes don't block reads. I would start by enabling READ_COMMITTED_SNAPSHOT and test with that:
ALTER DATABASE [...] SET READ_COMMITTED_SNAPSHOT ON;
I recommend reading the article linked for an explanation of implications and trade-offs implied by row-versioning.
Based on the comments exchange, I believe that you have to look at both the insert transaction and on the concurrent queries at the same time. You want to accommodate their load without losing transactional integrity. The available optimization techniques include:
Adding access indexes whenever you notice slow constructs (for example, nested loops) over large data sets in execution plans of frequently seen or slowly executing queries.
Adding covering indexes. These indexes contain additional columns in addition to lookup columns and they make it possible for a particular query to avoid a trip to a table at all. This is especially efficient when the table is wide and the covering index narrow, but it may also be used to avoid locking issues between UPDATEs and SELECTs on different columns of the same rows.
Denormalization. For example, switching some of the queries to access indexed views as opposed the physical tables, or secondary tables fed with triggers upon updates to the primary tables. These are costly and double edged techniques and should only be considered for resolving the proven top bottlenecks.
Make only those changes where the speed-up measured is very large as none of these techniques come for free in terms of performance. Never optimize without doing performance measurements at each step.
This following is trivial, but let's mention it for completeness - keep your statistics up to date (ANALYZE, UPDATE STATISTICS,... as per you database engine), both while you analyze the execution plans, and in production use.

What are the benefits of using database cursor?

It is based on the interview question that I faced.
Very short definition can be
It can be used to manipulate the rows
returned by a query.
Besides the use of the cursor (Points are listed here on MSDN), I have a question in my mind that if we can perform all the operations using query or stored procedure (if I'm not wrong, Like we can use Transact-SQL for ms-sql), is there any concrete point that we should use cursor?
Using cursors compared to big resultsets is like using video streaming instead of downloading an video in one swoop, and watching it when it has downloaded.
If you download, you have to have a few gigs of space and the patience to wait until the download finished. Now, no matter how fast your machine or network may be, everyone watches a movie at the same speed.
Normally any query gets sent to the server, executed, and the resultset sent over the network to you, in one burst of activity.
The cursor will give you access to the data row by row and stream every row only when you request it (can actually view it).
A cursor can save you time - because you don't need to wait for the processing and download of your complete recordset
It will save you memory, both on the server and on the client because they don't have to dedicate a big chunk of memory to resultsets
Load-balance both your network and your server - Working in "burst" mode is usually more efficient, but it can completely block your server and your network. Such delays are seldom desirable for multiuser environments. Streaming leaves room for other operations.
Allows operations on queried tables (under certain conditions) that do not affect your cursor directly. So while you are holding a cursor on a row, other processes are able to read, update and even delete other rows. This helps especially with very busy tables, many concurrent reads and writes.
Which brings us to some caveats, however:
Consistency: Using a cursor, you do (usually) not operate on a consistent snapshot of the data, but on a row. So your concurrency/consistency/isolation guarantees drop from the whole database (ACID) to only one row. You can usually inform your DBMS what level of concurrency you want, but if you are too nitpicky (locking the complete table you are in), you will throw away many of the resource savings on the server side.
Transmitting every row by itself can be very inefficient, since every packet has negotiation overhead that you might avoid by sending big, maybe compressed, chunks of data per packet. ( No DB server or client library is stupid enough to transmit every row individually, there's caching and chunking on both ends, still, it is relevant.)
Cursors are harder to do right. Consider a query with a big resultset, motivating you to use a cursor, that uses a GROUP BY clause with aggregate functions. (Such queries are common in data warehouses). The GROUP BY can completely trash your server, because it has to generate and store the whole resultset at once, maybe even holding locks on other tables.
Rule of thumb:
If you work on small, quickly created resultsets, don't use cursors.
Cursors excell on ad hoc, complex (referentially), queries of sequential nature with big resultsets and low consistency requirements.
"Sequential nature" means there are no aggregate functions in heavy GROUP BY clauses in your query. The server can lazily decide to compute 10 rows for your cursor to consume from a cache and do other stuff meanwhile.
HTH
A cursor is a tool that allows you to iterate the records in a set. It has concepts of order and current record.
Generally, SQL operates with multisets: these are sets of possibly repeating records in no given order, taken as a whole.
Say, this query:
SELECT *
FROM a
JOIN b
ON b.a = a.id
, operates on multisets a and b.
Nothing in this query makes any assumptions about the order of the records, how they are stored, in which order they should be accessed, etc.
This allows to abstract away implementation details and let the system try to choose the best possible algorithm to run this query.
However, after you have transformed all your data, ultimately you will need to access the records in an ordered way and one by one.
You don't care about how exactly the entries of a phonebook are stored on a hard drive, but a printer does require them to be feed in alphabetical order; and the formatting tags should be applied to each record individually.
That's exactly where the cursors come into play. Each time you are processing a resultset on the client side, you are using a cursor. You don't get megabytes of unsorted data from the server: you just get a tiny variable: a resultset descriptor, and just write something like this:
while (!rs.EOF) {
process(rs);
rs.moveNext();
}
That's cursor that implements all this for you.
This of course concerns database-client interaction.
As for the database itself: inside the database, you rarely need the cursors, since, as I have told above, almost all data transformations can be implemented using set operations more efficiently.
However, there are exceptions:
Analytic operations in SQL Server are implemented very poorly. A cumulative sum, for instance, could be calculated much more efficiently with a cursor than using the set-based operations
Processing data in chunks. There are cases when a set based operation should be sequentially applied to a portion of a set and the results of each chunk should be committed independently. While it's still possible to do it using set-based operations, a cursor is often a more preferred way to do this.
Recursion in the systems that do not support it natively.
You also may find this article worth reading:
The Island of Misfit Cursors
Using a cursor it is possible to read sequentially through a set of data, programmatically, so it behaves in a similar manner to conventional file access, rather than the set-based behaviour characteristic of SQL.
There are a couple of situations where this may be of use:
Where it is necessary to simulate file-based record access behaviour - for example, where a relational database is being used as the data storage mechanism for a piece of code that was previously written to use indexed files for data storage.
Where it is necessary to process data sequentially - a simple example might be to calculate a running total balance for a specific customer. (A number of relational databases, such as Oracle and SQLServer, now have analytical extensions to SQL that should greatly reduce the need for this.)
Inevitably, wikipedia has more: http://en.wikipedia.org/wiki/Database_cursor
With cursor you access one row at a time. So it is good to use it when you want manipulate with a lot of rows but with only one at a given time.
I was told at my classes, the reason to use cursor is you want to access more rows than you can fit your memory - so you can't just get all rows into a collection and then loop through it.
Sometimes a set based logic can get quite complex and opaque. In these cases and if the performance is not an issue a server side cursor can be used to replace the relational logic with a more manageable and familiar (to a non relational thinker) procedural logic resulting in easier maintenance.

Postgresql Application Insertion and Trigger Performance

I'm working on designing an application with a SQL backend (Postgresql) and I've got a some design questions. In short, the DB will serve to store network events as they occur on the fly, so insertion speed and performance is critical due 'real-time' actions depending on these events. The data is dumped into a speedy default format across a few tables, and I am currently using postgresql triggers to put this data into some other tables used for reporting.
On a typical event, data is inserted into two different tables each share the same primary key (an event ID). I then need to move and rearrange the data into some different tables that are used by a web-based reporting interface. My primary goal/concern is to keep the load off the initial insertion tables, so they can do their thing. Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
Once the data is in the appropriate reporting tables, I won't need to hang on to the data in the insertion tables too long, so I'll keep those regularly pruned for insertion performance. In thinking about this scenario, which I'm sure is semi-common, I've come up with three options:
Use triggers to trigger on the initial row insert and populate the reporting tables. This was my original plan.
Use triggers to copy the insertion data to a temporary table (same format), and then trigger or cron to populate the reporting tables. This was just a thought, but I figure that a simple copy operation to a temp table will offload any of the query-ing of the triggers in the solution above.
Modify my initial output program to dump all the data to a single table (vs across two) and then trigger on that insert to populate the reporting tables. So where solution 1 is a multi-table to multi-table trigger situation, this would be a single-table source to multi-table trigger.
Am I over thinking this? I want to get this right. Any input is much appreciated!
You may experience have a slight increase in performance since there are more "things" to do (although they should not affect operations in any way). But using Triggers/other PL is a good way to reduce it to minimum subce they are executed faster than code that gets sent from your application to the DB-Server.
I would go with your first idea 1) since it seems to me the cleanest and most efficient way.
2) is the most performance hungry solution since cron will do more queries than the other solutions that use server-side functions. 3) would be possible but will resulst in an "uglier" database layout.
This is an old one but adding my answer here.
Reporting is secondary, but it would still be nice for this to occur on the fly via triggers as opposed to a cron job where I have to query and manage events that have already been processed. Reporting should/will never touch the initial insertion tables. Performance wise..does this make sense or am I totally off?
That may be way off, I'm afraid, but under a few cases it may not be. It depends on the effects of caching on the reports. Keep in mind that disk I/O and memory are your commodities, and that writers and readers rarely block eachother on PostgreSQL (unless they explicitly elevate locks--- a SELECT ... FOR UPDATE will block writers for example). Basically if your tables fit comfortably in RAM you are better off reporting from them since you are keeping disk I/O free for the WAL segment commit of your event entry. If they don't fit in RAM then you may have cache miss issues induced by reporting. Here materializing your views (i.e. making trigger-maintained tables) may cut down on these but they have a significant complexity cost. This, btw, if your option 1. So I would chalk this one up provisionally as premature optimization. Also keep in mind you may induce cache misses and lock contention on materializing the views this way, so you might induce performance problems regarding inserts this way.
Keep in mind if you can operate from RAM with the exception of WAL commits, you will have no performance problems.
For #2. If you mean temporary tables as CREATE TEMPORARY TABLE, that's asking for a mess including performance issues and reports not showing what you want them to show. Don't do it. If you do this, you might:
Force PostgreSQL to replan your trigger on every insert (or at least once per session). Ouch.
Add overhead creating/dropping tables
Possibilities of OID wraparound
etc.....
In short I think you are overthinking it. You can get very far by bumping RAM up on your Pg box and making sure you have enough cores to handle the appropriate number of inserting sessions plus the reporting one. If you plan your hardware right, none of this should be a problem.