I am thinking to simulate a bank account like transaction in a dynamodb
Being a noSQL type database, would this use case be suited for dynamodb?
Or should I stick with SQL based database?
Does anyone know how do banks usually handles this?
Do they use something like DynamoDB or do they keep our transaction in a separate table?
A SQL based system allows transactions across multiple tables, while DynamoDB only allows transactions at the individual item. However, there is a DynamoDB transaction library for Java that allows atomic transactions across multiple items at the expense of speed and substantially more writes per request.
DynamoDB does support atomic counters, which are bank-account-like (the example on the documentation page is along those line).
DynamoDB also provides conditional writes which can help avoid certain type of race conditions when an item is attempted to be updated concurrently.
Could you build a suitable system on top of DynamoDB? Maybe, but this all really depends on your specific requirements and how you go about implementing such a system. That said, DynamoDB does provide some functionality that I outlined above to address the types of concerns that arise when building such a system.
In 2018, DynamoDB now has transaction support for reading and writing items in multiple tables within a single region. It is not exactly the same as SQL transactions, because transactions may fail and return an error if the data is modified by another parallel operation, but it is ACID.
https://aws.amazon.com/blogs/aws/new-amazon-dynamodb-transactions/
Related
Our Data Warehouse team is evaluating BigQuery as a Data Warehouse column store solution and had some questions regarding its features and best use. Our existing etl pipeline consumes events asynchronously through a queue and persists the events idempotently into our existing database technology. The idempotent architecture allows us to on occasion replay several hours or days of events to correct for errors and data outages with no risk of duplication.
In testing BigQuery, we've experimented with using the real time streaming insert api with a unique key as the insertId. This provides us with upsert functionality over a short window, but re-streams of the data at later times result in duplication. As a result, we need an elegant option for removing dupes in/near real time to avoid data discrepancies.
We had a couple questions and would appreciate answers to any of them. Any additional advice on using BigQuery in ETL architecture is also appreciated.
Is there a common implementation for de-duplication of real time
streaming beyond the use of the tableId?
If we attempt a delsert (via an delete followed by an insert using
the BigQuery API) will the delete always precede the insert, or do
the operations arrive asynchronously?
Is it possible to implement real time streaming into a staging
environment, followed by a scheduled merge into the destination
table? This is a common solution for other column store etl
technologies but we have seen no documentation suggesting its use in
BigQuery.
We let duplication happen, and write our logic and queries in a such way that every entity is a streamed data. Eg: a user profile is a streamed data, so there are many rows placed in time and when we need to pick the last data, we use the most recent row.
Delsert is not suitable in my opinion as you are limited to 96 DML statements per day per table. So this means you need to temp store in a table batches, for later to issue a single DML statement that deals with a batch of rows, and updates a live table from the temp table.
If you consider delsert, maybe it's easier to consider writing a query to only read most recent row.
Streaming followed by scheduled merge is possible. Actually you can rewrite some data in the same table, eg: removing dups. Or scheduled query batch content from temp table and write to live table. This is somehow the same as let duplicate happening and later deal within a query with it, also called re-materialization if you write to the same table.
I see, that by using RethinkDB connector one can achieve real time querying capabilites by subscribing into specifically named lists. I assume, that this is not actually the fastest solution, as the query probably updates only after changes to records are written to the database. Is there any recommended approach to achieve realtime querying capabilites deepstream-side?
There are some favourable properties like:
Number of unique queries is small compared to number of records or even number of connected clients
All manipulation of records that are subject to querying is done via RPC.
I can imagine multiple ways how to do that:
Imitate the rethinkdb connector approach. But for that I am missing a list.listen() method. With that I would be able to create a backend process creating a list on-demand and on each RPC CRUD operation on records update all currently active lists=queries.
Reimplement basic list functionality in records and use the above approach with now existing .listen()
Use .listen() in events?
Or do we have list.listen() and I just missed it? Or there is more elegant way how to do it?
Great question - generally lists are a client-side concept, implemented on top of records. Listen notifies you about clients subscribing to records, not necessarily changing them - change notifications arrive via mylist.subscribe(data => {}) or myRecord.subscribe(data => {}).
The tricky bit is the very limited querying capability of caches. Redis has a basic concept of secondary indices that can be searched for ranges and intersection, memcached and co are to my knowledge pure key-value stores, searchable only by ID - as a result the actual querying would make most sense on the database layer where your data will usually arrive in significantly less than 200ms.
The RethinkDB search provider offers support for RethinkDB's built in realtime querying capabilites. Alternatively you could use MongoDB and trail its operations log or use PostGres and deepstream's built in subscribe feature for change notifications.
Background
We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.
Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.
Problem
We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).
Question
How to solve the editing problem and relational model issues in our situation?
We came up with couple of solutions:
Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)
We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.
What do you think?
If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level.
If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me
Are there any patterns for mixing eventual consistency systems with legacy ACID-systems?
I want to store data in some(at least two) legacy systems on the mainframe that need ACID-like transactions. Those mainframe-databases(Let us call them OldWorld) are running under the same transaction manager in the same process so the consistency of the mainframe-systems is no problem.
I have a transaction manager that can handle XA-Transactions with the mainframe-tm and the ACID-able relational database in the non-mainframe environment (let us call this NewWorld).
But I do not want to use the XA-Transaction because it often causes trouble with long running locks on the mainframe-side and in many cases i do not need all ACID-Features for both worlds. I always want a consistent mainframe(All Data in the OldWorld are consistent inside the OldWorld). The NewWorld System can handle inconsistent data(Inconsistency between New and Old) when it reads data from the mainframe-side. The operations that are used to store data at the OldWorld are easy and save “add-only operations” whom cannot fail functionally (it can fail technically, but this should always be a temporary failure).
My idea to work around the need for a distributed transaction is that i update the data in the OldWorld asynchronously and use an event sourcing data layer(in the NewWorlds) to store the information what is needed to be done in the OldWorld, using “soft-transaction-id's“ to prevent double-submitting to the OldWorld. These “soft-transaction-id's” will be generated while storing the data to the event-sourcing-data-layer for a transaction that needs to be done in the OldWorld.
I don't have the change to add my „soft-transaction-id's“ to the OldWorld-Databases but i can add a new Database that can store a „Done“-State beside the „soft-transaction-id“ and make the update of this database part of the old-world-transactions. Then another async-process can read the state-information without any locking and update the NewWorld (ex. Update relational-model with data from the event sourcing store. And marking the soft-transaction-id as done(„global-consistent“)) The Update of the OldWorld will always check if the soft-transaction-id is always committed first.
As i read through my writings i get the feeling that it's like global transaction, just with less locking. The knowledge that my update to the OldWorld will functional succeed is essential, without that you need a manually merge process, which can handle the functional conflicts. The NewWorld systems needs the functionality to handle inconsistent global state. It can be done by reading the relational-database and mimic the OldSystem DataRequests by analysing the not yet committed ( into the OldWorld-Database) event-store. For all other transactions I need to use distributed transactions with their locking behavior.
Can you please give me an database design suggestion?
I want to sell tickets for events but the problem is that the database can become bootleneck when many user what to buy simultaneously tickets for the same event.
if I have an counter for tickets left for each event there will be more updates on this field (locking) but I will easy found how much tickets are left
if I generate tickets for each event in advance it will be hard to know how much tickets are left
May be it will be better if each event can use separate database (if the requests for this event are expected to be high)?
May be reservation also have to asynchronous operation?
Do I have to use relation database (MySQL, Postgres) or no relation database (MongoDB)?
I'm planing to use AWS EC2 servers so I can run more servers if I need them.
I heard that "relation databases don't scale" but I think that I need them because they have transactions and data consistency that I will need when working with definite number of tickets, Am I right or not?
Do you know some resources in internet for this kind of topics?
If you sell 100.000 tickets in 5 minutes, you need a database that can handle at least 333 transactions per second. Almost any RDBMS on recent hardware, can handle this amount of traffic.
Unless you have a not so optimal database schema and/of SQL, but that's another problem.
First things first: when it comes to selling stuff (ecommerce), you really do need a transactional support. This basically excludes any type of NoSQL solutions like MongoDB or Cassandra.
So you must use database that supports transactions. MySQL does, but not in every storage engine. Make sure to use InnoDB and not MyISAM.
Of cause many popular databases support transactions, so it's up to you which one to choose.
Why transactions? Because you need to complete a bunch of database updates and you must be sure that they all succeed as one atomic operation. For example:
1) make sure ticket is available.
2) Reduce the number of available tickets by one
3) process credit card, get approval
4) record purchase details into database
If any of the operations fail you must rollback the previous updates. For example if credit card is declined you should rollback the decreasing of available ticket.
And database will lock those tables for you, so there is no change that in between step 1 and 2 someone else tries to purchase a ticket but the count of available tickets has not yet been decreased. So without the table lock it would be possible for a situation where only 1 ticket is left available but it is sold to 2 people because second purchase started between step 1 and step 2 of first transaction.
It's essential that you understand this before you start programming ecommerce project
Check out this question regarding releasing inventory.
I don't think you'll run into the limits of a relational database system. You need one that handles transactions, however. As I recommended to the poster in the referenced question, you should be able to handle reserved tickets that affect inventory vs tickets on orders where the purchaser bails before the transaction is completed.
your question seems broader than database design.
first of all, relational database will scale perfectly well for this. You may need to consider a web services layer which will provide the actual ticket brokering to the end users. here you will be able to manage things in a cached manner independent of the actual database design. however, you need to think through the appropriate steps for data insertion, and update as well as select in order to optimize your performance.
first step would be to go ahead and construct a well normalized relational model to hold your information.
second, build some web service interface to interact with the data model
then put that into a user interface and stress test for many simultaneous transactions.
my bet will be you need to then rework your web services layer iteratively until you are happy - but your database (well normalized) will not be cusing you any bottleneck issues.