Handling relational model in Cassandra - apache-spark-sql

Background
We have chosen Cassandra as our storage engine since we have an application that must handle async messaging between many users on the website and event storing (some types of analytics, what happens on site and when, etc.). Also we have a voting platform so we are storing votes per users per day and Cassandra are good in those use cases.
Recently we got new requirements to build a relational model on top of our existing system (at least we think it is relational). Some types of political candidates with lists of jobs, education, historical voting, endorsements, etc.
Problem
We have relations which can be edited on both ends (i.e. candidate is supported by companies, but in our admin panel that company can be edited without candidate). A candidate is one row in our Cassandra DB identified by a UUID. On the front end, we would need full information about candidates (political party, schools, jobs, voting history, supporting companies). We want to place the majority of candidate info in a single row so we can read data with a single read. However when we place the list of supporting companies UDT we have problems editing it (we need to change it in company_by_id and candidate_by_id tables).
Question
How to solve the editing problem and relational model issues in our situation?
We came up with couple of solutions:
Track relations in Cassandra with additional index-like tables: candidates_by_supporting_company. When updating company, we update candidates who have that company as well.
Similar to 1, but using secondary index if relation is low carnality and updating based on secondary index (we have 10 political parties so we can place index on political party in candidates table and when political party changes we can change candidates by political party since we have index)
Use a relational database for relational type of data and leave Cassandra to handle only suitable use cases like time-series data, messaging, event sorting (this adds the maintenance cost of one more database, deployment costs and problems since our system is distributed how to have replication of data)
Use Spark to do joins (this will not be the sole purpose of adding Spark to the system, we are thinking of adding it for importing huge data sets in CSV and doing transformation so having Spark will be an added bonus and we can use SparkSQL for places where we need joins)
We are leaning towards option 3 since we will add Spark anyway, we will stay with only Cassandra database (which does not complicate maintenance and deployment of one more database) and we get sort of JOINS and GROUP BY efficient on application level with it.
What do you think?

If you want to use only cassandra the right way to proceed is the number 1: denormalization. But if yu have a lot of relationships it will bring a lot of effort at application level.
If adding an other dbms is not a problem in your environment, using the right tool for the right job is the best choice: number 3 for me

Related

MariaDB data separation in public and private, database design

I am working at a company that merged with another company a while ago.
There we have several business units that are basically equivalent. One in Europe, one in China, each. We already had an in-house MariaDB database, which we want to start sharing.
The problem is that there are different GDPR regulations and contracts that prohibit sharing certain data across sites. So what I can't do, is replicate data across sites and then just hide in from the user in the frontend. The private data has to stay at the facility, it belongs to.
So my idea was to separate each table that we have now and where possibly sensitive information is contained into two tables each.
One say table_contracts_private and table_contracts_public.
This would still seem pretty doable with basic database replication and replicating the public tables across sites. But how would you go about publishing private data? Also how would I best combine private and public data? Just using a view
I just could not find any good mechanisms for this, especially because we would also like to avoid data duplication, so the private entries would need to be removed and replaced by the public ones, which would entail also changing all referencing IDs.
Is this a possible application of sharding?
I'd be really grateful, if someone could point me in the right direction, or if someone has a demo project with similar requirements that I could check out.
Cheers
Is this a possible application of sharding?
I wouldn't think so. Sharding is a performance optimization method. What you need is to support legal constraints. Those are two very different problems.
I think you are on the right track. I call this a "walled garden" approach. You create a database with all non-PII information, using ids only. Nothing that remotely directly identifies people, their addresses, phones, credit cards, and so on. This can be tricky. In some jurisdictions combinations of demographics can be PII.
Some of those ids then refer to another database where you store all the sensitive information; this is the "walled garden". I would recommend that this second database be on a separate server. It has a very restricted access list. And this is where you implement requirements for things like "forgetting" a customer.
In any case, the point is that sharding is not the right approach. You want an application redesign with privacy and security as the top priorities. Happily, this is not actually that hard to implement, although if the databases are changing, you may need period auditing. For instance, in one database I worked with, we discovered that "coupon codes" sometimes contained unencrypted email addresses. Arrgggh!

How to move records from one DB2 database to another DB2 database?

At regular times we want to clean up (delete) records from our production DB (DB2) and move them to an archive DB (also DB2 database having the same schema).
To complete the story there are plenty of foreign key constraints in our DB.
So if record b in table B has a foreign key to record a in table A and we are deleting record a in production DB then also record b must be deleted in the production B and both records must be created in the archive DB.
Of course it is very important that no data gets lost. So that it is not possible that we delete records in the production DB while these records will never be inserted in the archive DB.
What is the best approach to do this ?
FYI I have checked https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.admin.dm.doc/doc/r0024482.html and the proposed solutions have following short comings.
Load utility, Ingest utility, Import utility : is only addressing the part of inserting records in the archive DB. It doesn't cover the full move.
Export utility : is only covering a means of exporting data (which might be imported by the Import utility*.
db2move, restore command, db2relocatedb, ADMIN_COPY_SCHEMA, ADMIN_MOVE_TABLE and split mirror : are not an option if you only want to move specific records meeting a certain condition to the archive DB.
So based on my research, the current best solution seems to be a kind of in-house developed script that is
Exporting the records to move in IXF format
Importing those exported records in the archive DB
deleting those records in the production DB
In order to cause no transaction log full errors, this script should do this in batches (e.g. of 50000 records)
In order to have no foreign key constraint errors in step 3: we must also assure that in step 1 we are also exporting all records having foreign key constraint to the exported records and all records having a foreign key constraint to these records ...
Questions that ask the "best" approach have limited use because the assessment criteria are omitted.
Sometimes the assessment criteria differ between technicians and business people.
Sometimes multiple policies of the client company can determine such criteria, so awareness of local policies and procedures or patterns is crucial .
Often the operational-requirements and security-requirements and licensing-requirements will influence the approach, apart from the skill level and experience of the implementation-team.
Occasionally corporates have specific standardised tools for archival and deletion, or specific patterns sometimes influenced by the industry-sector or even industry-specific regulatory requirements.
As stackoverflow is a programming oriented website, questions like yours can be considered off-topic because you are asking for advice about which design-approach are possible while omitting lots of context that is specific to your company/industry-sector that may well influence the solution pattern.
Some typical requirements or questions that influence the approach are below:
do local security requirements allow the data to leave the Db2 environment? (i.e. data stored on disk outside of Db2 tables). Sometimes this constrains use of export, or load-from-file/pipe). The data can be at risk of modification or inspection or deletion (whether accidental or deliberate) whilst outside of the RDBMS.
the restartability of the solution in the event of runtime errors. This is often a crucial requirement. When copying data between different physical databases (even if the same RDBMS) there are many possibilities of error (network errors, resource issues, concurrency issues, operational issues etc). Must the solution guarantee that any restarts after failures resume from the point of failure, or must cleanup happen and the entire job be restarted? The answer can determine the design.
if federation exists between the two databases (or if it can be added within the Db2-licence terms), then this is often the easiest practical approach to push or pull content. Local and remote tables appear to be in the same logical database which simplifies the approach. The data never needs to leave the RDBMS. This also simplifies restartability of failed jobs. It also allows the data to remain encrypted if that is a requirement.
if SQL-replication or Q-based-replication is licensed then it can be configured to intelligently sync the source and target tables and respect RI if suitably configured. This approach requires significant configuration skills.
if the production database is highly-available, and/or if the archival database is highly-available then the solution must respect the HA approach. Sometimes this prevents use of LOAD, depending on the operating-system platform of the Db2-server.
timing windows for scheduling are often crucial. If the archival+removal job must guarantee to fully complete with specific time intervals this can influence the design pattern.
if fastest rollout is a key requirement then range-partitioning is usually the best option.
There are tools out there for this (such as Optim Archive) which may better satisfy requirements you didn't realize you had.
In the interim - look into federation and the tool asntdiff.
On the archive database you can define a connection to the live database (CREATE SERVER). Using this definition you can define nicknames to the live tables (CREATE NICKNAME). Using these nicknames you can load the appropriate data into your archive table. You can either use your favorite data movement utility - load, import, insert, etc.
Once loaded you can verify the tables by using the asntdiff tool with appropriate selection criteria. The -f option is great.
Once you are satisfied the data exists in both locations you can delete the rows in the live database.
For your foreign key relationships - use the view SYSCAT.TABDEP to find such dependencies. You can define your foreign keys as "not enforced" (or don't define them) in the archive database to avoid errors during the previous process.
Data archiving is a big and common topic regardless of the database. You may also want to look at range partitioned tables for better performance and control.

SQL: Joins vs Denormalization (lots of data)

I know, variations of this question had been asked before. But my case may be a little different :-)
So, I am building a site that tracks events. Each event has id and value. It is also performed by a user, which has id, age, gender, city, country and rank. (these attributes are all integers, if it matters)
I need to be able to quickly get answers to two queries:
get number of events from users with certain profile (for example, males with age 18-25 from Moscow, Russia)
get sum(maybe avg also) of values of events from users with certain profile -
Also, data is generated by multiple customers, which, in turn, can have multiple source_ids.
Access pattern: data will be mostly written by collector processes, but when queried (infrequently, by web ui) it has to respond quickly.
I expect LOTS of data, certainly more than one table or single server can handle.
I am thinking about grouping events in separate tables per day (that is, 'events_20111011'). Also I want to prefix table name with customer id and source id, so that data is isolated and can be trivially discarded (purge old data) and relatively easily moved around (distribute load to other machines).
This way, every such table will have limited amount of rows, let's say, 10M tops.
So, the question is: what to do with user's attributes?
Option 1, normalized: store them in separate table and reference from event tables.
(pro) No repetition of data.
(con) joins, which are expensive (or so
I heard).
(con) this requires user table and event tables to be on
the same server
Option 2, redundant: store user attributes in event tables and index them.
(pro) easier load balancing (self-contained tables can be moved around)
(pro) simpler (faster?) queries
(con) lots of disk space and memory used for repeating user attributes and corresponding indexes
Your design should be normalized, you physical schema may end up denormalized for performance reasons.
Is it possible to do both? There is a reason why SQL Server ships with Analysis Server. Even if you are not in the Microsoft realm, it is a common design to have a transactional system for the data entry and day to day processing while a reporting system is available for the kinds of queries that would cause heavy loads upon the transactional system.
Doing this means you get the best of both worlds: a normalized system for daily operations and a denormalized system for rollup queries.
In most cases nightly updates are fine for reporting systems, but it depends on your hours of operation and other factors what works best. I find most 8-5 businesses have more than enough time in the evening to update a reporting system.
Use an OLAP/Data Warehousing approach. That is, store your data in the standard normalized way, but also store aggregated versions of the data that will be queried frequently in separate fact tables. The user queries won't be on real-time data, but it is usually worth it for the performance trade off.
Also, if you are using SQL Server enterprise I wouldn't roll your own horizontal partitioning scheme (breaking the data into days). There are tools built into SQL server to automatically do that for you.
Please Normalize
use partitions and indexing to balance load

Database for microblogging startup

I will do microblogging web service (for school, so don't blast me for lack of new idea) and I worry that DB could be often be overloaded (user could following other users or even tag so I suppouse that SELECT will be heavy - check 20 latest messages which contains all observing tags and user).
My idea is create another table, and store in it only statusID and userID (who should pick up message). Danger of that is, if some tag or user has many followers there will be a lot of record with that status ID. So, is it good idea? Or maybe better is used M2M relation? (one status -> many receivers)
I think most databases can easily handle large record sets. The responsibility to have it preform lies in your design with properly setting up the indexes. If you create the right indexes the select clauses should perform really well.
I'd go with a users table, a table to have the m2m relationship between users and messages table.
You can then do one select to find all of the users a user is following and then a second select in to get all of the messages of interest (sorting and limiting the results as appropriate). Extending this to tagging should be pretty simple.
This design should be fine for large numbers of users and messages as long as you index the right columns. If you got massive then you could also run the users tables and messages tables to different servers or have read only replicates. I wouldn't even worry about that for the moment - you'd need to be huge.
When implementing Collabinate (http://www.collabinate.com), a service-based engine for microblogging and shared activity streams, I used a graph database. The fact that people create posts and follow other people lends itself to a graph structure. With the right relationships and algorithms, this can be a very efficient and performant solution.

Multi-tenancy with SQL/WCF/Silverlight

We're building a Silverlight application which will be offered as SaaS. The end product is a Silverlight client that connects to a WCF service. As the number of clients is potentially large, updating needs to be easy, preferably so that all instances can be updated in one go.
Not having implemented multi tenancy before, I'm looking for opinions on how to achieve
Easy upgrades
Data security
Scalability
Three different models to consider are listed on msdn
Separate databases. This is not easy to maintain as all schema changes will have to be applied to each customer's database individually. Are there other drawbacks? A pro is data separation and security. This also allows for slight modifications per customer (which might be more hassle than it's worth!)
Shared Database, Separate Schemas. A TenantID column is added to each table. Ensuring that each customer gets the correct data is potentially dangerous. Easy to maintain and scales well (?).
Shared Database, Separate Schemas. Similar to the first model, but each customer has its own set of tables in the database. Hard to restore backups for a single customer. Maintainability otherwise similar to model 1 (?).
Any recommendations on articles on the subject? Has anybody explored something similar with a Silverlight SaaS app? What do I need to consider on the client side?
Depends on the type of application and scale of data. Each one has downfalls.
1a) Separate databases + single instance of WCF/client. Keeping everything in sync will be a challenge. How do you upgrade X number of DB servers at the same time, what if one fails and is now out of sync and not compatible with the client/WCF layer?
1b) "Silos", separate DB/WCF/Client for each customer. You don't have the sync issue but you do have the overhead of managing many different instances of each layer. Also you will have to look at SQL licensing, I can't remember if separate instances of SQL are licensed separately ($$$). Even if you can install as many instances as you want, the overhead of multiple instances will not be trivial after a certain point.
3) Basically same issues as 1a/b except for licensing.
2) Best upgrade/management scenario. You are right that maintaining data isolation is a huge concern (1a technically shares this issue at a higher level). The other issue is if your application is data intensive you have to worry about data scalability. For example if every customer is expected to have tens/hundreds millions rows of data. Then you will start to run into issues and query performance for individual customers due to total customer base volumes. Clients are more forgiving for slowdowns caused by their own data volume. Being told its slow because the other 99 clients data is large is generally a no-go.
Unless you know for a fact you will be dealing with huge data volumes from the start I would probably go with #2 for now, and begin looking at clustering or moving to 1a/b setup if needed in the future.
We also have a SaaS product and we use solution #2 (Shared DB/Shared Schema with TenandId). Some things to consider for Share DB / Same schema for all:
As mention above, high volume of data for one tenant may affect performance of the other tenants if you're not careful; for starters index your tables properly/carefully and never ever do queries that force a table scan. Monitor query performance and at least plan/design to be able to partition your DB later on based some criteria that makes sense for your domain.
Data separation is very very important, you don't want to end up showing a piece of data to some tenant that belongs to other tenant. every query must have a WHERE TenandId = ... in it and you should be able to verify/enforce this during dev.
Extensibility of the schema is something that solutions 1 and 3 may give you, but you can go around it by designing a way to extend the fields that are associated with the documents/tables in your domain that make sense (ie. Metadata for tables as the msdn article mentions)
What about solutions that provide an out of the box architecture like Apprenda's SaaSGrid? They let you make database decisions at deploy and maintenance time and not at design time. It seems they actively transform and manage the data layer, as well as provide an upgrade engine.
I've similar case, but my solution is take both advantage.
Where data and how data being placed is the question from tenant. Being a tenant of course I don't want my data to be shared, I want my data isolated, secure and I can get at anytime I want.
Certain data it possibly share eg: company list. So database should be global and tenant database, just make sure to locked in operation tenant database schema, and procedure to update all tenant database at once.
Anyway SaaS model everything delivered as server / web service, so no matter where the database should come to client as service, then only render by client GUI.
Thanks
Existing answers are good. You should look deeply into the issue of upgrading and managing multiple databases. Without knowing the specific app, it might turn out easier to have multiple databases and not have to pay the extra cost of tracking the TenantID. This might not end up being the right decision, but you should certainly be wary of the dev cost of data sharing.