What is the best practice to be used on Bluemix for purging of data from the db2 storage service? Say we want to purge a large amount of data, say a million entries of a particular communication to customers ?
You may look into this tutorial that describes the data purge algorithm for DB2.
http://www.ibm.com/developerworks/data/library/techarticle/dm-1501data-purge-db2/index.html
However, as SQL Database is a fully managed service, you will not be able to follow the exact instructions as described. For example, you will not be able to tune db cfg and dbm cfg for optimal performance. Also note that you will not have access to a shell script, so you may have to enter individual SQL command individually through a client like data studio.
On the other hand, if you are using the DB2 on Cloud service, it would be able to follow the above instructions.
Related
I have a SQL DB which contains PHI, hosted on AWS. I want to access this data to perform analytics, however, I must de-identify the data first to comply with HIPAA.
How should I approach this? I have thought of a few approaches:
Simply de-identify the DB with SQL commands.
From now on, every time the DB is added to, add a de-identified version of that data to another DB. Then access this DB for analytics.
From now on, every time the DB is added to, add a de-identified version of that data to another table in that DB. Then access this table with SQL commands for analytics.
Which is the best approach to use to maintain compliance with HIPAA? Or, is there a better way?
Thanks!
Budget allowing, consider doing your analytics on a different system and during the ETL, de-identify the data. Changing the source system to accommodate this requirement will increase complexity to maintain and likely affect other integrations - might end up with a monolith.
There's various ways to do this: You could do a AWS DMS (with ongoing replication) with the DB as your source and S3 as target (parquet format). From there you could use Athena for analytics as jarmod highlighted, which also supports parquet format and you can even use SQL-like queries in Athena to analyze your data. There's also Redshift, send to another Relational DB, other analytics platforms etc.
I have a database connected with website, data from website is inserting in that Database, i need to transfer data from that database to another Primary Database (SQL) on another server in real time (minimum latency).
I can not use transactional replication in this case. What are the other alternates to achieve this? Can i integrate DataStreams like Apache kafka etc with SQL server?
Without more detail it's hard to give a full answer. There's what's technically possible, and there's architecturally what actually makes sense :)
Yes you can stream from RDBMS to Kafka, and from Kafka to RDBMS. You can use the Kafka Connect JDBC source and sink. There are also CDC tools (e.g. Attunity, GoldenGate, etc) that support integration with MS SQL and other RDBMS)
BUT…it depends why you want the data in the second database. Do you need an exact replica of the first? If so DB-DB replication may be a better option. Kafka's a great option if you want to process the data elsewhere and/or persist it in another store. But if you just want MS SQL-MS SQL…Kafka itself may be overkill.
Background: We currently have our data split between two relational databases (Oracle and Postgres). There is a need to run ad-hoc queries that involve tables in both databases. Currently we are doing this in one of two ways:
ETL from one database to another. This requires a lot of developer
time.
Oracle foreign data wrapper on our
Postgres server. This is working, but the queries run extremely
slowly.
We already use Google Cloud Platform (for the project that uses the Postgres server). We are familiar with Google BigQuery (BQ).
What we want to do:
We want most of our tables from both these databases (as-is) available at a single location, so querying them is easy and fast. We are thinking of copying over the data from both DB servers into BQ, without doing any transformations.
It looks like we need to take full dumps of our tables on a periodic basis (daily) and update BQ since BQ is append-only. The recent availability of DML in BQ seems to be very limited.
We are aware that loading the tables as is to BQ is not an optimal solution and we need to denormalize for efficiency, but this is a problem we have to solve after analyzing the feasibility.
My question is whether BQ is a good solution for us, and if yes, how to efficiently keep BQ in sync with our DB data, or whether we should look at something else (like say, Redshift)?
WePay has been publishing a series of articles on how they solve these problems. Check out https://wecode.wepay.com/posts/streaming-databases-in-realtime-with-mysql-debezium-kafka.
To keep everything synchronized they:
The flow of data starts with each microservice’s MySQL database. These
databases run in Google Cloud as CloudSQL MySQL instances with GTIDs
enabled. We’ve set up a downstream MySQL cluster specifically for
Debezium. Each CloudSQL instance replicates its data into the Debezium
cluster, which consists of two MySQL machines: a primary (active)
server and secondary (passive) server. This single Debezium cluster is
an operational trick to make it easier for us to operate Debezium.
Rather than having Debezium connect to dozens of microservice
databases directly, we can connect to just a single database. This
also isolates Debezium from impacting the production OLTP workload
that the master CloudSQL instances are handling.
And then:
The Debezium connectors feed the MySQL messages into Kafka (and add
their schemas to the Confluent schema registry), where downstream
systems can consume them. We use our Kafka connect BigQuery connector
to load the MySQL data into BigQuery using BigQuery’s streaming API.
This gives us a data warehouse in BigQuery that is usually less than
30 seconds behind the data that’s in production. Other microservices,
stream processors, and data infrastructure consume the feeds as well.
I read in a few places that SQL Azure data is automatically replicated and the Azure platform provides redundant copies of the data, Therefore SQL Server high availability features such as database mirroring and failover cluster aren't needed.
Has anyone got a chance to investigate deeper into this? Are all those availability enhancements really not needed in Azure? Thanks!
To clarify, I'm talking about SQL as a service and not a VM hosted SQL.
The SQL Database service (database-as-a-service) is a multi-tenant database service, and your databases are triple-replicated within the data center, providing durable storage. The service itself, being large-scale, provides high availability (since there are many VMs running the service itself, along with replicated data). Nothing is needed in terms of mirroring or failover clusters. Having said that: If, say, your particular database became unavailable for a period of time, you'll need to consider how you'll handle that situation (perhaps sync'ing to another SQL Database, maybe even in another data center).
If you go with SQL Database (DBaaS), you'll still need to work out your backup strategy, and possibly syncing with another DC (or on-premises database server) for DR purposes.
More info on SQL Database fault tolerance is here.
Your desired detail is probably contained in this MSDN article of Business Continuity and Azure SQL Database (see: http://msdn.microsoft.com/en-us/library/windowsazure/hh852669.aspx). At the most basic level Azure SQL Database will keep three replicas of your database - one primary and two secondary.
While this helps with BCP / DR scenarios you may also wish to investigate ways to backup your database so you have point-in-time restore capabilities. More information on backup / restore can be found here: http://msdn.microsoft.com/en-us/library/windowsazure/jj650016.aspx
This is not a traditional scale-up or scale-out question.
Please bear with me, here first allow me give an example:
I created a Sql Azure server and create a 1GB database inside, cost $9.99 a month.
(It has a master database as well, 1G, but Microsoft not charge us for that)
Ok, here is my question comes, when I need another 1G database for my application. Why I need another 1GB database? You may ask me this because the azure can support database up to 50GB. My answer is distribution, I know the data will reach 50G eventually, so I create the data model distribute and spread the data in different database.
For all the sake of performance, which option I should use:
Create another database in same server
Create another server and create a new database inside
Both option cost same.
I guess option 2 will be better, isn't it?
I'm not sure there are strong (or any) performance implications, my understanding is that the consideration is mostly a management one as some entities, mostly around security, are defined at server level and some at database level.
Behind the scenes the model is quite different anyway, and a multi-tenant one, so having separate SQL Azure server does not actually mean you get a dedicated server per-se. theoretically separate servers or separate databases may end up looking exactly the same.