How to use batch process for IBM MDM SE 11.6, Virtual MDM implementation - master-data-management

I have a requirement where an external source wants to send there data (CDE) to IBM MDM which will run through PME algorithm and sends out EIDs to the external source.
Note: The external source data will not be stored into MDM database.
I want to go towards batch processing route, where the external source will drop a file on a server which will be picked up by MDM to process through its PME engine.
Question: I am not familiar with the batch process for IBM MDM. If I can get some pointers/guidance on how to achieve this solution that would be very helpful.
I provided one of the solution to use member search web service, but, due to very high volume of external source data per day (100k-200k) records, this option is not feasible.
Thank you for reading.

Related

What is the most performant way to submit a POST to an API

A little background on what I am needing to accomplish. I am a developer of a cloud-based SaaS application. In one instance, my clients use my application to log the receipt of goods as they come across a conveyor line. Directly before the PC where they are logged into my app, there is another Windows PC that is collecting from instruments, the moisture and weight of the item. I (personally not my app) have full access to this pc and its database. I know how I am going to grab the latest record from the db via stored procedure/SQLCMD.
On the receiving end, I have an API endpoint that needs to receive the ID, Weight, Moisture, and ClientID. This all needs to happen in less than ~2 seconds since they are waiting to add this record to the my software's database.
What is the most-perfomant way for me to stand up a process that triggers retrieving the record from the db and then calls the API? I also want to update the record flagging success for 200 response. My thoughts were to script all of this in a batch file and use cURL to make the API call. Then call this batch file from a task in windows. But I feel like there may be a better way with less moving parts.
P.S. I am not looking for code solutions per say, just direction or tools that will help, also I am using the AWS stack to host my application.
The most performant way is to use AWS Amplify, its ready aws framework and development environment that can connect your existing DB to a REST API easily
you can check their documentation on how to build it
https://docs.amplify.aws/lib/restapi/getting-started/q/platform/js

How does Bigtable handle Tablet Server failures?

I have found that Bigtable cluster consists of a single master server and multiple tablet servers. Message sent to tablet server by master during failure. Backup copy of tablet made primary. Extra replica of tablet created automatically by GFS.
Is it enough? But I want to know how to solve it actually or What's the procedure or Which steps actually followed to solve it?
There is a single distributed file system that Bigtable uses, not several different ones, and all the tablet servers access the same file system.
In this file system, the tablets (SSTables) are already stored durably replicated, so when a single tablet server fails, a new one is assigned by the master to handle the data handled by the original tablet server, but no data copying is necessary, because the new tablet server can access the same data as the old one, on the same distributed file system.

Bulk user account creation from CSV data import/ingestion

Hi all brilliant minds,
I am currently working on a fairly complex problem and I would love to get some idea brainstorming going on. I have a C# .NET web application running in Windows Azure, using SQL Azure as the primary datastore.
Everytime a new user creates an account, all they need to provide is the name, email and password. Upon account creation, we store the core membership data to the SQL database, and all the secondary operations (e.g. sending emails, establishing social relationships, creating profile assets, etc) get pushed onto an Azure Queue and gets picked-up/processed later.
Now I have a couple of CSV files that contain hundreds of new users (names & emails) that need to be created on the system. I am thinking of automating this by breaking into two parts:
Part 1: Write a service that ingests the CSV files, parses out the names & emails, and saves this data in storage A
This service should be flexible enough to take files with different formats
This service does not actually create the user accounts, so this is decoupled from the business logic layer of our application
The choice of storage does not have to be SQL, it could also be non-relational datastore
(e.g. Azure Tables)
This service could be a third-party solution outside of our application platform - so it is open to all suggestions
Part 2: Write a process that periodically goes through storage A and creates the user accounts from there
This is in the "business logic layer" of our application
Whenever an account is successfully created, mark that specific record in storage A as processed
This needs to be retry-able in case of failures in user account creations
I'm wondering if anyone has experience with importing bulk "users" from files, and if what I am suggesting sounds like a decent solution.
Note that Part 1 could be a third-party solution outside of our application platform, so there's no restriction in what language/platform it has to be running in. We are thinking about either using BULK INSERT, or Microsoft SQL Server Integration Services 2008 (SSIS) that ingests and loads data from CSV into SQL datastore. If anyone has worked with these and can provide some pointers that would be greatly appreciated too.. Thanks so much in advance!
If I understand this correctly, you already have a process that picks up messages from a queue and does its core logic to create the user assets/etc. So, sounds like you should only automate the parsing of the CSV files and dumping the contents into queue messages? That sounds like a trivial task.
You can kick the process of processing the CSV file also via a queue message (to a different queue). The message would contain the location of the CSV file and the Worker Role running in Azure would pick it up (could be the same worker role as the one that processes new users if the usual load is not high).
Since you're utilizing queues, the process is retriable
HTH

SQL Service Broker: Collecting data -- plug-in scenario analysis

(2nd Update from 2012/12/06 -- new protocol, a sligtly different view)
The question is whether the solution below seems reasonable for you, or whether there is any flaw that I did not notice (being quite new to SQL Server Service Broker)...
I would like to continue in analysis of the problem presented in the SQL Service Broker: Collecting data from distributed sources. I would like to focus on the problem of protocol to be used when collecting data from the satellite SQL servers. The usage of the SQL Server Service Broker is a must -- it is dictated also by other reasons not presented here. So, please, do not suggest completely alternative solutions.
I would like to focus on details of what should be done and how to use the Service Broker naturally (the best possible way) for the exact problem. The overall goal was presented in the above mentioned question. The picture first:
Now more details to be considered...
Plug-in architecture wanted
The satellite machines are related to real physical production lines. It can happen that some machine is added to the technology process, some machine can disappear, some machine can be replaced in the sense it will use the same production-line identification, but it is physically different -- i.e. its SQL server is a different instance.
The central server knows nothing about the satellite until it gets first messages from it. There is no centralized database of the satelite servers. No knowledge about what and how many satelite SQL servers are to be included to the system. It is always decided on the satelite site.
Any activity related to collecting the data should be initiated by events generated by the satellite machines.
Important: The goal is to continually transfer all the newly created data (from sensors), and to discover and fix drop-outs -- independently on whatever could cause them.
To give you the concrete example:
The machine identified by line number 3 (yellow) was recently added to the environment. Its SQL Server Express was launched and it started to collect the sensor data (the third party solution, dedicated table with special structure). The machine was not connected to the central server, yet.
The only configuration thing is the reliably assigned fixed identification of the production line (here 3), and all the neccessary details to connect to the central SQL server. But the central SQL server does not know the information. The central is just ready to accept data from any new souce, but never knows when. (It was already tested for one machine using the approach suggested by Remus Rusanu answer to the question SQL Service Broker — one central SQL and more satelite SQL….)
The piece of the SQL software is deployed on the machine 3 just a bit later. It starts to talk with the central. The satellite part is not dumb, but its own activity is to send the sensor data whenever new record is inserted to the sensor data table (see point 1 above). From the record, UTC time is calculated (from the proprietary format), several sensor data from one record is converted to the same number of normalized records (formatted as one XML message), and sent to the central SQL server.
The central is activated by the message with the sensor data sent from the satellite machine. The failures of the physical connection is masked by the Service Broker queues.
After a reasonable interval (here one hour), the central server checks whether the so far collected data should be processed or not. There is a work unit that takes some production time, and the data should be processed and added to the documentation of the unit. The processing should happen only when the unit was finished.
The central also checks whether it has all the data for the unit. As the sensor sampling is done in known regular intervals (here about 1 minute), the central can check whether there are some drop-outs. There also is an initial "drop-out" for the time interval when the satellite was not connected to the central via SSB. The mechanism should recover from whatever situation. It can also happen that the sensor where out of order or the data were not collected. The detected drop-out at the central may actually mean that central asks: "I have no data from you for this time interval. Send me some of them if they exist, or tell me they do not exist."
The satellite should send only that much data that can be sent between the sampling times. The recovery from drop-outs can be rather slow. The delay of processing the data at the central server is not critical. However, the central should know when the data is ready (or does not exist for the detected time interval).
Some picture, more solution details
I have chosen the "Recycling conversations" by Remus Rusanu as the basic framework for the communication between the satellite and the central. It defines the EndOfStream message type to signal that the conversation handle should be thrown away and the new one should be used. The lifetime is limited by the above mentioned one hour interval generated by the Service Broker timer.
The message is (mis)used at the central server also for activation of the data processing. At about the same time, the central checks for drop-outs. The central keeps the time below that the drop-outs where already checked. This way it knows what data are ready to be processed.
Do you consider the scenario reasonable? Can you see any problem with it?
(I am going to refine the question to reflect your suggestions.)
Thanks for your time and experience, and have a nice day.
Petr
All data should be stored in table. On satellite side, you should create a table for last processed row to be stored. When new request from Central arrives, new data pack will be sent back to Central depending on last processed record value.
Note: i recommend to limit a number of rows to be sent depending on your data to do not create very large data packs.
When Central processed all rows, appropriate message should be sent to Satellite. It also should contain information about data import errors occurred.
You can start Service Broker conversation when database activity is registered (using DML/DDL triggers on both Central/Satellite database) or within schedule (using Central Agent job).

Application Level Replication Technologies

I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?
Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.
You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.
AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom
You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).
I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.