Open refine by google on private data - openrefine

Could OpenRefine be securely used for cleaning private data?
Tried on public data and ok but I'm not sure about data security.

No data leaves your machine when using OpenRefine. Although it's implemented as a web client/server app, the OpenRefine server runs on your personal machine and nothing is sent out over the Internet. You can verify this by turning off your external network connections.
The one exception is if you use any external reconciliation services, in which case the data that you select as reconciliation criteria will be sent to whatever reconciliation service you are using. All of the data cleaning and transformation operations can be done without any external network services.

Related

How to create a process in Dell Boomi that will get data from one Database and then will send data to a SaaS

I would like to know how do I create a process in Dell Boomi that will meet the following criteria:
Read data directly from Database poduction table then will send the data to SaaS (public internet) using REST API.
Another process will read data from SaaS (REST API) and then write it to another Database table.
Please see attached link as to what I have done so far and I really don't know how to proceed. Hope you can help me out. Thank you.Boomi DB connector
You are actually making a good start. For the first process (DB > Saas) you need to:
Ensure you have access to the DB - if your Atom is local than this shouldn't be much of an issue, but if it is on the Boomi Cloud,
then you need to enable access to this DB from the internet (not
something I would recommend).
Check what you need to read and define Boomi Operation - from the image you have linked I can see that you are doing that, but not
knowing what data you need and how it is structured, it is impossible to say if you have defined all correctly.
Transform data to the output system format - once you get the data from the DB, use the Map shape to map it to the Profile of the Saas you are sending your data to.
Send data to Saas - you can use HttpClient connector to send data in JSON or XML (or any other format you like) to the Saas Rest API
For the other process (Saas > DB) the steps are practically the same but in reverse order.

How to handle temporary unreachable online api

This is a more general question, so bear with my abstraction of the following problem.
I'm currently developing an application, that is interfacing with a remote server over a public api. The api in question does provide mechanisms for fetching data based on a timestamp (e.g. "get me everything that changed since xxx"). Since the amount of data is quite high, I keep a local copy in a database and check for changes on the remote side every hour.
While this makes the application robust against network problems (remote server in maintenance, network outage, etc.) and enables employees to continue working with the application, there is one big gaping problem:
The api in question also offers write access. E.g. my application can instruct the remote server to create a new object. Currently I'm sending the request via api, and upon success create the object in my local database, too. It will eventually propagate via the hourly data fetching, where my application (ideally) sees that no changes need to be made to the local database.
Now when the api is unreachable, i create the object in my database, and cache the request until the api is reachable again. This has multiple problems:
If the request fails (due to not beforehand validateble errors), I end up with an object in the database which shouldn't even exist. I could delete it, but it seems hard to explain to the user(s) ("something went wrong with the api, we deleted that object again").
The problem especially cascades when depended actions que up. E.g. creating the object, and two more requests for modifying it. When the initial create fails, so will the modifying requests (since the object does not exist on the remote side)
Worst case is deletion - when an object is deleted locally, but will not be deleted on the remote site, I have no way of restoring it (easily).
One might suggest to never create objects locally, and let them propagate only through the hourly data sync. This unfortunately is not an option. If the api is not accessible, it can be for hours. And it is mandatory that employees can continue working with the application (which they cannot when said objects don't exist locally).
So bottom line:
How to handle such a scenario, where the api might not be reachable, but certain requests must be cached locally and repeated when the api is reachable again. Especially how to handle cases where those requests unpredictable fail.

Bulk user account creation from CSV data import/ingestion

Hi all brilliant minds,
I am currently working on a fairly complex problem and I would love to get some idea brainstorming going on. I have a C# .NET web application running in Windows Azure, using SQL Azure as the primary datastore.
Everytime a new user creates an account, all they need to provide is the name, email and password. Upon account creation, we store the core membership data to the SQL database, and all the secondary operations (e.g. sending emails, establishing social relationships, creating profile assets, etc) get pushed onto an Azure Queue and gets picked-up/processed later.
Now I have a couple of CSV files that contain hundreds of new users (names & emails) that need to be created on the system. I am thinking of automating this by breaking into two parts:
Part 1: Write a service that ingests the CSV files, parses out the names & emails, and saves this data in storage A
This service should be flexible enough to take files with different formats
This service does not actually create the user accounts, so this is decoupled from the business logic layer of our application
The choice of storage does not have to be SQL, it could also be non-relational datastore
(e.g. Azure Tables)
This service could be a third-party solution outside of our application platform - so it is open to all suggestions
Part 2: Write a process that periodically goes through storage A and creates the user accounts from there
This is in the "business logic layer" of our application
Whenever an account is successfully created, mark that specific record in storage A as processed
This needs to be retry-able in case of failures in user account creations
I'm wondering if anyone has experience with importing bulk "users" from files, and if what I am suggesting sounds like a decent solution.
Note that Part 1 could be a third-party solution outside of our application platform, so there's no restriction in what language/platform it has to be running in. We are thinking about either using BULK INSERT, or Microsoft SQL Server Integration Services 2008 (SSIS) that ingests and loads data from CSV into SQL datastore. If anyone has worked with these and can provide some pointers that would be greatly appreciated too.. Thanks so much in advance!
If I understand this correctly, you already have a process that picks up messages from a queue and does its core logic to create the user assets/etc. So, sounds like you should only automate the parsing of the CSV files and dumping the contents into queue messages? That sounds like a trivial task.
You can kick the process of processing the CSV file also via a queue message (to a different queue). The message would contain the location of the CSV file and the Worker Role running in Azure would pick it up (could be the same worker role as the one that processes new users if the usual load is not high).
Since you're utilizing queues, the process is retriable
HTH

Azure WCF webrole / Worker role confused..

I have several hardware devices that send large amount of data to the cloud. I need to store them on the cloud, process them and send some status reports based on the data analysed to clients who are interested in looking at those results. Clients are smart phone users.
A single client is interested in knowing one or more hardware status report.
I need to make this scalable using Azure, i.e be able to monitor 1000s of hardwares. I need cloud storage , cloud computing power and ability to send data from many hardwares and send reports to clients that are monitoring these hardwares.
I am new to WCF and Azure any guidance on how to write a scalable application using WCF and Azure will be very useful. Please explain how it can be scalable. Do I have to use worker role / web role ? I have some computationally intensive data processing to be done to produce the reports, that clients are interested in.
Shashi
Sounds like an interesting project...
You can host WCF Services in a WCF Service Web Role, which is a web role with starting artifacts for hosting WCF services.
For entensive processing you can use worker roles. When data is received, a WCF service cam place a message on a Service Bus queue, which will be received by a worker role, which can process the data asynchronouslty.
For data storage you could look at the Table and Blob storage in Windows Azure Storage, or look at Windows Azure SQL Database if you need relational storage. There are advantages and disadvanteges to both approaches.
There is quite a lot of technology to evaluate, so it might be worth running through a few tutorials to get an idea of what will make for the best implementation. The Windows Azure Training Kit is a good starting place for this.
http://www.microsoft.com/en-us/download/details.aspx?id=8396
Regards,
Alan
You can scale by increasing the instances of both Web and Worker roles based on the load. Azure roles (cloud service) is a stateless (wont support sticky sessions by default) hence the response of request from same client will be equally distributed to all your instances (round robin).
#coolshashi.
By default one Azure Cloud Solution can consist of 5 different roles (the mix of Web or Worker doesn't matter). Each of those roles can have multiple instances.
For example: 7 Instances of your Web Role could form your front-end Web Farm that places orders on a Service Bus Queue. These orders might be read by 2 instances of your Worker Role which processes them & put them into a database.
The only difference between a web & worker role is that the Web Role has IIS installed & started.
It is easy to configure the number of instances per role to dynamically change based on some metric you define (ie: CPU use or Messages in a Queue). So the solution can scale up to handle load & shrink to save money when not required.
Most Azure subscriptions (or accounts) are initially constrained to 20 cores. This is to prevent you from accidentally creating a massive bill. If your solution requires more, a quick chat to Microsoft can remove that limit to give you as much as you desire.

How to upload a file in WCF along with identifying credentials?

I've got an issue with WCF, streaming, and security that isn't the biggest deal but I wanted to get people's thoughts on how I could get around it.
I need to allow clients to upload files to a server, and I'm allowing this by using the transferMode="StreamedRequest" feature of the BasicHttpBinding. When they upload a file, I'd like to transactionally place this file in the file system and update the database with the metadata for the file (I'm actually using Sql Server 2008's FILESTREAM data type, that natively supports this). I'm using WCF Windows Authentication and delegating the Kerberos credentials to SQL Server for all my database authentication.
The problem is that, as the exception I get helpfully notes, "HTTP request streaming cannot be used in conjunction with HTTP authentication." So, for my upload file service, I can't pass the Windows authentication token along with my message call. Even if I weren't using SQL Server logins, I wouldn't even be able to identify my calling client by their Windows credentials.
I've worked around this temporarily by leaving the upload method unsecured, and having it dump the file to a temporary store and return a locator GUID. The client then makes a second call to a secure, non-streaming service, passing the GUID, which uploads the file from the temporary store to the database using Windows authentication.
Obviously, this isn't ideal. From a performance point of view, I'm doing an extra read/write to the disk. From a scalability point of view, there's (in principle, with a load balancer) no guarantee that I hit the same server with the two subsequent calls, meaning that the temporary file store needs to be on a shared location, meaning not a scalable design.
Can anybody think of a better way to deal with this situation? Like I said, it's not the biggest deal, since a) I really don't need to scale this thing out much, there aren't too many users, and b) it's not like these uploads/downloads are getting called a lot. But still, I'd like to know if I'm missing an obvious solution here.
Thanks,
Daniel