Using Apache hudi library in java clients - apache-hudi

I am a hudi newbie. I was wondering if Hudi client libraries can be used straight from java clients to write to Amazon S3 folders. I am trying to build a system that can store a large no. of events upto 50k/second that will be emitted from a distributed system of over 10 components. I was wondering if I could build a simple client using a Hudi client library that buffers this data and then periodically just writes it into a Hudi datastore?

Related

How to transfer a file between a cloud storage and a download/upload host? (and vice versa)

Let's say there is a file in this link example.com/file.bin and I want to transfer it to a cloud storage (Mega or Google Drive for example), or I have a download/upload host and want to transfer files from a cloud storage to that host. how can I do that? (Besides the good old downloading-and-reuploading way and transfer sites like Multcloud)
At first I thought I can use python + selenium framework to handle the cloud storage side, but that works only if I have the file on my own system. Can I use a host to deploy the code on it, and then use it to transfer the files? (Some of cloud storages don't have API to use them for downloading, So I think it's necessary to use Selenium)

BigQuery retrieve data from sftp

I have an internet facing sftp server that has regularly csv files update. Is there anyway command to have BigQuery to retrieve data from this sftp and put it into tables. Alternatively, any API or python library that support this?
As for BigQuery - there's no integration I know of with SFTP.
You'll need to either:
Create/find a script that reads from SFTP and pushes to GCS.
Add a HTTPS service to the SFTP server, so your data can be read with the GCS transfer service (https://cloud.google.com/storage-transfer/docs/)
Yet another 3rd party Tool supporting (S)FTP In and Out to/from GCP is Magnus - Workflow Automator which is part of Potens.io Suite - supports all BigQuery, Cloud Storage and most of Google APIs as well as multiple simple utility type Tasks like BigQuery Task, Export to Storage Task, Loop Task and many many more along with advanced scheduling, triggering, etc.. Also available at Marketplace.
FTP-to-GCS Task accepts a source FTP URI and can transfer single or multiple files based on input to a destination location in Google Cloud Storage. The resulting uploaded list to Google Cloud Storage is saved to a parameter for later use within the Workflow. The source FTP can be of types SFTP, FTP, or FTPS.
See here for more
Disclosure: I am GDE for Google Cloud and creator of those tools and leader on Potens team

Trouble in using AWS SWF

I am new to Amazon simple workflow service. Is there a way to run the swf workflows on EMR. I have AWS CLI setup and able to bootstrap hadoop and bring up the cluster. I have not found enough documentation on this and no source on the web. Is there any change that I can boot the EMR cluster using SWF instead of AWS CLI. Thanks.
You should use one of the dedicated AWS SDKs to coordinate between the two services. I am successfully using the AWS SDK for Java to create a workflow that starts several EMR clusters in parallel with different jobs and then just waits for them to finish, failing the whole workflow if one of the jobs fail.
Out of all available AWS SDKs, I highly recommend the Java one. It struck me as extremely robust. I have also used the PHP one in the past, but it lacks on certain departments (it does not provide a 'flow' framework for SWF for example).

Protocol Buffers and Hadoop

I am new to the Hadoop world. I know that Hadoop has its own serialization mechanism called Writables. And that AVRO is another such library. I wanted to know whether we can write map-reduce jobs using the Google's protocol buffer serialization? If yes then can someome point to a good example to get me started.
Twitter has published their elephant-bird library which allows hadoop to work with protocol buffers files.

Using Amazon S3 along with Amazon RDS

I'm trying to host a database on Amazon RDS, and the actual content the database will store info on (videos) will be hosted on Amazon S3. I have some questions about this process I was hoping someone can help me with.
Can a database hosted on Amazon RDS interact (Search, update) something on Amazon S3? So if I have a database on Amazon RDS, and run a delete command to remove a specific video, is it possible to have that command remove the video on S3? Also, is there a tutorial on how to make the two mediums interact?
Thanks very much!
You will need an intermediary scripting language to maintain this process. For instance, if you're building a web based application that stores videos on S3 and the info for these videos including their locations on RDS you could write a PHP application (hosted on an EC2 instance, or elsewhere outside of Amazon's cloud) that connects to the MySQL database on RDS and does the appropriate queries and then interacts with Amazon S3 to complete a certain task there (e.g. delete a video like you stated).
To do this you would use the Amazon AWS SDK, for PHP the link is: http://aws.amazon.com/php/
You can use Java, Ruby, Python, .NET/Windows, and mobile SDKs to do these various tasks on S3, as well as control other areas of AWS if you use them.
You can instead find third-party scripts that do what you want and build an application around them, like for example, if someone wrote a simpler S3 interaction class you could use instead of rewriting some of your own code.
For a couple command line applications I've built I have used this handy and free tool: http://s3tools.org/s3cmd which is basically a command line tool for interacting with S3. Very useful for bash scripts.
Tyler