Atomic update of multipart s3 - amazon-s3

I need to update multiple files to s3 from a Java application. But the catch is we need all the files atomically i.e. All or nothing.
I am unable to find any solution for that.
Any suggestions are welcome.
Thanks!

S3 is an eventual consistency store so you'll need some mechanism like _commit. Parquet format and others do this for you. The format options depend on your readers, for example, no RedShift bulk loader for Parquet, so AVRO is a better format for that use case.
What common formats are supported by all systems that need to work with these files?

Till date only elegant solution I could find was reading it in DataFrame (using spark libs) and write it.
I also implemented basically checking of some commit files (let's say _commit) for locking/sync purposes which is basically done by Spark APIs as well.
Hope that helps. If anyone has any other solution - they are most welcome to please share. :)

Related

Azure Data Factory - optimal design for an IOT pipeline

I am working on an Azure Data Factory solution to solve the following scenario:
Data files in CSV format are dumped into Data Lake Gen 2 paths. There are two varieties of files, let's call them TypeA and TypeB and each is dumped into a path reflecting a grouping of sensors and the date.
For example:
/mycontainer/csv/Group1-20210729-1130/TypeA.csv
/mycontainer/csv/Group1-20210729-1130/TypeB.csv
/mycontainer/csv/Group1-20210729-1138/TypeA.csv
/mycontainer/csv/Group1-20210729-1138/TypeB.csv
I need to extract data from TypeA files in Delta format into a different location on Data Lake Gen 2 storage. I'll need to do similar processing for TypeB files but they'll have a different format.
I have successfully put together a "Data Flow" which, given a specific blob path, accomplishes step 2. But I am struggling to put together a pipeline which applies this for each file which comes in.
My first thought was to do this based on a storage event trigger, whereby each time a CSV file appeared the pipeline would be run to process that one file. I was almost able to accomplish this using a combination of fileName and folderPath parameters and wildcards. I even had a pipeline which will work when triggered manually (meaning I entered a specific fileName and folderPath value manually). However I had two problems which made me question whether this was the correct approach:
a) I wasn't able to get it to work when triggered by real storage events, I suspect because my combination of parameters and wildcards was ending up including the container name twice in the path it was generating. It's hard to check this because the error message you get doesn't tell you what the various values actually resolve to (!).
b) The cluster that is needed to extract the CSV into parquet Delta and put the results into Data Lake takes several minutes to spin up - not great if working at the file level. (I realize I can mitigate this somewhat - at a cost - by setting a TTL on the cluster.)
So then I abandoned this approach and tried to set up a pipeline which will be triggered periodically, and will pick up all the CSV files matching a particular pattern (e.g. /mycontainer/csv/*/TypeA.csv), process them as a batch, then delete them. At this point I was very surprised to find out that the "Delimited Text" dataset doesn't seem to support wildcards, which is what I was kind of relying on to achieve this in a simple way.
So my questions are:
Am I broadly on the right track with my 'batch of files' approach? Is there a way to define a delimited text data source which reads its data from multiple blobs?
Or do I need a more 'iterative' approach using maybe a 'Foreach' step? I'm really really hoping this isn't the case as it seems an odd pattern to be adopting in 2021.
A much wider question: is ADF a suitable tool for this kind of scenario? I was excited about using it at first, but increasingly it feels like one of those 'exciting to demo but hard to actually use' things which so often pop-up in the low/no code space. Are there popular alternatives which will work nicely with Azure storage?
Any pointers very much appreciated.
I believe you're very much on the right track.
Last week I was able to get wildcard CSV's to be imported if the wildcard is in the CSV name. Maybe create an intermediate step to put all Type A's in the same folder?
Concerning ADF - it's a cool technology, with a steep learning curve (and a lot of updates - incl. breaking changes sometimes) if you're looking to get data ingested without too much coding. Some drawbacks:
Monitoring - if you want to have it cheaper, there's a lot of hacking (e.g. mailing via Logic Apps)
Debugging - as you've noticed, debug messages are often cryptic or insufficient
Multiple monthly updates make it feel like a beta. Indeed, often there are straightforward tasks that are quite difficult to achieve.
Good luck ;)

Write a list of Julia DataFrames to file

I have a list of Julia DataFrames that I want to write to file. What is the fastest way to write these out? I'm looking for something akin to rds files in R.
I routinely use serialize and deserialize from the Serialization module. Note that this is Julia-version specific, but apart from that this is the most robust approach currently.
You can also consider https://github.com/JuliaData/Feather.jl, but it does not support all possible data types that you can store in a DataFrame (but covers all standard types).
Here https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb you can find some benchmarks (at the end of the notebook).
JLD2 solved my problem. Thanks.

Liquibase load data in a format other than CSV

With the load data option that Liquibase provides, one can specify seed data in a CSV format. Is there a way I can provide say, a JSON or XML file with data that Liquibase would understand?
The use case is we are trying to put in some sample data which is hierarchical. E.g. Category - Subcategory relation which would require putting in parent id for all related categories. If there is a way to avoid including the ids in the seed data via say, JSON.
{
"MainCat1": ["SubCat11", "SubCat12"],
"MainCat2": ["SubCat21", "SubCat22"]
}
Very likely to have this as not supported (couldn't make Google help me) but is there a way to write a plugin or something that does this? Pointer to a guide (if any) would help.
NOTE: This is not about specifying the change log in that format.
This not currently supported and supporting it robustly would be pretty difficult. The main difficultly lies in the fact that Liquibase is designed to be database-platform agnostic, combined with the design goal of being able to generate the SQL required to do an operation without actually doing the operation live.
Inserting data like you want without knowing the keys and just generating SQL that could be run later is going to be very difficult, perhaps even impossible. I would suggest approaching Nathan, who is the main developer for Liquibase, more directly. The best way to do that might be through the JIRA bug database for Liquibase.
If you want to have a crack at implementing it, you could start by looking at the code for the LoadDataChange class (source in Github), which is where the CSV support currently lives.

"Best practice" for HBase data "serialization"

Hi I am new to HBase and I wonder what is the best approach to serialize and store the data to HBase. Is there any convenient way how to transform "business objects" at application level to HBase objects (Put) - transformation to byte[]. I doubt that it has to be converted manually via helpers methods like .toByte etc.
What are the best practices and experiences?
I read about Avro, Thrift, n-orm, ...
Can someone share his knowledge?
I would go with the default Java API and enable compression on HDFS rather than using a framework for serializing / deserializing efficiently during RPC calls.
Apparently, updates like addition of a column to records in Avro/Thrift would be difficult as you are forced to delete and recreate.
Secondly, I don't see support for Filters in thrift/avro. In case you need to filter data at the source.
My two cents .
For a ORM solution, kindly have a look at https://github.com/impetus-opensource/Kundera .

Hive with Lucene

Is it possible to use Hive for querying Lucene index which is distributed over Hadoop???
Hadapt is a startup whose software bridges Hadoop with a SQL front-end (like Hive) and hybrid storage engines. They offer a archival text search capability that may meet your needs.
Disclaimer: I work for Hadapt.
As far as I know you can essentially write custom "row-extraction" code in Hive so I would guess that you could. I've never used Lucene and barely used Hive, so I can't be sure. If you find a more conclusive answer to your question, please post it!
I know this is a fairly old post, but thought I could offer a better alternative.
In your case, instead of going through the hassle of mapping your HDFS Lucene index to hive schema, it's better to push them into pig, because pig can read flat files. Unless you want a Relational way of storing your data, you could probably process them through Pig and use, Hbase as your DB.
You could write a custom input format for Hive to access lucene index in Hadoop.