Write a list of Julia DataFrames to file - dataframe

I have a list of Julia DataFrames that I want to write to file. What is the fastest way to write these out? I'm looking for something akin to rds files in R.

I routinely use serialize and deserialize from the Serialization module. Note that this is Julia-version specific, but apart from that this is the most robust approach currently.
You can also consider https://github.com/JuliaData/Feather.jl, but it does not support all possible data types that you can store in a DataFrame (but covers all standard types).
Here https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb you can find some benchmarks (at the end of the notebook).

JLD2 solved my problem. Thanks.

Related

How to use Pandas in apache beam?

How to implement Pandas in Apache beam ?
I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam.
Can anyone direct me to the desired link ?
There's some confusion going on here.
pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.
It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.
That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.
As well as using Pandas directly from DoFns, Beam now has an API to manipulate PCollections as Dataframes. See https://s.apache.org/simpler-python-pipelines-2020 for more details.
pandas is supported in the Dataflow SDK for Python 2.x. As of writing, workers have the pandas v0.18.1 version pre-installed, so you should not have any issue with that. StackOverflow does not accept answers where you request the community to point you to external documentation and/or tutorials, so maybe you should first try an implementation yourself, and then come back with more information about what is/isn't failing and what did you achieve before stumbling with an error.
In any case, if what you want to achieve is a left join, maybe you can also have a look at the CoGroupByKey transform type, which is documented in the Apache Beam documentation. It is used to perform relational joins of several PCollections with a common key type. In that same page, you will be able to find some examples, which use CoGroupByKey and ParDo to join the contents of several data objects.

Atomic update of multipart s3

I need to update multiple files to s3 from a Java application. But the catch is we need all the files atomically i.e. All or nothing.
I am unable to find any solution for that.
Any suggestions are welcome.
Thanks!
S3 is an eventual consistency store so you'll need some mechanism like _commit. Parquet format and others do this for you. The format options depend on your readers, for example, no RedShift bulk loader for Parquet, so AVRO is a better format for that use case.
What common formats are supported by all systems that need to work with these files?
Till date only elegant solution I could find was reading it in DataFrame (using spark libs) and write it.
I also implemented basically checking of some commit files (let's say _commit) for locking/sync purposes which is basically done by Spark APIs as well.
Hope that helps. If anyone has any other solution - they are most welcome to please share. :)

is it possible to use pig built in function inside pig java udf

I am new to pig and writing java UDF for different operations which already exists in builtin package but the datatype does not match when called from application.
So I need to wrap pig built in functions of correct datatype from user defined datatypes.
Please suggest.
As mentioned in the comments, the solution that you propose is not possible.
Though you did not ask this (and did not provide relevant information to enable people to be more specific), it is probably possible to solve your problem with a different solution.

Julia: how stable are serialize() / deserialize()

I am considering the use of serialize() and deserialize() for all of my data i/o due to their convenience. I do not, however, want to be stuck with unreadable files on a Julia update.
How stable are serialize() and deserialize()? Should they work between updates of 0.3? Can I expect safe behavior if I stick to basic types like arrays of Float64?
Thank you.
If you want to store data you might depend on being able to read in the future, you should not use a format that will incorporate breaking changes if/when someone finds it useful. As far as I understand the default serialization format is for network communications, so it is designed for maximum performance.
There is also the HDF5.jl package that uses a documented format and a common library that has wrappers for different languages.
I believe the official answer here is, "people will try not to break the serialization format, but you shouldn't depend upon on it."

is there an ocaml library store/use data structure on disk

like bdb. However, I looked at the ocaml-bdb, seems like it's made to store only string. My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database or those key-value db things, which is my last resort. I'm wondering if there's a better way.
The HDF4 / HDF5 file format might suit your needs. See http://forge.ocamlcore.org/projects/ocaml-hdf/
In addition to the HDF4 bindings mentioned by jrouquie there are HDF5 bindings available (http://opam.ocaml.org/packages/hdf5/). Depending on the type of data you're storing there are bindings to GDAL (http://opam.ocaml.org/packages/gdal/).
For data which can fit in a bigarray you also have the option of memory mapping a large file on disk. See https://caml.inria.fr/pub/docs/manual-ocaml/libref/Bigarray.Genarray.html#VALmap_file for example. While it ties you to a rather strict on-disk format, it does make it relatively simple to manipulate arrays which are larger than the available RAM.
there was an ocaml BerkeleyDB wrapper in the past:
OCamlDB
Apparently someone looked into it recently:
recent patch for OCamlDB
However, the GDAL bindings from hcarty are probably production ready and in intensive usage somewhere.
Also, there are bindings for dbm in opam: dbm and cryptodbm
HDF5 is prolly the answer, but given the question is somewhat vague, another solution is possible.
Disclaimer: I don't know ocaml (but I knew caml-light) and I know berkeley database (AKA. bsddb (AKA bdb)).
However, I looked at the ocaml-bdb, seems like it's made to store only string.
That maybe true in ocaml-bdb but in reality it stores bytes. I am not sure about your case, because in Python2 there was no difference between bytes and strings of unicode chars. It's until recently that Python 3 got a proper byte type and the bdb bindings take and spit bytes. That said, the difference is subtile but you'd rather work with bytes because that what bdb understand and use.
My problem is I have arrays that store giant data. Sure, I can serialize them into many files, or encode/decode my data and put them on database
or use those key-value db things, which is my last resort.
I'm wondering if there's a better way.
It depends on you need and how the data looks.
If the data can all stay in memory, you'd rather dump memory to a file and load it back.
If you need to share than data among several architectures or Operating system you'd rather use a serialisation framework like HDF5. Remember is that HDF5 doesn't handle circular references.
If the data can not stay all in memory, then you need to use something like bdb (or wiredtiger).
Why bdb (or wiredtiger)
Simply said, several decades of work have gone into:
splitting data
storing it on disk
retrieve data
As fast as possible.
wiredtiger is the successor of bdb.
So yes you could split the files yourself et al. but that will require a lot of work. Only specialized compagnies do that (bloomberg included...), among people that manage themself all the above there is the famous postgresql, mariadb, google and algolia.
ordered key value stores like wiredtiger and bdb use similar algorithm to higher level databases like postgresql and mysql or specialized one like lucene/solr or sphinx ie. mvcc, btree, lsm, PSSI etc...
MongoDB since 3.2 use wiredtiger backend for storing all the data.
Some people argue that key-value store are not good at storing relational data, that said several project started doing distributed databases on top of key value stores. This is a clue that it's useful. E.g. FoundationDB or CockroachDB.
The idea behind key-value stores is to deliver a generic framework for:
splitting data
storing it on disk
retrieve data
As fast as possible, giving some guarantees (like ACID) and other nice to haves (like compression or cryptography).
To take advantage of the power offer by those libraries. You need to learn about key-value composition.