How to use Pandas in apache beam? - pandas

How to implement Pandas in Apache beam ?
I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam.
Can anyone direct me to the desired link ?

There's some confusion going on here.
pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.
It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.
That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.

As well as using Pandas directly from DoFns, Beam now has an API to manipulate PCollections as Dataframes. See https://s.apache.org/simpler-python-pipelines-2020 for more details.

pandas is supported in the Dataflow SDK for Python 2.x. As of writing, workers have the pandas v0.18.1 version pre-installed, so you should not have any issue with that. StackOverflow does not accept answers where you request the community to point you to external documentation and/or tutorials, so maybe you should first try an implementation yourself, and then come back with more information about what is/isn't failing and what did you achieve before stumbling with an error.
In any case, if what you want to achieve is a left join, maybe you can also have a look at the CoGroupByKey transform type, which is documented in the Apache Beam documentation. It is used to perform relational joins of several PCollections with a common key type. In that same page, you will be able to find some examples, which use CoGroupByKey and ParDo to join the contents of several data objects.

Related

Write a list of Julia DataFrames to file

I have a list of Julia DataFrames that I want to write to file. What is the fastest way to write these out? I'm looking for something akin to rds files in R.
I routinely use serialize and deserialize from the Serialization module. Note that this is Julia-version specific, but apart from that this is the most robust approach currently.
You can also consider https://github.com/JuliaData/Feather.jl, but it does not support all possible data types that you can store in a DataFrame (but covers all standard types).
Here https://github.com/bkamins/Julia-DataFrames-Tutorial/blob/master/04_loadsave.ipynb you can find some benchmarks (at the end of the notebook).
JLD2 solved my problem. Thanks.

Using dfs and calculate_feature_matrix?

You could use ft.dfs to get back feature definitions as input to ft.calculate_feature_matrix or you could just use ft.dfs to compute the feature matrix. Is there a recommended way of using ft.dfs and ft.calculate_feature_matrix for best practice?
If you're in a situation where you might use either, the answer is to use ft.dfs to create both features and a feature matrix. If you're starting with a blank slate, you'll want to be able to examine and use a feature matrix for data analysis and feature selection. For that purpose, you're better off doing both at once with ft.dfs.
There are times when calculate_feature_matrix is the tool to use as well, though you'll often be able to tell if you're in that situation. The main cases are:
You've loaded in features that were previously saved
You want to rebuild the same features on new data

apache storm/spark and data visualisation tool(s)

I have been searching for hours bu i did not find a clear answer. I would like to know what it is the most suitable data visualization tool(s) to use with apache storm/spark. I know there is tableau and jaspersoft but they are not free. Furthermore, there is the possibility of elasticsearch and kibana but I would like to find/try something else. So, do you have an idea please ?!
Thanks a lot for your attention.
You are not giving much info here. Storm is stream processing engine, Spark can do a lot more but in both cases you need to deposit information somewhere. If it is text based data, you may do Solr+Graphana or Elastic+Kibana. If it is SQL or NoSQL DB there are many tools mostly around data base type. There are BIs for time series with InfluxDB, etc. With Spark, you have Zepplin that can do some level of BI. The last is to have your own visualization but I would be careful with D3 as it is not very good for dynamic charts. You may be better with pure JS charts like HighCharts, etc.
Best of luck.
Apache Zeppelin is a great web based front end for Spark
Highcharts is an excellent chart library.
spark-highcharts add easy modeling feature from Spark DataFrame to highcharts. It can be used in Zeppelin, spark-shell, or other spark application.
spark-highcharts can generate self contain HTML page with full interaction feature. It can share to other users.
Using following docker command try out
docker run -p 8080:8080 -d knockdata/zeppelin-highcharts
Have a look at D3 Javascript library.It provides a very good Visualization library
https://d3js.org/

Does Scalding support record filtering via predicate pushdown w/Parquet?

There are obvious speed benefits from not having to read records that would fail a filter. I see Spark support for it, but I haven't found any documentation on how to do it w/Scalding.
Unfortunately there is no support for this in scalding-parquet yet. We at Tapad started working on implementing Predicate support in scalding. Once we get something working we'll share it.
We have implemented our own ParquetAvroSource that can read/store avro records in parquet. It's possible to use column projection and read only columns/fields required to a scalding job. In some cases using this feature jobs read only 1% of the input bytes.
Predicate pushdown was added to Scalding, but it is not documented yet.
For more details see scalding issue #1089

Using Pig and Python

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code", does Jython come into the picture? Is it more efficient if it does come into the picture?
Thanks
P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.
Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.
A word counting example looks like this:
#map(produces=['word'])
def split_words(tuple):
# This is called for each line of text
for word in tuple.get(1).split():
yield [word]
def main():
flow = Flow()
input = flow.source(Hfs(TextLine(), 'input.txt'))
output = flow.tsv_sink('output')
# This is the processing pipeline
input | split_words | GroupBy('word') | Count() | output
flow.run()
When you use streaming in pig, it doesn't matter what language you use... all it is doing is executing a command in a shell (like via bash). You can use Python, just like you can use grep or a C program.
You can now define Pig UDFs in Python natively. These UDFs will be called via Jython when they are being executed.
The Programming Pig book discusses using UDFs. The book is indispensable in general. On a recent project, we used Python UDFs and occasionally had issues with Floats vs. Doubles mismatches, so be warned. My impression is that the support for Python UDFs may not be as solid as the support for Java UDFs, but overall, it works pretty well.