Using Pig and Python - jython

Apologies if this question is poorly worded: I am embarking on a large scale machine learning project and I don't like programming in Java. I love writing programs in Python. I have heard good things about Pig. I was wondering if someone could clarify to me how usable Pig is in combination with Python for mathematically related work. Also, if I am to write "streaming python code", does Jython come into the picture? Is it more efficient if it does come into the picture?
Thanks
P.S: I for several reasons would not prefer to use Mahout's code as is. I might want to use a few of their data structures: It would be useful to know if that would be possible to do.

Another option to use Python with Hadoop is PyCascading. Instead of writing only the UDFs in Python/Jython, or using streaming, you can put the whole job together in Python, using Python functions as "UDFs" in the same script as where the data processing pipeline is defined. Jython is used as the Python interpreter, and the MapReduce framework for the stream operations is Cascading. The joins, groupings, etc. work similarly to Pig in spirit, so there is no surprise there if you already know Pig.
A word counting example looks like this:
#map(produces=['word'])
def split_words(tuple):
# This is called for each line of text
for word in tuple.get(1).split():
yield [word]
def main():
flow = Flow()
input = flow.source(Hfs(TextLine(), 'input.txt'))
output = flow.tsv_sink('output')
# This is the processing pipeline
input | split_words | GroupBy('word') | Count() | output
flow.run()

When you use streaming in pig, it doesn't matter what language you use... all it is doing is executing a command in a shell (like via bash). You can use Python, just like you can use grep or a C program.
You can now define Pig UDFs in Python natively. These UDFs will be called via Jython when they are being executed.

The Programming Pig book discusses using UDFs. The book is indispensable in general. On a recent project, we used Python UDFs and occasionally had issues with Floats vs. Doubles mismatches, so be warned. My impression is that the support for Python UDFs may not be as solid as the support for Java UDFs, but overall, it works pretty well.

Related

How to use Pandas in apache beam?

How to implement Pandas in Apache beam ?
I cannot perform left join on multiple columns and Pcollections does not support sql queries. Even the Apache Beam document is not properly framed. I checked but couldn't find any kind of Panda implementation in Apache beam.
Can anyone direct me to the desired link ?
There's some confusion going on here.
pandas is "supported", in the sense that you can use the pandas library the same way you'd be using it without Apache Beam, and the same way you can use any other library from your Beam pipeline as long as you specify the proper dependencies. It is also "supported" in the sense that it is bundled as a dependency by default so you don't have to specify it yourself. For example, you can write a DoFn that performs some computation using pandas for every element; a separate computation for each element, performed by Beam in parallel over all elements.
It is not supported in the sense that Apache Beam currently provides no special integration with it, e.g. you can't use a PCollection as a pandas dataframe, or vice versa. A PCollection does not physically contain any data (this should be particularly clear for streaming pipelines) - it is just a placeholder node in Beam's execution plan.
That said, a pandas-like API for working with Beam PCollections would certainly be a good idea, and would simplify learning Beam for many existing pandas users, but I don't think anybody is working on implementing this currently. However, the Beam community is currently discussing the idea of adding schemas to PCollections, which is a step in this direction.
As well as using Pandas directly from DoFns, Beam now has an API to manipulate PCollections as Dataframes. See https://s.apache.org/simpler-python-pipelines-2020 for more details.
pandas is supported in the Dataflow SDK for Python 2.x. As of writing, workers have the pandas v0.18.1 version pre-installed, so you should not have any issue with that. StackOverflow does not accept answers where you request the community to point you to external documentation and/or tutorials, so maybe you should first try an implementation yourself, and then come back with more information about what is/isn't failing and what did you achieve before stumbling with an error.
In any case, if what you want to achieve is a left join, maybe you can also have a look at the CoGroupByKey transform type, which is documented in the Apache Beam documentation. It is used to perform relational joins of several PCollections with a common key type. In that same page, you will be able to find some examples, which use CoGroupByKey and ParDo to join the contents of several data objects.

Best practice approaches to ETL on Bigquery?

I'm wondering what some of the best practices/tools are people have found for building and managing ETL jobs on bigquery.
At the moment I have lots of sql 'templates' (horribly parameterized by lob, date etc using sed type string replacements into a tmp.sql file and then running that) and I use the command line tool to run sequences of them and send output to tables. It works fine but is getting a bit unwieldy. I still don't get why I can't run stored procedure type parameterized scripts on bigquery. Or even some sort of gui to build and manage pipelines.
I love bigquery but really feel like I'm either missing something very obvious here or its a real gap in the product (e.g. Pretty sure Apache Drill more built out in this regard).
So just wondering if anyone can share any best practice etl tips or approaches you use yourself.
I do also use xplenty for some jobs which is good but it's also a bit messy in that I can't just write sql in it so can be painful to build and debug complicated pipelines.
Was thinking about looking into Talend also, but really parameterized stored procedures, macros and SQL is all i'd ideally need.
Sorry if this is more of a discussion question then specific code. Happy to move it to reddit or something if more suited there.
Google Cloud Dataflow is closer to your needs than BigQuery in my opinion. We use it for real-time streaming ETL with automatic scaling. Works great, though you will need to code Java.

is it possible to use pig built in function inside pig java udf

I am new to pig and writing java UDF for different operations which already exists in builtin package but the datatype does not match when called from application.
So I need to wrap pig built in functions of correct datatype from user defined datatypes.
Please suggest.
As mentioned in the comments, the solution that you propose is not possible.
Though you did not ask this (and did not provide relevant information to enable people to be more specific), it is probably possible to solve your problem with a different solution.

Do dplyr functions on a database tbl execute locally or remotely?

I've been using dplyr for a bit locally and I've found it a very powerful tool. One thing that gets showcased in a lot of the intro talks I've found is how you can use it to operate on a database table "to only work with the data you want" via its aggregation functions, summarize, mutate, etc. I understand how it translates those into sql statements, but not so much other operations.
For example, if I wanted to work on a database table as a tbl, and I wanted to run a function on the result of my pipeline through do(), such as glm, would glm be transported to the database somehow to be run there, or is the data necessarily downloaded (in whatever reduced form) and then glm is run locally?
Depending on the size of the table in question, this is an important distinction. Thanks!
Any R analyses, calls to glm(), are run locally. As #joran commented above, the databases vignette, introductory documentation, development information, and many you can find on using dplyr are useful in learning how certain operations are converted to SQL and executed on the DB system. I believe you can induce certain bottlenecks by introducing R-specific analyses in the middle of a chain of operations when finishing DB-capable operations first might be more efficient.

What data generators?

I'm about to release a FOSS data generator that can generate random yet meaningful data in CSV format. Rather belatedly, I guess, I need to poll the state of the art for such products - because if there is a well known and useful existing tool, I can write my work off to experience. I am aware of of a couple of SQL Server specific tools, but mine is not database specific.
So, links? And if you have used such a product,
what features did you find it was missing?
Edit: To add a bit more info on my tool (Ooh, Matron!) it is intended to allow generation of any kind of random data from existing data files, and
supports weighting. It is XML based (sorry, folks) and lets you say things like:
<pick distribute="20,80" >
<datafile file="femalenames.dat"/>
<datafile file="malenames.dat"/>
<pick/>
to select female names about 20% of the time and male names 80% of the time.
But the purpose of this question is not to describe my product but to get info on other tools.
Latest: If anyone is interested, they can get the alpha of my data generator at http://code.google.com/p/csvtest
That can be a one-liner in R where I use the littler scripting front-end:
# generate the data as a one-liner from the command-line
# we set the RNG seed, and draw from a bunch of distributions
# indented just to fit the box here
edd#ron:~$ r -e'set.seed(42); write.csv(data.frame(y=runif(10), x1=rnorm(10),
x2=rt(10,4), x3=rpois(10, 0.4)), file="/tmp/neil.csv",
quote=FALSE, row.names=FALSE)'
edd#ron:~$ cat /tmp/neil.csv
y,x1,x2,x3
0.914806043496355,-0.106124516091484,0.830735621223563,0
0.937075413297862,1.51152199743894,1.6707628713402,0
0.286139534786344,-0.0946590384130976,-0.282485683052060,0
0.830447626067325,2.01842371387704,0.714442314565005,0
0.641745518893003,-0.062714099052421,-1.08008578470128,0
0.519095949130133,1.30486965422349,2.28674786332467,0
0.736588314641267,2.28664539270111,-0.73270267483628,1
0.134666597237810,-1.38886070111234,-1.45317770550920,1
0.656992290401831,-0.278788766817371,-1.01676025893376,1
0.70506478403695,-0.133321336393658,0.404860813371462,0
edd#ron:~$
You have not said anything about your data-generating process, but rest assured that R can probably cope with just about any requirement, including multivariate normal, t, skew-t, and more. The (six different) random-number generators in R are also of very high quality.
R can also write to DBs, or read parameters from it, and if it needs to be on Windoze then the Rscript front-end could be used instead of littler.
I asked a similar question some months ago:
Tools for Generating Mock Data?
I got some sincere suggestions, but most were not suitable for my needs. Either expensive (non-free) software, or else not flexible enough w.r.t. data types and database structure, or range of mock data, or way too slow (e.g. the Rails ActiveRecord solution).
Features I was looking for were:
Generate mock data to fill existing database tables
Quick to generate > 1 million rows
Produce either SQL script format or flat file suitable for importing
Scriptable command-line interface, not a GUI
Not dependent on Microsoft Windows environment
Nice-to-have features:
Extensible/configurable
Open-source, free license
Written in a dynamic language like Perl/PHP/Python
Point it at a database and let it "discover" the metadata
Integrated with testing tools (e.g. DbUnit)
Option to fill directly into the database as it generates data
The answer I accepted as Databene Benerator. Though since asking the question, I admit I haven't used it very much.
I was surprised that even when asking the community, the range of tools for generating mock data was so thin. This seems like a niche waiting to be filled! I'll be interested to see what you release.