hadoop : supporting multiple outputs for Map Reduce jobs - apache

Seems like it is supported in Hadoop(reference), but I dont know how to use this.
I want to :
a.) Map - Read a huge XML file and load the relevant data and pass on to reduce
b.) Reduce - write two .sql files for different tables
Why I am choosing map/reduce is because I have to do this for over 100k(may be many more) xml files residing ondisk. any better suggestions are welcome
Any resources/tutorials explaining how to use this is appreciated.
I am using Python and would want to learn how to achieve this using streaming
Thank you

Might not be an elegant solution, but you could create two templates to convert the output of the reduce tasks into the required format once the job is complete. Much could be automated by writing a shell script which would look for the reduce outputs and apply the templates on them. With the shell script the transformation happens in sequence and doesn't take care of the n machines in the cluster.
Or else in the reduce tasks you could create the two output formats into a single file with some delimiter and split them later using the delimiter. In this approach since the transformation happens in the reduce, the transformation is spread across all the nodes in the cluster.

Related

Multiple step Pandas processing with Airflow

I have a multiple stage ETL transform stage using pandas. Basically, I load almost 2Gb of data from Mongodb and then I apply several functions in the columns. My question is if there's any way to break those transformations in multiple Airflow tasks.
The options I have considered are:
Creating a temporary table in Mongodb and loading/storing the transformed data frame between steps. I found this cumbersome and totally prone to a non-usual overhead due to disk I/O
Passing data among the tasks using XCom. I think this is a nice solution but I worry about the sheer size of the data. The docs explicitly state
Any object that can be pickled can be used as an XCom value, so users should make sure to use objects of appropriate size.
Using an in-memory storage between steps. Maybe saving the data in a Redis server or something, but I'm not really sure if that would be any better than just using XCom altogether.
So, does any of you have any tips on how to handle this situation? Thanks!

Hint parallelization for U-SQL outputter

I have a custom u-sql outputter which has some reasonably heavy lifting to do.
My understanding is that an outputter will naturally parallelize and create separate files, as stitching the files back together in Azure Data Lake Store is a quick operation.
Running as either a custom outputter or processor (with a CSV outputter) I only ever see one vertex being run.
Is there a way to hint to multiple vertices?

Optimal data streaming and processing solution for enormous datasets into tf.data.Dataset

Context:
My text input pipeline currently consists of two main parts:
I. A complex text preprocessing and exporting of tf.SequenceExamples to tfrecords (custom tokenization, vocabulary creation, statistics calculation, normalization and many more over the full dataset as well as per each individual example). That is done once for each data configuration.
II. A tf.Dataset (TFRecords) pipeline that does quite a bit of processing during training, too (string_split into characters, table lookups, bucketing, conditional filtering, etc.).
Original Dataset is present across multiple locations (BigQuery, GCS, RDS, ...).
Problem:
The problem is that as the production dataset increases rapidly (several terabytes), it is not feasible to recreate a tfrecords files for each possible data configuration (part 1 has a lot of hyperparameters) as each will have an enormous size of hundreds of terabytes. Not to mention, that tf.Dataset reading speed surprisingly slows down when tf.SequenceExamples or tfrecords grow in size.
There are quite a few possible solutions:
Apache Beam + Cloud DataFlow + feed_dict;
tf.Transform;
Apache Beam + Cloud DataFlow + tf.Dataset.from_generator;
tensorflow/ecosystem + Hadoop or Spark
tf.contrib.cloud.BigQueryReader
, but neither of the following seem to fully fulfill my requierments:
Streaming and processing on the fly data from BigQuery, GCS, RDS, ... as in part I.
Sending data (protos?) directly to tf.Dataset in one way or another to be used in part II.
Fast and reliable for both training and inference.
(optional) Being able to pre-calculate some full pass statistics over the selected part of the data.
EDIT: Python 3 support would be just wonderful.
What is the most suitable choice for the tf.data.Dataset pipeline? What are the best practices in this case?
Thanks in advance!
I recommend to orchestrate the whole use case using Cloud Composer(GCP integration of Airflow).
Airflow provided operators which let you orchestrate a pipeline with a script.
In your case you can use the dataflow_operator to have the dataflow job spin up when you have enough data to process.
To get the data from BigQuery you can use the bigquery_operator.
Furthermore you can use the python operator or the bash operator to monitor and pre-calculate statistics.

How can I make a Spark paired RDD from many S3 files whose URLs are in an RDD?

I have millions of S3 files, whose sizes average about 250k but are highly variable (up to a few 4 GB size). I can't easily use wildcards to pick out multiple files, but I can make an RDD holding the S3 URLs of the files I want to process at any time.
I'd like to get two kinds of paired RDDs. The first would have the S3 URL, then the contents of the file as a Unicode string. (Is that even possible when some of the files can be so long?) The second could be computed from the first, by split()-ting the long string at newlines.
I've tried a number of ways to do this, typically getting a Python PicklingError, unless I iterate though the PII of S3 URLs one at a time. Then I can use union() to build up the big pairRDDs I want, as was described in another question. But I don't think that is going to run in parallel, which will be important when dealing with lots of files.
I'm currently using Python, but can switch to Scala or Java if needed.
Thanks in advance.
The size of the files shouldn't matter as long as your cluster has the in-memory capacity. Generally, you'll need to do some tuning before everything works.
I'm not versed with python so I can't comment too much on pickling error. Perhaps these links might help but I'll add python tag so that someone better can take a look.
cloudpickle.py
pyspark serializer can't handle functions

Loading large amounts of data to an Oracle SQL Database

I was wondering if anyone had any experience with what I am about to embark on. I have several csv files which are all around a GB or so in size and I need to load them into a an oracle database. While most of my work after loading will be read-only I will need to load updates from time to time. Basically I just need a good tool for loading several rows of data at a time up to my db.
Here is what I have found so far:
I could use SQL Loader t do a lot of the work
I could use Bulk-Insert commands
Some sort of batch insert.
Using prepared statement somehow might be a good idea. I guess I was wondering what everyone thinks is the fastest way to get this insert done. Any tips?
I would be very surprised if you could roll your own utility that will outperform SQL*Loader Direct Path Loads. Oracle built this utility for exactly this purpose - the likelihood of building something more efficient is practically nil. There is also the Parallel Direct Path Load, which allows you to have multiple direct path load processes running concurrently.
From the manual:
Instead of filling a bind array buffer
and passing it to the Oracle database
with a SQL INSERT statement, a direct
path load uses the direct path API to
pass the data to be loaded to the load
engine in the server. The load engine
builds a column array structure from
the data passed to it.
The direct path load engine uses the
column array structure to format
Oracle data blocks and build index
keys. The newly formatted database
blocks are written directly to the
database (multiple blocks per I/O
request using asynchronous writes if
the host platform supports
asynchronous I/O).
Internally, multiple buffers are used
for the formatted blocks. While one
buffer is being filled, one or more
buffers are being written if
asynchronous I/O is available on the
host platform. Overlapping computation
with I/O increases load performance.
There are cases where Direct Path Load cannot be used.
With that amount of data, you'd better be sure of your backing store - the dbf disks' free space.
sqlldr is script drive, very efficient, generally more efficient than a sql script.
The only thing I wonder about is the magnitude of the data. I personally would consider several to many sqlldr processes and assign each one a subset of data and let the processes run in parallel.
You said you wanted to load a few records at a time? That may take a lot longer than you think. Did you mean a few files at a time?
You may be able to create an external table on the CSV files and load them in by SELECTing from the external table into another table. Whether this method will be quicker not sure however might be quicker in terms of messing around getting sql*loader to work especially when you have a criteria for UPDATEs.