Nesting pipelines in apache beam - tensorflow

I am looking do to the following with apache beam.
Specifically pre-processing for a tensorflow neural network.
for each file from a folder.
for each line from a file
process line to 1d list of floats
I need each return to be a 2d list of floats for each file.
I think I can accomplish this by creating nested pipelines.
I could create and run a pipeline inside of a ParDo of another pipeline.
This seems inefficient, but my problem seems like a pretty standard use case.
Is there a tool to do this better in apache beam?
Is there a way to restructure my problem to make it work in apache beam better?
Are nested pipelines not as bad as I think they are?
Thanks

Apache Beam is a great tool for pre-processing data for machine learning with Tensorflow. More information about this general use case and tf.Transform is available in this post.
Nothing described seems to indicate the need for "nested pipelines". Processing each line of each file in a directory is a simple TextIO.Read transformation. It is unclear what your requirements from now on are, but, in general, separating the line into floats and joining with other lines are straightforward ParDo and grouping operations.
As a general guidance, I'd avoid nested pipelines, and try to break down the problem to fit into a single pipeline.

Related

Is it better better to open or to read large matrices in Julia?

I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)

What exactly is Orchestrators in ML?

Actually, in ML pipeline components we are specifying inputs and outputs clearly .
For example in TFX statisticgen take input from examplegen and outputs some statistics.so input and output is clear which is same in all components .so why we need orchestrators .if anyone knows please help me?
In real-life projects, everything can be much more complicated:
the input data can be from the different sources: database, file system, third-party services. So we need to do classical ETL before we can start working with data.
you can use different technologies in the one pipeline. For instance, Spark as a preprocessing tool, after you can need to use an instance with GPU for the model training.
last, but not least - in production you need to care much more things. For instance data validation, model evaluation, etc. I wrote a separate article about how to organize this part using Apache Airflow.

Is there a way to partition a tf.Dataset with TensorFlow’s Dataset API?

I checked the doc but I could not find a method for it. I want to de cross validation, so I kind of need it.
Note that I'm not asking how to split a tensor, as I know that TensorFlow provides an API for that an has been answered in another question. I'm asking on how to partition a tf.Dataset (which is an abstraction).
You could either:
1) Use the shard transformation partition the dataset into multiple "shards". Note that for best performance, sharding should be to data sources (e.g. filenames).
2) As of TensorFlow 1.12, you can also use the window transformation to build a dataset of datasets.
I am afraid you cannot. The dataset API is a way to efficiently stream inputs to your net at run time. It is not a set of tools to manipulate datasets as a whole -- in that regards it might be a bit of a misnomer.
Also, if you could, this would probably be a bad idea. You would rather have this train/test split done once and for all.
it let you review those sets offline
if the split is done each time you run an experiment there is a risk that samples start swapping sets if you are not extremely careful (e.g. when you add more data to your existing dataset)
See also a related question about how to split a set into training & testing in tensorflow.

Exporting Tensorflow Models for Eigen Only Environments

Has anyone seen any work done on this? I'd think this would be a reasonably common use-case. Train model in python, export the graph and map to a sequence of eigen instructions?
I don't believe anything like this is available, but it is definitely something that would be useful. There are some obstacles to overcome though:
Not all operations are implemented by Eigen.
We'd need to know how to generate code for all operations we want to support.
The glue code to allocate buffers and schedule work can get pretty gnarly.
It's still a good idea though, and it might get more attention posted as a feature request on https://github.com/tensorflow/tensorflow/issues/

How can I make a Spark paired RDD from many S3 files whose URLs are in an RDD?

I have millions of S3 files, whose sizes average about 250k but are highly variable (up to a few 4 GB size). I can't easily use wildcards to pick out multiple files, but I can make an RDD holding the S3 URLs of the files I want to process at any time.
I'd like to get two kinds of paired RDDs. The first would have the S3 URL, then the contents of the file as a Unicode string. (Is that even possible when some of the files can be so long?) The second could be computed from the first, by split()-ting the long string at newlines.
I've tried a number of ways to do this, typically getting a Python PicklingError, unless I iterate though the PII of S3 URLs one at a time. Then I can use union() to build up the big pairRDDs I want, as was described in another question. But I don't think that is going to run in parallel, which will be important when dealing with lots of files.
I'm currently using Python, but can switch to Scala or Java if needed.
Thanks in advance.
The size of the files shouldn't matter as long as your cluster has the in-memory capacity. Generally, you'll need to do some tuning before everything works.
I'm not versed with python so I can't comment too much on pickling error. Perhaps these links might help but I'll add python tag so that someone better can take a look.
cloudpickle.py
pyspark serializer can't handle functions