I've been reading on the improvements of DataFrames/Datasets compared to RDDs: Tungsten row format, code generation, etc. Some texts seem to imply that DataFrames/Datasets are converted to RDDs as part of the optimizer pipeline. Is this correct?
Spark: The definitive guide says:
Physical planning results in a series of RDDs and transformations. This is why you might have heard Spark referred to as a compiler - it takes queries in DataFrames, Datasets and SQL and compiles them into RDD transformations for you.
And on slide 10 of the slideset below we see RDDs as the end result of the optimizer pipeline.
https://www.slideshare.net/databricks/sparksql-a-compiler-from-queries-to-rdds
But if I understand correctly, DataFrames and Datasets are stored in memory in the Tungsten row format while RDDs are stored as Java objects. Does this mean that when we perform a query (which gets processed by the optimizer) on a DataFrame, we eventually end up with Java objects instead of Tungsten rows?
Thanks in advance
Related
My current understanding is:
Different map_func: Both interleave and flat_map expect "A function mapping a dataset element to a dataset". In contrast, map expects "A function mapping a dataset element to another dataset element".
Arguments: Both interleave and map offer the argument num_parallel_calls, whereas flat_map does not. Moreover, interleave offers these magical arguments block_length and cycle_length. For cycle_length=1, the documentation states that the outputs of interleave and flat_map are equal.
Last, I have seen data loading pipelines without interleave as well as ones with interleave. Any advice when to use interleave vs. map or flat_map would be greatly appreciated
//EDIT: I do see the value of interleave, if we start out with different datasets, such as in the code below
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.tfrecord")
dataset = files.interleave(tf.data.TFRecordDataset)
However, is there any benefit of using interleave over map in a scenario such as the one below?
files = tf.data.Dataset.list_files("/path/to/dataset/train-*.png")
dataset = files.map(load_img, num_parallel_calls=tf.data.AUTOTUNE)
Edit:
Can map not also be used to parallelize I/O?
Indeed, you can read images and labels from a directory with map function. Assume this case:
list_ds = tf.data.Dataset.list_files(my_path)
def process_path(path):
### get label here etc. Images need to be decoded
return tf.io.read_file(path), label
new_ds = list_ds.map(process_path,num_parallel_calls=tf.data.experimental.AUTOTUNE)
Note that, now it is multi-threaded as num_parallel_calls has been set.
The advantage of interlave() function:
Suppose you have a dataset
With cycle_length you can out that many elements from the dataset, i.e 5, then 5 elements are out from the dataset and a map_func can be applied.
After, fetch dataset objects from newly generated objects, block_length pieces of data each time.
In other words, interleave() function can iterate through your dataset while applying a map_func(). Also, it can work with many datasets or data files at the same time. For example, from the docs:
dataset = dataset.interleave(lambda x:
tf.data.TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
cycle_length=4, block_length=16)
However, is there any benefit of using interleave over map in a
scenario such as the one below?
Both interleave() and map() seems a bit similar but their use-case is not the same. If you want to read dataset while applying some mapping interleave() is your super-hero. Your images may need to be decoded while being read. Reading all first, and decoding may be inefficient when working with large datasets. In the code snippet you gave, AFAIK, the one with tf.data.TFRecordDataset should be faster.
TL;DR interleave() parallelizes the data loading step by interleaving the I/O operation to read the file.
map() will apply the data pre-processing to the contents of the datasets.
So you can do something like:
ds = train_file.interleave(lambda x: tf.data.Dataset.list_files(directory_here).map(func,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
tf.data.experimental.AUTOTUNE will decide the level of parallelism for buffer size, CPU power, and also for I/O operations. In other words, AUTOTUNE will handle the level dynamically at runtime.
num_parallel_calls argument spawns multiple threads to utilize multiple cores for parallelizing the tasks. With this you can load multiple datasets in parallel, reducing the time waiting for the files to be opened; as interleave can also take an argument num_parallel_calls. Image is taken from docs.
In the image, there are 4 overlapping datasets, that is determined by the argument cycle_length, so in this case cycle_length = 4.
FLAT_MAP: Maps a function across the dataset and flattens the result. If you want to make sure order stays the same you can use this. And it does not take num_parallel_calls as an argument. Please refer docs for more.
MAP:
The map function will execute the selected function on every element of the Dataset separately. Obviously, data transformations on large datasets can be expensive as you apply more and more operations. The key point is, it can be more time consuming if CPU is not fully utilized. But we can use parallelism APIs:
num_of_cores = multiprocessing.cpu_count() # num of available cpu cores
mapped_data = data.map(function, num_parallel_calls = num_of_cores)
For cycle_length=1, the documentation states that the outputs of
interleave and flat_map are equal
cycle_length --> The number of input elements that will be processed concurrently. When set it to 1, it will be processed one-by-one.
INTERLEAVE: Transformation operations like map can be parallelized.
With parallelism of the map, at the top the CPU is trying to achieve parallelization in transformation, but the extraction of data from the disk can cause overhead.
Besides, once the raw bytes are read into memory, it may also be necessary to map a function to the data, which of course, requires additional computation. Like decrypting data etc. The impact of the various data extraction overheads needs to be parallelized in order to mitigate this with interleaving the contents of each dataset.
So while reading the datasets, you want to maximize:
Source of image: deeplearning.ai
TensorFlow recommends batching of datasets before transformations with map in order to vectorize the transformation and reduce overhead: https://www.tensorflow.org/guide/data_performance#vectorizing_mapping
However, there are cases where you want to perform transformations on the dataset and then do something (e.g., shuffle) on the UNBATCHED dataset.
I haven't been able to find anything to indicate which is more efficient:
1) dataset.map(my_transformations)
2) dataset.batch(batch_size).map(my_transformations).unbatch()
(2) has reduced map overhead from having vectorized with batch, but has additional overhead from having to unbatch the dataset after.
I could also see it being that there is not a universal rule. Short of testing every time I try a new dataset or transformation (or hardware!), is there a good rule of thumb here? I have seen several examples online use (2) without explanation, but I have no intuition on this subject, so...
Thanks in advance!
EDIT: I have since found that in at least some cases, (2) is MUCH less efficient than (1). For example, on our image dataset, applying random flips and rotations (with .map and the built-in TF functions tf.image.random_flip_left_right, tf.image.random_flip_up_down, and tf.image.rot90) per epoch for data augmentation takes 50% longer with (2). I still have no idea when to expect this to be the case, or not, but the tutorials' suggested approach is at least sometimes wrong.
The answer is (1). https://github.com/tensorflow/tensorflow/issues/40386
TF is modifying the documentation to reflect that the overhead from unbatch will usually (always?) be higher than the savings from vectorized transformations.
I have a generic spark 2.3 job that does many transformations and joins and produces a huge dag.
That is having a big impact on driver side as dag becomes very complex.
In order to release pressure on the driver I've though on checkpointing some intermediate dataframes to cut the dag, but I've noticed that dataframe.checkpoint is using rdds underneath and it spends many time serializing and deserializing the dataframe.
According to this Spark: efficiency of dataframe checkpoint vs. explicitly writing to disk and my experience, writing the dataframe as parquet and reading it back is faster than checkpointing, but it has a disadvantage. Dataframe loses the partitioner.
Is there any way of writing and reading the dataframe and keep the partitioner? I've though in using buckets when writing the dataframe, so that when the dataframe is read back it knows the data partitioning.
Problem is, how can I know the columns that the dataframe has as partitioner?
Spark job I'm running is kind of generic, so I can't hardcode the columns
Thanks
today i'm using pandas as my main tool of data pre-processing in my project, where i need to do some transformations in the data to ensure they are in a correct format, which my python class expects.
So i heard about TF Tansform and tested it a little, but i didn't see any obvious advantage (obviously i'm referring to data transformation itself, not in a machine learning pipeline).
For example, i made a simple code in TFT to uppercase all values in my dataframe column:
upper = tf.strings.upper(input, encoding='', name=None)
The execution time of this pre processing function is: 17.1880068779
This is, in the other hand, the code that i use to do the exactly same thing in dataframe:
x = dataset['CITY'].str.upper()
The execution time is: 0.0188028812408
So,i'm doing something wrong? I think that if we have dozens of transformations and a dataset if millions of lines, maybe TFT will be better in this comparison, but for a 100k dataframe it seems not so useful.
I have flask app code where an API is exposed to dump the data from oracle database to postgress database.
I am using Pandas to copy the content of the tables from oracle, mysql and postgress to postgress.
After using constantly for 15 days or so, the CPU memory consumption is very high.
It usually transfers atleast 5 million records per two days.
Can anyone help me optimizing pandas write.
If you have some preprocess step, I suggest using dask. Dask offers parallel computation and do not fill memory unless you explicitly force it. The force means computation of any task on dataframe. Refer to documentation here for dask api read_sql_table method.
import dask.dataframe as dd
# read the data as dask dataframe
df = dd.read_csv('path/ to / file') # this code is subject to change as your
# source changes, just consider this as a
# pseudo.
{
# do the preprocess step on data.
}
# finally write it.
This solution comes very handy if you have to deal with large dataset with a preprocessing step possibly a reduction. Refer to documentation here for more information. It may have a significant improvement depending on your preprocess step.
Or alternatively, you can use chunksize parameter of pandas as #TrigonaMinima suggested. This allows your machine to retrieve the data in chunks as "x rows at a time" so you may want to process it as above with preprocessing, this may require you to create temp file and append them.