I have a question regarding the data pipeline in multi-host training.
In particular, I have multiple workers equipped with GPU and each worker can access central data strorage.
I would like to use tf.data.Dataset.shard to load part of the batch independently on each worker and join the shard in a single ShardedDeviceArray that can be handled by pmap.
It looks like jax.device_put_sharded does the job but it requires a list of shards on the host, and I want to load them independently on the workers.
I imagine my loop to be like
for it in ds.shard(n,jax.process_index()):
sharded_batch = unknow_function(it,...)
pmaped_train_step(sharded_batch)
What is the most efficient way to do it?
Related
I have a Spark script that reads data from amazon S3 and then writes in another bucket usion parquet format.
This is what the code looks like:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods.repartition(25).write.format("parquet").mode("OverWrite").save("AnotherLocationInS3")
My question is: how does the repartition argument (here 25) affects the execution time? Should I increase it so the script runs faster?
Second question: Would it be better if I cache my df before the last line?
Thank you
In typical setups neither repartition nor cache will help you in this specific case. Since you read data from non-splittable format:
File = "LocationInFirstBucket.csv.gz"
df_ods = spark.read.csv(File, header=True, sep=";")
df_ods will have only one partition.
In such case repartitioning would make sense, if you performed any actual processing on this data.
However if you just write to distributed file system repartitioning will simply double the cost - you have to send data to other nodes first (that involves serialization, deserialization, network transfer, write to disk) and then still write to distributed file system.
There are of course edge cases when this makes sense. If network connecting your cluster is much faster than network connection your cluster to S3 nodes, effective latency might be a bit lower.
As of caching ‒ there is no value in caching here at all. Caching Dataset is expensive, and makes sense only if persisted data is reused.
Answer 1 :- Repartition of 25 or more or less it depends on how much data you have and no. of executors you provided. If your Spark code run in the cluster with more than one executor and it is not repartitioned then repartitioning will speedy to writing parallel your data.
Answer 2 :- There is no need to cache df before the last line because you are using only single action in your code. If you will perform multiple actions on your DF and don't want it will recalculate as the number of actions then you will Cache it.
The thing here is that Spark can parallelize writing to a certain point since one file can't be written by multiple executors at the same time.
Repartition helps you in this parallelization because it will write 25 different files (one for each partition). If you increase the number of partitions you will increase the number of written files hence speeding up the execution. This comes with a price because of the reading time will increase with the number of files to be read.
The limit is the number of executors you are running your job with, e.g. if you are running with 25 executors then setting repartition to 26 will not help you because to write the 26th partition one of the previous 25 would have to be finished.
For the other question, I don't think .cache() will help you because Spark is lazy, maybe this article can help you further.
Which class in Gem5 has access to all the Stats from different objects?
Do the statistics of each object being returned to specific class continuously or these stats are collected just at the end of the simulation?
For example servicedByWrQ is a Scalar stat defined in the dram_ctrl.hh. On the other hand condPredicted is another Scalar stat which is defined in the bpred_unit.hh. How can I monitor these two statistics at the same time during the simulation not through the output file in Gem5?
My ultimate goal is to change the behavior other hardware units during the simulations such as branch prediction or cache replacement policy, etc. based on those statistics.
I have a dataset stored in a tab-separated text file. The file looks as follows:
date time temperature
2010-01-01 12:00:00 10.0000
...
where the temperature column contains values in degrees Celsius (°C).
I compute the daily average temperature using Dask. Here is my code:
from dask.distributed import Client
import dask.dataframe as dd
client = Client("<scheduler URL")
inputDataFrame = dd.read_table("<input file>").drop('time', axis=1)
groupedData = inputDataFrame.groupby('date')
meanDataframe = groupedData.mean()
result = meanDataframe.compute()
result.to_csv('result.out', sep='\t')
client.close()
In order to improve the performance of my program, I would like to understand the data flow caused by Dask data frames.
How is the text file read into a data frame by read_table()? Does the client read the whole text file and send the data to the scheduler, which partitions the data and sends it to the workers? Or does each worker read the data partitions it works on directly from the text file?
When an intermediate data frame is created (e.g. by calling drop()) is the whole intermediate data frame sent back to the client and then sent to the workers for further processing?
The same question for groups: where is the data for a group object create and stored? How does it flow between client, scheduler and workers?
The reason for my question is that if I run a similar program using Pandas, the computation is roughly two times faster, and I am trying to understand what causes the overhead in Dask. Since the size of the result data frame is very small compared to the size of the input data, I suppose there is quite some overhead caused by moving the input and intermediate data between client, scheduler and workers.
1) The data are read by the workers. The client does read a little ahead of time, to figure out the column names and types and, optionally, to find line-delimiters for splitting files. Note that all workers must be able to reach the file(s) of interest, which can require some shared file-system when working on a cluster.
2), 3) In fact, the drop, groupby and mean methods do not generate intermediate data-frames at all, they just accumulate a graph of operations to be executed (i.e., they are lazy). You could time these steps and see they are fast. During execution, intermediates are made on workers, copies to other workers as required, and discarded as soon as possible. There are never copies to the scheduler or client, unless you explicitly request so.
So, to the root of your question: you can investigate the performance or your operation best by looking at the dashboard.
There are many factors that govern how quickly things will progress: the processes may be sharing an IO channel; some tasks do not release the GIL, and so parallelise poorly in threads; the number of groups will greatly affect the amount of shuffling of data into groups... plus there is always some overhead for every task executed by the scheduler.
Since Pandas is efficient, it is not surprising that for the case where data fits easily into memory, it performs well compared to Dask.
I have a defined list of S3 file paths and I want to read them as DataFrames:
ss = SparkSession(sc)
JSON_FILES = ['a.json.gz', 'b.json.gz', 'c.json.gz']
dataframes = {t: ss.read.json('s3a://bucket/' + t) for t in JSON_FILES}
The code above works, but in an unexpected way. When the code is submitted to a Spark cluster, only a single file is read at time, keeping only a single node occupied.
Is there a more efficient way to read multiple files? A way to make all nodes work at the same time?
More details:
PySpark - Spark 2.2.0
Files stored on S3
Each file contains one JSON object per line
The files are compressed, as it can be seen by their extensions
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
ss = SparkSession(sc)
dataframes = ss.read.json("s3a://bucket/*.json.gz")
The problem was: I didn't understand Spark's runtime architecture. Spark has the notion of "workers", which, if I now understand it better (don't trust me), are capable of doing stuff in parallel. When we submit a Spark job, we can set both things, the number of workers and the level of parallelism they can leverage.
If you are using the Spark command spark-submit, these variables are represented as the following options:
--num-executors: similar to the notion of number of workers
--executor-cores: how many CPU cores a single worker should use
This is a document that helped me understand these concepts and how to tune them.
Coming back to my problem, in that situation I would have one worker per file.
I am considering implementing a Lambda Architecture in order to process events transmitted by multiple devices.
In most cases (averages etc.) its seems to fit my requirements. However, I am stuck trying to model a specific use case. In short...
Each device has a device_id. Every device emits 1 event per second. Each event has an event_id ranging from {0-->10}.
An event_id of 0 indicates START & an event_id of 10 indicates END
All the events between START & END should be grouped into one single group (event_group).
This will produce tuples of event_groups i.e. {0,2,2,2,5,10}, (0,4,2,7,...5,10), (0,10)
This (event_group) might be small i.e. 10 minutes or very large say 3hours.
According to Lambda Architecture these events transmitted by every device are my "Master Data Set".
Currently, the events are sent to HDFS & Storm using Kafka (Camus, Kafka Spout).
In the Streaming process I group by device_id, and use Redis to maintain a set of incoming events in memory, based on a key which is generated each time an event_id=0 arrives.
The problem lies in HDFS. Say I save a file with all incoming events every hour. Is there a way to distinguish these (group_events)?
Using Hive I can group tuples in the same manner. However, each file will also contain "broken" event_groups
(0,2,2,3) previous computation (file)
(4,3,) previous computation (file)
(5,6,7,8,10) current computation (file)
so that I need to merge them based on device_id into (0,2,2,3,4,3,5,6,7,8,10) (multiple files)
Is a Lambda Architecture a fit for this scenario? Or should the streaming process be the only source of truth? I.e. write to hbase, hdfs itself won't this affect the overall latency.
As far as I understand your process, I don't see any issue, as the principle of Lambda Architecure is to re-process regularly all your data on a batch mode.
(by the way, not all your data, but a time frame, usually larger than the speed layer window)
If you choose a large enough time window for your batch mode (let's say your aggregation window + 3 hours, in order to include even the longest event groups), your map reduce program will be able to compute all your event groups for the desired aggregation window, whatever file the distincts events are stored (Hadoop shuffle magic !)
The underlying files are not part of the problem, but the time windows used to select data to process are.