Cache larger-than-memory dataframe to local disk with Dask - amazon-s3

I have a bunch of files in S3 which comprise a larger-than-memory dataframe.
Currently, I use Dask to read the files into a dataframe, perform an inner-join with a smaller dataset (which will change on each call to this function, whereas huge_df is basically the full dataset & does not change), call compute to get a much smaller pandas dataframe, and then do some processing. E.g:
huge_df = ddf.read_csv("s3://folder/**/*.part")
merged_df = huge_df.join(small_df, how='inner', ...)
merged_df = merged_df.compute()
...other processing...
Most of the time is spent downloading the files from S3. My question is: is there a way to use Dask to cache the files from S3 on disk, so that on subsequent calls to this code, I could just read the dataframe files from disk, rather than from S3? I figure I can't just call huge_df.to_csv(./local-dir/) since that will bring huge_df into memory which won't work.
I'm sure there is a way to do this using a combination of other tools plus standard Python IO utilities, but I wanted to see if there was a way to use Dask to download the file contents from S3 and store them on the local disk without bringing everything into memory.

Doing huge_df.to_csv would have worked, because it would write each partition to a separate file locally, and so the whole thing would not have been in memory at once.
However, to answer the specific question, dask uses fsspec to manage file operations, and it allows for local caching, e.g., you could do
huge_df = ddf.read_csv("simplecache::s3://folder/**/*.part")
By default, this will store the files in a temporary folder, which gets cleaned up when you exit the python session, but you can provide options using an optional argument storage_options={"simplecache": {..}} to specify the cache location, or use "filecache" instead of "simplecache" if you want to enable the local copies to expire after some time or to check the target for updated versions.
Note that, obviously, these will only work with a distributed cluster only if all the workers have access to the same cache location, since the loading of a partition might happen on any of your workers.

Related

Saving single parquet file on spark at only one partition while using multi partitions

I'm trying to save my dataframe parquet file as a single file instead of multi parts files (?? I don't know the exact term for this, pardon me. I'm stupid in English)
here is how I made my df
val df = spark.read.format("jdbc").option("url","jdbc:postgresql://ipaddr:port/address").option("header","true").load()
and I have 2 ip addresses where I run different master / work servers.
for ex,
ip : 10.10.10.1 runs 1 master server
ip : 10.10.10.2 runs 2 work servers (and these work servers work for ip1)
and I'm trying to save my df as parquet file only in master server (ip1)
normally I would save the file with (I use spark on master server using scala)
df.write.parquet("originals.parquet")
then the df gets saved as parquet in both servers (obviously since it's spark)
but from now I'm trying to save the df as one single file, while keeping the multi process for better speed yet saving the file only on one side of the server
so I've tried using
df.coalesce(1).write.parquet("test.parquet")
and
df.coalesce(1).write.format("parquet").mode("append").save("testestest.parquet")
but the result for both are same as the original write.parquet resulting the df being saved in both servers.
I guess it's because of my lack of understanding in using spark and also the function of df, coalesce..
I was told that using coalesce or partition will help me saving the file only on one server, but I want to know why is it still being saved in both servers. Is it how it's supposed to be and is it me who understood wrongly while studying the use of coalesce?? Or is it my way of writing the scala query that made coalesce not working effectively..?
I've also found that using pandas are okay to save the df into one file but my df is very very large and also I want a fast result so I don't think I'm supposed to use pandas for big files like my df (correct me if i'm worng, please.)
Also I don't quite understand how people explain 'partition' and 'coalesce' are different cause 'coalesce' is minizing the movement of the files, can somebody explain it to me in easier way please?
To resume: Why is my use of coalesce / partition to save a paquet file into only one partition not working? (Is saving the file on one partition not possible at all? I just realized, maybe using coalesce/partition is just to save 1 parquet file in EACH partition AND NOT at ONE partition as I want)

Dask not recovering partitions from simple (non-Hive) Parquet files

I have a two-part question about Dask+Parquet. I am trying to run queries on a dask dataframe created from a partitioned Parquet file as so:
import pandas as pd
import dask.dataframe as dd
import fastparquet
##### Generate random data to Simulate Process creating a Parquet file ######
test_df = pd.DataFrame(data=np.random.randn(10000, 2), columns=['data1', 'data2'])
test_df['time'] = pd.bdate_range('1/1/2000', periods=test_df.shape[0], freq='1S')
# some grouping column
test_df['name'] = np.random.choice(['jim', 'bob', 'jamie'], test_df.shape[0])
##### Write to partitioned parquet file, hive and simple #####
fastparquet.write('test_simple.parquet', data=test_df, partition_on=['name'], file_scheme='simple')
fastparquet.write('test_hive.parquet', data=test_df, partition_on=['name'], file_scheme='hive')
# now check partition sizes. Only Hive version works.
assert test_df.name.nunique() == dd.read_parquet('test_hive.parquet').npartitions # works.
assert test_df.name.nunique() == dd.read_parquet('test_simple.parquet').npartitions # !!!!FAILS!!!
My goal here is to be able to quickly filter and process individual partitions in parallel using dask, something like this:
df = dd.read_parquet('test_hive.parquet')
df.map_partitions(<something>) # operate on each partition
I'm fine with using the Hive-style Parquet directory, but I've noticed it takes significantly longer to operate on compared to directly reading from a single parquet file.
Can someone tell me the idiomatic way to achieve this? Still fairly new to Dask/Parquet so apologies if this is a confused approach.
Maybe it wasn't clear from the docstring, but partitioning by value simply doesn't happen for the "simple" file type, which is why it only has one partition.
As for speed, reading the data in one single function call is fastest when the data are so small - especially if you intend to do any operation such as nunique which will require a combination of values from different partitions.
In Dask, every task incurs an overhead, so unless the amount of work being done by the call is large compared to that overhead, you can lose out. In addition, disk access is not generally parallelisable, and some parts of the computation may not be able to run in parallel in threads if they hold the GIL. Finally, the partitioned version contains more parquet metadata to be parsed.
>>> len(dd.read_parquet('test_hive.parquet').name.nunique())
12
>>> len(dd.read_parquet('test_simple.parquet').name.nunique())
6
TL;DR: make sure your partitions are big enough to keep dask busy.
(note: the set of unique values is already apparent from the parquet metadata, it shouldn't be necessary to load the data at all; but Dask doesn't know how to do this optimisation since, after all, some of the partitions may contain zero rows)

How to find less frequenlty accessed files in HDFS

Beside using Cloudera Navigator, how can I find the less frequently accessed files, in HDFS.
I assume that you are looking for the time a file was last accessed (open, read, etc.), because as longer in the past the file would be less accessed.
Whereby you can do this in Linux quite simple via ls -l -someMoreOptions, in HDFS more work is necessary.
Maybe you could monitor the /hdfs-audit.log for cmd=open of the mentioned file. Or you could implement a small function to read out the FileStatus.getAccessTime() and as mentioned under Is there anyway to get last access time of HDFS files? or How to get last access time of any files in HDFS? in Cloudera Community.
In other words, it will be necessary to create a small program which scans all the files, read out the properties
...
status = fs.getFileStatus(new Path(line));
...
long lastAccessTimeLong = status.getAccessTime();
Date lastAccessTimeDate = new Date(lastAccessTimeLong);
...
and order it. It that you will be able find files which were not accessed for long times.

Spark - How to read multiple files as DataFrames in parallel?

I have a defined list of S3 file paths and I want to read them as DataFrames:
ss = SparkSession(sc)
JSON_FILES = ['a.json.gz', 'b.json.gz', 'c.json.gz']
dataframes = {t: ss.read.json('s3a://bucket/' + t) for t in JSON_FILES}
The code above works, but in an unexpected way. When the code is submitted to a Spark cluster, only a single file is read at time, keeping only a single node occupied.
Is there a more efficient way to read multiple files? A way to make all nodes work at the same time?
More details:
PySpark - Spark 2.2.0
Files stored on S3
Each file contains one JSON object per line
The files are compressed, as it can be seen by their extensions
To read multiple inputs in Spark, use wildcards. That's going to be true whether you're constructing a dataframe or an rdd.
ss = SparkSession(sc)
dataframes = ss.read.json("s3a://bucket/*.json.gz")
The problem was: I didn't understand Spark's runtime architecture. Spark has the notion of "workers", which, if I now understand it better (don't trust me), are capable of doing stuff in parallel. When we submit a Spark job, we can set both things, the number of workers and the level of parallelism they can leverage.
If you are using the Spark command spark-submit, these variables are represented as the following options:
--num-executors: similar to the notion of number of workers
--executor-cores: how many CPU cores a single worker should use
This is a document that helped me understand these concepts and how to tune them.
Coming back to my problem, in that situation I would have one worker per file.

cache table in pyspark using sql [duplicate]

When a resilient distributed dataset (RDD) is created from a text file or collection (or from another RDD), do we need to call "cache" or "persist" explicitly to store the RDD data into memory? Or is the RDD data stored in a distributed way in the memory by default?
val textFile = sc.textFile("/user/emp.txt")
As per my understanding, after the above step, textFile is a RDD and is available in all/some of the node's memory.
If so, why do we need to call "cache" or "persist" on textFile RDD then?
Most RDD operations are lazy. Think of an RDD as a description of a series of operations. An RDD is not data. So this line:
val textFile = sc.textFile("/user/emp.txt")
It does nothing. It creates an RDD that says "we will need to load this file". The file is not loaded at this point.
RDD operations that require observing the contents of the data cannot be lazy. (These are called actions.) An example is RDD.count — to tell you the number of lines in the file, the file needs to be read. So if you write textFile.count, at this point the file will be read, the lines will be counted, and the count will be returned.
What if you call textFile.count again? The same thing: the file will be read and counted again. Nothing is stored. An RDD is not data.
So what does RDD.cache do? If you add textFile.cache to the above code:
val textFile = sc.textFile("/user/emp.txt")
textFile.cache
It does nothing. RDD.cache is also a lazy operation. The file is still not read. But now the RDD says "read this file and then cache the contents". If you then run textFile.count the first time, the file will be loaded, cached, and counted. If you call textFile.count a second time, the operation will use the cache. It will just take the data from the cache and count the lines.
The cache behavior depends on the available memory. If the file does not fit in the memory, for example, then textFile.count will fall back to the usual behavior and re-read the file.
I think the question would be better formulated as:
When do we need to call cache or persist on a RDD?
Spark processes are lazy, that is, nothing will happen until it's required.
To quick answer the question, after val textFile = sc.textFile("/user/emp.txt") is issued, nothing happens to the data, only a HadoopRDD is constructed, using the file as source.
Let's say we transform that data a bit:
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
Again, nothing happens to the data. Now there's a new RDD wordsRDD that contains a reference to testFile and a function to be applied when needed.
Only when an action is called upon an RDD, like wordsRDD.count, the RDD chain, called lineage will be executed. That is, the data, broken down in partitions, will be loaded by the Spark cluster's executors, the flatMap function will be applied and the result will be calculated.
On a linear lineage, like the one in this example, cache() is not needed. The data will be loaded to the executors, all the transformations will be applied and finally the count will be computed, all in memory - if the data fits in memory.
cache is useful when the lineage of the RDD branches out. Let's say you want to filter the words of the previous example into a count for positive and negative words. You could do this like that:
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
Here, each branch issues a reload of the data. Adding an explicit cache statement will ensure that processing done previously is preserved and reused. The job will look like this:
val textFile = sc.textFile("/user/emp.txt")
val wordsRDD = textFile.flatMap(line => line.split("\\W"))
wordsRDD.cache()
val positiveWordsCount = wordsRDD.filter(word => isPositive(word)).count()
val negativeWordsCount = wordsRDD.filter(word => isNegative(word)).count()
For that reason, cache is said to 'break the lineage' as it creates a checkpoint that can be reused for further processing.
Rule of thumb: Use cache when the lineage of your RDD branches out or when an RDD is used multiple times like in a loop.
Do we need to call "cache" or "persist" explicitly to store the RDD data into memory?
Yes, only if needed.
The RDD data stored in a distributed way in the memory by default?
No!
And these are the reasons why :
Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums.
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently – for example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
For more details please check the Spark programming guide.
Below are the three situations you should cache your RDDs:
using an RDD many times
performing multiple actions on the same RDD
for long chains of (or very expensive) transformations
Adding another reason to add (or temporarily add) cache method call.
for debug memory issues
with cache method, spark will give debugging informations regarding the size of the RDD. so in the spark integrated UI, you will get RDD memory consumption info. and this proved very helpful diagnosing memory issues.