Using dask to read data from Hive - pandas

I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()

There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)

Related

How to generate, then reduce, a massive set of DataFrames from each row of one DataFrame in PySpark?

I cannot share my actual code or data, unfortunately, as it is proprietary, but I can produce a MWE if the problem isn't clear to readers from the text.
I am working with a dataframe containing ~50 million rows, each of which contains a large XML document. From each XML document, I extract a list of statistics relating to the number of occurrences and hierarchical relationships between tags (nothing like undocumented XML formats to brighten one's day). I can express these statistics in dataframes, and I can combine these dataframes over multiple documents using standard operations like GROUP BY/SUM and DISTINCT. The goal is to extract the statistics for all 50 million documents and express them in a single dataframe.
The problem is that I don't know how to efficiently generate 50 million dataframes from each row of one dataframe in Spark, or how to tell Spark to reduce a list of 50 million dataframes to one dataframe using binary operators. Are there standard functions that do these things?
So far, the only workaround I have found is massively inefficient (storing the data as a string, parsing it, doing the computations, and then converting it back into a string). It would take weeks to finish using this method, so it isn't practical.
The extractions and statistical data from each XML response for each row can be stored in additional columns of the row itself. That way spark should be able to do the processes in its multiple executors improving the performance.
Here is a pseudocode.
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType, DateType, FloatType, ArrayType
def extract_metrics_from_xml(row):
j = row['xmlResponse'] # assuming your xml column name is xmlResponse
# perform your xml extractions and computations for the xmlResponse in python
...
load_date = ...
stats_data1 = ...
return Row(load_date, stats_data1, stats_data2, stats_group)
schema = schema = StructType([StructField('load_date', DateType()),
StructField('stats_data1', FloatType()),
StructField('stats_data2', ArrayType(IntegerType())),
StructField('stats_group', StringType())
])
df_with_xml_stats = original_df.rdd\
.map(extract_metrics_from_xml)\
.toDF(schema=schema, sampleRatio=1)\
.cache()

Output Dataframe to CSV File using Repartition and Coalesce

Currently, I am working on a single node Hadoop and I wrote a job to output a sorted dataframe with only one partition to one single csv file. And I discovered several outcomes when using repartition differently.
At first, I used orderBy to sort the data and then used repartition to output a CSV file, but the output was sorted in chunks instead of in an overall manner.
Then, I tried to discard repartition function, but the output was only a part of the records. I realized without using repartition spark will output 200 CSV files instead of 1, even though I am working on a one partition dataframe.
Thus, what I did next were placing repartition(1), repartition(1, "column of partition"), repartition(20) function before orderBy. Yet output remained the same with 200 CSV files.
So I used the coalesce(1) function before orderBy, and the problem was fixed.
I do not understand why working on a single partitioned dataframe has to use repartition and coalesce, and how the aforesaid processes affect the output. Grateful if someone can elaborate a little.
Spark has relevant parameters here:
spark.sql.shuffle.partitions and spark.default.parallelism.
When you perform operations like sort in your case, it triggers something called a shuffle operation
https://spark.apache.org/docs/latest/rdd-programming-guide.html#shuffle-operations
That will split your dataframe to spark.sql.shuffle.partitions partitions.
I also struggled with the same problem as you do and did not find any elegant solution.
Spark generally doesn’t have a great concept of ordered data, because all your data is split accross multiple partitions. And every time you call an operation that requires a shuffle your ordering will be changed.
For this reason, you’re better off only sorting your data in spark for the operations that really need it.
Forcing your data into a single file will break when the dataset gets larger
As Miroslav points out your data gets shuffled between partitions every time you trigger what’s called a shuffle stage (this is things like grouping or join or window operations)
You can set the number of shuffle partitions in the spark Config - the default is 200
Calling repartition before a group by operation is kind of pointless because spark needs to reparation your data again to execute the groupby
Coalesce operations sometimes get pushed into the shuffle stage by spark. So maybe that’s why it worked. Either that or because you called it after the groupby operation
A good way to understand what’s going on with your query is to start using the spark UI - it’s normally available at http://localhost:4040
More info here https://spark.apache.org/docs/3.0.0-preview/web-ui.html

filter TabularDataset in azure ML

My Dataset is huge. I am using Azure ML notebooks and using azureml.core to read dateset and convert to azureml.data.tabular_dataset.TabularDataset. Is there anyway i would filter the data in the tabularDataset with out converting to pandas data frame.
I am using below code to read the data. as the data is huge pandas data-frame is running out of memory. I don't have to load complete data into the program. Only subset is required. is there any way i could filter the records before converting to pandas data frame
def read_Dataset(dataset):
ws = Workspace.from_config()
ds = ws.datasets
tab_dataset = ds.get(dataset)
dataframe = tab_dataset.to_pandas_dataframe()
return dataframe
At this point of time, we only support simple sampling, filtering by column name, and datetime (reference here). Full filtering capability (e.g. by column value) on tabulardataset is an upcoming feature in the next couple of months. We will update our public documentation once the feature is ready.
You can subset your data in two ways,
row wise - use TabularDataset class filter method
column wise - use TabularDataset class keep_columns method or drop_columns method
hope this helps tackle out of memory error

Is there an idiomatic way to cache Spark dataframes?

I have a large parquet dataset that I am reading with Spark. Once read, I filter for a subset of rows which are used in a number of functions that apply different transformations:
The following is similar but not exact logic to what I'm trying to accomplish:
df = spark.read.parquet(file)
special_rows = df.filter(col('special') > 0)
# Thinking about adding the following line
special_rows.cache()
def f1(df):
new_df_1 = df.withColumn('foo', lit(0))
return new_df_1
def f2(df):
new_df_2 = df.withColumn('foo', lit(1))
return new_df_2
new_df_1 = f1(special_rows)
new_df_2 = f2(special_rows)
output_df = new_df_1.union(new_df_2)
output_df.write.parquet(location)
Because a number of functions might be using this filtered subset of rows, I'd like to cache or persist it in order to potentially speed up execution speed / memory consumption. I understand that in the above example, there is no action called until my final write to parquet.
My questions is, do I need to insert some sort of call to count(), for example, in order to trigger the caching, or if Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
If yes, is this an idiomatic approach? Does this mean in production and large scale Spark jobs that rely on caching, random operations that force an action on the dataframe pre-emptively are frequently used, such as a call to count?
there is no action called until my final write to parquet.
and
Spark during that final write to parquet call will be able to see that this dataframe is being used in f1 and f2 and will cache the dataframe itself.
are correct. If you do output_df.explain(), you will see the query plan, which will show that what you said is correct.
Thus, there is no need to do special_rows.cache(). Generally, cache is only necessary if you intend to reuse the dataframe after forcing Spark to calculate something, e.g. after write or show. If you see yourself intentionally calling count(), you're probably doing something wrong.
You might want to repartition after running special_rows = df.filter(col('special') > 0). There can be a large number of empty partitions after running a filtering operation, as explained here.
The new_df_1 will make cache special_rows which will be reused by new_df_2 here new_df_1.union(new_df_2). That's not necessarily a performance optimization. Caching is expensive. I've seen caching slow down a lot of computations, even when it's being used in a textbook manner (i.e. caching a DataFrame that gets reused several times downstream).
Counting does not necessarily make sure the data is cached. Counts avoid scanning rows whenever possible. They'll use the Parquet metadata when they can, which means they don't cache all the data like you might expect.
You can also "cache" data by writing it to disk. Something like this:
df.filter(col('special') > 0).repartition(500).write.parquet("some_path")
special_rows = spark.read.parquet("some_path")
To summarize, yes, the DataFrame will be cached in this example, but it's not necessarily going to make your computation run any faster. It might be better to have no cache or to "cache" by writing data to disk.

Spark group by several fields several times on the same RDD

My data is stored in csv format, and the headers are given in the column_names variable.
I wrote the following code to read it into a python dictionary RDD
rdd=sc.textFile(hdfs_csv_dir)\
.map(lambda x: x.split(','))\
.filter(lambda row: len(row)==len(column_names))\
.map(lambda row: dict([(column,row[index]) for index,column in enumerate(column_names)]))
Next, I wrote a function that counts the combinations of column values given the column names
import operator
def count_by(rdd,cols=[]):
'''
Equivalent to:
SELECT col1, col2, COUNT(*) FROM MX3 GROUP BY col1, col2;
But the number of columns can be more than 2
'''
counts=rdd.map(lambda x: (','.join([str(x[c]) for c in cols]), 1))\
.reduceByKey(operator.add)\
.map(lambda t:t[0].split(',')+[t[1]])\
.collect()
return counts
I am running count_by several times, with a lot of different parameters on the same rdd.
What is the best way to optimize the query and make it run faster ?
First, you should cache the RDD (by calling cachedRdd = rdd.cache()) before passing it multiple times into count_by, to prevent Spark from loading it from disk for each operation. Operating on a cached RDD means data will be loaded to memory upon first use (first call to count_by), then read from memory for following calls.
You should also consider using Spark DataFrame API instead of the lower-level RDD API, since:
You seem to articulate your intentions using SQL, and DataFrame API allows you to actually use such a dialect
When using DataFrames, Spark can perform some extra optimizations since it has a better understanding of what you are trying to do, is it can design the best way to achieve it. SQL-like dialect is declerative - you only say what you want, not how to get it, which gives Spark more freedom to optimize