Pyspark use customized function to store each row into a self defined object, for example a node object - dataframe

Is there a way to utilize the map function to store each row of the pyspark dataframe into a self-defined python class object?
pyspark dataframe
For example, in the picture above I have a spark dataframe, I want to store every row of id, features, label into a node object (with 3 attributes node_id, node_features, and node_label). I am wondering if this is feasible in pyspark. I have tried something like
for row in df.rdd.collect()
do_something (row)
but this can not handle big data and is extremely slow. I am wondering if there is a more efficient way to resolve it. Much thanks.

You can use foreach method for your operation. The operation will be parallelized in spark.
Refer Pyspark applying foreach if you need more details.

Related

filter TabularDataset in azure ML

My Dataset is huge. I am using Azure ML notebooks and using azureml.core to read dateset and convert to azureml.data.tabular_dataset.TabularDataset. Is there anyway i would filter the data in the tabularDataset with out converting to pandas data frame.
I am using below code to read the data. as the data is huge pandas data-frame is running out of memory. I don't have to load complete data into the program. Only subset is required. is there any way i could filter the records before converting to pandas data frame
def read_Dataset(dataset):
ws = Workspace.from_config()
ds = ws.datasets
tab_dataset = ds.get(dataset)
dataframe = tab_dataset.to_pandas_dataframe()
return dataframe
At this point of time, we only support simple sampling, filtering by column name, and datetime (reference here). Full filtering capability (e.g. by column value) on tabulardataset is an upcoming feature in the next couple of months. We will update our public documentation once the feature is ready.
You can subset your data in two ways,
row wise - use TabularDataset class filter method
column wise - use TabularDataset class keep_columns method or drop_columns method
hope this helps tackle out of memory error

When i need to persist dataframe

I am curious to know when i need to persist my dataframe in spark and when not. Cases:-
If i need data from file ( Do i need to persist it? if i apply repetitive count like:-
val df=spark.read.json("file://root/Download/file.json")
df.count
df.count
Do i need to persist df?? because according to me it should store df in memory after first count and use same df in second count. Record in file is 4 , Because when i practically check it , it read file again and again, So why spark doesn't store it in memory
Second question is in spark read is an action or transformation?
DataFrames by design are immutable, so every transformation done on them would create a new data frame altogether. A spark pipeline generally involves multiple transformations leading to multiple data frames being created. If spark stores all of these data frames, the memory requirement would be huge. So spark leaves the responsibility of persisting data frames to the user. Whichever data frame you are planning on re using, you can persist them and later unpersist them when done.
I don't think we can define spark read as an action or a transformation. Action/Transformation is applied over a data frame. To identify the difference, you should remember that the transformation operation will return a new dataframe while action will return some value/s.

How do I add a new feature column to a tf.data.Dataset object?

I am building an input pipeline for proprietary data using Tensorflow 2.0's data module and using the tf.data.Dataset object to store my features. Here is my issue - the data source is a CSV file that has only 3 columns, a label column and then two columns which just hold strings referring to JSON files where that data is stored. I have developed functions that access all the data I need, and am able to use Dataset's map function on the columns to get the data, but I don't see how I can add a new column to my tf.data.Dataset object to hold the new data. So if anyone could help with the following questions, it would really help:
How can a new feature be appended to a tf.data.Dataset object?
Should this process be done on the entire Dataset before iterating through it, or during (I think during iteration would allow utilization of the performance boost, but I don't know how this functionality works)?
I have all the methods for taking the input as the elements from the columns and performing everything required to get the features for each element, I just don't understand how to get this data into the dataset. I could do "hacky" workarounds, using a Pandas Dataframe as a "mediator" or something along those lines, but I want to keep everything within the Tensorflow Dataset and pipeline process, for both performance gains and higher quality code.
I have looked through the Tensorflow 2.0 documentation for the Dataset class (https://www.tensorflow.org/versions/r2.0/api_docs/python/tf/data/Dataset), but haven't been able to find a method that can manipulate the structure of the object.
Here is the function I use to load the original dataset:
def load_dataset(self):
# TODO: Function to get max number of available CPU threads
dataset = tf.data.experimental.make_csv_dataset(self.dataset_path,
self.batch_size,
label_name='score',
shuffle_buffer_size=self.get_dataset_size(),
shuffle_seed=self.seed,
num_parallel_reads=1)
return dataset
Then, I have methods which allow me to take a string input (column element) and return the actual feature data. And I am able to access the elements from the Dataset using a function like ".map". But how do I add that as a column?
Wow, this is embarassing, but I have found the solution and it's simplicity literally makes me feel like an idiot for asking this. But I will leave the answer up just in case anyone else is ever facing this issue.
You first create a new tf.data.Dataset object using any function that returns a Dataset, such as ".map".
Then you create a new Dataset by zipping the original and the one with the new data:
dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

Using dask to read data from Hive

I am using as_pandas utility from impala.util to read the data in dataframe form fetched from hive. However, using pandas, I think I will not be able to handle large amount of data and it will also be slower. I have been reading about dask which provides excellent functionality for reading large data files. How can I use it to efficiently fetch data from hive.
def as_dask(cursor):
"""Return a DataFrame out of an impyla cursor.
This will pull the entire result set into memory. For richer pandas-
like functionality on distributed data sets, see the Ibis project.
Parameters
----------
cursor : `HiveServer2Cursor`
The cursor object that has a result set waiting to be fetched.
Returns
-------
DataFrame
"""
import pandas as pd
import dask
import dask.dataframe as dd
names = [metadata[0] for metadata in cursor.description]
dfs = dask.delayed(pd.DataFrame.from_records)(cursor.fetchall(),
columns=names)
return dd.from_delayed(dfs).compute()
There is no current straight-forward way to do this. You would do well to see the implementation of dask.dataframe.read_sql_table and similar code in intake-sql - you will probably want a way to partition your data, and have each of your workers fetch one partition via a call to delayed(). dd.from_delayed and dd.concat could then be used to stitch the pieces together.
-edit-
Your function has the delayed idea back to front. You are delaying and the immediately materialising the data within a function that operates on a single cursor - it can't be parallelised and will break your memory if the data is big (which is the reason you are trying this).
Lets suppose you can form a set of 10 queries, where each query gets a different part of the data; do not use OFFSET, use a condition on some column that is indexed by Hive.
You want to do something like:
queries = [SQL_STATEMENT.format(i) for i in range(10)]
def query_to_df(query):
cursor = impyla.execute(query)
return pd.DataFrame.from_records(cursor.fetchall())
Now you have a function that returns a partition and has no dependence on global objects - it only takes as input a string.
parts = [dask.delayed(query_to_df)(q) for q in queries]
df = dd.from_delayed(parts)

Spark group by several fields several times on the same RDD

My data is stored in csv format, and the headers are given in the column_names variable.
I wrote the following code to read it into a python dictionary RDD
rdd=sc.textFile(hdfs_csv_dir)\
.map(lambda x: x.split(','))\
.filter(lambda row: len(row)==len(column_names))\
.map(lambda row: dict([(column,row[index]) for index,column in enumerate(column_names)]))
Next, I wrote a function that counts the combinations of column values given the column names
import operator
def count_by(rdd,cols=[]):
'''
Equivalent to:
SELECT col1, col2, COUNT(*) FROM MX3 GROUP BY col1, col2;
But the number of columns can be more than 2
'''
counts=rdd.map(lambda x: (','.join([str(x[c]) for c in cols]), 1))\
.reduceByKey(operator.add)\
.map(lambda t:t[0].split(',')+[t[1]])\
.collect()
return counts
I am running count_by several times, with a lot of different parameters on the same rdd.
What is the best way to optimize the query and make it run faster ?
First, you should cache the RDD (by calling cachedRdd = rdd.cache()) before passing it multiple times into count_by, to prevent Spark from loading it from disk for each operation. Operating on a cached RDD means data will be loaded to memory upon first use (first call to count_by), then read from memory for following calls.
You should also consider using Spark DataFrame API instead of the lower-level RDD API, since:
You seem to articulate your intentions using SQL, and DataFrame API allows you to actually use such a dialect
When using DataFrames, Spark can perform some extra optimizations since it has a better understanding of what you are trying to do, is it can design the best way to achieve it. SQL-like dialect is declerative - you only say what you want, not how to get it, which gives Spark more freedom to optimize