Include partition steps as columns when reading Synapse spark dataframe - apache-spark-sql

I have the following partition strategy in an ADLS Gen2 store
dir_parquet = "abfss://blah.windows.net/container_name/project=cars/make=*/model=*/*.parquet"
And this would load in the already partitioned data into a dataframe accordingly. I am aware of using the .filepath(n) in SQL to achieve this, and effectively require the same thing but in a notebook dataframe.
How can I keep the project, make and model values in the dataframe as separate columns?
According to this other SO thread setting .option("mergeSchema","true") on read would work however it did not.
Thanks.

Since I received no answer to this and cannot find an official means to do so, I wrote the below code.
People with this problem may also find recursively returning blob directories to be useful and if so please see the deep_ls function here (not my code).
import pyspark
import pyspark.sql.functions as F
from typing import List
def load_dataframes_with_partition_steps(dir_urls:List[str]) -> List[pyspark.sql.dataframe.DataFrame]:
"""
Written by: Paul Wilson, 2022-07-29
Takes in a list of blob directories including their partition steps and returns a list of dataframes with the associated
partition steps in the in the dataframe.
Ex. input...:
['abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Vauxhall/model=Astra/transmission=Manual',
'abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Ford/model=Fiesta/transmission=Automatic']
...which is turned into a list of dicts...
[{'url': 'abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Vauxhall/model=Astra/transmission=Manual',
'make': 'Vauxhall',
'model': 'Astra',
'transmission': 'Manual'},
{'url': 'abfss://container#yourgen2store.dfs.core.windows.net/projects/cars/make=Ford/model=Fiesta/transmission=Automatic',
'make': 'Ford',
'model': 'Fiesta',
'transmission': 'Automatic'}]
...and from that list a list of dataframes per url and associated partition steps, such as:
[df1, df2, ..., dfn]
"""
def load_dataframe(url:str=None, partition_steps:dict={}, file_format:str=None, df:pyspark.sql.dataframe.DataFrame=None) -> pyspark.sql.dataframe.DataFrame:
"""
Recursively load a dataframe and apply the partition steps via withColumn
"""
if file_format is None or len(file_format) == 0:
raise(ValueError('file_format must not be none, the URL must end in the file format (.parquet, .csv, etc)'))
# if there is a url and non empty partition steps without a df loaded then load the dataframe
if (url is not None and len(partition_steps.keys()) > 0 and df is None):
df = spark.read.format(file_format).load(url)
# df is loaded so do not pass a url indicating it is loaded
return load_dataframe(url=None, partition_steps=partition_steps, df=df)
# if here then the df is loaded and proceed to apply withColumn
if (url is None and df is not None and len(partition_steps.keys()) > 0):
# load the first item in the partition steps dict
key = list(partition_steps.keys())[0]
value = list(partition_steps.values())[0]
# remove the first item from the partition steps dict
partition_steps.pop(key)
# load the dataframe with the new partition step
df = df.withColumn(key, F.lit(value))
return load_dataframe(url=None, partition_steps=partition_steps, df=df)
# if it makes it here then the dataframe is loaded and the partition steps are applied
return df
# list of dataframe dict values of url and partition steps
list_df_dicts = list()
if not isinstance(dir_urls, list):
raise TypeError('dir_urls must be a list of string values')
# iterate over all urls and generate dict of partition values
for url in dir_urls:
# dict to store url and partition steps
d_dict = dict()
d_dict['url'] = url
# get the format from the last part of the url
file_format = url.split('.')[-1]
d_dict['file_format'] = file_format
# split the url keeping only partition steps (ex. make=Vauxhall)
url_split = [u for u in d_dict['url'].split('/') if '=' in u]
if len(url_split) == 0:
raise ValueError('The list of URLs must contain the partition steps, ex. make=ford')
# turn the partition=item into a key:value
partition_items = [u.split('=') for u in url_split]
# iterate over every item in partition_items=[['key', 'value']] and set dict[key] = value
for item in partition_items:
key = item[0]
value = item[1]
d_dict[key] = value
list_df_dicts.append(d_dict)
# iterate over all the dicts and load the dataframes to a list with their partition steps in place
list_dfs = list()
for d_dict in list_df_dicts:
# get the url from the d_dict
url = d_dict['url']
# get the format
file_format = d_dict['file_format']
# remove the url from the d_dict
d_dict.pop('url')
df = load_dataframe(url=url, partition_steps=d_dict, file_format=file_format)
list_dfs.append(df)
# return the list of dataframes
return list_dfs

Related

TensorFlow Federated - Loading and preprocessing data on a remote client

Part of the simulation program that I am working on allows clients to load local data from their device without the server being able to access that data.
Following the idea from this post, I have the following code configured to assign the client a path to load the data from. Although the data is in svmlight format, loading it line-by-line can still allow it to be preprocessed afterwards.
client_paths = {
'client_0': '<path_here>',
'client_1': '<path_here>',
}
def create_tf_dataset_for_client_fn(id):
path = client_paths.get(id)
data = tf.data.TextLineDataset(path)
path_source = tff.simulation.datasets.ClientData.from_clients_and_fn(client_paths.keys(), create_tf_dataset_for_client_fn)
The code above allows a path to be loaded during runtime from the remote client's-side by the following line of code.
data = path_source.create_tf_dataset_for_client('client_0')
Here, the data variable can be iterated through and can be used to display the contents on the client on the remote device when calling tf.print(). But, I need to preprocess this data into an appropriate format before continuing. I am presently attempting to convert this from a string Tensor in svmlight format into a SparseTensor of the appropriate format.
The issue is that, although the defined preprocessing method works in a standalone scenario (i.e. when defined as a function and tested on a manually defined Tensor of the same format), it fails when the code is executed during the client update #tf.function in the tff algorithm. Below is the specified error when executing the notebook cell which contains a #tff.tf_computation function which calls an #tf.function which does the preprocessing and retrieves the data.
ValueError: Shape must be rank 1 but is rank 0 for '{{node Reshape_2}} = Reshape[T=DT_INT64, Tshape=DT_INT32](StringToNumber_1, Reshape_2/shape)' with input shapes: [?,?], [].
Since the issue occurs when executing the client's #tff.tf_computation update function which calls the #tf.function with the preprocessing code, I am wondering how I can allow the function to perform the preprocessing on the data without errors. I assume that if I can just get the functions to properly be run when defined that when called remotely it will work.
Any ideas on how to address this issue? Thank you for your help!
For reference, the preprocessing function uses tf computations to manipulate the data. Although not optimal yet, below is the code presently being used. This is inspired from this link on string_split examples. I have extracted the code to put directly into the client's #tf.function after loading the TextLineDataset as well, but this also fails.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, -1)
feat_ids = tf.expand_dims(feat_ids, 1)
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, -1)
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}
Update (Fix)
Following the advice from Jakub's comment, the issue was fixed by enclosing the reshape and expand_dim calls in [], when needed. Now there is no issue running the code within tff.
def decode_libsvm(line):
# Split the line into columns, delimiting by a blank space
cols = tf.strings.split([line], ' ')
# Retrieve the labels from the first column as an integer
labels = tf.strings.to_number(cols.values[0], out_type=tf.int32)
# Split all column pairs
splits = tf.strings.split(cols.values[1:], ':')
# Convert splits into a sparse matrix to retrieve all needed properties
splits = splits.to_sparse()
# Reshape the tensor for further processing
id_vals = tf.reshape(splits.values, splits.dense_shape)
# Retrieve the indices and values within two separate tensors
feat_ids, feat_vals = tf.split(id_vals, num_or_size_splits=2, axis=1)
# Convert the indices into int64 numbers
feat_ids = tf.strings.to_number(feat_ids, out_type=tf.int64)
# To reload within a SparseTensor, add a dimension to feat_ids with a default value of 0
feat_ids = tf.reshape(feat_ids, [-1])
feat_ids = tf.expand_dims(feat_ids, [1])
feat_ids = tf.pad(feat_ids, [[0,0], [0,1]], constant_values=0)
# Extract and flatten the values
feat_vals = tf.strings.to_number(feat_vals, out_type=tf.float32)
feat_vals = tf.reshape(feat_vals, [-1])
# Configure a SparseTensor to contain the indices and values
sparse_output = tf.SparseTensor(indices=feat_ids, values=feat_vals, dense_shape=[1, <shape>])
return {"x": sparse_output, "y": labels}

Adding a column with a calculation to multiple CSVs

I'm SUPER green to Python and am having some issues trying to automate some calculations.
I know that this works to add a new column called "Returns" that divides "value" of current to "value" of previous to a csv:
import pandas as pd
import numpy as np
import csv
a = pd.read_csv("/Data/a_data.csv", index_col = "time")
a ["Returns"] = (a["value"]/a["value"].shift(1) -1)*100
However, I have a lot of these CSVs. I need this calculation to happen prior to merging them all together. So I was hoping to write something that just looped through all of the CSVs and did the calculation and added the column but clearly this was incorrect as I get Syntax error:
import pandas as pd
import numpy as np
import csv
a = pd.read_csv("/Data/a_data.csv", index_col = "time")
b = pd.read_csv("/Data/b_data.csv", index_col = "time")
c = pd.read_csv("/Data/c_data.csv", index_col = "time")
my_lists = ['a','b','c']
for my_list in my_lists:
{my_list}["Returns"] = ({my_list}["close"]/{my_list}["close"].shift(1) -1)*100
print(f"Calculating: {my_list.upper()}")
I'm sure there is an easy way to do this that I just haven't reached in my Python education yet, so any guidance would be greatly appreciated!
Assuming "close" and "time" are fields defined in each of your csv files, you could define a function that reads each file, do the shift and returns a dataframe:
def your_func(my_file): # this function takes a file name as an argument.
my_df = pd.read_csv(my_file, index_col = "time") # The function reads its content into a data frame,
my_df["Returns"] = (my_df["close"]/{my_df}["close"].shift(1) -1)*100 # makes the calculation
return my_df #and returns it as an output.
Then as a main code, you collect all csv files from a folder with glob package. Using the above function, you build a data frame for each file with the calculation done.
import glob
path =r'/Data/' # path to the directory where you have the csv files
filenames = glob.glob(path + "/*.csv") # grab the csv files names using glob package with path+all csv files present
for filename in filenames: # loop into all csv files names in the list of csv files present in the directory
df= your_func (filename) # call the function, defined above block of code, that reads the file from its name as argument, then makes the calculation and returns it.
print (df)
Above, there is a print of the data Frame which shows results; I am not sure what you intend to do with upper (I dont think this is a function on a data frame).
Finally, this returns independent data frames with calculations done prior to other or final transformation.
1.Do a, b, c data frames have the same dimension?
2.You don't need to import the CSV library because it includes in the Pandas library.
3.If you want to union data frames, you can use like this :
my_lists = [a,b,c]
and you can concatenate with this way:
result=pd.concat(my_lists)
Lastly, your calculation should be :
result["Returns"]=(result.loc[:, "close"].div(result.loc[:, "close"].shift()).fillna(0).replace([np.inf, -np.inf], 0))
You need to add an index-label selection (loc) function to the data frame in order to access the values. When numbers are dividing, results can be NaN(Not a Number) or infinite. Therefore, replace and fillna functions are related to NaN and Inf.

How to filter some data by read_parquet() in pandas?

i want to reduce loading memory usage by filter some gid
reg_df = pd.read_parquet('/data/2010r.pq',
columns=['timestamp', 'gid', 'uid', 'flag'])
But in docs kwargs havn't been shown .
For example:
gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]
so,how can i only load gid that i want to calculate?
The introduction of the **kwargs to the pandas library is documented here. It looks like the original intent was to actually pass columns into the request to limit IO volumn. The contributors took the next step and added a general pass for **kwargs.
For pandas/io/parquet.py the following is for read_parquet:
def read_parquet(path, engine='auto', columns=None, **kwargs):
"""
Load a parquet object from the file path, returning a DataFrame.
.. versionadded 0.21.0
Parameters
----------
path : string
File path
columns: list, default=None
If not None, only these columns will be read from the file.
.. versionadded 0.21.1
engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
Parquet library to use. If 'auto', then the option
``io.parquet.engine`` is used. The default ``io.parquet.engine``
behavior is to try 'pyarrow', falling back to 'fastparquet' if
'pyarrow' is unavailable.
kwargs are passed to the engine
Returns
-------
DataFrame
"""
impl = get_engine(engine)
return impl.read(path, columns=columns, **kwargs)
For pandas/io/parquet.py the following is for read on the pyarrow engine:
def read(self, path, columns=None, **kwargs):
path, _, _, should_close = get_filepath_or_buffer(path)
if self._pyarrow_lt_070:
result = self.api.parquet.read_pandas(path, columns=columns,
**kwargs).to_pandas()
else:
kwargs['use_pandas_metadata'] = True #<-- only param for kwargs...
result = self.api.parquet.read_table(path, columns=columns,
**kwargs).to_pandas()
if should_close:
try:
path.close()
except: # noqa: flake8
pass
return result
for pyarrow/parquet.py the following is for read_pandas:
def read_pandas(self, **kwargs):
"""
Read dataset including pandas metadata, if any. Other arguments passed
through to ParquetDataset.read, see docstring for further details
Returns
-------
pyarrow.Table
Content of the file as a table (of columns)
"""
return self.read(use_pandas_metadata=True, **kwargs) #<-- params being passed
For pyarrow/parquet.py the following is for read:
def read(self, columns=None, nthreads=1, use_pandas_metadata=False): #<-- kwargs param at pyarrow
"""
Read a Table from Parquet format
Parameters
----------
columns: list
If not None, only these columns will be read from the file. A
column name may be a prefix of a nested field, e.g. 'a' will select
'a.b', 'a.c', and 'a.d.e'
nthreads : int, default 1
Number of columns to read in parallel. If > 1, requires that the
underlying file source is threadsafe
use_pandas_metadata : boolean, default False
If True and file has custom pandas schema metadata, ensure that
index columns are also loaded
Returns
-------
pyarrow.table.Table
Content of the file as a table (of columns)
"""
column_indices = self._get_column_indices(
columns, use_pandas_metadata=use_pandas_metadata)
return self.reader.read_all(column_indices=column_indices,
nthreads=nthreads)
So, if I understand correctly maybe you can access nthreads and use_pandas_metadata - but then again, neither is explicitly assigned (??). I haven't tested it - but it maybe a start.

How to I convert multiple Pandas DFs into a single Spark DF?

I have several Excel files that I need to load and pre-process before loading them into a Spark DF. I have a list of these files that need to be processed. I do something like this to read them in:
file_list_rdd = sc.emptyRDD()
for file_path in file_list:
current_file_rdd = sc.binaryFiles(file_path)
print(current_file_rdd.count())
file_list_rdd = file_list_rdd.union(current_file_rdd)
I then have some mapper function that turns file_list_rdd from a set of (path, bytes) tuples to (path, Pandas DataFrame) tuples. This allows me to use Pandas to read the Excel file and to manipulate the files so that they're uniform before making them into a Spark DataFrame.
How do I take an RDD of (file path, Pandas DF) tuples and turn it into a single Spark DF? I'm aware of functions that can do a single transformation, but not one that can do several.
My first attempt was something like this:
sqlCtx = SQLContext(sc)
def convert_pd_df_to_spark_df(item):
return sqlCtx.createDataFrame(item[0][1])
processed_excel_rdd.map(convert_pd_df_to_spark_df)
I'm guessing that didn't work because sqlCtx isn't distributed with the computation (it's a guess because the stack trace doesn't make much sense to me).
Thanks in advance for taking the time to read :).
Can be done using conversion to Arrow RecordBatches which Spark > 2.3 can process into a DF in a very efficient manner.
https://gist.github.com/linar-jether/7dd61ed6fa89098ab9c58a1ab428b2b5
This snippet monkey-patches spark to include a createFromPandasDataframesRDD method.
The createFromPandasDataframesRDD method accepts a RDD object of pandas DFs (Assumes same columns) and returns a single Spark DF.
I solved this by writing a function like this:
def pd_df_to_row(rdd_row):
key = rdd_row[0]
pd_df = rdd_row[1]
rows = list()
for index, series in pd_df.iterrows():
# Takes a row of a df, exports it as a dict, and then passes an unpacked-dict into the Row constructor
row_dict = {str(k):v for k,v in series.to_dict().items()}
rows.append(Row(**row_dict))
return rows
You can invoke it by calling something like:
processed_excel_rdd = processed_excel_rdd.flatMap(pd_df_to_row)
pd_df_to_row now has a collection of Spark Row objects. You can now say:
processed_excel_rdd.toDF()
There's probably something more efficient than the Series-> dict-> Row operation, but this got me through.
Why not make a list of the dataframes or filenames and then call union in a loop. Something like this:
If pandas dataframes:
dfs = [df1, df2, df3, df4]
sdf = None
for df in dfs:
if sdf:
sdf = sdf.union(spark.createDataFrame(df))
else:
sdf = spark.createDataFrame(df)
If filenames:
names = [name1, name2, name3, name4]
sdf = None
for name in names:
if sdf:
sdf = sdf.union(spark.createDataFrame(pd.read_excel(name))
else:
sdf = spark.createDataFrame(pd.read_excel(name))

Slice Spark’s DataFrame SQL by row (pyspark)

I have a Spark's Dataframe parquet file that can be read by spark as follows
df = sqlContext.read.parquet('path_to/example.parquet')
df.registerTempTable('temp_table')
I want to slice my dataframe, df, by row (i.e. equivalent to df.iloc[0:4000], df.iloc[4000:8000] etc. in Pandas dataframe) since I want to convert each small chunks to pandas dataframe to work on each later on. I only know how to do it by using sample random fraction i.e.
df_sample = df.sample(False, fraction=0.1) # sample 10 % of my data
df_pandas = df_sample.toPandas()
I would be great if there is a method to slice my dataframe df by row. Thanks in advance.
You can use monotonically_increasing_id() to add an ID column to your dataframe and use that to get a working set of any size.
import pyspark.sql.functions as f
# add an index column
df = df.withColumn('id', f.monotonically_increasing_id())
# Sort by index and get first 4000 rows
working_set = df.sort('id').limit(4000)
Then, you can remove the working set from your dataframe using subtract().
# Remove the working set, and use this `df` to get the next working set
df = df.subtract(working_set)
Rinse and repeat until you're done processing all rows. Not the ideal way to do things, but it works. Consider filtering out your Spark data frame to be used in Pandas.