Transforming Python Classes to Spark Delta Rows - pandas

I am trying to transform an existing Python package to make it work with Structured Streaming in Spark.
The package is quite complex with multiple substeps, including:
Binary file parsing of metadata
Fourier Transformations of spectra
The intermediary & end results were previously stored in an SQL database using sqlalchemy, but we need to transform it to delta.
After lots of investigation, I've made the first part work for the binary file parsing but only by statically defining the column types in an UDF:
fileparser = F.udf(File()._parseBytes,FileDelta.getSchema())
Where the _parseBytes() method takes a binary stream and outputs a dictionary of variables
Now I'm trying to do this similarly for the spectrum generation:
spectrumparser = F.udf(lambda inputDict : vars(Spectrum(inputDict)),SpectrumDelta.getSchema())
However the Spectrum() init method generates multiple Pandas Dataframes as fields.
I'm getting errors as soon as the Executor nodes get to that part of the code.
Example error:
expected zero arguments for construction of ClassDict (for pandas.core.indexes.base._new_Index).
This happens when an unsupported/unregistered class is being unpickled that requires construction arguments.
Fix it by registering a custom IObjectConstructor for this class.
Overall, I feel like i'm spending way too much effort for building the Delta adaptation. Is there maybe an easy way to make these work?
I read in 1, that we could switch to the Pandas on spark API but to me that seems to be something to do within the package method itself. Is that maybe the solution, to rewrite the entire package & parsers to work natively in PySpark?
I also tried reproducing the above issue in a minimal example but it's hard to reproduce since the package code is so complex.

After testing, it turns out that the problem lies in the serialization when wanting to output (with show(), display() or save() methods).
The UDF expects ArrayType(xxxType()), but gets a pandas.Series object and does not know how to unpickle it.
If you explicitly tell the UDF how to transform it, the UDF works.
def getSpectrumDict(inputDict):
spectrum = Spectrum(inputDict["filename"],inputDict["path"],dict_=inputDict)
dict = {}
for key, value in vars(spectrum).items():
if type(value) == pd.Series:
dict[key] = value.tolist()
elif type(value) == pd.DataFrame:
dict[key] = value.to_dict("list")
else:
dict[key] = value
return dict
spectrumparser = F.udf(lambda inputDict : getSpectrumDict(inputDict),SpectrumDelta.getSchema())

Related

Dask unable to write to parquet with concatenated data

I am trying to do the following:
Read in a .dat file with pandas, converting it to a dask dataframe, concatenate it to another dask dataframe that I read in from a parquet file, and then output to a new parquet file. I do the following:
import dask.dataframe as dd
import pandas as pd
hist_pth = "\path\to\hist_file"
hist_file = dd.read_parquet(hist_pth)
pth = "\path\to\file"
daily_file = pd.read_csv(pth, sep="|", encoding="latin")
daily_file = daily_file.astype(hist_file.dtypes.to_dict(), errors="ignore")
dask_daily_file = dd.from_pandas(daily_file, npartitions=1)
combined_file = dd.concat([dask_daily_file, hist_file])
output_path = "\path\to\output"
combined_file.to_parquet(output_path)
The combined_file.to_parquet(output_path) always starts and then stops / or doesn't work correctly. In a jupyter notebook when I do this I get a kernel fail error. When I do it in a python script the script completes but the whole combined file isn't written (I know because of the size - the CSV is 140MB and the parquet file is around 1GB - the output of to_parquet is only 20MB).
Some context, this is for an ETL process and with the amount of data were adding daily I'm soon going to run out of memory on the historical and combined datasets, so I'm trying to migrate the process from just pandas to Dask to handle the larger than memory data I will soon have. The current data, daily + historical, still fits in memory but just barely (I already make use of categoricals, these are stored in the parquet file and then I copy that schema to the new file).
I also noticed that after the dd.concat([dask_daily_file, hist_file]) that I am unable to call .compute() even on simple tasks without it crashing the same way it does when writing to parquet. For example, on the original, pre-concatenated data, I can call hist_file["Value"].div(100).compute() and get the expected value but the same method on combined_file crashes. Even just combined_file.compute() to turn it into a pandas df crashes. I have tried repartitioning as well with no luck.
I was able to do these exact operations, just in pandas, without issue. But again, I'm going to be running out of memory soon which is why I am moving to dask.
Is this something dask isn't able to handle? If it can handle it, am I processing it correctly? Specifically, it seems like the concat is causing issues. Any help appreciated!
UPDATE
After playing around more I ended up with the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'categories'
There is an existing GitHub issue that seems like it could be related to this - i asked and am waiting for confirm.
As a work around I converted all categorical columns to strings/objects and tried again and then ended up with
ArrowTypeError: ("Expected a bytes object, got a 'int' object, 'Conversion failed for column Account with type object')
When I check that column df["Account"].dtype it returns dtype('O') so I think I have the correct dtype already. The values in this column are mainly numbers but there are some records with just letters.
Is there a way to resolve this?
I got this error in Pandas after concatenating dataframes and saving the result to Parquet format..
data = pd.concat([df_1, d2, df3], axis=0, ignore_index=True)
data.to_parquet(filename)
..apparently because the rows contained different data types, either int or float. By forcing them before saving to have the same data type the error goes away
cols = ["first affected col", "second affected col", ..]
data[cols] = data[cols].apply(pd.to_numeric, errors='coerce', axis=1)

How to serialize data in example-in-example format for tensorflow-ranking?

I'm building a ranking model with tensorflow-ranking. I'm trying to serialize a data set in the TFRecord format and read it back at training time.
The tutorial doesn't show how to do this. There is some documentation here on an example-in-example data format but it's hard for me to understand: I'm not sure what the serialized_context or serialized_examples fields are or how they fit into examples and I'm not sure what the Serialize() function in the code block is.
Concretely, how can I write and read data in example-in-example format?
The context is a map from feature name to tf.train.Feature. The examples list is a list of maps from feature name to tf.train.Feature. Once you have these, the following code will create an "example-in-example":
context = {...}
examples = [{...}, {...}, ...]
serialized_context = tf.train.Example(features=tf.train.Features(feature=context)).SerializeToString()
serialized_examples = tf.train.BytesList()
for example in examples:
tf_example = tf.train.Example(features=tf.train.Features(feature=example))
serialized_examples.value.append(tf_example.SerializeToString())
example_in_example = tf.train.Example(features=tf.train.Features(feature={
'serialized_context': tf.train.Feature(bytes_list=tf.train.BytesList(value=[serialized_context])),
'serialized_examples': tf.train.Feature(bytes_list=serialized_examples)
}))
To read the examples back, you may call
tfr.data.parse_from_example_in_example(example_pb,
context_feature_spec = context_feature_spec,
example_feature_spec = example_feature_spec)
where context_feature_spec and example_feature_spec are maps from feature name to tf.io.FixedLenFeature or tf.io.VarLenFeature.
First of all, I recommend reading this article to ensure that you know how to create a tf.Example as well as tf.SequenceExample (which by the way, is the other data format supported by TF-Ranking):
Tensorflow Records? What they are and how to use them
In the second part of this article, you will see that a tf.SequenceExample has two components: 1) Context and 2)Sequence (or examples). This is the same idea that Example-in-Example is trying to implement. Basically, context is the set of features that are independent of the items that you want to rank (a search query in the case of search, or user features in the case of a recommendation system) and the sequence part is a list of items (aka examples). This could be a list of documents (in search) or movies (in recommendation).
Once you are comfortable with tf.Example, Example-in-Example will be easier to understand. Take a look at this piece of code for how to create an EIE instance:
https://www.gitmemory.com/issue/tensorflow/ranking/95/518480361
1) bundle context features together in a tf.Example object and serialize it
2) bundle sequence(example) features (each of which could contain a list of values) in another tf.Example object and serialize this one too.
3) wrap these inside a parent tf.Example
4) (if you're writing to tfrecords) serialize the parent tf.Example object and write to your tfrecord file.

Create new column from existing column in Dataset - Apache Spark Java

I am new to Spark ML and got stuck in a task which require some data normalization and there is very less documentation available on net for Spark ML - Java. Any help is much appreciated.
Problem Description :
I have a Dataset that contains encoded url in column (ENCODED_URL) and I want to create new column (DECODED_URL) in existing Dataset that contains decoded version of ENCODED_URL.
For Eg :
Current Dataset
ENCODED_URL
https%3A%2F%2Fmywebsite
New Dataset
ENCODED_URL | DECODED_URL
https%3A%2F%2Fmywebsite | https://mywebsite
Tried using withColumn but had no clue what i should pass as 2nd argument
Dataset<Row> newDs = ds.withColumn("new_col",?);
After reading the Spark documentation got an idea that it may be possible with SQLTransformer but couldn't figure out how to customize it to decode the url.
This is how i read information from CSV
Dataset<Row> urlDataset = s_spark.read().option("header", true).csv(CSV_FILE).persist(StorageLevel.MEMORY_ONLY());
A Spark primer
The first thing to know is that Spark Datasets are effectively immutable. Whenever you do a transformation, a new Dataset is created and returned. Another thing to keep in mind is the difference between actions and transformations -- actions cause Spark to actually to start crunching numbers and compute your DataFrame while transformations add to the definition of a DataFrame but are not computed unless an action is called. An example of an action is DataFrame#count while an example of a transformation is DataFrame#withColumn. See the full list of actions and transformations in the Spark Scala documentation.
A solution
withColumn allows you to either create a new column or replace an existing column in a Dataset (if the first argument is an existing column's name). The docs for withColumn will tell you that the second argument is supposed to be a Column object. Unfortunately, the Column documentation only describes methods available to Column objects but does not link to other ways to create Column objects, so it's not your fault that you're at a loss for what do next.
The thing you're looking for is org.apache.spark.sql.functions#regexp_replace. Putting it all together, your code should look something like this:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
ds = ds.withColumn(
"decoded_url",
functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
regexp_replace requires that we pass a Column object as the first value but nothing requires that it even exist on any Dataset because Column objects are basically instructions for how to compute a column, they don't actually contain any real data themselves. To illustrate this principle, we could write the above snippet as:
...
import org.apache.spark.sql.functions
Dataset<Row> ds = ... // reading from your csv file
Column myColExpression = functions.regexp_replace(functions.col("encoded_url"), "\\^https%3A%2F%2F", "https://"))
ds = ds.withColumn("decoded_url", myColExpression)
If you wanted, you could reuse myColExpression on other datasets that have an encoded_url column.
Suggestion
If you haven't already, you should familiarize yourself with the org.apache.spark.sql.functions class. It's a util class that's effectively the Spark standard lib for transformations.

What is the lambda function doing in the info_dict parameter of the summary_col in this code?

I'm running summary statistics for a group of standard OLS regressions. The code was written by my professor and I'm trying to figure out what's going on specifically in a portion of the code.
summary_col(
[reg0,reg1,reg2,reg3],
stars=True,
float_format='%0.2f',
info_dict = {
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
})
I looked up lambda functions. I have a fairly decent understanding of how they work. Aspects of the code that I do understand:
info_dict is a dictionary of values that can be called if you wish to include them in your summary statistics
lambda function work by calling an anonymous function "lambda x" then you place the : and list what operation you want to take place (i.e. x + 5) and then if you already know what parameters you want it to run you can put in a list after a second ":".
{0:d} will round to integers which makes perfect sense for observations. Although I don't know why you can't just say {%.f}. Maybe it's because the former returns an explicit int and the latter returns a float that looks like an int.
{:.2f} will return a float with 2 decimal places
What I don't fully understand is what somestring.format() does. Somehow x is getting defined as the results from the regression I believe and x.nobs is the variable "number of observations". Similar for x.rsquared.
Could someone fill in the gaps for me about what's going on in the formula? What exactly about the lambda function is enabling it to fetch data for each individual regression?
Let's break this out a little bit to make it obvious what is happening:
summary_col(
[reg0,reg1,reg2,reg3],
stars=True,
float_format='%0.2f',
info_dict={
'N':lambda x: "{0:d}".format(int(x.nobs)),
'R2':lambda x: "{:.2f}".format(x.rsquared)
}
)
The summary_col object is taking in some input, the first argument being a list of regression objects, [reg0,reg1,reg2,reg3]. Then there are three named arguments, stars, float_format, and info_dict. When we pass in the list of regression objects as the first argument, I believe that the lambda function knows to apply the anonymous function to each object. So all info_dict is doing is creating a dictionary with two keys, N and R2 which map to strings. When the member x.nobs and x.rsquared are referenced in the lambda functions they are applied against the regression objects due to the context in which these are used.
If you try to use lambda in that line of code on something that does not exist in the regression objects, you'll almost certainly get an error. The key is in the context against which the lambda is applied.
A good example on the context of lambda functions is iterating over a dictionary and sorting by key and value.
# sort the dict by value first, and key second...
# x is inferred from the context (my_dict.items())
for key, value in sorted(my_dict.items(), key=lambda x: (x[1], x[0]):
print(key, value)

Pandas and Fuzzy Match

Currently I have two data frames. I am trying to get a fuzzy match of client names using fuzzywuzzy's process.extractOne function. When I have run the following script on sample data I get good results and no error, but when I run the following on my current data frames I get both an Attribute and Type error. I am not able to provide the data for security reasons, but if anyone can figure out why I am getting errors based on the script provided I would be much obliged.
names2 = list(dftr3['Common Name'])
names3 = dict(zip(names2,names2))
def get_fuzz_match(row):
match = process.extractOne(row['CLIENT_NAME'],choices = n3.keys(),score_cutoff = 80)
if match:
return n3[match[0]]
return np.nan
dfmi4['Match Name'] = dfmi4.apply(get_fuzz_match, axis=1)
I know not having some examples makes this more difficult to troubleshoot, so I will answer any question and edit the post to help this process along. The specific errors are:
1.AttributeError: 'dict_keys' object has no attribute 'items'
2.TypeError: expected string or buffer
The AttributeError is straightforward and to be expected, I think. Fuzzywuzzy's process.extract function, which does most of the actual work in process.extractOne, uses a try:... except: clause to determine whether to process the choices parameter as dict-like or list-like. I think you are seeing the exception because the TypeError is raised during the except: clause.
The TypeError is trickier to pin down, but I suspect it occurs somewhere in the StringProcessor class, used in the processor module, again called by extract, which uses several string methods and doesn't catch exceptions. So it seems likely that your apply call is passing something that is not a string. Is it possible that you have any empty cells?