Dataframe turns empty after writing to mount - dataframe

I have a dataframe called updatesDF as follows,
+-------+---------+--------------+-------------------+----------------+-------------------+----------------+
|PartyID| TIN|SourceSystemID| ODSInsertDate| ODSInsertBy| ODSUpdateDate| ODSUpdateBy|
+-------+---------+--------------+-------------------+----------------+-------------------+----------------+
| 11111|222222222| 1|2021-07-20 01:56:25|sneha|2021-07-20 01:56:25|sneha|
+-------+---------+--------------+-------------------+----------------+-------------------+----------------
so updatesDF.show() gives me the above output. Now I need to write this dataframe to a mount path,
updatesDF.write.format('delta').mode('append').save('/mnt/Sneha/Updates/')
So as soon as I write into this location, the updatesDF turns blank like this
+-------+---+--------------+-------------+-----------+-------------+-----------+
|PartyID|TIN|SourceSystemID|ODSInsertDate|ODSInsertBy|ODSUpdateDate|ODSUpdateBy|
+-------+---+--------------+-------------+-----------+-------------+-----------+
+-------+---+--------------+-------------+-----------+-------------+-----------+
There are no other steps in between and I also tried taking backup of this DF...both backup and original DF turns empty after the append step. Please help

It's a weird situation. Looks like a memory saturation.
Can you try updatesDF.persist() or .cache() just after creation.

Related

Impossible to get post transform statistics by split

I'm running a simple penguin pipeline in interactive mode with a split train/eval, the transform step run but i can't get post_transform_statistics artifacts.
Inside the dedicated artifacts folder /tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/post_transform_stats/5, i have just one FeaturesStats.pb inside, but not subfolders Split-train and Split-eval with a FeaturesStats.pb inside each.
However, I have the subfolders inside artifacts dedicated to transformed examples (/tmp/tfx-penguin_custom_INTERACTIVE-nq5dn56x/Transform/transformed_examples/5/).
Here is how i define the transform components by explicitly providing splits and also disable_statistics=False:
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
disable_statistics=False,
splits_config= transform_pb2.SplitsConfig(
analyze=['train'], transform=['train', 'eval']),
module_file=_transformer_module_file)
I went to the docstring and even the __init__ of the component https://github.com/tensorflow/tfx/blob/master/tfx/components/transform/component.py, it seems there is nothing i would have forgotten or mistaken but i was very disturbed to read following comment with an untraceable location for stats....
disable_statistics: If True, do not invoke TFDV to compute pre-transform
and post-transform statistics. When statistics are computed, they will
will be stored in the `pre_transform_feature_stats/` and
`post_transform_feature_stats/` subfolders of the `transform_graph`
export.
For now, the workaround is to explicitly disable stats in the transform component and define next to it, a dedicated statistics components to work on transformed features splits but it would have been great to have the splits statistics inside transform component directly.
Thanks for any help
This is expected as StatisticsGen in Transform is currently working on the entire transform dataset regardless of split/span.
To generate separate statistics for different splits, please use StatisticsGen component.
Thank you!

How to "single File Save" empty dataframe in Spark?

I have a job which processes files and then lands them as single CSVs on a blob storage container. The problem I face is that I also need to land empty files, which only contain the header. How can this be achieved when I use .saveSingleFile?
Example Code snipped:
df.coalesce(1)
.write
.options(configuration.readAndWriteOptions)
.partitionBy(INGESTION_TIME)
.format("csv")
.mode("append")
.saveSingleFile(path.toString)
Example readAndWriteOptions:
{"sep": ";", "header": "true"}
In other words:
In above case, if df.show() is only displaying a header, no CSV file is written. However, I want to output a csv file without data but column names. Is there an option which would allow this ? Both cases need to be possible, if data is available and if data is not available, therefore something like .take(1) will not be a sufficient solution.
Update:
Looks like this is related to a Spark API Bug and should have been resolved with Version 3.

Correct Malformed CSV and pull corrected data back into a dataframe

UPDATE BELOW.....
Have automated csv data dumping into our backend and it looks like there are some malformed items buried in the data. There is a job family title that errantly has a \n in between two words. Which is wrecking our data, so that's the problem.
I want to read in the csv as wholetext, regexp_replace the title with the correction, then load this fixed wholetext into a new dataframe as if I loaded up a correct csv to start with.. Here's the madness of where I'm at right now: Lol.
# Import in the functions I need
# from pyspark.sql.functions import col
# Looks like there is a job family title with an issue. There's a carriage return / line feed between two words messing up the csv
# This needs to be patched before we actually pull the data into the dataframes to begin work
data_requisitions_patch0 = spark.read.text('abfss://container#somethingcool.dfs.core.windows.net/Data/brokencsv.csv', wholetext=True)
data_requisitions_patch0.collect()
data_requisitions_new = data_requisitions_patch0
# print(data_requisitions_patch0)
# data_requisitions_patch0.printSchema()
# data_requisitions_patch0.show()
data_requisitions_patch1 = data_requisitions_patch0 \
.withColumn("value", regexp_replace(col('value'), 'Job - Starting\n', 'Job - Starting'))
data_requisitions_patch1.collect()
print('patch0')
data_requisitions_new.count()
print('patch1')
data_requisitions_patch1.count()
# print('Patch0 dataframe: ' + data_requisitions_patch0.count())
# print('Patch0 dataframe: ' + data_requisitions_patch1.count())
# data_requisitions_test0 = spark(data_requisitions_patch1, header=True)
# data_requisitions_test1 = spark.read.csv('abfss://container#somethingcool.dfs.core.windows.net/Data/brokencsv.csv', header=True)
# data_requisitions_test0.count()
# data_requisitions_test0.printSchema()
# data_requisitions_test1.count()
# data_requisitions_test1.printSchema()
It's obviously a mess right now, I'm trying to troubleshoot is the regexp_replace is working, but not having much luck. Then it occurred to me that I have a single row single column dataframe. Now I'm attempting to try to figure how how to take the dataframe post the 'patch' and turn that back into a normal csv'ed dataframe like everything was ok to begin with.
I left in all my testing nonsense, thought was that you might see where my head is... Unsure if that was helpful or not. Links have been faked, obviously.
First off: Am I going in the right direction? No part of this is really working.. I can't even get the counts to work. test1.count() does return... but test0.count() doesn't? I don't even really care about the counts, that's me just trying to figure out why it's not working.
Secondly: Malformed csv -> wholetext dataframe -> regexp fix the problem -> fixed dataframe with correct headers, rows, like normal.
How off am I?
=======
UPDATE
Made some great progress, I ended up splitting the wholetext dataframe on \n line feeds and exploded that into rows. That works great. Now the dataframe has exactly how many rows it's supposed to have. Now working on trying to figure out how to re-map the columns to get those created correctly.
Thoughts are to take in the header row and try to use that as a map? I don't know, still researching.
I wasn't approaching this right... Was handling this like a typical C# project, pull data from the db and process. But this doesn't really deal well with that. Ended up putting the processed data into the dataframe itself and ran my if checks from contained columns. Works fantastic, and it's a lot faster than trying to extract the data to do the checks.

Missing data in SFrame

I'm trying to use graphlab.linear_regression.create and I get an error that I have missing data in the column I am using to predict my model and it says to use dropna to fix the problem. I use dropna but it doesn't get rid of any of the rows missing values. I am typing ti_train.dropna() to try and drop the missing data and age_model = graphlab.linear_regression.create(ti_train, target='Survived', features=['Age'],validation_set=None) for my linear regression. I've also tried fillna with ti_train.fillna('Age', np.median(ti_train['Age'])). I got my data by reading a csv file into an SFrame. Thank you
just executing this code won't modify your sframe (assuming you are running ipython notebook with sframe)
ti_train.fillna('Age', np.median(ti_train['Age']))
Have you tried this (LHS for modifying the sframe) and then try running regression
ti_train = ti_train.fillna('Age', np.median(ti_train['Age']))

how do I add a pandas object (e.g. DataFrame) to a group within an HDF file?

Suppose I have an HDF5 file (myHDF.h5) with a hierarchy of groups, something like:
/root/groupA
/groupB
Now I want to add a DataFrame (myFrame) to the groupA (along with some other objects such as dictionaries). How do I do that? If I open my HDF.h5 with pandas.io.HDFStore:
store = pandas.io.HDFStore('myHDF.h5')
and then try:
store['groupA']['myFrame'] = myFrame
I get:
AttributeError: Attribute 'pandas_type' does not exist in node: '/groupA'
What is the proper way to do this?
this is enabled as of version 0.10.0
http://pandas.pydata.org/pandas-docs/stable/io.html#hierarchical-keys
Currently pandas does not support hierarchical paths as you specified.
There is an open github issue about this: https://github.com/pydata/pandas/issues/13
I'm not sure when we will get around to adding this feature, would more than welcome a pull request if you're interested in completing the skeleton code that's in the issue discussion.