Update some rows of a dataframe or create new dataframe in PySpark - dataframe

I am new to PySpark and my objective is to use PySpark script in AWS Glue for:
reading a dataframe from input file in Glue => done
changing columns of some rows which satisfy a condition => facing issue
write the updated dataframe on the same schema into S3 => done
The task seems very simple, but I could not find a way to complete it and still facing different different issues with my changing code.
Till now, my code looks like this:
Transform2.printSchema() # input schema after reading
Transform2 = Transform2.toDF()
def updateRow(row):
# my logic to update row based on a global condition
#if row["primaryKey"]=="knownKey": row["otherAttribute"]= None
return row
LocalTransform3 = [] # creating new dataframe from Transform2
for row in Transform2.rdd.collect():
row = row.asDict()
row = updateRow(row)
LocalTransform3.append(row)
print(len(LocalTransform3))
columns = Transform2.columns
Transform3 = spark.createDataFrame(LocalTransform3).toDF(*columns)
print('Transform3 count', Transform3.count())
Transform3.printSchema()
Transform3.show(1,truncate=False)
Transform4 = DynamicFrame.fromDF(Transform3, glueContext, "Transform3")
print('Transform4 count', Transform4.count())
I tried using multiple things like:
using map to update existing rows in a lambda
using collect()
using createDataFrame() to create new dataframe
But faced errors in below steps:
not able to create new updated rdd
not able to create new dataframe from rdd using existing columns
Some errors in Glue I got, at different stages:
ValueError: Some of types cannot be determined after inferring
ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. Traceback (most recent call last):
Any working code snippet or help is appreciated.

from pyspark.sql.functions import col, lit, when
Transform2 = Transform2.toDF()
withKeyMapping = Transform2.withColumn('otherAttribute', when(col("primaryKey") == "knownKey", lit(None)).otherwise(col('otherAttribute')))
This should work for your use-case.

Related

Query returns value that don't exist in PySpark Dataframe

Is there a way to create a subset dataframe from a dataframe and be sure that its values will be used afterward?
I have a huge PySpark Dataframe like this (simplified example):
id
timestamp
value
1
1658919602
5
1
1658919604
9
2
1658919632
2
Now I want to take a sample from it to test something, before running on the entire Dataframe. I get a sample by:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
df_sample.show() shows some values.
Then I run this command, and sometimes it returns values that are present in df_sample and sometimes it returns values that are not present in df_sample but in df.
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol'))
As if it's not using df_sample but picking in a non deterministic way 10 rows from df.
Interestingly, if I run df_sample.show() afterwards, it shows the same values as when it was first called.
Why is this happening?
Here's full code:
# Big dataframe
df = ...
# Create sample
df_sample = df.limit(10)
# shows some values
df_sample.show()
# run query
df_temp = df_sample.sort(F.desc('timestamp')).groupBy('id').agg(F.collect_list('value').alias('newcol')
# df_temp sometimes shows values that are present in df_sample, but sometimes shows values that aren't present in df_sample but in df
df_temp.show()
# Shows the exact same values as when it was first called
df_sample.show()
Edit1: I understand that Spark is lazy, but is there any way to force it to not be lazy in this scenario?
We can use sample function provided by spark to achieve this.Every time you run a sample() function it returns a different set of sampling records, To regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run.
df=spark.range(100)
# Execute first time
print(df.sample(0.1,123).collect())
# Execute Second time with same seed-123
print(df.sample(0.1,123).collect())
# Execute with different seed-456
print(df.sample(0.1,456).collect())
Refer spark docs
Stratum sampling in spark
What worked was using df_sample = df.limit(10).cache() or df_sample = df.limit(10).persist(). Samkart's comment pointed me in this direction.

Failed exporting df.to_csv using a variable name in the path

I am using a function MyFunction(DataName) that creates a pd.DataFrame(). After certain modifications to data, I am able to export such dataframe into csv with this code:
df.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\out.csv', index = True, header=True)
Creating an 'out.csv' file which is overwritten everytime the code is run. However when I try to give a specific name (for instance the name of the data used to fill in the dataframe, for multiple exports like this:
df.to_csv (fr'\\kant\kjemi-u1\izarc\pc\Desktop\{DataName}.csv', index = True, header=True)
I have this error:
---------------------------------------------------------------------------
FileNotFoundError Traceback (most recent call last)
in
----> 1 MyFunction(DataName)
I am new in the programming world so any ideas of how I can overcome this problem are very welcomed. Thank you very much!
If I understand you right (and given that the fr in your code should be simply r), you want to have your to_csv statement dynamic and what is supposed to go within the brackets to change. So, assume you dataframe if df. Then, do this:
DataName = "df"
NewFinger.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\{}.csv'.format(DataName), index = True, header=True)
thanks for your help. In the beggining I was confused with 'NewFinger' I thought it was some sort of module I needed to install. I could not find information in google. However I solved the issue based on your suggestion actually with the following code:
DataName = "whichever name"
df.to_csv (r'\\kant\kjemi-u1\izarc\pc\Desktop\{}.csv'.format(DataName), index = True, header=True)

Prepare a csv file for process mining

hope you are doing well !
I was following tutorials for process mining using 'PM4PY', but I found difficulties in the csv file ,
in my csv file I have this columns : 'id', 'status', 'mailID', 'date'.... ('status' is same as 'activity' that contain some specific choises )
my csv file contains a lot of data.
to follow process mining tutorial I must have in my columns something like 'case:concept:name' ... but I don't know how can I make it
In your case, I assume 'id' would be the same as the Case ID in normal process mining terminology. Similarly, 'status' corresponds to Activity ID and 'date' would correspond to the timestamp.
The best option is to first read into a pandas dataframe before feeding into PM4Py.
For a detailed understanding of how to do this, here is an example below. As you have not mentioned all the columns that you have in your csv file, let us assume that currently you only have [ 'id', 'status', 'date' ] as your column list. The following code can be adapted to any number of columns you have (by adding them to the list named cols) :
import pandas as pd
from pm4py.objects.conversion.log import converter as log_converter
path = '' # Enter path to the csv file
data = pd.read_csv(path)
cols = ['case:concept:name','concept:name','time:timestamp']
data.columns = cols
data['time:timestamp'] = pd.to_datetime(data['time:timestamp'])
data['concept:name'] = data['concept:name'].astype(str)
log = log_converter.apply(data, variant=log_converter.Variants.TO_EVENT_LOG)
Here we have changed the column names and their datatypes as required by the PM4Py package. Convert this dataframe into an event log using the log_converter function. Now you can perform your regular process mining tasks on this event log object. For instance, if you wish to create a Directly-Follows Graph from the event log, you can use the following line of code :
from pm4py.algo.discovery.dfg import algorithm as dfg_algorithm
dfg = dfg_algorithm.apply(log)
first you need import your csv file using pandas, then convert to an event log object, finally you can use in pm4py.
reference:
https://pm4py.fit.fraunhofer.de/documentation

How to import Pandas data frames in a loop [duplicate]

So what I'm trying to do is the following:
I have 300+ CSVs in a certain folder. What I want to do is open each CSV and take only the first row of each.
What I wanted to do was the following:
import os
list_of_csvs = os.listdir() # puts all the names of the csv files into a list.
The above generates a list for me like ['file1.csv','file2.csv','file3.csv'].
This is great and all, but where I get stuck is the next step. I'll demonstrate this using pseudo-code:
import pandas as pd
for index,file in enumerate(list_of_csvs):
df{index} = pd.read_csv(file)
Basically, I want my for loop to iterate over my list_of_csvs object, and read the first item to df1, 2nd to df2, etc. But upon trying to do this I just realized - I have no idea how to change the variable being assigned when doing the assigning via an iteration!!!
That's what prompts my question. I managed to find another way to get my original job done no problemo, but this issue of doing variable assignment over an interation is something I haven't been able to find clear answers on!
If i understand your requirement correctly, we can do this quite simply, lets use Pathlib instead of os which was added in python 3.4+
from pathlib import Path
csvs = Path.cwd().glob('*.csv') # creates a generator expression.
#change Path(your_path) with Path.cwd() if script is in dif location
dfs = {} # lets hold the csv's in this dictionary
for file in csvs:
dfs[file.stem] = pd.read_csv(file,nrows=3) # change nrows [number of rows] to your spec.
#or with a dict comprhension
dfs = {file.stem : pd.read_csv(file) for file in Path('location\of\your\files').glob('*.csv')}
this will return a dictionary of dataframes with the key being the csv file name .stem adds this without the extension name.
much like
{
'csv_1' : dataframe,
'csv_2' : dataframe
}
if you want to concat these then do
df = pd.concat(dfs)
the index will be the csv file name.

Trying to load an hdf5 table with dataframe.to_hdf before I die of old age

This sounds like it should be REALLY easy to answer with Google but I'm finding it impossible to answer the majority of my nontrivial pandas/pytables questions this way. All I'm trying to do is to load about 3 billion records from about 6000 different CSV files into a single table in a single HDF5 file. It's a simple table, 26 fields, mixture of strings, floats and ints. I'm loading the CSVs with df = pandas.read_csv() and appending them to my hdf5 file with df.to_hdf(). I really don't want to use df.to_hdf(data_columns = True) because it looks like that will take about 20 days versus about 4 days for df.to_hdf(data_columns = False). But apparently when you use df.to_hdf(data_columns = False) you end up with some pile of junk that you can't even recover the table structure from (or so it appears to my uneducated eye). Only the columns that were identified in the min_itemsize list (the 4 string columns) are identifiable in the hdf5 table, the rest are being dumped by data type into values_block_0 through values_block_4:
table = h5file.get_node('/tbl_main/table')
print(table.colnames)
['index', 'values_block_0', 'values_block_1', 'values_block_2', 'values_block_3', 'values_block_4', 'str_col1', 'str_col2', 'str_col3', 'str_col4']
And any query like df = pd.DataFrame.from_records(table.read_where(condition)) fails with error "Exception: Data must be 1-dimensional"
So my questions are: (1) Do I really have to use data_columns = True which takes 5x as long? I was expecting to do a fast load and then index just a few columns after loading the table. (2) What exactly is this pile of garbage I get using data_columns = False? Is it good for anything if I need my table back with query-able columns? Is it good for anything at all?
This is how you can create an HDF5 file from CSV data using pytables. You could also use a similar process to create the HDF5 file with h5py.
Use a loop to read the CSV files with np.genfromtxt into a np array.
After reading the first CSV file, write the data with .create_table() method, referencing the np array created in Step 1.
For additional CSV files, write the data with .append() method, referencing the np array created in Step 1
End of loop
Updated on 6/2/2019 to read a date field (mm/dd/YYY) and convert to datetime object. Note changes to genfromtxt() arguments! Data used is added below the updated code.
import numpy as np
import tables as tb
from datetime import datetime
csv_list = ['SO_56387241_1.csv', 'SO_56387241_2.csv' ]
my_dtype= np.dtype([ ('a',int),('b','S20'),('c',float),('d',float),('e','S20') ])
with tb.open_file('SO_56387241.h5', mode='w') as h5f:
for PATH_csv in csv_list:
csv_data = np.genfromtxt(PATH_csv, names=True, dtype=my_dtype, delimiter=',', encoding=None)
# modify date in fifth field 'e'
for row in csv_data :
datetime_object = datetime.strptime(row['my_date'].decode('UTF-8'), '%m/%d/%Y' )
row['my_date'] = datetime_object
if h5f.__contains__('/CSV_Data') :
dset = h5f.root.CSV_Data
dset.append(csv_data)
else:
dset = h5f.create_table('/','CSV_Data', obj=csv_data)
dset.flush()
h5f.close()
Data for testing:
SO_56387241_1.csv:
my_int,my_str,my_float,my_exp,my_date
0,zero,0.0,0.00E+00,01/01/1980
1,one,1.0,1.00E+00,02/01/1981
2,two,2.0,2.00E+00,03/01/1982
3,three,3.0,3.00E+00,04/01/1983
4,four,4.0,4.00E+00,05/01/1984
5,five,5.0,5.00E+00,06/01/1985
6,six,6.0,6.00E+00,07/01/1986
7,seven,7.0,7.00E+00,08/01/1987
8,eight,8.0,8.00E+00,09/01/1988
9,nine,9.0,9.00E+00,10/01/1989
SO_56387241_2.csv:
my_int,my_str,my_float,my_exp,my_date
10,ten,10.0,1.00E+01,01/01/1990
11,eleven,11.0,1.10E+01,02/01/1991
12,twelve,12.0,1.20E+01,03/01/1992
13,thirteen,13.0,1.30E+01,04/01/1993
14,fourteen,14.0,1.40E+01,04/01/1994
15,fifteen,15.0,1.50E+01,06/01/1995
16,sixteen,16.0,1.60E+01,07/01/1996
17,seventeen,17.0,1.70E+01,08/01/1997
18,eighteen,18.0,1.80E+01,09/01/1998
19,nineteen,19.0,1.90E+01,10/01/1999