Applying a user defined function to a PySpark dataframe and return a dictionary - pandas

Suppose I have a pandas dataframe called df
id value1 value2
1 2 1
2 2 1
3 4 5
In plain Python, I wrote a function to process this dataframe and return a dictionary:
d = dict()
for row in df.itertuples()
x = do_something (row)
d[x[0]] = x[1:]
I am trying to reimplement this function using Spark.
d = dict() # define a global var
def do_something (id, value1, value2):
# business logic
d[x0] = [x1,x2,x3]
return 0
udf_do = udf (do_something)
then:
df_spark.select (udf_do ('id','value1','value2'))
My idea is, by calling df_spark.select, the function do_something will be called over the dataframe, and it will update the global variable d. I don't really care about the return value of udf_do so I return 0.
My solution does not work, indeed.
Could you suggest me some ways to iterate through (I know it is not a Spark-way) or somehow to process a Spark dataframe and update an external dictionary?
Note that the dataframe is quite large. I tried to convert it to pandas by calling toPandas() but I have OOM problem.

UDF cannot update any global state. But, you can do some some businness login inside UDF and then use toLocalIterator to get all the data to the driver in memory-efficient way (partition by partition). For example:
df = spark.createDataFrame([(10, 'b'), (20, 'b'), (30, 'c'),
(40, 'c'), (50, 'c'), (60, 'a')], ['col1', 'col2'])
df.withColumn('udf_result', ......)
df.cache()
df.count() # force cache fill
for row in df.toLocalIterator():
print(row)

Related

cancatenate multiple dfs with same dimensions and apply functions to cell values of all dfs and store result in the cell

df1 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
df2 = pd.DataFrame(np.random.randint(0,9,size=(2, 2)))
Lets say after concatenate df1 and df2(real case I have many dfs with 700*200 size) in a way that I get something like below table(I dont need to see this table, just for explanation)
col a
col b
row a
[1.4]
[7,8]
row b
[9,2]
[2,0]
Then i want to pass each cell values to below compute function and add the result it from to the cell
def compute(row, column, cell_values):
baseline_df = [2, 4, 6, 7, 8]
result = baseline_df
for values in cell_values:
if (column-row) != dict[values]: # dict contain specific values
result = baseline_df
else:
result = result.apply(func, value=values)
return result.loc[column-row]
def func(df, value):
# operation
result_df = df*value
return result_df
What I want is get df1 and df2 , concatenate and apply above function and get the results. In a really fast way.
In the actual use case , df quite big and if it run for all cells it would take significant amount of time, i need a faster way to perform this.
Note:
This is my idea of doing this. I hope you understand what my requirements are. Please let me know if that is not clear.
currently, i am using something like below, just get the max value of the cell and do the calculation(func)later
This will just give the max value of all cells combined,
dfs = pd.concat(grid).max(level=0)
Final result should be something like this after calculation(same 2d array with new cell data)
col a
col b
row a
0.1
0.7
row b
0.9
0,6
Different approaches are also welcome

How to combine 2 ideally rows that met specific conditions from a dataframe into another dataframe?

I am new to pandas. Now I successfully retrieve the 2 rows with seperated 'return' code as below:
df = pd.read_csv ('all_time_olympic_medals.csv')
df2 = df.iloc[:-1]
return df2[df2['no_summer_golds']==df2['no_summer_golds'].max()]
return df2[df2['no_winter_golds']==df2['no_winter_golds'].max()]
The question is how to make it to dataframe shape (2, 17) as below:
>>> the_king_of_summer_winter_olympics.shape
(2, 17)
Use boolean indexing with chain conditions by | for bitwise OR:
return df2[(df2['no_summer_golds']==df2['no_summer_golds'].max()) |
(df2['no_winter_golds']==df2['no_winter_golds'].max())]

How to pass function parameters if using one or more parameters

Thank you in advance for your assistance.
#Create df.
import pandas as pd
d = {'dep_var' : pd.Series([10, 20, 30, 40], index =['a', 'b', 'c', 'd']),
'one' : pd.Series([9, 23, 37, 41], index =['a', 'b', 'c', 'd']),
'two' : pd.Series([1, 6, 5, 4], index =['a', 'b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df)
dep_var one two
a 10 9 1
b 20 23 6
c 30 37 5
d 40 41 4
#Define function.
def df_two(dep_var, ind_var_1, ind_var_2):
global two
data = {
dep_var: df[dep_var],
ind_var_1: df[ind_var_1],
ind_var_2: df[ind_var_2]
}
two = pd.DataFrame(data)
return two
# Execute function.
df_two("dep_var", "one", "two")
dep_var one two
a 10 9 1
b 20 23 6
c 30 37 5
d 40 41 4
Works perfect. I'd like to, fairly new at this, be able to use a single function when using say three or four parameters, of course, using the above code I get error message with third parameter.
So rookie move I define another function with 3 parameters.
def df_three(dep_var, ind_var_1, ind_var_2, ind_var_3):
global three
data = {
dep_var: df[dep_var],
ind_var_1: df[ind_var_1],
ind_var_2: df[ind_var_2],
ind_var_3: df[ind_var_2]
}
three = pd.DataFrame(data)
return three
I've tried *args, *kargs, mapping and host of things with no luck. My sense is I'm close but need a way to tell the function that sometimes there might be one, two, or three parameters, and then map one, two or three parameters to created dataframe.
Use unpack *args:
def foo(dep_var, *args):
global df
data = {dep_var: df[dep_var]}
for a in args:
data[a] = df[a]
return pd.DataFrame(data)
And then you can call
foo('dep_var', 'one')
foo('dep_var', 'one', 'two')
To eliminate the need of global argument, I'd pass df to the function as well:
def foo(df, dep_var, *args):
data = {dep_var: df[dep_var]}
for a in args:
data[a] = df[a]
return pd.DataFrame(data)
More information on *args.
It sounds like you want to select only some columns from a data frame, in a certain order. You can just pass a list of the column names for that:
two[["dep_var", "one", "two"]]
If you want to, you can pack that into a function, using tuple unpacking to have a variable number of arguments.
def select(df, *columns):
return df[list(columns)]
This should directly work with your use cases:
select(two, "dep_var", "one", "two")
select(three, "dep_var", "one", "two", "three")
Note that I also passed the data frame variable, so you don't need to rely on a global variable.
The call to list is needed, because tuple unpacking produces, well, a tuple. And using a tuple as an index to the data frame produces different results than using a list.
You might want to append a .copy() to the return line, depending on how you use the return value of this.
A variable number of arguments also includes zero, so you might want to add a check for that.

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

Replacing Specific Values in a Pandas Column [duplicate]

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64