Replacing a column in a dataframe raises ValueError: Columns must be same length as key - pandas

There is a description of using the replace method:
[https://www.geeksforgeeks.org/replace-values-of-a-dataframe-with-the-value-of-another-dataframe-in-pandas/][1]
Unfortunately, I have an error during replacing columns of dataframe with a column from another dataframe.
import pandas as pd
# initialise data of lists.
colors = {
"first_set": ["99", "88", "77", "66", "55", "44", "33", "22"],
"second_set": ["1", "2", "3", "4", "5", "6", "7", "8"],
}
color = {
"first_set": ["a", "b", "c", "d", "e", "f", "g", "h"],
"second_set": ["VI", "IN", "BL", "GR", "YE", "OR", "RE", "WI"],
}
# Calling DataFrame constructor on list
df = pd.DataFrame(colors, columns=["first_set", "second_set"])
df1 = pd.DataFrame(color, columns=["first_set", "second_set"])
# Display the Output
display(df)
display(df1)
Here is a code with an error:
# replace column of one DataFrame with
# the column of another DataFrame
ser1 = df1["first_set"]
ser2 = df["second_set"]
print(ser1)
print(ser2)
df["second_set"] = df1.replace(to_replace=ser1, value=ser2)
--------------------------------------------------------------------------- ValueError Traceback (most recent call
last) ~\AppData\Local\Temp/ipykernel_28648/2104797653.py in
5 print(ser1)
6 print(ser2)
----> 7 df['second_set'] = df1.replace(to_replace=ser1,value=ser2)
~.virtualenvs\01_python_packages-rD-UbwAe\lib\site-packages\pandas\core\frame.py
in setitem(self, key, value) 3600
self._setitem_array(key, value) 3601 elif isinstance(value,
DataFrame):
-> 3602 self._set_item_frame_value(key, value) 3603 elif ( 3604 is_list_like(value)
~.virtualenvs\01_python_packages-rD-UbwAe\lib\site-packages\pandas\core\frame.py
in _set_item_frame_value(self, key, value) 3727
len_cols = 1 if is_scalar(cols) else len(cols) 3728 if
len_cols != len(value.columns):
-> 3729 raise ValueError("Columns must be same length as key") 3730 3731 # align right-hand-side columns
if self.columns
ValueError: Columns must be same length as key

Related

concatenate values in dataframe if a column has specific values and None or Null values

I have a dataframe with name+address/email information based on the type. Based on a type I want to concat name+address or name+email into a new column (concat_name) within the dataframe. Some of the types are null and are causing ambiguity errors. Identifying the nulls correctly in place is where I'm having trouble.
NULL = None
data = {
'Type': [NULL, 'MasterCard', 'Visa','Amex'],
'Name': ['Chris','John','Jill','Mary'],
'City': ['Tustin','Cleveland',NULL,NULL ],
'Email': [NULL,NULL,'jdoe#yahoo.com','mdoe#aol.com']
}
df_data = pd.DataFrame(data)
#Expected resulting df column:
df_data['concat_name'] = ['ChrisTustin', 'JohnCleveland','Jilljdoe#yahoo.com,'Marymdoe#aol.com']
Attempt one using booleans
if df_data['Type'].isnull() | df_data[df_data['Type'] == 'Mastercard':
df_data['concat_name'] = df_data['Name']+df_data['City']
if df_data[df_data['Type'] == 'Visa' | df_data[df_data['Type'] == 'Amex':
df_data['concat_name'] = df_data['Name']+df_data['Email']
else:
df_data['concat_name'] = 'Error'
Error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Attempt two using np.where
df_data['concat_name'] = np.where((df_data['Type'].isna()|(df_data['Type']=='MasterCard'),df_data['Name']+df_data['City'],
np.where((df_data['Type']=="Visa")|(df_data['Type]=="Amex"),df_data['Name']+df_data['Email'], 'Error'
Error
ValueError: Length of values(2) does not match length of index(12000)
Does the following code solve your use case?
# == Imports needed ===========================
import pandas as pd
import numpy as np
# == Example Dataframe =========================
df_data = pd.DataFrame(
{
"Type": [None, "MasterCard", "Visa", "Amex"],
"Name": ["Chris", "John", "Jill", "Mary"],
"City": ["Tustin", "Cleveland", None, None],
"Email": [None, None, "jdoe#yahoo.com", "mdoe#aol.com"],
# Expected output:
"concat_name": [
"ChrisTustin",
"JohnCleveland",
"Jilljdoe#yahoo.com",
"Marymdoe#aol.com",
],
}
)
# == Solution Implementation ====================
df_data["concat_name2"] = np.where(
(df_data["Type"].isin(["MasterCard", pd.NA, None])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
np.where(
(df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
"Error",
),
)
# == Expected Output ============================
print(df_data)
# Prints:
# Type Name City Email concat_name concat_name2
# 0 None Chris Tustin None ChrisTustin ChrisTustin
# 1 MasterCard John Cleveland None JohnCleveland JohnCleveland
# 2 Visa Jill None jdoe#yahoo.com Jilljdoe#yahoo.com Jilljdoe#yahoo.com
# 3 Amex Mary None mdoe#aol.com Marymdoe#aol.com Marymdoe#aol.com
Notes
You might also consider simplifying the problem, by replacing the first condition (Type == 'MasterCard' or None) with the opposite of your second condition (Type == 'Visa' or 'Amex'):
df_data["concat_name2"] = np.where(
(~df_data["Type"].isin(["Visa", "Amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", "")
)
Additionally, if you are dealing with messy data, you can also improve the implementation by converting the Type column to lowercase, or uppercase. This makes your code also account for cases where you have values like "mastercard", or "Mastercard", etc.:
df_data["concat_name2"] = np.where(
(df_data["Type"].astype(str).str.lower().isin(["mastercard", pd.NA, None, "none"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["City"].astype(str).replace("None", ""),
np.where(
(df_data["Type"].astype(str).str.lower().isin(["visa", "amex"])),
df_data["Name"].astype(str).replace("None", "")
+ df_data["Email"].astype(str).replace("None", ""),
"Error",
),
)

Pandas: imputing descriptive stats using a groupby with a variable

I have a data frame like this:
input_df = pd.DataFrame({"sex": ["M", "F", "F", "M", "M"], "Class": [1, 2, 2, 1, 1], "Age":[40, 30, 30, 50, NaN]})
What I want to do is to impute the missing value for the age based on the sex and class columns.
I have tried doing it with a function, conditional_impute. What the function does is take a data frame and a condition and then use it to impute the age based on the sex and class grouping. Butthe caveat is that the condition can either be a mean or median and if not either of these two, the function has to raise an error.
So I did this:
### START FUNCTION
def conditional_impute(input_df, choice='median'):
my_df = input_df.copy()
# if choice is not median or mean, raise valueerror
if choice == "mean" or choice == "median":
my_df['Age'] = my_df['Age'].fillna(my_df.groupby(["Sex","Pclass"])['Age'].transform(choice))
else:
raise ValueError()
# round the values in Age colum
my_df['Age'] = round(my_df['Age'], 1)
return my_df
### END FUNCTION
But I am getting an error when I call it.
conditional_impute(train_df, choice='mean')
What could I possibly be doing wrong? I really cannot get a handle on this.
If you give the right inputs, it outputs just fine...
# Fixed input to match function:
df = pd.DataFrame({"Sex": ["M", "F", "F", "M", "M"], "Pclass": [1, 2, 2, 1, 1], "Age":[40, 30, 30, 50, np.nan]})
def conditional_impute(input_df, choice='median'):
my_df = input_df.copy()
# if choice is not median or mean, raise valueerror
if choice == "mean" or choice == "median":
my_df['Age'] = my_df['Age'].fillna(my_df.groupby(["Sex","Pclass"])['Age'].transform(choice))
else:
raise ValueError()
# round the values in Age colum
my_df['Age'] = round(my_df['Age'], 1)
return my_df
conditional_impute(df, choice='mean')
Output:
Sex Pclass Age
0 M 1 40.0
1 F 2 30.0
2 F 2 30.0
3 M 1 50.0
4 M 1 45.0

How sort multiindex dataframe by column value and maintain multiindex structure?

I have a multiindex (TestName and TestResult.Outcome) dataframe, want to sort descending by a column value and maintain the visual multiindex pair (TestName and TestResult.Outcome). How can I achieve that?
For example, I want to sort desc by column "n * %" for TestResult.Outcome index value "Failed" the following table:
I want to achieve the following outcome, maintaining the Pass Fail pairs in the indices:
I tried this:
orderedByTotalNxPercentDesc = myDf.sort_values(['TestResult.Outcome','n * %'], ascending=False)
but this orders firstly by index values = "Passed" and breaks the Passed Failed index pairs
This can help you:
import pandas as pd
import numpy as np
arrays = [np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),np.array(["one", "two", "one", "two", "one", "two", "one", "two"])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.reset_index().groupby(["level_0"]).apply(lambda x: x.sort_values([3], ascending = False)).set_index(['level_0','level_1'])
In your case 3 is your column n * %, level_0 is your index TestName and level_1 is your TestResult.Outcome.
Becomes:
I was able to get what I want by creating a dummy column for sorting:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]
df = pd.DataFrame(np.random.randn(8, 1), index=arrays)
df.index.names = ['level_0', 'level_1']
df = df.rename(columns={0: "myvalue"}, errors='raise')
for index, row in df.iterrows():
df.loc[index,'sort_dummy'] = df.loc[(index[0],'two'),'myvalue']
df = df.sort_values(['sort_dummy'], ascending = False)
df
Output:

Conditional mapping among columns of two data frames with Pandas Data frame

I needed your advice regarding how to map columns between data-frames:
I have put it in simple way so that it's easier for you to understand:
df = dataframe
EXAMPLE:
df1 = pd.DataFrame({
"X": [],
"Y": [],
"Z": []
})
df2 = pd.DataFrame({
"A": ['', '', 'A1'],
"C": ['', '', 'C1'],
"D": ['D1', 'Other', 'D3'],
"F": ['', '', ''],
"G": ['G1', '', 'G3'],
"H": ['H1', 'H2', 'H3']
})
Requirement:
1st step:
We needed to track a value for X column on df1 from columns A, C, D respectively. It would stop searching once it finds any value and would select it.
2nd step:
If the selected value is "Other" then X column of df1 would map columns F, G, and H respectively until it finds any value.
Result:
X
0 D1
1 H2
2 A1
Thank you so much in advance
Try this:
def first_non_empty(df, cols):
"""Return the first non-empty, non-null value among the specified columns per row"""
return df[cols].replace('', pd.NA).bfill(axis=1).iloc[:, 0]
col_x = first_non_empty(df2, ['A','C','D'])
col_x = col_x.mask(col_x == 'Other', first_non_empty(df2, ['F','G','H']))
df1['X'] = col_x

Arrays to row in pandas

I have following dict which I want to convert into pandas. this dict have nested list which can appear for one node but not other.
dis={"companies": [{"object_id": 123,
"name": "Abd ",
"contact_name": ["xxxx",
"yyyy"],
"contact_id":[1234,
33455]
},
{"object_id": 654,
"name": "DDSPP"},
{"object_id": 987,
"name": "CCD"}
]}
AS
object_id, name, contact_name, contact_id
123,Abd,xxxx,1234
123,Abd,yyyy,
654,DDSPP,,
987,CCD,,
How can i achive this
I was trying to do like
abc = pd.DataFrame(dis).set_index['object_id','contact_name']
but it says
'method' object is not subscriptable
This is inspired from #jezrael answer in this link: Splitting multiple columns into rows in pandas dataframe
Use:
s = {"companies": [{"object_id": 123,
"name": "Abd ",
"contact_name": ["xxxx",
"yyyy"],
"contact_id":[1234,
33455]
},
{"object_id": 654,
"name": "DDSPP"},
{"object_id": 987,
"name": "CCD"}
]}
df = pd.DataFrame(s) #convert into DF
df = df['companies'].apply(pd.Series) #this splits the internal keys and values into columns
split1 = df.apply(lambda x: pd.Series(x['contact_id']), axis=1).stack().reset_index(level=1, drop=True)
split2 = df.apply(lambda x: pd.Series(x['contact_name']), axis=1).stack().reset_index(level=1, drop=True)
df1 = pd.concat([split1,split2], axis=1, keys=['contact_id','contact_name'])
pd.options.display.float_format = '{:.0f}'.format
print (df.drop(['contact_id','contact_name'], axis=1).join(df1).reset_index(drop=True))
Output with regular index:
name object_id contact_id contact_name
0 Abd 123 1234 xxxx
1 Abd 123 33455 yyyy
2 DDSPP 654 nan NaN
3 CCD 987 nan NaN
Is this something you were looking for?
If you have just only one column needs to convert, then you can use something more shortly, like this:
df = pd.DataFrame(d['companies'])
d = df.loc[0].apply(pd.Series)
d[1].fillna(d[0], inplace=True)
df.drop([0],0).append(d.T)
Otherwise, if you need to do this with more then one raw, you can use it, but it have to be modified.