Error while setting column equal to another pandas ( ValueError: Must have equal len keys and value when setting with an iterable) - pandas

I have the following dataframe in pandas
datadate fyear ebit glp ibc ... ind status year month a_date
gvkey ...
7767 20130831 NaN NaN NaN NaN ... 0 1 2013.0 8.0 0
10871 20110930 NaN NaN NaN NaN ... 0 1 2011.0 9.0 0
15481 20110930 NaN NaN NaN NaN ... 0 1 2011.0 9.0 0
15582 19821031 NaN NaN NaN NaN ... 1 1 1982.0 10.0 0
15582 19831031 NaN NaN NaN NaN ... 1 1 1983.0 10.0 0
... ... ... ... ... ... ... ... ... ... ...
282553 20071231 NaN NaN NaN NaN ... 0 1 2007.0 12.0 0
282553 20081231 NaN NaN NaN NaN ... 0 1 2008.0 12.0 0
282553 20091231 NaN NaN NaN NaN ... 0 1 2009.0 12.0 0
294911 20150930 NaN NaN NaN NaN ... 0 1 2015.0 9.0 0
321467 20161231 NaN NaN NaN NaN ... 0 1 2016.0 12.0 0
I want to run the following command to assign the year value to the column a_date if month is at least 6. (Please do not consider that there are NaNs in the dataframe):
df.iloc[(df['month']>=6).values,-1]=df.iloc[(df['month']>=6).values,-3]
but I get the error
ValueError: Must have equal len keys and value when setting with an iterable
How do I proceed then? I really cannot get why I get this error. I googled and found some solutions to the same ValueError but they do not apply to my case. I would like to avoid using dictionaries and keep everything in one line if possible. I know I could solve with a loop but I am looking for a more efficient solution

I think that the error comes from the iloc function in the right part of your line (after =), because this function returns a series and not a value. So you are affecting a serie to a dataframe cell, which for me is the source of the error. Using pandas, for me the code would be :
df.loc[df['month'] >= 6, 'a_date'] = df['year']
The loc function allows to select a group of lines according to a condition (here df['month'] >= 6), a column to apply a change (here 'a_date') and the change you want to apply (here, as it is another column of the dataframe : df['year'])

I found an efficient solution myself using np.where:
df['a_date']=np.where(df['month']>=6,df['year'],df['year']-1)

Related

Pandas Long to Wide conversion

I am new to Pandas. I have a data set with in this format.
UserID ISBN BookRatings
0 276725.0 034545104U 0.0
1 276726.0 155061224 5.0
2 276727.0 446520802 0.0
3 276729.0 052165615W 3.0
4 276729.0 521795028 6.0
I would like to create this
ISBN 276725 276726 276727 276729
UserID
0 034545104U
1 0 155061224 0 0 0
2 0 0 446520802 0 0
3 0 0 0 052165615W 0
4 0 0 0 521795028 0
I tried pivot but was not successful. Any kind advice please?
I think that pivot() is the right approach here. The most difficult part is to get the arguments correctly. I think we need to keep the original index and the new columns should be the values in column UserID. Also, we want to fill the new dataframe with the values from column ISBN.
For this, I firstly extract the original index as column and then apply the pivot() function:
df = df.reset_index()
result = df.pivot(index='index', columns='UserID', values='ISBN')
# Make your float columns to integers (only works if all user ids are numbers, drop nan values first)
result.columns = map(int,result.columns)
Output:
276725 276726 276727 276729
index
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028
Edit: If you want the same appearance as in the original dataframe you have to apply the following line as well:
result = result.rename_axis(None, axis=0)
Output:
276725 276726 276727 276729
0 034545104U NaN NaN NaN
1 NaN 155061224 NaN NaN
2 NaN NaN 446520802 NaN
3 NaN NaN NaN 052165615W
4 NaN NaN NaN 521795028

Pandas DataFrame subtraction is getting an unexpected result. Concatenating instead?

I have two dataframes of the same size (510x6)
preds
0 1 2 3 4 5
0 2.610270 -4.083780 3.381037 4.174977 2.743785 -0.766932
1 0.049673 0.731330 1.656028 -0.427514 -0.803391 -0.656469
2 -3.579314 3.347611 2.891815 -1.772502 1.505312 -1.852362
3 -0.558046 -1.290783 2.351023 4.669028 3.096437 0.383327
4 -3.215028 0.616974 5.917364 5.275736 7.201042 -0.735897
... ... ... ... ... ... ...
505 -2.178958 3.918007 8.247562 -0.523363 2.936684 -3.153375
506 0.736896 -1.571704 0.831026 2.673974 2.259796 -0.815212
507 -2.687474 -1.268576 -0.603680 5.571290 -3.516223 0.752697
508 0.182165 0.904990 4.690155 6.320494 -2.326415 2.241589
509 -1.675801 -1.602143 7.066843 2.881135 -5.278826 1.831972
510 rows × 6 columns
outputStats
0 1 2 3 4 5
0 2.610270 -4.083780 3.381037 4.174977 2.743785 -0.766932
1 0.049673 0.731330 1.656028 -0.427514 -0.803391 -0.656469
2 -3.579314 3.347611 2.891815 -1.772502 1.505312 -1.852362
3 -0.558046 -1.290783 2.351023 4.669028 3.096437 0.383327
4 -3.215028 0.616974 5.917364 5.275736 7.201042 -0.735897
... ... ... ... ... ... ...
505 -2.178958 3.918007 8.247562 -0.523363 2.936684 -3.153375
506 0.736896 -1.571704 0.831026 2.673974 2.259796 -0.815212
507 -2.687474 -1.268576 -0.603680 5.571290 -3.516223 0.752697
508 0.182165 0.904990 4.690155 6.320494 -2.326415 2.241589
509 -1.675801 -1.602143 7.066843 2.881135 -5.278826 1.831972
510 rows × 6 columns
when I execute:
preds - outputStats
I expect a 510 x 6 dataframe with elementwise subtraction. Instead I get this:
0 1 2 3 4 5 0 1 2 3 4 5
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
505 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
506 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
507 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
508 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
509 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've tried dropping columns and the like, and that hasn't helped. I also get the same result with preds.subtract(outputStats). Any Ideas?
There are many ways that two different values can appear the same when displayed. One of the main ways is if they are different types, but corresponding values for those types. For instance, depending on how you're displaying them, the int 1 and the str '1' may not be easily distinguished. You can also have whitespace characters, such as '1' versus ' 1'.
If the problem is that one set is int while the other is str, you can solve the problem by converting them all to int or all to str. To do the former, do df.columns = [int(col) for col in df.columns]. To do the latter, df.columns = [str(col) for col in df.columns]. Converting to str is somewhat safer, as trying to convert to int can raise an error if the string isn't amenable to conversion (e.g. int('y') will raise an error), but int can be more usual as they have the numerical structure.
You asked in a comment about dropping columns. You can do this with drop and including axis=1 as a parameter to tell it to drop columns rather than rows, or you can use the del keyword. But changing the column names should remove the need to drop columns.

Foward-fill dataframe based on mask. Fill with last valid value

I have a dataframe like the following:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,21
5,10,22
6,11,23
7,12,24
8,13,NaN
9,NaN,NaN
And a boolean mask dataframe like the following:
index,col1,col2
1,False,False
2,False,False
3,False,False
4,False,True
5,False,False
6,False,False
7,True,True
8,True,False
9,False,False
I would like to convert them to this final dataframe:
index,col1,col2
1,NaN,NaN
2,NaN,NaN
3,NaN,20
4,NaN,20
5,10,22
6,11,23
7,11,23
8,11,NaN
9,NaN,NaN
That is: foward-filling the values matching True on the mask with the last value in the column having False in the mask.
How can I get this?
Let's try:
df.mask(mask).ffill().where(df.notna())
Output:
col1 col2
index
1 NaN NaN
2 NaN NaN
3 NaN 20.0
4 NaN 20.0
5 10.0 22.0
6 11.0 23.0
7 11.0 23.0
8 11.0 NaN
9 NaN NaN

For every row in pandas, do until sample ID change

How can I iterarate over rows in a dataframe until the sample ID change?
my_df:
ID loc_start
sample1 10
sample1 15
sample2 10
sample2 20
sample3 5
Something like:
samples = ["sample1", "sample2" ,"sample3"]
out = pd.DataFrame()
for sample in samples:
if my_df["ID"] == sample:
my_list = []
for index, row in my_df.iterrows():
other_list = [row.loc_start]
my_list.append(other_list)
my_list = pd.DataFrame(my_list)
out = pd.merge(out, my_list)
Expected output:
sample1 sample2 sample3
10 10 5
15 20
I realize of course that this could be done easier if my_df really would look like this. However, what I'm after is the principle to iterate over rows until a certain column value change.
Based on the input & output provided, this should work.
You need to provide more info if you are looking for something else.
df.pivot(columns='ID', values = 'loc_start').rename_axis(None, axis=1).apply(lambda x: pd.Series(x.dropna().values))
output
sample1 sample2 sample3
0 10.0 10.0 5.0
1 15.0 20.0 NaN
Ben.T is correct that a pivot works here. Here is an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 5, (10, 2)), columns=list("AB"))
# what does the df look like? Here, I consider column A to be analogous to your "ID" column
In [5]: df
Out[5]:
A B
0 3 1
1 2 1
2 4 2
3 4 1
4 0 4
5 4 2
6 4 1
7 3 1
8 1 1
9 4 0
# now do a pivot and see what it looks like
df2 = df.pivot(columns="A", values="B")
In [8]: df2
Out[8]:
A 0 1 2 3 4
0 NaN NaN NaN 1.0 NaN
1 NaN NaN 1.0 NaN NaN
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 4.0 NaN NaN NaN NaN
5 NaN NaN NaN NaN 2.0
6 NaN NaN NaN NaN 1.0
7 NaN NaN NaN 1.0 NaN
8 NaN 1.0 NaN NaN NaN
9 NaN NaN NaN NaN 0.0
Not quite what you wanted. With a little help from Jezreal's answer
df2 = df2.apply(lambda x: pd.Series(x.dropna().values))
In [20]: df3
Out[20]:
A 0 1 2 3 4
0 4.0 1.0 1.0 1.0 2.0
1 NaN NaN NaN 1.0 1.0
2 NaN NaN NaN NaN 2.0
3 NaN NaN NaN NaN 1.0
4 NaN NaN NaN NaN 0.0
The empty spots in the dataframe have to be filled with something, and NaN is used by default. Is this what you wanted?
If, on the other hand, you wanted to perform an operation on your data you would use the groupby instead.
df2 = df.groupby(by="A", as_index=False).mean()

Compare 2 columns and replace to None if found equal

The following command will replace all values for matching row to None.
ndf.iloc[np.where(ndf.path3=='sys_bck_20190101.tar.gz')] = np.nan
What I really need to do is to replace the value of a single column called path4 if it matches with column path3. This does not work:
ndf.iloc[np.where(ndf.path3==ndf.path4), ndf.path3] = np.nan
Update:
There is a pandas method "fillna" that can be used with axis = 'columns'.
Is there a similar method to write "NA" values to the duplcate columns?
I can do this, but it does not look like pythonic.
ndf.loc[ndf.path1==ndf.path2, 'path1'] = np.nan
ndf.loc[ndf.path2==ndf.path3, 'path2'] = np.nan
ndf.loc[ndf.path3==ndf.path4, 'path3'] = np.nan
ndf.loc[ndf.path4==ndf.filename, 'path4'] = np.nan
Update 2
Let me explain the issue:
Assuming this dataframe:
ndf = pd.DataFrame({
'path1':[4,5,4,5,5,4],
'path2':[4,5,4,5,5,4],
'path3':list('abcdef'),
'path4':list('aaabef'),
'col':list('aaabef')
})
The expected results :
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f
As you can see this is reverse of fillna. And I guess there is no easy way to do this in pandas. I have already mentioned the commands I can use. I will like to know if there is a better way to achieve this.
Use:
for c1, c2 in zip(ndf.columns, ndf.columns[1:]):
ndf.loc[ndf[c1]==ndf[c2], c1] = np.nan
print (ndf)
path1 path2 path3 path4 col
0 NaN 4.0 NaN NaN a
1 NaN 5.0 b NaN a
2 NaN 4.0 c NaN a
3 NaN 5.0 d NaN b
4 NaN 5.0 NaN NaN e
5 NaN 4.0 NaN NaN f