New data table not updating based on if-else condition [duplicate] - pandas

I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)

DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd

The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.

Related

What's the best way to insert columns in a pandas Dataframe when you don't know the exact number of columns?

I have an input dataframe.
I have also a list, with the same len as the number of rows in the dataframe.
Every element of the list is a dictionary: the key is the name of the new column, and the value is the value to be inserted in the cell.
I have to insert the columns from that list in the dataframe.
What is the best way to do so?
So far, given the input dataframe indf and the list l, I came up with something on the line of:
from copy import deepcopy
outdf = deepcopy(indf)
for index, row in indf.iterrows():
e = l[index]
for key, value in e:
outdf.loc[index, key] = value
But it doesn't seem pythonic and pandasnic and I get performance warnings like:
<ipython-input-5-9dde586a9c14>:8: PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling `frame.insert` many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use `newframe = frame.copy()`
If the sorting of the list and the data frame is the same, you can convert your list of dictionaries to a data frame:
mylist = [
{'a':1,'b':2,'c':3},
{'e':11,'f':22,'c':33},
{'a':111,'b':222,'c':333}
]
mylist_df = pd.DataFrame(mylist)
a
b
c
e
f
0
1
2
3
nan
nan
1
nan
nan
33
11
22
2
111
222
333
nan
nan
Then you can use pd.concat to merge the list to your input data frame:
result = pd.concat([input_df, mylist_df], axis=1)
In this way, there is always a column created for all unique keys in your dictionary, regardless of they exist in one dictionary and not the other.

Why doesn't df.append( ) work in my for loop? [duplicate]

I formulated this question about adding rows WITH index, but it is not yet clear to me how/why this happens when there are no indexes:
columnsList=['A','B','C','D']
df8=pd.DataFrame(columns=columnsList)
L=['value aa','value bb','value cc','value dd']
s = pd.Series(dict(zip(df8.columns, L)))
df8.append(s,ignore_index=True)
df8.append(s,ignore_index=True)
I EXPECT HERE A 2X4 DATAFRAME.
nevertheless no values where added, nor an error occurred.
print(df8.shape)
#>>> (0,4)
Why is the series not being added, and why is not given any error?
If I try to add a row with LOC, an index is added,
df8.loc[df8.index.max() + 1, :] = [4, 5, 6,7]
print(df8)
result:
A B C D
NaN 4 5 6 7
I guess neither LOC, nor iLOC could be used to append rows without index name (i.e. Loc adds the index name NaN, and iLoc can not be used when the index number is higher than the rows of the database)
DataFrame.append is not an in-place operation. From the docs,
DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=None)
Append rows of other to the end of this frame, returning a new object.
Columns not in this frame are added as new columns.
You need to assign the result back.
df8 = df8.append([s] * 2, ignore_index=True)
df8
A B C D
0 value aa value bb value cc value dd
1 value aa value bb value cc value dd
The statement data.append(sub_data) does not work on its own.
But the statement data=data.append(sub_data) will work
Assigning it back solved the issue for me. Good tip not available elsewhere.

Error in using Pandas groupby.apply to drop duplication

I have a Pandas data frame which has some duplicate values, not rows. I want to use groupby.apply to remove the duplication. An example is as follows.
df = pd.DataFrame([['a', 1, 1], ['a', 1, 2], ['b', 1, 1]], columns=['A', 'B', 'C'])
A B C
0 a 1 1
1 a 1 2
2 b 1 1
# My function
def get_uniq_t(df):
if df.shape[0] > 1:
df['D'] = df.C * 10 + df.B
df = df[df.D == df.D.max()].drop(columns='D')
return df
df = df.groupby('A').apply(get_uniq_t)
Then I get the following value error message. The issue seems to do with creating the new column D. If I create the column D outside the function, the code seems running fine. Can someone help explain what caused the value error message?
ValueError: Shape of passed values is (3, 3), indices imply (2, 3)
The problem with your code is that it attempts to modify
the original group.
Other problem is that this function should return a single row
not a DataFrame.
Change your function to:
def get_uniq_t(df):
iMax = (df.C * 10 + df.B).idxmax()
return df.loc[iMax]
Then its application returns:
A B C
A
a a 1 2
b b 1 1
Edit following the comment
In my opinion, it is not allowed to modify the original group,
as it would indirectly modify the original DataFrame.
At least it displays a warning about this and is considered a bad practice.
Search the Web for SettingWithCopyWarning for more extensive description.
My code (get_uniq_t function) does not modify the original group.
It only returns one row from the current group.
The returned row is selected based on which row returns the greatest value
of df.C * 10 + df.B. So when you apply this function, the result is a new
DataFrame, with consecutive rows equal to results of this function
for consecutive groups.
You can perform an operation equivalent to modification, when you
create some new content, e.g. as the result of groupby instruction
and then save it under the same variable which so far held the source
DataFrame.

How to get a value of a column from another df based on index? pandas

I have 2 data frames, and i'd like to get the first data frame that contains data from the second data frame, based on the their index. The catch is that I do it iteratively and the columns index numbers of only the first df increase by one with each iteration, so it causes error.
example to that would be:
First df after first iteration:
0
440 7.691
Second df after first iteration (doesn't change after each iteration):
1
0 M
1 M
2 M
3 M
4 M
.. ..
440 B
441 M
442 M
when i ran the code, I get the wanted df:
df_with_label = first_df.join(self.second_df)
0 1
440 7.691 B
After second iteration, my first df in now:
1
3 10.72
and when i run the same df_with_label = first_df.join(self.second_df) i'd like to get:
1 2
3 10.72 M
But I get the error:
ValueError: columns overlap but no suffix specified: Int64Index([1], dtype='int64')
I'm guessing it has a problem with the fact that the index of the column of the first df is 1 after the second iteration, but don't know how to fix it.
i'd like to keep the index of the first column to keep increasing.
The best solution would be to give the second column different name, so like:
1 class
3 10.72 M
Any idea how to fix it?
If I got it right your second dataframe doesn't change with iterations so why don't you just change its column name once and for all:
second_df.columns=['colname']
this should solve your naming conflicts.
Try:
df_with_label = first_df.join(self.second_df, rsuffix = "_2")
The thing is - df_with_label and second_df both have column 1, so the rsuffix will add "_2" to the second_df column name "1" := "1_2". You join on indexes, so every other column is shown on default - so you need to avoid naming conflicts.
REF
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html

Taking second last observed row

I am new to pandas. I know how to use drop_duplicates and take the last observed row in a dataframe. Is there any way that I can use it to take only second last observed. Or any other way of doing it.
For example:
I would like to go from
df = pd.DataFrame(data={'A':[1,1,1,2,2,2],'B':[1,2,3,4,5,6]}) to
df1 = pd.DataFrame(data={'A':[1,2],'B':[2,5]})
The idea is that you'll group the data by the duplicate column , then check the length of group , if the length of group is greater than or equal 2 this mean that you can slice the second element of group , if the group has a length of one which mean that this value is not duplicated , then take index 0 which is the only element in the grouped data
df.groupby(df['A']).apply(lambda x : x.iloc[1] if len(x) >= 2 else x.iloc[0])
The first answer I think was on the right track, but possibly not quite right. I have extended your data to include 'A' groups with two observations, and an 'A' group with one observation, for the sake of completeness.
import pandas as pd
df = pd.DataFrame(data={'A':[1,1,1,2,2,2, 3, 3, 4],'B':[1,2,3,4,5,6, 7, 8, 9]})
def user_apply_func(x):
if len(x) == 2:
return x.iloc[0]
if len(x) > 2:
return x.iloc[-2]
return
df.groupby('A').apply(user_apply_func)
Out[7]:
A B
A
1 1 2
2 2 5
3 3 7
4 NaN NaN
For your reference the apply method automatically passes the data frame as the first argument.
Also, as you are always going to be reducing each group of data to a single observation you could also use the agg method (aggregate). apply is more flexible in terms of the length of the sequences that can be returned whereas agg must reduce the data to a single value.
df.groupby('A').agg(user_apply_func)