Issue looping through dataframes in Pandas - pandas

I have a dict 'd' set up which is a list of dataframes E.g.:
d["DataFrame1"]
Will return that dataframe with all its columns:
ID Name
0 123 John
1 548 Eric
2 184 Sam
3 175 Andy
Each dataframe has a column in it called 'Names'. I want to extract this column from each dataframe in the dict and to create a new dataframe consisting of these columns.
df_All_Names = pd.DataFrame()
for df in d:
df_All_Names[df] = df['Names']
Returns the error:
TypeError: string indices must be integers
Unsure where I'm going wrong here.

For example you have df as follow
df=pd.DataFrame({'Name':['X', 'Y']})
df1=pd.DataFrame({'Name':['X1', 'Y1']})
And we create a dict
d=dict()
d['df']=df
d['df1']=df1
Then presetting a empty data frame:
yourdf=pd.DataFrame()
Using items with for loop
for key,val in d.items():
yourdf[key]=val['Name']
yield :
yourdf
Out[98]:
df df1
0 X X1
1 Y Y1

Your can use reduce and concatenate all of the columns named ['Name'] in your dictionary of dataframes
Sample Data
from functools import reduce
d = {'df1':pd.DataFrame({'ID':[0,1,2],'Name':['John','Sam','Andy']}),'df2':pd.DataFrame({'ID':[3,4,5],'Name':['Jen','Cara','Jess']})}
You can stack the data side by side using axis=1
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=1),d.values())
Name Name
0 John Jen
1 Sam Cara
2 Andy Jess
Or on top of one an other usingaxis=0
reduce(lambda x,y:pd.concat([x.Name,y.Name],axis=0),d.values())
0 John
1 Sam
2 Andy
0 Jen
1 Cara
2 Jess

Related

convert some columns in a dataframe to la column of list in the dataframe

I would like to convert some of the columns to list in adataframe.
The dataframe, df:
Name salary department days other
0 ben 1000 A 90 abc
1 alex 3000 B 80 gf
2 linn 600 C 55 jgj
3 luke 5000 D 88 gg
The desired output, df1:
Name list other
0 ben [1000,A,90] abc
1 alex [3000,B,80] gf
2 linn [600,C,55] jgj
3 luke [5000,D,88] gg
You can slice and convert the columns to a list of list, then to a Series:
cols = ['salary', 'department', 'days']
out = (df.drop(columns=cols)
.join(pd.Series(df[cols].to_numpy().tolist(), name='list', index=df.index))
)
Output:
Name other list
0 ben abc [1000, A, 90]
1 alex gf [3000, B, 80]
2 linn jgj [600, C, 55]
3 luke gg [5000, D, 88]
If you want to preserve the order, then we can break it down into 3 parts, as #mozway mentioned in his answer
Define columns we want to group (as #mozway mentioned in his answer)
Find the first element's index (you can take it a step forward and find the smallest one, as the list won't be necessarily sorted as the DataFrame)
Insert the Series to the dataframe at the position we generated
cols = ['salary', 'department', 'other']
first_location = df.columns.get_loc(cols[0])
list_values = pd.Series(df[cols].values.tolist()) # converting values to one list
df.insert(loc=first_location, column='list', value=list_values) # inserting the Series in the desired location
df = df.drop(columns=cols) # dropping the columns we grouped together.
print(df)
Which results in:
Name list other
0 ben [1000, A, 90] abc
1 alex [3000, B, 80] gf
...

Subtract 1 from column based on another DataFrame. Pandas

Trying to figure out how to subtract a constant from a column based on the presence of a value in another DataFrame. For example, if I have the below DataFrame a that contains a column called person name and count:
a = pd.DataFrame({
"person":["Bob", "Kate", "Joe", "Mark"],
"count":[1, 2, 3, 4],
})
person count
0 Bob 3
1 Kate 4
2 Joe 5
2 Mark 4
And a second DataFrame that contains Person and whatever other arbitrary columns:
b = pd.DataFrame({
"person":["Bob", "Joe"],
"foo":['a', 'b'],
})
person foo
0 Bob a
1 Joe c
My hope is that I can change the first DataFrame to look like the below. Specifically decreasing count by one for any instance of person within DataFrame b. It is safe to assume that DataFrame b will always be a subset of DataFrame a and person will be unique.
Person Count
0 Bob 2
1 Kate 4
2 Joe 3
2 Mark 4
Many thanks in advance!
a["count"] -= a.person.isin(b.person)
With isin we get a boolean array of True and Falses per each person if it is in the other one or not. Then treating it as integers, we can subtract it from count column,
to get
>>> a
person count
0 Bob 2
1 Kate 4
2 Joe 4
3 Mark 4
This answer assumes that df2 can have multiple instances of a name. If it is just one instance, you can subtract by just iterating through and checking whether the person is named in the second data frame. In df2:
df2_counts = df2['Person'].value_counts()
In df1, join this data over and then subtract the counts:
df1['subtracts'] = df1.set_index('Person').join(df2_counts)
df1['count_new'] = df1['count'] - df1['subtracts']
Create a list of person names from Dataset B:
listDFB=DFB['person']
loop through dfa to fill new col accordingly
for i, rw in dfa.iterrows():
if rw['person'] in listDFB:
rw['count']=rw['count']-1

How to append/concat 2 Pandas Dataframe with different columns

How to concat/append based on common column values?
I'm creating some dfs from some files, and I want to compile them.
The columns don't always match, but there will always be some common columns (I only know a few columns guaranteed to match, but there's a lot of columns, and I'd like to retain as much info as possible)
df1:
Name
Status
John
1
Jane
2
df2:
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
Desired output:
either this (doesn't matter of the order):
Extra1
Extra2
Name
Status
a
b
Bob
2
c
d
Nancy
2
NULL
NULL
John
1
NULL
NULL
Jane
2
Or this (doesn't matter of the order):
Name
Status
John
1
Jane
2
Bob
2
Nancy
2
I've tried these, but doesn't get the result I want:
df = pd.concat([df2, df], axis=0, ignore_index=True)
df = df.set_index('Name').combine_first(df2.set_index('Name')).reset_index()
Thanks
Not sure why the tables aren't being formatted, it shows up fine in the preview
import pandas as pd
df1 = pd.DataFrame({'Name':['John', 'Jane'],'Status':[1,2]})
df2 = pd.DataFrame({'Extra1':['a','b'],'Extra2':['c','d'],'Name':['bob', 'nancy'],'Status':[2,2]})
df = pd.concat([df1,df2], axis=0, ignore_index=True)
Gives me
Name
Status
Extra1
Extra2
John
1
NaN
NaN
Jane
2
NaN
NaN
bob
2
a
c
nancy
2
b
d
Which looks to me like your desired output.
And your tables aren't formatted correctly because you need empty newlines between text and tables.

How to order dataframe using a list in pandas

I have a pandas dataframe as follows.
import pandas as pd
data = [['Alex',10, 175],['Bob',12, 178],['Clarke',13, 179]]
df = pd.DataFrame(data,columns=['Name','Age', 'Height'])
print(df)
I also have a list as follows.
mynames = ['Emj', 'Bob', 'Jenne', 'Alex', 'Clarke']
I want to order the rows of my dataframe in the order of mynames list. In other words, my output should be as follows.
Name Age Height
0 Bob 12 178
1 Alex 10 175
2 Clarke 13 179
I was trying to do this as follows. I am wondering if there is an easy way to do this in pandas than converting the dataframe to list.
I am happy to provide more details if needed.
You can do pd.Categorical + argsort
df=df.loc[pd.Categorical(df.Name,mynames).argsort()]
Name Age Height
1 Bob 12 178
0 Alex 10 175
2 Clarke 13 179

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)