Concatenate/Append many dataframes in pandas - pandas

I have a list of dataframes df1 to df20 that are being created from a loop and I need to concatenate all of them in one go. These dataframes are dynamic and can be any number between 1 to 20 as per the loop that generates in my code.
So, I was trying to create an empty list first and add these dataframe names to it (in a loop for 1 to 20 as example) and to use this list in pd.concat(df_list) as below:
df_list=[]
for in in range(1,21):
df_list.append(f'df{i}')
pd.concat(df_list)
the above code is creating list of dataframe names but in the form of string with quotes like below and I'm unable to concatenate the dataframes using the pd.concat(df_list) as it's considering all the dataframe names as string elements
print(df_list)
['df1', 'df2', 'df3', 'df4', 'df5', 'df6', 'df7', 'df8', 'df9', 'df10', 'df11', 'df12', 'df13', 'df14', 'df15', 'df16', 'df17', 'df18','df19','df20']
Appreciate if anyone can help me in getting this concatenation of dataframes.
I think if I can add the dataframes names without quotes, like df_list=[df0,df1,df2...] then the pd.concat can work or else please let me know if there is any best alternative to get this done. Thanks!
UPDATE
As per commented suggestions, I've created a simple loop to create multiple dataframes and then I tried to append the "names of these dataframes" to an empty list in this loop itself where these dataframes are getting created. But, the o/p is not what am I expecting.
mylist=[]
for x in range(1,4):
globals()[f"df{i}"]=pd.DataFrame(np.random.randint(99,size=(3,3)),columns=['AA','BB','CC'])
mylist.append(globals()[f"df{i}"])
The above code creates 3 dataframes df1,df2 and df3 and also the empty list is getting appended but with the contents of dataframes as shown below
print(mylist)
[ AA BB CC
0 57 92 50
1 33 47 28
2 82 77 46, AA BB CC
0 18 8 75
1 1 15 52
2 4 69 38, AA BB CC
0 19 24 31
1 24 52 62
2 50 8 63]
But, my desired output is not the contents of the dataframes, but the names of the dataframes themselves like below.
print(mylist)
[df1,df2,df3]
Appreciate if anyone can show me how to get this done. I think there must be some simple way to do this.

That's because you're efectively appending strings to your list. If you have named variables, df1 to df20, you can access them by using locals() (or globals() depending on where your named variables are, and if you are concatenating the dataframes in a function or not). Here is an example,
df1 = 0
df2 = 1
df3 = 2
df_list = []
for i in range(1, 4):
df_list.append(locals()[f'df{i}'])
>>> df_list
[0, 1, 2]
EDIT: I think what you want to do is the following:
import pandas as pd
import numpy as np
mylist = []
for x in range(1, 4):
df = pd.DataFrame(np.random.randint(99, size=(3,3)), columns=['AA','BB','CC'])
mylist.append(df)
dfs = pd.concat(mylist)
Note that printing mylist is never going to tell you something along the lines of mylist = [df1, df2, df3], even if you hardcode that. That will print the entire content of all the variables inside your list. If you don't know how many dataframes you're going to concatenate for some reason, then just implement a while loop that breaks when you want to stop creating dataframes.
Consider another example
# create a list of 100 dataframes (df0 to df99)
mylist = []
for x in range(100):
df = pd.DataFrame(np.random.randint(99,size=(3,3)), columns=['AA','BB','CC'])
mylist.append(df)
concat_range = input("Range of dataframes to concatenate (0-100): ")
i, j = concat_range.split(" ")
dfs = pd.concat(mylist[int(i) : int(j)])
# further operations on dfs
Now, let's say I am the user and I want to concatenate df5 to df32.
>>> Range of dataframes to concatenate (0-100): 5 32
>>> dfs
AA BB CC
0 28 37 36
1 34 18 14
2 39 41 97
0 44 66 76
1 57 16 3
.. .. .. ..
1 43 87 74
2 67 70 73
0 40 60 57
1 23 63 70
2 96 24 31
[81 rows x 3 columns]

Related

Pandas groupby custom nlargest

When trying to solve my own question here I came up with an interesting problem. Consider I have this dataframe
import pandas as pd
import numpy as np
np.random.seed(0)
df= pd.DataFrame(dict(group = np.random.choice(["a", "b", "c", "d"],
size = 100),
values = np.random.randint(0, 100,
size = 100)
)
)
I want to select top values per each group, but I want to select the values according to some range. Let's say, top x to y values per each group. If any group has less than x values in it, give top(min((y-x), x)) values for that group.
In general, I am looking for a custom made alternative function which could be used with groupby objects to select not top n values, but instead top x to y range of values.
EDIT: nlargest() is a special case of the solution to my problem where x = 1 and y = n
Any further help, or guidance will be appreciated
Adding an example with this df and this top(3, 6). For every group output the values from top 3rd until top 6th values:
group value
a 190
b 166
a 163
a 106
b 86
a 77
b 70
b 69
c 67
b 54
b 52
a 50
c 24
a 20
a 11
As group c has just two members, it will output top(3)
group value
a 106
a 77
a 50
b 69
b 54
b 52
c 67
c 24
there are other means of doing this and depending on how large your dataframe is, you may want to search groupby slice or something similar. You may also need to check my conditions are correct (<, <=, etc)
x=3
y=6
# this gets the groups which don't meet the x minimum
df1 = df[df.groupby('group')['value'].transform('count')<x]
# this df takes those groups meeting the minimum and then shifts by x-1; does some cleanup and chooses nlargest
df2 = df[df.groupby('group')['value'].transform('count')>=x].copy()
df2['shifted'] = df2.groupby('group').shift(periods=-(x-1))
df2.drop('value', axis=1, inplace=True)
df2 = df2.groupby('group')['shifted'].nlargest(y-x).reset_index().rename(columns={'shifted':'value'}).drop('level_1', axis=1)
# putting it all together
df_final = pd.concat([df1, df2])
df_final
group value
8 c 67.0
12 c 24.0
0 a 106.0
1 a 77.0
2 a 50.0
3 b 70.0
4 b 69.0
5 b 54.0

How to mark or select rows in one dataframe, where value lies between any of the ranges in another dataframe featuring additional identifier

I am a bloody beginner and working with pandas and numpy for scientific dataframes.
I have 2 large dataframes, one with coordinates (geneX - position) and the other with ranges of coordinates (genes - start - end). I would like to select (or mark) all rows in the first dataset, were the coordinate falls into any of the ranges in the second dataframe.
For example:
import pandas as pd
import numpy as np
test = {'A': ['gene1','gene1','gene2','gene3','gene3','gene3','gene4'],
'B': [1,11,21,31,41,51,61],
'C': [10,20,30,40,50,60,70],
'D': [4,64,25,7,36,56,7]}
df1 = pd.DataFrame(test, columns = ['A', 'D'])
df2 = pd.DataFrame(test, columns = ['A', 'B', 'C'])
This gives two dataframes looking like this:
df1:
A D
0 gene1 4 <-- I want this row
1 gene1 64
2 gene2 25 <-- and this row
3 gene3 7
4 gene3 36 <-- and this row
5 gene3 56 <-- and this row
6 gene4 7
df2:
A B C
0 gene1 1 10
1 gene1 11 20
2 gene2 21 30
3 gene3 31 40
4 gene3 41 50
5 gene3 51 60
6 gene4 61 70
I managed to come this far:
for i,j in zip(df2["B"],df2["C"]):
x=df1[(df1["D"] >=i ) & (df1["D"] <= j)]
df_final.append(x)
df_final=pd.concat(df_final,axis=0).reset_index(drop=True)
df_final=df_final.drop_duplicates()
df_final
But this only checks if the value in df1'D' is in any of the ranges in df2 but I fail to incorporate the "gene" identifier. Basically, I need it for each row in df1 to loop through df2 and first check if the gene matches, and if thats the case, check if the coordinate is in the data range.
Can anyone help me to figure this out?
Additional question: Can I make it, so it leaves the df1 intakt and just makes a new column with a "true" behind the rows that match the conditions directly? If not I would create a new df from the selected rows, add a column with the "true" label and then merge it back to the first one.
Thank you for your help. I really appreciate it !
Let us do in steps
Reset the index of df1
Merge df1 with df2 on column A (basically merge rows with same genes)
Query the merged dataframe to filter the rows where column D falls between B and C
Flag the rows by testing the membership of index of df1 in the index column of filtered rows
m = df1.reset_index().merge(df2, on='A').query('B <= D <= C')
df1['flag'] = df1.index.isin(m['index'])
print(df1)
A D flag
0 gene1 4 True
1 gene1 64 False
2 gene2 25 True
3 gene3 7 False
4 gene3 36 True
5 gene3 56 True
6 gene4 7 False

Pandas Groupby -- efficient selection/filtering of groups based on multiple conditions?

I am trying to
filter dataframe groups in Pandas, based on multiple (any) conditions.
but I cannot seem to get to a fast Pandas 'native' one-liner.
Here I generate an example dataframe of 2*n*n rows and 4 columns:
import itertools
import random
n = 100
lst = range(0, n)
df = pd.DataFrame(
{'A': list(itertools.chain.from_iterable(itertools.repeat(x, n*2) for x in lst)),
'B': list(itertools.chain.from_iterable(itertools.repeat(x, 1*2) for x in lst)) * n,
'C': random.choices(list(range(100)), k=2*n*n),
'D': random.choices(list(range(100)), k=2*n*n)
})
resulting in dataframes such as:
A B C D
0 0 0 26 49
1 0 0 29 80
2 0 1 70 92
3 0 1 7 2
4 1 0 90 11
5 1 0 19 4
6 1 1 29 4
7 1 1 31 95
I want to
select groups grouped by A and B,
filtered groups down to where any values in the group are greater than 50 in both columns C and D,
A "native" Pandas one-liner would be the following:
test.groupby([test.A, test.B]).filter(lambda x: ((x.C>50).any() & (x.D>50).any()) )
which produces
A B C D
2 0 1 70 92
3 0 1 7 2
This is all fine for small dataframes (say n < 20).
But this solution takes quite long (for example, 4.58 s when n = 100) for large dataframes.
I have an alternative, step-by-step solution which achieves the same result, but runs much faster (28.1 ms when n = 100):
test_g = test.assign(key_C = test.C>50, key_D = test.D>50).groupby([test.A, test.B])
test_C_bool = test_g.key_C.transform('any')
test_D_bool = test_g.key_D.transform('any')
test[test_C_bool & test_D_bool]
but arguably a bit more ugly. My questions are:
Is there a better "native" Pandas solution for this task? , and
Is there a reason for the sub-optimal performance of my version of the "native" solution?
Bonus question:
In fact I only want to extract the groups and not together with their data. I.e., I only need
A B
0 1
in the above example. Is there a way to do this with Pandas without going through the intermediate step I did above?
This is similar to your second approach, but chained together:
mask = (df[['C','D']].gt(50) # in the case you have different thresholds for `C`, `D` [50, 60]
.all(axis=1) # check for both True on the rows
.groupby([df['A'],df['B']]) # normal groupby
.transform('max') # 'any' instead of 'max' also works
)
df.loc[mask]
If you don't want the data, you can forgo the transform:
mask = df[['C','D']].min(axis=1).gt(50).groupby([df['A'],df['B']]).any()
mask[mask].index
# out
# MultiIndex([(0, 1)],
# names=['A', 'B'])

Why this inconsistency between a Dataframe and a column of it?

When debugging a nasty error in my code I come across this that looks that an inconsistency in the way Dataframes work (using pandas = 1.0.3):
import pandas as pd
df = pd.DataFrame([[10*k, 11, 22, 33] for k in range(4)], columns=['d', 'k', 'c1', 'c2'])
y = df.k
X = df[['c1', 'c2']]
Then I tried to add a column to y (forgetting that y is a Series, not a Dataframe):
y['d'] = df['d']
I'm now aware that this adds a weird row to the Series; y is now:
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
But the weird thing is that now:
>>> df.shape, df['k'].shape
((4, 4), (5,))
And df and df['k'] look like:
d k c1 c2
0 0 11 22 33
1 10 11 22 33
2 20 11 22 33
3 30 11 22 33
and
0 11
1 11
2 11
3 11
d 0 0
1 10
2 20
3 30
Name: d, dtype...
Name: k, dtype: object
There are a few things at work here:
A pandas series can store objects of arbitrary types.
y['d'] = _ add a new object to the series y with name 'd'.
Thus, y['d'] = df['d'] add a new object to the series y with name 'd' and value is the series df['d'].
So you have added a series as the last entry of the series y. You can verify that
(y['d'] == y.iloc[-1]).all() == True and
(y.iloc[-1] == df['d']).all() == True.
To clarify the inconsistency between df and df.k: Note that df.k, df['k'], or df.loc[:, 'k'] returns the series 'view' of column k, thus, adding an entry to the series will directly append it to this view. However, df.k shows the entire series, whereas df only show the series to maximum length df.shape[0]. Hence the inconsistent behavior.
I agree that this behavior is prone to bugs and should be fixed. View vs. copy is a common cause for many issues. In this case, df.iloc[:, 1] behaves correctly and should be used instead.

How to run assembled sample data

I have a pd df assembled from various samples that I randomly picked. Now, I want to run 10,000 times and get mean values for column ['MP_learning'] and ['LCC_saving'] for each row.
How should I write the code?
I tried
output=np.mean(df), but it didn't work.
PC EL MP_Learning LCC _saving
0 1 0 24 95
1 1 1 35 67
2 1 2 12 23
3 1 3 23 45
4 2 0 36 67
5 2 1 74 10
6 2 2 80 23
np.random.seed()
output=[]
for i in range (10000):
output=np.mean(df)
output
For your code, you did not post the entire code. Thus, I don't know where the data come from. However, I replicated something similar and here is the solution. For you loop code though, you suppose to append to output. Use only one of those two lines in the "for" loop code, unless you need them both.
import pandas as pd
import numpy as np
df =\
pd.DataFrame([[1,0,24,95],
[1,1,35,67],
[1,2,12,23],
[1,3,23,45],
[2,0,36,67],
[2,1,74,10],
[2,2,80,23]],
columns = ["PC","EL","MP_Learning","LCC_saving"],
index = [0,1,2,3,4,5,6]
).T
output = []
for i in range (10000):
# Use the line below to get mean for both column
output.append(np.mean([df.loc["MP_Learning"],df.loc["LCC_saving"]]))
# Use the line below to get mean for one column
output.append(np.mean(df.loc["MP_Learning"]))
print(output)