Pandas: take the minimum of two operations on two dataframes, while preserving index - pandas

I'm a beginner with Pandas. I've got two dataframes df1 and df2 of three columns each, labelled by some index.
I would like to get a third dataframe whose entries are
min( df1-df2, 1-df1-df2 )
for each column, while preserving the index.
I don't know how to do this on all the three columns at once. If I try e.g. np.min( df1-df2, 1-df1-df2 ) I get TypeError: 'DataFrame' objects are mutable, thus they cannot be hashed, whereas min( df1-df2, 1-df1+df2 ) gives ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I can't use apply because I've got more than one dataframe. Basically, I would like to use something like subtract, but with the ability to define my own function.
Example: consider these two dataframes
df0 = pd.DataFrame( [[0.1,0.2,0.3], [0.3, 0.1, 0.2], [0.1, 0.3, 0.9]], index=[2,1,3], columns=['px', 'py', 'pz'] )
In [4]: df0
Out[4]:
px py pz
2 0.1 0.2 0.3
1 0.3 0.1 0.2
3 0.1 0.3 0.9
and
df1 = pd.DataFrame( [[0.9,0.1,0.9], [0.1,0.2,0.1], [0.3,0.1,0.8]], index=[3,1,2], columns=['px', 'py', 'pz'])
px py pz
3 0.9 0.1 0.9
1 0.1 0.2 0.1
2 0.3 0.1 0.8
my desired output is a new dataframe df, made up of three columns 'px', 'py', 'pz', whose entries are:
for j in range(1,4):
dfx[j-1] = min( df0['px'][j] - df1['px'][j], 1 - df0['px'][j] + df1['px'][j] )
for df['px'], and similarly for 'py' and 'pz'.
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 0.0
I hope it's clear now! Thanks in advance!

pandas is smart enough to match up the columns and index values for you in a vectorized way. If you're looping a dataframe, you're probably doing it wrong.
m1 = df0 - df1
m2 = 1 - (df0 + df1)
# Take the values from m1 where they're less than
# The corresponding value in m2. Otherwise, take m2:
out = m1[m1.lt(m2)].combine_first(m2)
# Another method: Combine our two calculated frames,
# groupby the index, and take the minimum.
out = pd.concat([m1, m2]).groupby(level=0).min()
print(out)
# Output:
px py pz
1 0.2 -0.1 0.1
2 -0.2 0.1 -0.5
3 -0.8 0.2 -0.8

Related

How to remove columns that have all values below a certain threshold

I am trying to remove any columns in my dataframe that do not have one value above .9. I know this probably isn't the most efficient way to do it but I can't find the problem with it. I know it isn't correct because it only removes one column and I know it should be closer to 20. So I do a count to see how many values are below .9 and then if the count equals the length of the list of column values then drop that column. Thanks in advance.
for i in range(len(df3.columns)):
count=0
for j in df3.iloc[:,i].tolist():
if j<.9:
count+=1
if len(df3.iloc[:,i].tolist())==count:
df4=df3.drop(df3.columns[i], axis=1)
df4
You can loop through each column in the dataframe and check the maximum value in each column against your defined threshold, 0.9 in this case, if there are no values more than 0.9, drop the column.
The input:
col1 col2 col3
0 0.2 0.8 1.0
1 0.3 0.5 0.5
Code:
# define dataframe
df = pd.DataFrame({'col1':[0.2, 0.3], 'col2':[0.8, 0.5], 'col3':[1, 0.5]})
# define threshold
threshold = 0.9
# loop through each column in dataframe
for col in df:
# get the maximum value in column
# check if it is less than or equal to the defined threshold
if df[col].max() <= threshold:
# if true, drop the column
df = df.drop([col], axis=1)
This outputs:
col3
0 1.0
1 0.5

Converting Lists within Pandas Dataframe into New DataFrame

I have a dataframe:
df =
col1 col2
0 [0.1,0.2,0.3] [1,2,3]
1 [0.5,0.6,0.7] [11,12,13]
My goal: to re-create data frame from index 0:
new_df =
new_col1 new_col2
0 0.1 1
1 0.2 2
2 0.3 3
What I tried was trying to access row by row:
new_col1 = df.col1[0]
new_col2 = df.col2[0]
But new_col1 results in below instead of a list. So I am unsure how to approach this.
0 [0.1,0.2,0.3]
Name: col1, dtype: object
Thanks.
Here is a way by using apply.
df.apply(pd.Series.explode).loc[0]
You can create new DataFrame by select first row by DataFrame.loc or DataFrame.iloc and then transpose by DataFrame.T with DataFrame.add_prefix for new columns names:
df1 = pd.DataFrame(df.iloc[0].tolist(), index=df.columns).T.add_prefix('new_')
print (df1)
new_col1 new_col2
0 0.1 1.0
1 0.2 2.0
2 0.3 3.0
new_df = pd.DataFrame([new_col1, new_col2]).transpose()
If you want to add column names,
new_df.columns = ["new_col1","new_col2"]
you can use the list() function for this
>>> new_col1
[0.1, 0.2, 0.3]
>>> new_col1_=list(new_col1)
[0.1, 0.2, 0.3]
>>> type(new_col1_)
<class 'list'>

Pandas convert array column into multiple columns with a condition

I have a pandas data frame with 2 columns:
embedding as an array column and size of embedding = size_of_embedding
language
like this:
embedding language
[0.1 0.2 0.3] fr
[0.1 0.4 0.4] en
[0.8 0.1 0.1] fr
Given a beginning integer n = 10, for each value of embedding column, I want to add a column to the above data frame like this:
embedding language feature1 feature2 feature3
[0.1 0.2 0.3] fr 10:0.1 11:0.2 12:0.3
[0.1 0.4 0.4] en 13:0.1 14:0.4 15:0.4
[0.8 0.1 0.1] fr 10:0.8 11:0.1 12:0.1
So, feature1 = 1st embedding value, feature2 = 2nd embedding value .... For the next language the beginning feature value = n+size_of_embedding:.
So, for each language, the number of columns added is exactly equal to the size_of_embedding array. and for each next language encountered, we start with n+size_of_embedding:. Is there an easy way of doing this? Thanks.
first ensure that the embedding column is in fact an array. If it is stored as string, you can convert it to a numpy array like so:
df.embedding = df.embedding.apply(lambda x: np.fromstring(x[1:-1], sep=' '))
create a lookup list of languages and their starting values, and use that to generate the
features
lookup = {'fr': 10, 'en': 13}
If you have too many languages to create this by hand, you could try the following statement, replacing 10 & 3 as is appropriate for your actual dataset
lookup = {l:10+i*3 for i, l in enumerate(df.language.drop_duplicates().to_list())}
Generating the features is then just a lookup & a list comprehension. Here I've used the helper function f to keep the code tidy.
def f(lang, embeddings):
return [f'{lookup[lang]+i}:{e}' for i, e in enumerate(embedding)]
new_names = ['feature1', 'feature2', 'feature3']
df[new_names] = df.apply(lambda x: f(x.language, x.embedding), axis=1, result_type='expand')
df now looks like:
embedding language feature1 feature2 feature3
0 [0.1, 0.2, 0.3] fr 10:0.1 11:0.2 12:0.3
1 [0.1, 0.4, 0.4] en 13:0.1 14:0.4 15:0.4
2 [0.8, 0.1, 0.1] fr 10:0.8 11:0.1 12:0.1
Longhand
df=pd.DataFrame({'embedding':['[0.1 0.2 0.3]','[0.1 0.4 0.4]','[0.8 0.1 0.1]'],'language':['fre','en','fr']})
df['feature1']=0
df['feature2']=0
df['feature3']=0
df['z']=df.embedding.str.strip('\[\]')#Remove the box brackets
df['y']=df.z.str.findall('(\d+[.]+\d+)')#extract each digit dot digit in the list
lst=['10:','11:','12:']#Create List lookup for `fr/fre`
lst2=['13:','14:','15:']##Create List lookup for `en`
Create two frames fo fr and en using boolean select
m=df.language.isin(['en'])
df2=df[~m]
df3=df[m]
Compute feature1, feature2 and feature3
df2['k']=[lst+i for i in df2['y']]
df3['m']=[lst2+i for i in df3['y']]
df2['feature1']=[i[0]+i[len(df2['k'])] for i in df2['k']]
df2['feature2']=[i[1]+i[len(df2['k'])+1] for i in df2['k']]
df2['feature3']=[i[2]+i[len(df2['k'])+2] for i in df2['k']]
df3['feature1']=[i[0]+i[len(df3['m'])] for i in df3['m']]
df3['feature2']=[i[1]+i[len(df3['m'])+1] for i in df3['m']]
df3['feature3']=[i[2]+i[len(df3['m'])+2] for i in df3['m']]
Concat df2 and df3
pd.concat([df3.iloc[:,:5:],df2.iloc[:,:5:]])

Reshape pandas dataframe and work with columns

I have dataset:
dat = {'Block': ['blk_-105450231192318816', 'blk_-1076549517733373559', 'blk_-1187723472581877455', 'blk_-1385756122847916710', 'blk_-1470784088028862059'], 'Seq': ['13 13 13 15',' 15 13 13', '13 13 15', '13 13 15 13', '13'], 'Time' : ['1257712532.0 1257712532.0 1257712532.0 1257712532.0','1257712533.0 1257712534.0 1257712534.0','1257712533.0 1257712533.0 1257712533.0','1257712532.0 1257712532.0 1257712532.0 1257712534.0','1257712535.0']}
df = pd.DataFrame(data = dat)
Block is id. Seq is id. Time is time in unix format.
I want to change columns or create new columns.
1)I need to join Seq and Time columns by index of elements in two columns.
2)After i want to get delta of Time column(next element - previous) and first element set to zero.
And in the end write in file rows from different block, but witch have same Seq-id.
I want to solve this problem by pandas methods
I tried to solve it by dictionary, but this way is complicated.
dict_block = dict((key, []) for key in np.unique(df.Block))
for idx, row in enumerate(seq):
block = df.Block[idx]
dict_seq = dict((key, []) for key in np.unique(row.split(' ')))
for idy, key in enumerate(row.split(' ')):
item = df.Time[idx].split(' ')[idy]
dict_seq[key].append(item)
dict_block[block].append(dict_seq)
1)For example:
blk_-105450231192318816 :
13: 1257712532.0, 1257712532.0, 1257712532.0
15: 1257712532.0
2)For example:
blk_-105450231192318816 :
13: 0, (1257712532.0 - 1257712532.0) = 0, (1257712532.0 - 1257712532.0) = 0
15: 0
Output for dictionary try:
{'blk_-105450231192318816':
[{'13': ['1257712532.0', '1257712532.0','1257712532.0'],
'15': ['1257712532.0']}],
'blk_-1076549517733373559':
[{'13': ['1257712534.0', '1257712534.0'],
'15': ['1257712533.0']}],
'blk_-1187723472581877455':
[{'13': ['1257712533.0', '1257712533.0'],
'15': ['1257712533.0']}],
'blk_-1385756122847916710':
[{'13': ['1257712532.0',
'1257712532.0',
'1257712534.0'],
'15': ['1257712532.0']}],
'blk_-1470784088028862059':
[{'13': ['1257712535.0']}]}
Summary:
I want solve next points by pandas, numpy methods:
1) Group columns
2) Get delta of time(t1-t0)
Waiting for your comment :)
Solution 1: Working with dicts
If you prefer working with dictionaries, you can use apply and custom methods where you do your tricks with the dictionaries.
df is the sample dataframe you provided. Here I've made two methods. I hope the code is clear enough to be understandable.
def grouping(x):
"""Make a dictionary combining 'Seq' and 'Time' columns.
'Seq' elements are the keys, 'Time' are the values. 'Time' elements
corresponding to the same key are stored in a list.
"""
#splitting the string and make it numeric
keys = list(map(int, x['Seq'].split()))
times = list(map(float, x['Time'].split()))
#building the result dictionary.
res = {}
for i, k in enumerate(keys):
try:
res[k].append(times[i])
except KeyError:
res[k] = [times[i]]
return res
def timediffs(x):
"""Make a dictionary starting from 'GroupedSeq' column, which can
be created with the grouping function.
It contains the difference between the times of each key.
"""
ddt = x['GroupedSeq']
res = {}
#iterating over the dictionary to calculate the differences.
for k, v in ddt.items():
res[k] = [0.0] + [t1 - t0 for t0, t1 in zip(v[:-1], v[1:])]
return res
df['GroupedSeq'] = df.apply(grouping, axis=1)
df['difftimes'] = df.apply(timediffs, axis=1)
What apply does here is to apply the function on each row. The result is stored in a new column of the dataframe. Now df contains two new column, you can drop the original 'Seq' and Time columns if you wish, by doing: df.drop(['Seq', 'Time'], axis=1, inplace=True). In the end, df looks like:
Block grouped difftimes
0 blk_-105450231192318816 {13: [1257712532.0, 1257712532.0, 1257712532.0... {13: [0.0, 0.0, 0.0], 15: [0.0]}
1 blk_-1076549517733373559 {15: [1257712533.0], 13: [1257712534.0, 125771... {15: [0.0], 13: [0.0, 0.0]}
2 blk_-1187723472581877455 {13: [1257712533.0, 1257712533.0], 15: [125771... {13: [0.0, 0.0], 15: [0.0]}
3 blk_-1385756122847916710 {13: [1257712532.0, 1257712532.0, 1257712534.0... {13: [0.0, 0.0, 2.0], 15: [0.0]}
4 blk_-1470784088028862059 {13: [1257712535.0]} {13: [0.0]}
As you can see, here pandas itself is used only to apply the custom methods, but inside those methods there is normal python code at work.
Solution 2: No dictionaries, more Pandas
Pandas itself is not very useful if you are storing list or dicts in the dataframe. So I propose an alternative a solution without dictionaries. I use groupby in combination with apply to perform operations on selected rows based on their values.
groupby selects a subsample of the dataframe based on the values of one or more columns: all rows with the same values in those columns are grouped, and a method or action is performed on this subsample.
Again, df is the sample dataframe you provided.
df1 = df.copy() #working on a copy, not really needed but I wanted to preserve the original
##splitting the string and make it a numeric list using apply
df1['Seq'] = df1['Seq'].apply(lambda x : list(map(int, x.split())))
df1['Time'] = df1['Time'].apply(lambda x : list(map(float, x.split())))
#for each index in 'Block', unnest the list in 'Seq' making it a secodary index.
df2 = df1.groupby('Block').apply(lambda x : pd.DataFrame([[e] for e in x['Time'].iloc[0]], index=x['Seq'].tolist()))
#resetting index and renaming column names created by pandas
df2 = df2.reset_index().rename(columns={'level_1':'Seq', 0:'Time'})
#custom method to store the differences between times.
def timediffs(x):
x['tdiff'] = x['Time'].diff().fillna(0.0)
return x
df3 = df2.groupby(['Block', 'Seq']).apply(timediffs)
The final df3 is:
Block Seq Time tdiff
0 blk_-105450231192318816 13 1.257713e+09 0.0
1 blk_-105450231192318816 13 1.257713e+09 0.0
2 blk_-105450231192318816 13 1.257713e+09 0.0
3 blk_-105450231192318816 15 1.257713e+09 0.0
4 blk_-1076549517733373559 15 1.257713e+09 0.0
5 blk_-1076549517733373559 13 1.257713e+09 0.0
6 blk_-1076549517733373559 13 1.257713e+09 0.0
7 blk_-1187723472581877455 13 1.257713e+09 0.0
8 blk_-1187723472581877455 13 1.257713e+09 0.0
9 blk_-1187723472581877455 15 1.257713e+09 0.0
10 blk_-1385756122847916710 13 1.257713e+09 0.0
11 blk_-1385756122847916710 13 1.257713e+09 0.0
12 blk_-1385756122847916710 15 1.257713e+09 0.0
13 blk_-1385756122847916710 13 1.257713e+09 2.0
14 blk_-1470784088028862059 13 1.257713e+09 0.0
As you can see, no dictionaries inside the dataframe. You have repetitions in columns 'Block' and 'Seq', but that's unavoidable.

Pandas: Create a new column with random values based on conditional

I've tried reading similar questions before asking, but I'm still stumped.
Any help is appreaciated.
Input:
I have a pandas dataframe with a column labeled 'radon' which has values in the range: [0.5, 13.65]
Output:
I'd like to create a new column where all radon values that = 0.5 are changed to a random value between 0.1 and 0.5
I tried this:
df['radon_adj'] = np.where(df['radon']==0.5, random.uniform(0, 0.5), df.radon)
However, i get the same random number for all values of 0.5
I tried this as well. It creates random numbers, but the else statment does not copy the original values
df['radon_adj'] = df['radon'].apply(lambda x: random.uniform(0, 0.5) if x == 0.5 else df.radon)
One way would be to create all the random numbers you might need before you select them using where:
>>> df = pd.DataFrame({"radon": [0.5, 0.6, 0.5, 2, 4, 13]})
>>> df["radon_adj"] = df["radon"].where(df["radon"] != 0.5, np.random.uniform(0.1, 0.5, len(df)))
>>> df
radon radon_adj
0 0.5 0.428039
1 0.6 0.600000
2 0.5 0.385021
3 2.0 2.000000
4 4.0 4.000000
5 13.0 13.000000
You could be a little smarter and only generate as many random numbers as you're actually going to need, but it probably took longer for me to type this sentence than you'd save. (It takes me 9 ms to generate ~1M numbers.)
Your apply approach would work too if you used x instead of df.radon:
>>> df['radon_adj'] = df['radon'].apply(lambda x: random.uniform(0.1, 0.5) if x == 0.5 else x)
>>> df
radon radon_adj
0 0.5 0.242991
1 0.6 0.600000
2 0.5 0.271968
3 2.0 2.000000
4 4.0 4.000000
5 13.0 13.000000