How to add values to an multiindexed column dataframe - pandas

My dataframe is
a b
1 2 1 2
0 0.281045 0.975469 -0.538213 -0.180008
1 0.128696 1.875480 0.247637 -0.047927
I want to insert the matrix to (a,3), (b, 3)
[[1, 1],
[1, 1]]
a b
1 2 3 1 2 3
0 0.281045 0.975469 1. -0.538213 -0.180008 1.
1 0.128696 1.875480 1. 0.247637 -0.047922 1.
It seems like there is no decent way to add value to the multiindex dataframe, Here is the code that I tried:
df[:,:,3] = [[1, 1],
[1, 1]]```
But it didn't work...

You can create new DataFrame with MultiIndex and then append to data by DataFrame.join with sorting MultiIndex:
arr = np.array([[1, 1],[1, 1]])
df1 = pd.DataFrame(arr,
index=df.index,
columns= pd.MultiIndex.from_product([df.columns.levels[0], [3]]))
df = df.join(df1).sort_index(axis=1)
print (df)
a b
1 2 3 1 2 3
0 0.281045 0.975469 1 -0.538213 -0.180008 1
1 0.128696 1.875480 1 0.247637 -0.047927 1

Related

pandas merge on list type rows

i have two dataframe.
df1:
import pandas as pd
values=[[1,[1,2]],[2,[2,2]],[3,[2,3]]]
df=pd.DataFrame(values,columns=['idx','value'])
print(df)
'''
idx value
1 [1,2]
2 [2,2]
3 [2,3]
4 []
'''
df2:
values=[[1,'json'],[2,'csv'],[3,'xml']]
df2=pd.DataFrame(values,columns=['id2','type'])
print(df2)
'''
id2 type
1 json
2 csv
3 xml
'''
i want to merge this two dataframes. But the values ​​column in the first df consists of lists. Expected output:
idx value type
1 [1,2] [json,csv]
2 [2,2] [csv,csv]
3 [2,3] [csv,xml]
4 [] []
I tried the code below but got an error.. Is there a way I can merge for each element in the list?
final=df.merge(df2,how='left',left_on='value',right_on='id2')
#returns
TypeError: unhashable type: 'list'
here is one way to do it
df.explode('value').merge(df2, left_on = 'idx',
right_on='id2').drop(columns='id2').pivot_table(index='idx', aggfunc=list).reset_index()
idx type value
0 1 [json, json] [1, 2]
1 2 [csv, csv] [2, 2]
2 3 [xml, xml] [2, 3]
Explode the value column then map the type using common id then groupby and aggregate back to list
d = df2.set_index('id2')['type']
df['type'] = df['value'].explode().map(d).groupby(level=0).agg(list)
Alternative approach with list comp and dict lookup
d = df2.set_index('id2')['type']
df['type'] = df['value'].map(lambda l: [d.get(i) for i in l])
idx value type
0 1 [1, 2] [json, csv]
1 2 [2, 2] [csv, csv]
2 3 [2, 3] [csv, xml]

Pandas locate and apply changes to column

This is something I always struggle with and is very beginner. Essentially, I want to locate and apply changes to a column based on a filter from another column.
Example input.
import pandas as pd
cols = ['col1', 'col2']
data = [
[1, 1],
[1, 1],
[2, 1],
[1, 1],
]
df = pd.DataFrame(data=data, columns=cols)
# NOTE: In practice, I will be applying a more complex function
df['col2'] = df.loc[df['col1'] == 1, 'col2'].apply(lambda x: x+1)
Returned output:
col1 col2
0 1 2.0
1 1 2.0
2 2 NaN
3 1 2.0
Expected output:
col1 col2
0 1 2
1 1 2
2 2 2
3 1 2
What's happening:
Records that do not meet the filtering condition are being set to null because of my apply / lambda routine
What I request:
The correct locate/filter and apply approach. I can achieve the expected frame using update, however I want to use locate and apply.
By doing df['col2'] = ..., you're setting all the values of col2. But, since you're only calling apply on some of the values, the values that aren't included get set to NaN. To fix that, save your mask and reuse it:
mask = df['col1'] == 1
df.loc[mask, 'col2'] = df.loc[mask, 'col2'].apply(lambda x: x+1)
Output:
>>> df
col1 col2
0 1 2
1 1 2
2 2 1
3 1 2

how to generate random numbers that can be summed to a specific value?

I have 2 dataframe as follows:
import pandas as pd
import numpy as np
# Create data set.
dataSet1 = {'id': ['A', 'B', 'C'],
'value' : [9,20,20]}
dataSet2 = {'id' : ['A', 'A','A','B','B','B','C'],
'id_2': [1, 2, 3, 2,3,4,1]}
# Create dataframe with data set and named columns.
df_map1 = pd.DataFrame(dataSet1, columns= ['id', 'value'])
df_map2 = pd.DataFrame(dataSet2, columns= ['id','id_2'])
df_map1
id value
0 A 9
1 B 20
2 C 20
df_map2
id id_2
0 A 1
1 A 2
2 A 3
3 B 2
4 B 3
5 B 4
6 C 1
where id_2 can have dups of id. (namely id_2 is subset of id)
#doing a quick merge, based on id.
df = df_map1.merge(df_map2 ,on=['id'])
id value id_2
0 A 9 1
1 A 9 2
2 A 9 3
3 B 20 2
4 B 20 3
5 B 20 4
6 C 20 1
I can represent what's the relationship between id and id_2 as follows
id_ref = df.groupby('id')['id_2'].apply(list).to_dict()
{'A': [1, 2, 3], 'B': [2, 3, 4], 'C': [1]}
Now, I would like to generate random integer say 0 to 3 put the list (5 elements for exmaple) into the pandas df and explode.
import numpy as np
import random
df['random_value'] = df.apply(lambda _: np.random.randint(0,3, 5), axis=1)
id value id_2 random_value
0 A 9 1 [0, 0, 0, 0, 1]
1 A 9 2 [0, 2, 1, 2, 1]
2 A 9 3 [0, 1, 2, 2, 1]
3 B 20 2 [2, 1, 1, 2, 2]
4 B 20 3 [0, 0, 0, 0, 0]
5 B 20 4 [1, 0, 0, 1, 0]
6 C 20 1 [1, 2, 2, 2, 1]
The condition for generating this random_value list, is that sum of the list has to be equal to 9.
That means, for id : A, if we sum all the elements inside the list, we have total of 13 shown the description below, but what we want is 9:
and same concept for id B and C.. and so on....
is there anyway to achieve this?
# i was looking into multinomial from np.random function... seems this should be the solution but im not sure how to apply this with pandas.
np.random.multinomial(9, np.ones(5)/5, size = 1)[0]
=> array([2,3,3,0,1])
2+3+3+0+1 = 9
ATTEMPT/IDEA ...
given that we have list of id_2. ie) id: A has 3 distinct elements [1,2,3].
so id A is mapped to 3 different elements. so we can get
3 * 5 = 15 ( which will be our long list )
3: length of list
5: create 5 elements of list
hence
list_A = np.random.multinomial(9,np.ones(3*5)/(3*5) ,size = 1)[0]
and then we evenly distribute/split the list.
using this list comprehension:
[list_A [i:i + n] for i in range(0, len(list_A ), n)]
but I am still unsure how to do this dynamically.
The core idea is as you said (about getting 3*5=15 numbers), plus reshaping it into a 2D array with the same number of rows as that id has in the dataframe. The following function does that,
def generate_random_numbers(df):
value = df['value'].iloc[0]
list_len = 5
num_rows = len(df)
num_rand = list_len*num_rows
return pd.Series(
map(list, np.random.multinomial(value, np.ones(num_rand)/num_rand).reshape(num_rows, -1)),
df.index
)
And apply it:
df['random_value'] = df.groupby(['id', 'value'], as_index=False).apply(generate_random_numbers).droplevel(0)

Get back to DataFrame after df.as_matrix()

I play with a dataset in pandas.
At some point I use it as matrix (df.as_matrix()) , then I do some transformations (with sklearn) and I want to go back to DataFrame.
How can I go back from df.as_matrix() back to df this the most straightworward way and with preserving indexes and col names?
Consider the data frame df
df = pd.DataFrame(1, list('xyz'), list('abc'))
df
a b c
x 1 1 1
y 1 1 1
z 1 1 1
as_matrix gives you:
df.as_matrix()
array([[1, 1, 1],
[1, 1, 1],
[1, 1, 1]])
It is completely reasonable to go back to a data frame with
pd.DataFrame(df.as_matrix())
0 1 2
0 1 1 1
1 1 1 1
2 1 1 1
But you lose the index and column information.
If you still have that info lying around
pd.DataFrame(df.as_matrix(), df.index, df.columns)
a b c
x 1 1 1
y 1 1 1
z 1 1 1
And you are back where you started.

how to transpose dataframe?

I've got a dataframe (result from groupby par "nr")
id lap nr time
1 1 2 10
4 2 2 100
I need to rearrange this dataframe to folowing format
nr lap1 time1 lap2 time2
2 1 10 2 100
Any Idea how can I do it?
You can think of this as a pivot. If your DataFrame had an extra column called, say, colnum:
lap nr time colnum
0 1 2 10 1
1 2 2 100 2
then
df.pivot(index='nr', columns='colnum')
moves the nr column values into the row index, and the colnum column values into the column index:
lap time
colnum 1 2 1 2
nr
2 1 2 10 100
This is basically the desired result. All we need to do is fix up the column labels:
df.columns = ['{}{}'.format(col, num) for col,num in df.columns]
Thus,
import pandas as pd
df = pd.DataFrame({'id': [1, 4], 'lap': [1, 2], 'nr': [2, 2], 'time': [10, 100]})
df['colnum'] = df.groupby('nr').cumcount()+1
df = df[['lap','nr','time','colnum']]
df = df.pivot(index='nr', columns='colnum')
df.columns = ['{}{}'.format(col, num) for col,num in df.columns]
df = df.reset_index()
yields
nr lap1 lap2 time1 time2
0 2 1 2 10 100