Pandas DataFrame.update with MultiIndex label - pandas

Given a DataFrame A with MultiIndex and a DataFrame B with one-dimensional index, how to update column values of A with new values from B where the index of B should be matched with the second index label of A.
Test data:
begin = [10, 10, 12, 12, 14, 14]
end = [10, 11, 12, 13, 14, 15]
values = [1, 2, 3, 4, 5, 6]
values_updated = [10, 20, 3, 4, 50, 60]
multiindexed = pd.DataFrame({'begin': begin,
'end': end,
'value': values})
multiindexed.set_index(['begin', 'end'], inplace=True)
singleindexed = pd.DataFrame.from_dict(dict(zip([10, 11, 14, 15],
[10, 20, 50, 60])),
orient='index')
singleindexed.columns = ['value']
And the desired result should be
value
begin end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Now I was thinking about a variant of
multiindexed.update(singleindexed)
I searched the docs of DataFrame.update, but could not find anything w.r.t. index handling.
Am I missing an easier way to accomplish this?

You can use loc for selecting data in multiindexed and then set new values by values:
print singleindexed.index
Int64Index([10, 11, 14, 15], dtype='int64')
print singleindexed.values
[[10]
[20]
[50]
[60]]
idx = pd.IndexSlice
print multiindexed.loc[idx[:, singleindexed.index],:]
value
start end
10 10 1
11 2
14 14 5
15 6
multiindexed.loc[idx[:, singleindexed.index],:] = singleindexed.values
print multiindexed
value
start end
10 10 10
11 20
12 12 3
13 4
14 14 50
15 60
Using slicers in docs.

Related

Why doesn't .loc reverse slice correctly?

From my understanding, there are two ways to subset a dataframe in pandas:
a) df['columns']['rows']
b) df.loc['rows', 'columns']
I was following a guided case study, where the instruction was to select the first and last n rows of a column in a dataframe. The solution used Method A, whereas I tried Method B.
My method wasn't working and I couldn't for the life of me figure out why.
I've created a simplified version of the dataframe...
male = [6, 14, 12, 13, 21, 14, 14, 14, 14, 18]
female = [9, 11, 6, 10, 11, 13, 12, 11, 9, 11]
df = pd.DataFrame({'Male': male,
'Female': female},
index = np.arange(1, 11))
df['Mean'] = df[['Male', 'Female']].mean(axis = 1).round(1)
df
Selecting the first two rows, works fine for method a and b
print('Method A: \n', df['Mean'][:2])
print('Method B: \n', df.loc[:2, 'Mean'])
Method A:
1 7.5
2 12.5
Method B:
1 7.5
2 12.5
But not for selecting the last 2 rows, it doesn't work the same. Method A returns the last two rows as it should.
Method B (.loc) doesn't, it returns the whole dataframe. Why is this and how do I fix it?
print('Method A: \n', df['Mean'][-2:])
print('Method B: \n', df.loc[-2:, 'Mean'])
Method A:
9 11.5
10 14.5
Method B:
1 7.5
2 12.5
3 9.0
4 11.5
5 16.0
6 13.5
7 13.0
8 12.5
9 11.5
10 14.5
You could use .index[-2:] to get the index of the lasts two rows which are 9 and 10 instead of only -2:. Here is some reproducible code:
male = [6, 14, 12, 13, 21, 14, 14, 14, 14, 18]
female = [9, 11, 6, 10, 11, 13, 12, 11, 9, 11]
df = pd.DataFrame({'Male': male,
'Female': female},
index = np.arange(1, 11))
df['Mean'] = df[['Male', 'Female']].mean(axis = 1).round(1)
print('Method B: \n', df.loc[df.index[-2:], 'Mean'])
Output:
Method B:
9 11.5
10 14.5
Name: Mean, dtype: float64
As you can see it returns the two last rows of your dataframe.
Also you can get with iloc and tail method, like that :
df['Mean'][-2:]
df['Mean'].iloc[-2:]
df['Mean'].tail(2)
We don't usually use loc for this. iloc or other methods are easier to use. But if you want to use it could be like this:
df.loc[df.index[-2:],'Mean']

Getting the difference iterated over every row from one dataframe to another in pandas

I have two datasets: df1 and df2, each with a column named 'value' with 10 records. Currently I have:
df = df1.value - df2.value
but this code outputs 10 rows only (as expected). How would one iterate the difference for all rows instead of just the difference between corresponding row index (and get a table of 100 records instead)?
Thanks in advance!
You can pandas.DataFrame.merge with how = 'cross' (cartesian product), then get the columns difference with pandas.DataFrame.diff:
#setup
df1 = pd.DataFrame({"value":[7,5,4,8,9]})
df2 = pd.DataFrame({"value":[1,7,9,5,3]})
df2.merge(df1, "cross", suffixes=['x','']).diff(axis = 1).dropna(1)
Output
value
0 6
1 4
2 3
3 7
4 8
5 0
6 -2
7 -3
8 1
9 2
10 -2
11 -4
12 -5
13 -1
14 0
15 2
16 0
17 -1
18 3
19 4
20 4
21 2
22 1
23 5
24 6
Try this.
ndf = df.assign(key=1).merge(df2.assign(key=1),on='key',suffixes=('_l','_r')).drop('key',axis=1)
ndf['value_l'] - ndf['value_r']
Use an outer subtraction.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({"value":[7,5,4,8,9]})
df2 = pd.DataFrame({"value":[1,7,9,5,3]})
np.subtract.outer(df1['value'].to_numpy(), df2['value'].to_numpy())
#array([[ 6, 0, -2, 2, 4],
# [ 4, -2, -4, 0, 2],
# [ 3, -3, -5, -1, 1],
# [ 7, 1, -1, 3, 5],
# [ 8, 2, 0, 4, 6]])
Add a .ravel() if you want the same order as a cross join.
np.subtract.outer(df1['value'].to_numpy(), df2['value'].to_numpy()).ravel('F')
#array([ 6, 4, 3, 7, 8, 0, -2, -3, 1, 2, -2, -4, -5, -1, 0, 2, 0,
# -1, 3, 4, 4, 2, 1, 5, 6])

Pandas Vlookup 2 DF Columns Different Lengths & Perform Calculation

I need to execute a vlookup-like calculation considering two df's of diff lengths with the same column name. Suppose i have a df called df1 such as:
Y M P D
2020 11 Red 10
2020 11 Blue 9
2020 11 Green 12
2020 11 Tan 7
2020 11 White 5
2020 11 Cyan 17
and a second df called df2 such as:
Y M P D
2020 11 Blue 4
2020 11 Red 12
2020 11 White 6
2020 11 Tan 7
2020 11 Green 20
2020 11 Violet 10
2020 11 Black 7
2020 11 BlackII 3
2020 11 Cyan 14
2020 11 Copper 6
I need a new df like df3['Res','P'] with 2 columns showing results from subtracting df1 from df2 such as:
Res P
Red -2
Blue 5
Green -8
Tan 0
White -1
Cyan 3
I have not been able to find anything with a lookup and then calculation on the web. I've tried merging df1 and df2 into one df but I do not see how to execute the calculation when the values in the "P" column match. I think that a merge of df1 and df2 is probably the first step though?
Based on the example, column 'Y' and 'M' do not matter for the merge. If these columns are relevant, then use a list with the on parameter (e.g. on=['Y', 'M', 'P']).
Currently, only columns [['P', 'D']] are being used from df1 and df2.
The following code, produces the desire output for the example, but it's difficult say what will happen with larger dataframes and if there are repeating values in 'P'.
import pandas as pd
# setup the dataframes
df1 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11], 'P': ['Red', 'Blue', 'Green', 'Tan', 'White', 'Cyan'], 'D': [10, 9, 12, 7, 5, 17]})
df2 = pd.DataFrame({'Y': [2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020], 'M': [11, 11, 11, 11, 11, 11, 11, 11, 11, 11], 'P': ['Blue', 'Red', 'White', 'Tan', 'Green', 'Violet', 'Black', 'BlackII', 'Cyan', 'Copper'], 'D': [4, 12, 6, 7, 20, 10, 7, 3, 14, 6]})
# merge the dataframes
df = pd.merge(df1[['P', 'D']], df2[['P', 'D']], on='P', suffixes=('_1', '_2')).rename(columns={'P': 'Res'})
# subtract the values
df['P'] = (df.D_1 - df.D_2)
# drop the unneeded columns
df = df.drop(columns=['D_1', 'D_2'])
# display(df)
Res P
0 Red -2
1 Blue 5
2 Green -8
3 Tan 0
4 White -1
5 Cyan 3

Reshaping column values into rows with Identifier column at the end

I have measurements for Power related to different sensors i.e A1_Pin, A2_Pin and so on. These measurements are recorded in file as columns. The data is uniquely recorded with timestamps.
df1 = pd.DataFrame({'DateTime': ['12/12/2019', '12/13/2019', '12/14/2019',
'12/15/2019', '12/16/2019'],
'A1_Pin': [2, 8, 8, 3, 9],
'A2_Pin': [1, 2, 3, 4, 5],
'A3_Pin': [85, 36, 78, 32, 75]})
I want to reform the table so that each row corresponds to one sensor. The last column indicates the sensor ID to which the row data belongs to.
The final table should look like:
df2 = pd.DataFrame({'DateTime': ['12/12/2019', '12/12/2019', '12/12/2019',
'12/13/2019', '12/13/2019','12/13/2019', '12/14/2019', '12/14/2019',
'12/14/2019', '12/15/2019','12/15/2019', '12/15/2019', '12/16/2019',
'12/16/2019', '12/16/2019'],
'Power': [2, 1, 85,8, 2, 36, 8,3,78, 3, 4, 32, 9, 5, 75],
'ModID': ['A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN','A1_PiN','A2_PiN','A3_PiN',
'A1_PiN','A2_PiN','A3_PiN']})
I have tried Groupby, Melt, Reshape, Stack and loops but could not do that. If anyone could help? Thanks
When you tried stack, you were on one good track. you need to set_index first and reset_index after such as:
df2 = df1.set_index('DateTime').stack().reset_index(name='Power')\
.rename(columns={'level_1':'ModID'}) #to fit the names your expected output
And you get:
print (df2)
DateTime ModID Power
0 12/12/2019 A1_Pin 2
1 12/12/2019 A2_Pin 1
2 12/12/2019 A3_Pin 85
3 12/13/2019 A1_Pin 8
4 12/13/2019 A2_Pin 2
5 12/13/2019 A3_Pin 36
6 12/14/2019 A1_Pin 8
7 12/14/2019 A2_Pin 3
8 12/14/2019 A3_Pin 78
9 12/15/2019 A1_Pin 3
10 12/15/2019 A2_Pin 4
11 12/15/2019 A3_Pin 32
12 12/16/2019 A1_Pin 9
13 12/16/2019 A2_Pin 5
14 12/16/2019 A3_Pin 75
I'd try something like this:
df1.set_index('DateTime').unstack().reset_index()

Saving with numpy savetxt. Array elements as columns

I am pretty new to Python and trying to kick my Matlab addiction. I am converting a lot of my lab's machine vision code over to Python but I am just stuck on one aspect of the saving. At each line of the code we save 6 variables in an array. I'd like these to be entered in as one of 6 columns in a txt file with bumpy.savetxt. Each iteration of the tracking loop would then add similar variables for that given frame as the next row in the txt file.
But I keep getting either a single column that just grows with every loop. I've attached a simple code to show my problem. As it loops through, there will be a variable generated that is called output. I would like this to be the three columns of the txt file and each iteration of the loop to be a new row. Is there any easy way to do this?
import numpy as np
dataFile_Path = "dataFile.txt"
dataFile_id = open(dataFile_Path, 'w+')
for x in range(0, 9):
variable = np.array([2,3,4])
output = x*variable+1
output.astype(float)
print(output)
np.savetxt(dataFile_id, output, fmt="%d")
dataFile_id.close()
In [160]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: output.astype(float)
...: print(output)
...:
[1 1 1]
[3 4 5]
[5 7 9]
[ 7 10 13]
[ 9 13 17]
[11 16 21]
[13 19 25]
[15 22 29]
[17 25 33]
So you are writing one row at a time. savetxt normally is used to write a 2d array.
Notice that the print is still integers - astype returns a new array, it does not change things inplace.
But because you are giving it 1d arrays it writes those as columns:
In [177]: f = open('txt','bw+')
In [178]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: np.savetxt(f, output, fmt='%d')
...:
In [179]: f.close()
In [180]: cat txt
1
1
1
3
4
5
5
7
9
if instead I give savetxt a 2d array ((1,3) shape), it writes
In [181]: f = open('txt','bw+')
In [182]: for x in range(0, 9):
...: variable = np.array([2,3,4])
...: output = x*variable+1
...: np.savetxt(f, [output], fmt='%d')
...:
...:
In [183]: f.close()
In [184]: cat txt
1 1 1
3 4 5
5 7 9
7 10 13
9 13 17
11 16 21
13 19 25
15 22 29
17 25 33
But a better approach is to construct the 2d array, and write that with one savetxt call:
In [185]: output = np.array([2,3,4])*np.arange(9)[:,None]+1
In [186]: output
Out[186]:
array([[ 1, 1, 1],
[ 3, 4, 5],
[ 5, 7, 9],
[ 7, 10, 13],
[ 9, 13, 17],
[11, 16, 21],
[13, 19, 25],
[15, 22, 29],
[17, 25, 33]])
In [187]: np.savetxt('txt', output, fmt='%10d')
In [188]: cat txt
1 1 1
3 4 5
5 7 9
7 10 13
9 13 17
11 16 21
13 19 25
15 22 29
17 25 33