I have a DataFrame like this, where for column2 I need to add 0.004 throughout the column to get a 0 value in row 1 of column 2. Similarly, for column 3 I need to subtract 0.4637 from the entire column to get a 0 value at row 1 column 3. How do I efficiently execute this?
Here is my code -
df2 = pd.DataFrame(np.zeros((df.shape[0], len(df.columns)))).round(0).astype(int)
for (i,j) in zip(range(0, 5999),range(1,len(df.columns))):
if j==1:
df2.values[i,j] = df.values[i,j] + df.values[0,1]
elif j>1:
df2.iloc[i,j] = df.iloc[i,j] - df.iloc[0,j]
print(df2)
Any help would be greatly appreciated. Thank you.
df2 = df - df.iloc[0]
Explanation:
Let's work through an example.
df = pd.DataFrame(np.arange(20).reshape(4, 5))
0
1
2
3
4
0
0
1
2
3
4
1
5
6
7
8
9
2
10
11
12
13
14
3
15
16
17
18
19
df.iloc[0] selects the first row of the dataframe:
0 0
1 1
2 2
3 3
4 4
Name: 0, dtype: int64
This is a Series. The first column printed here is its index (column names of the dataframe), and the second one - the actual values of the first row of the dataframe.
We can convert it to a list to better see its values
df.iloc[0].tolist()
[0, 1, 2, 3, 4]
Then, using broadcasting, we are subtracting each value from the whole column where it has come from.
Related
I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!
get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1
You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1
Having df of probabilities distribution, I get max probability for rows with df.idxmax(axis=1) like this:
df['1k-th'] = df.idxmax(axis=1)
and get the following result:
(scroll the tables to the right if you can not see all the columns)
0 1 2 3 4 5 6 1k-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1
the question is how to get the 2-th, 3th, etc probabilities, so that I get the following result?:
0 1 2 3 4 5 6 1k-th 2-th
0 0.114869 0.020708 0.025587 0.028741 0.031257 0.031619 0.747219 6 0
1 0.020206 0.012710 0.010341 0.012196 0.812495 0.113863 0.018190 4 3
2 0.023585 0.735475 0.091795 0.021683 0.027581 0.054217 0.045664 1 4
3 0.009834 0.009175 0.013165 0.016014 0.015507 0.899115 0.037190 5 4
4 0.023357 0.736059 0.088721 0.021626 0.027341 0.056289 0.046607 1 2
Thank you!
My own solution is not the prettiest, but does it's job and works fast:
for i in range(7):
p[f'{i}k'] = p[[0,1,2,3,4,5,6]].idxmax(axis=1)
p[f'{i}k_v'] = p[[0,1,2,3,4,5,6]].max(axis=1)
for x in range(7):
p[x] = np.where(p[x]==p[f'{i}k_v'], np.nan, p[x])
The loop does:
finds the largest value and it's column index
drops the found value (sets to nan)
again
finds the 2nd largest value
drops the found value
etc ...
I'm well and truly stumped on this
I have a MultiIndex dataframe that looks like this
data
index1 index2
0 1 8
2 7
3 6
4 9
1 1 3
2 4
3 3
4 6
2 1 5
2 5
.... and so on
and I'm trying to sum a load of values from the data column for each index1 based on a range of values from index2 to create a new dataframe.
i.e. if I were to create a new dataframe from the data values that correspond to the first 2 values of index2 per index1 from the example above I would want to get,
index1 summed_data
0 15
1 7
2 10
Does anyone know how to do this?
You don't need to change your input format, using the following statement:
x = df.groupby(level ='index1').agg({'data': lambda x: x[:2].sum()}).rename(columns = {'data':'summed_data'})
Then print:
summed_data
index1
0 15
1 7
2 10
Consider following dataframe which has columns with same name (Apparently this does happens, currently I have a dataset like this! :( )
>>> df = pd.DataFrame({"a":range(10,15),"b":range(5,10)})
>>> df.rename(columns={"b":"a"},inplace=True)
df
a a
0 10 5
1 11 6
2 12 7
3 13 8
4 14 9
>>> df.columns
Index(['a', 'a'], dtype='object')
I would expect that when dropping by index , only the column with the respective index would be gone, but apparently this is not the case.
>>> df.drop(df.columns[-1],1)
0
1
2
3
4
Is there a way to get rid of columns with duplicated column names?
EDIT: I choose missleading values for the first column, fixed now
EDIT2: the expected outcome is
a
0 10
1 11
2 12
3 13
4 14
Actually just do this:
In [183]:
df.ix[:,~df.columns.duplicated()]
Out[183]:
a
0 0
1 1
2 2
3 3
4 4
So this index all rows and then uses the column mask generated from duplicated and invert the mask using ~
The output from duplicated:
In [184]:
df.columns.duplicated()
Out[184]:
array([False, True], dtype=bool)
UPDATE
As .ix is deprecated (since v0.20.1) you should do any of the following:
df.iloc[:,~df.columns.duplicated()]
or
df.loc[:,~df.columns.duplicated()]
Thanks to #DavideFiocco for alerting me
I know how to set the pandas data frame equal to a column.
i.e.:
df = df['col1']
what is the equivalent for a row? let's say taking the index? and would I eliminate one or more of them?
Many thanks.
If you want to take a copy of a row then you can either use loc for label indexing or iloc for integer based indexing:
In [104]:
df = pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})
df
Out[104]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
In [106]:
row = df.iloc[3]
row
Out[106]:
a 0.531293
b -0.386598
Name: 3, dtype: float64
If you want to remove that row then you can use drop:
In [107]:
df.drop(3)
Out[107]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
You can also use a slice or pass a list of labels:
In [109]:
rows = df.loc[[3,5]]
row_slice = df.loc[3:5]
print(rows)
print(row_slice)
a b
3 0.531293 -0.386598
5 0.491417 -0.498816
a b
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
Similarly you can pass a list to drop:
In [110]:
df.drop([3,5])
Out[110]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
If you wanted to drop a slice then you can slice your index and pass this to drop:
In [112]:
df.drop(df.index[3:5])
Out[112]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186