I know how to set the pandas data frame equal to a column.
i.e.:
df = df['col1']
what is the equivalent for a row? let's say taking the index? and would I eliminate one or more of them?
Many thanks.
If you want to take a copy of a row then you can either use loc for label indexing or iloc for integer based indexing:
In [104]:
df = pd.DataFrame({'a':np.random.randn(10),'b':np.random.randn(10)})
df
Out[104]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
In [106]:
row = df.iloc[3]
row
Out[106]:
a 0.531293
b -0.386598
Name: 3, dtype: float64
If you want to remove that row then you can use drop:
In [107]:
df.drop(3)
Out[107]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
You can also use a slice or pass a list of labels:
In [109]:
rows = df.loc[[3,5]]
row_slice = df.loc[3:5]
print(rows)
print(row_slice)
a b
3 0.531293 -0.386598
5 0.491417 -0.498816
a b
3 0.531293 -0.386598
4 -0.278565 1.224272
5 0.491417 -0.498816
Similarly you can pass a list to drop:
In [110]:
df.drop([3,5])
Out[110]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
4 -0.278565 1.224272
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
If you wanted to drop a slice then you can slice your index and pass this to drop:
In [112]:
df.drop(df.index[3:5])
Out[112]:
a b
0 1.216387 -1.298502
1 1.043843 0.379970
2 0.114923 -0.125396
5 0.491417 -0.498816
6 0.222941 0.183743
7 0.322535 -0.510449
8 0.695988 -0.300045
9 -0.904195 -1.226186
Related
I noticed this today and wanted to ask because I am a little confused about this.
Lets say we have two df's
df = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('ABC'))
A B C
0 3 1 6
1 2 4 0
2 8 8 0
3 8 6 7
4 4 5 0
df2 = pd.DataFrame(np.random.randint(0,9,size=(5,3)),columns = list('CBA'))
C B A
0 3 5 5
1 7 4 6
2 0 7 7
3 6 6 5
4 4 0 6
If we wanted to conditionally assign new values in the first df with values, we could do this:
df.loc[df['A'].gt(3)] = df2
I would expect the columns to be aligned, and if there were missing columns, for the values in the first df to be populated with nan. However when the above code is run, it replaces the data and does not take into account the column names. (it does take the index names into account however)
A B C
0 3 1 6
1 2 4 0
2 0 7 7
3 6 6 5
4 4 0 6
on index 2 instead of [7,7,0] we have [0,7,7].
However, if we pass the names of the columns into the loc statement, without changing the order of the columns in df2, it aligns with the columns.
df.loc[df['A'].gt(3),['A','B','C']] = df2
A B C
0 3 1 6
1 2 4 0
2 7 7 0
3 5 6 6
4 6 0 4
Why does this happen?
Interestingly, loc performs a number of optimizations to improve performance, one of those optimizations is checking the type of the index passed in.
Both Row and Column Indexes Included
When passing both a row index and a column index the __setitem__ function:
def __setitem__(self, key, value):
if isinstance(key, tuple):
key = tuple(com.apply_if_callable(x, self.obj) for x in key)
else:
key = com.apply_if_callable(key, self.obj)
indexer = self._get_setitem_indexer(key)
self._has_valid_setitem_indexer(key)
iloc = self if self.name == "iloc" else self.obj.iloc
iloc._setitem_with_indexer(indexer, value, self.name)
Interprets the key as a tuple.
key:
(0 False
1 False
2 True
3 True
4 True
Name: A, dtype: bool,
['A', 'B', 'C'])
This is then passed to _get_setitem_indexer to convert to a positional indexer from label-based:
indexer = self._get_setitem_indexer(key)
def _get_setitem_indexer(self, key):
"""
Convert a potentially-label-based key into a positional indexer.
"""
if self.name == "loc":
self._ensure_listlike_indexer(key)
if self.axis is not None:
return self._convert_tuple(key, is_setter=True)
ax = self.obj._get_axis(0)
if isinstance(ax, ABCMultiIndex) and self.name != "iloc":
with suppress(TypeError, KeyError, InvalidIndexError):
# TypeError e.g. passed a bool
return ax.get_loc(key)
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
if isinstance(key, range):
return list(key)
try:
return self._convert_to_indexer(key, axis=0, is_setter=True)
except TypeError as e:
# invalid indexer type vs 'other' indexing errors
if "cannot do" in str(e):
raise
elif "unhashable type" in str(e):
raise
raise IndexingError(key) from e
This generates a tuple indexer (both rows and columns are converted):
if isinstance(key, tuple):
with suppress(IndexingError):
return self._convert_tuple(key, is_setter=True)
returns
(array([2, 3, 4], dtype=int64), array([0, 1, 2], dtype=int64))
Only Row Index Included
However, when only a row index is passed to loc the indexer is not a tuple and, as such, only a single dimension is converted from label to positional:
if isinstance(key, range):
return list(key)
returns
[2 3 4]
For this reason, no alignment happens among columns when only a single value is passed to loc, as no parsing is done to align the columns.
That is why an empty slice is often used:
df.loc[df['A'].gt(3), :] = df2
As this is sufficient to align the columns appropriately.
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('ABC'))
df2 = pd.DataFrame(np.random.randint(0, 9, size=(5, 3)), columns=list('CBA'))
print(df)
print(df2)
df.loc[df['A'].gt(3), :] = df2
print(df)
Example:
df:
A B C
0 3 6 6
1 0 8 4
2 7 0 0
3 7 1 5
4 7 0 1
df2:
C B A
0 4 6 2
1 1 2 7
2 0 5 0
3 0 4 4
4 3 2 4
df.loc[df['A'].gt(3), :] = df2:
A B C
0 3 6 6
1 0 8 4
2 0 5 0
3 4 4 0 # Aligned as expected
4 4 2 3
I am trying to merge two Excel tables, but the rows don't line up because in one column information is split over several rows whereas in the other table it is contained in a single cell.
Is there a way with pandas to rename the cells in Table A so that they line up with the rows in Table B?
df_jobs = pd.read_excel(r"jobs.xlsx", usecols="Jobs")
df_positions = pd.read_excel(r"orders.xlsx", usecols="Orders")
Sample files:
https://drive.google.com/file/d/1PEG3nZc0183Gh-8A2xbIs9kEZIWLzLSa/view?usp=sharing
https://drive.google.com/file/d/1HfQ4q7pjba0TKNJAHBqcGeoqdY3Yr3DB/view?usp=sharing
I suppose your input data looks like:
>>> df1
A i j
0 O-20-003049 NaN NaN
1 1 0.643284 0.834937
2 2 0.056463 0.394168
3 3 0.773379 0.057465
4 4 0.081585 0.178991
5 5 0.667667 0.004370
6 6 0.672313 0.587615
7 O-20-003104 NaN NaN
8 1 0.916426 0.739700
9 O-20-003117 NaN NaN
10 1 0.800776 0.614192
11 2 0.925186 0.980913
12 3 0.503419 0.775606
>>> df2
A x y
0 O-20-003049.01 0.593312 0.666600
1 O-20-003049.02 0.554129 0.435650
2 O-20-003049.03 0.900707 0.623963
3 O-20-003049.04 0.023075 0.445153
4 O-20-003049.05 0.307908 0.503038
5 O-20-003049.06 0.844624 0.710027
6 O-20-003104.01 0.026914 0.091458
7 O-20-003117.01 0.275906 0.398993
8 O-20-003117.02 0.101117 0.691897
9 O-20-003117.03 0.739183 0.213401
We start by renaming the rows in column A:
# create a boolean mask
mask = df1["A"].str.startswith("O-")
# rename all rows
df1["A"] = df1.loc[mask, "A"].reindex(df1.index).ffill() \
+ "." + df1["A"].str.pad(2, fillchar="0")
# remove unwanted rows (where mask==True)
df1 = df1[~mask].reset_index(drop=True)
>>> df1
A i j
1 O-20-003049.01 0.000908 0.078590
2 O-20-003049.02 0.896207 0.406293
3 O-20-003049.03 0.120693 0.722355
4 O-20-003049.04 0.412412 0.447349
5 O-20-003049.05 0.369486 0.872241
6 O-20-003049.06 0.614941 0.907893
8 O-20-003104.01 0.519443 0.800131
10 O-20-003117.01 0.583067 0.760002
11 O-20-003117.02 0.133029 0.389461
12 O-20-003117.03 0.969289 0.397733
Now, we are able to merge data on column A:
>>> pd.merge(df1, df2, on="A")
A i j x y
0 O-20-003049.01 0.643284 0.834937 0.593312 0.666600
1 O-20-003049.02 0.056463 0.394168 0.554129 0.435650
2 O-20-003049.03 0.773379 0.057465 0.900707 0.623963
3 O-20-003049.04 0.081585 0.178991 0.023075 0.445153
4 O-20-003049.05 0.667667 0.004370 0.307908 0.503038
5 O-20-003049.06 0.672313 0.587615 0.844624 0.710027
6 O-20-003104.01 0.916426 0.739700 0.026914 0.091458
7 O-20-003117.01 0.800776 0.614192 0.275906 0.398993
8 O-20-003117.02 0.925186 0.980913 0.101117 0.691897
9 O-20-003117.03 0.503419 0.775606 0.739183 0.213401
I have a large Dataframe. One of my columns contains the name of others. I want to eval this colum and set in each row the value of the referenced column:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| B |
|2|5|3| A |
|3|5|9| C |
Desired output:
|A|B|C|Column|
|:|:|:|:-----|
|1|3|4| 3 |
|2|5|3| 2 |
|3|5|9| 9 |
I am achieving this result using:
df.apply(lambda d: eval("d." + d['Column']), axis=1)
But it is very slow, even using swifter. Is there a more efficient way of performing this?
For better performance, use df.to_numpy():
In [365]: df['Column'] = df.to_numpy()[df.index, df.columns.get_indexer(df.Column)]
In [366]: df
Out[366]:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
For Pandas < 1.2.0, use lookup:
df['Column'] = df.lookup(df.index, df['Column'])
From 1.2.0+, lookup is decprecated, you can just use a for loop:
df['Column'] = [df.at[idx, r['Column']] for idx, r in df.iterrows()]
Output:
A B C Column
0 1 3 4 3
1 2 5 3 2
2 3 5 9 9
Since lookup is going to decprecated try numpy method with get_indexer
df['new'] = df.values[df.index,df.columns.get_indexer(df.Column)]
df
Out[75]:
A B C Column new
0 1 3 4 B 3
1 2 5 3 A 2
2 3 5 9 C 9
I would like to remove all rows in a pandas df that have an index value within 4 counts of the index value of the previous row.
In the pandas df below,
A B
0 1 1
5 5 5
8 9 9
9 10 10
Only the row with index value 0 should remain.
Thanks!
get the differences between the current and previous row as a list and pass to loc. Chose to get it as a list so i could return a dataframe as a final output.
ind = [ a for a,b in zip(df.index,df.index[1:]) if b-a > 4]
df.loc[ind]
A B
0 1 1
You can use reset_index, diff and shift:
In [1309]: df
Out[1309]:
A B
0 1 1
5 5 5
8 9 9
9 10 10
In [1310]: d = df.reset_index()
In [1313]: df = d[d['index'].diff(1).shift(-1) >=4].drop('index', 1)
In [1314]: df
Out[1313]:
A B
0 1 1
pandas iloc could slice dataframe two cases such as df.iloc[:,2:5] and df.iloc[:,[6,10]].
If I want to select 2:5, 6 and 10 columns, how to use iloc to slice df?
Use numpy.r_:
From docs:
Translates slice objects to concatenation along the first axis.
This is a simple way to build up arrays quickly. There are two use
cases.
If the index expression contains comma separated arrays, then stack
them along their first axis.
If the index expression contains slice
notation or scalars then create a 1-D array with a range indicated by
the slice notation.
Demo:
In [16]: df = pd.DataFrame(np.random.rand(3, 12))
In [17]: df.iloc[:, np.r_[2:5, 6, 10]]
Out[17]:
2 3 4 6 10
0 0.760201 0.378125 0.707002 0.310077 0.375646
1 0.770165 0.269465 0.419979 0.218768 0.832087
2 0.253142 0.737015 0.652522 0.474779 0.094145
In [18]: df
Out[18]:
0 1 2 3 4 5 6 7 8 9 10 11
0 0.668062 0.581268 0.760201 0.378125 0.707002 0.249094 0.310077 0.336708 0.847258 0.705631 0.375646 0.830852
1 0.521096 0.798405 0.770165 0.269465 0.419979 0.455890 0.218768 0.833776 0.862483 0.817974 0.832087 0.958174
2 0.211815 0.747482 0.253142 0.737015 0.652522 0.274231 0.474779 0.256119 0.110760 0.224096 0.094145 0.525201
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
So I updated my answer in order to fix that deprecated feature: changed .ix[] --> df.iloc[...]
I think you need numpy.r_ for concanecate indices and then iloc for selecting by positions:
ds = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9],
'D':[1,3,5],
'E':[5,3,6],
'F':[7,4,3],
'G':[1,3,5],
'H':[5,3,6],
'I':[4,4,3],
'J':[6,4,3],
'K':[9,4,3]})
print (ds)
A B C D E F G H I J K
0 1 4 7 1 5 7 1 5 4 6 9
1 2 5 8 3 3 4 3 3 4 4 4
2 3 6 9 5 6 3 5 6 3 3 3
print (np.r_[2:5, 6,10])
[ 2 3 4 6 10]
print (ds.iloc[:, np.r_[2:5, 6,10]])
C D E G K
0 7 1 5 1 9
1 8 3 3 3 4
2 9 5 6 5 3
To discussion:
ix vs iloc - main problem is ix will be deprecated in Pandas 0.20.0. And it seems new version is soon - in April, so better is use iloc.