How to select rows of a dataframe according to the list of ids? [duplicate] - pandas

This question already has answers here:
Select rows from a DataFrame based on multiple values in a column in pandas [duplicate]
(1 answer)
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 1 year ago.
I have the following dataframe and data list, respectively:
import pandas as pd
df = pd.DataFrame({'ID': [1, 2, 4, 7, 30],
'Instrument': ['temp_sensor', 'temp_sensor', 'temp_sensor',
'strain_gauge', 'light_sensor'],
'Value': [1000, 0, 1000, 0, 1000]})
print(df)
ID Instrument Value
1 temp_sensor 1000
2 temp_sensor 0
4 temp_sensor 1000
7 strain_gauge 0
30 light_sensor 1000
list_ID = [2, 30]
I would like to generate a new dataframe that corresponds to the dataframe df, but that it would receive only the lines where the ID belongs to list_ID.
I tried to implement the following code. However, it is not working:
d = {'ID':[], 'Instrument':[], 'Value':[]}
df_aux = pd.DataFrame(d)
for j in range(0, len(df)):
for k in range(0, len(list_ID)):
if(df['ID'][j] == list_ID[k]):
df_aux.append(df[df['ID'][j] == list_ID[k]])
The error appears: KeyError: True
I would like the output of df_aux to be:
ID Instrument Value
2 temp_sensor 0
30 light_sensor 1000

Related

pandas read dataframe multi-header values

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to extract the value related the rows named 'lat' and 'long'.
A easy way, could be to read the dataframe in two step. In other words, the idea could be have two dataframes. I do not like this because it is not very elegant and it not seems to take advantage of pandas potentiality. I believe that I could use some feature related to multi-index.
what do you think?
You can use get_level_values:
dfr = pd.read_csv(f_name, skiprows=0, header=[0, 1, 2, 3], index_col=0,
parse_dates=[0], skipinitialspace=True)
lat = df.columns.get_level_values('lat').astype(int)
long = df.columns.get_level_values('long').astype(int)
elv = df.columns.get_level_values('elv').astype(int)
Output:
>>> lat.to_list()
[613297, 626278, 626323, 616720]
>>> long.to_list()
[5185127, 5188418, 5188431, 5181393]
>>> elv.to_list()
[1833, 1915, 1915, 1499]
If you only need the first row of column header, use droplevel
df = dfr.droplevel(['lat', 'long', 'elv'], axis=1).rename_axis(columns=None))
print(df)
# Output
00590BL 01090BL 01100MS 02200MS
1956-01-01 1 2 2 -2
1956-01-02 2 3 3 -1
1956-01-03 3 4 4 0
1956-01-04 4 5 5 1
1956-01-05 5 6 6 2
One way to do this is to use the .loc method to select the rows by their label. For example, you could use the following code to extract the 'lat' values:
lat_values = dfr.loc['lat']
And similarly, you could use the following code to extract the 'long' values:
long_values = dfr.loc['long']
Alternatively, you can use the .xs method to extract the values of the desired level.
lat_values = dfr.xs('lat', level=1, axis=0) long_values = dfr.xs('long', level=1, axis=0)
Both these approach will extract the values for 'lat' and 'long' rows from the dataframe and will allow you to access it as one dataframe with one index.

How to sort a dataframe by a multiindex level? [duplicate]

This question already has answers here:
Sorting columns of multiindex dataframe
(2 answers)
Closed 7 months ago.
I have a pandas dataframe with a multiindex with various data in it. Minimal example could be this one:
elev = [1, 100, 10, 1000]
number = [4, 3, 1, 2]
name = ['foo', 'bar', 'baz', 'qux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(4,4)
df = pd.DataFrame(data=data, columns=idx)
Now I want to sort if by its elevation or number. Seems like there's an inbuilt function for it: MultiIndex.sortlevel, but it just sorts the MultiIndex, and I can't figure out how to make it sort the dataframe along the index too.
df.columns.sortlevel(level=1) gives me a sorted Multiindex
(MultiIndex([('foo', 1, 4),
('baz', 10, 1),
('bar', 100, 3),
('qux', 1000, 2)],
names=['name', 'elev', 'number']),
array([0, 2, 1, 3], dtype=int64))
but trying to apply it with df.columns = df.columns.sortlevel(level=1) or df = ... just gives me ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements or turns the df into the sorted multiindex. The keywords axis or inplace I'm used to for similar actions aren't supported in sortlevel.
How do I apply my sorting to my dataframe?
Use DataFrame.sort_index:
df = df.sort_index(level=1, axis=1)
print (df)
name foo baz bar qux
elev 1 10 100 1000
number 4 1 3 2
0 0.009359 0.113384 0.499058 0.049974
1 0.685408 0.897657 0.486988 0.647452
2 0.896963 0.831353 0.721135 0.827568
3 0.833580 0.368044 0.957044 0.494838

Selecting Multiple Sets of Columns in a DataFrame [duplicate]

This question already has answers here:
Selecting multiple columns R vs python pandas
(2 answers)
Closed 3 years ago.
Is there a way to select multiple sets of columns from a dataframe, without naming the columns individually? For example, all rows of the 1st to 4th, 7th to 9th and 22nd to 29th columns.
I tried
df.loc[:, [1:5, 7:10, 22:30] ]
and
df.loc[:, [[1:5], [7:10], [22:30]] ]
without success
Try this:
df = pd.DataFrame(np.random.random((10,25)))
df.iloc[:, np.r_[1:5, 10:15, 24]]
Output:
1 2 3 4 10 11 12 \
0 0.919851 0.852250 0.296771 0.562167 0.926956 0.425690 0.347112
1 0.053743 0.709286 0.866658 0.873554 0.588566 0.349387 0.582820
2 0.910201 0.918976 0.170105 0.967791 0.839613 0.200846 0.680498
3 0.606104 0.932580 0.857744 0.876963 0.199340 0.303397 0.103754
4 0.310878 0.386755 0.792151 0.664561 0.295020 0.980937 0.161358
5 0.808738 0.473452 0.190060 0.882827 0.778226 0.054262 0.052157
6 0.381418 0.216191 0.034603 0.314118 0.806126 0.535102 0.903150
7 0.531248 0.411528 0.644153 0.994051 0.727920 0.587441 0.679924
8 0.585064 0.352427 0.940689 0.684018 0.544400 0.765451 0.018906
9 0.075305 0.526637 0.911727 0.945098 0.105858 0.299441 0.862912
13 14 24
0 0.084237 0.317501 0.906934
1 0.949726 0.744821 0.149304
2 0.529243 0.492711 0.933917
3 0.723055 0.898373 0.642724
4 0.929206 0.540533 0.467883
5 0.825112 0.357224 0.235781
6 0.258703 0.114978 0.506079
7 0.758599 0.440214 0.863970
8 0.936511 0.117202 0.089875
9 0.968953 0.509748 0.584470
You need to use iloc:
mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4},
{'a': 100, 'b': 200, 'c': 300, 'd': 400},
{'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }]
df = pd.DataFrame(mydict)
Generate dataframe, then if you want to make a single-array selection criteria:
print(df.iloc[:2]):
Output:
a b c
0 1 2 3
1 100 200 300
2 1000 2000 3000
EDIT: Yes, if you do want multiple selection criteria you can use np.r_ which is helpful when dealing with multiple arrays, and need to be merged row-wise.
df.iloc[:,np.r_[1:5, 10:15, 22:30]]

Remove row having any value 0 [duplicate]

This question already has answers here:
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 3 years ago.
I have a dataframe having two columns "Service" and "Value". I want to remove all the rows having "0" under value column.
service value
abc 10
def 0
ghi 0
xyz 5
I want my dataframe looks like
service value
abc 10
xyz 5
I tried the following
df = df[(df != 0).all(1)]
df = pd.DataFrame(list(result.items()),columns=['service', 'value'])
df = df[(df != 0).all(1)]
For small Dataframe having 6-7 rows it's working fine but in another Dataframe having 125 rows I am getting the following error.
Illegal instruction
PS: I checked all the values under "value" column and these are numbers.
You can use the drop function combined with a condition :
df = pd.DataFrame(
{'service': ['abc', 'def', 'ghi', 'xyz'],
'value': [10,0,0,5]})
df.drop(df[df.value==0].index)
Out :
service value
0 abc 10
3 xyz 5

Pull out values in a dataframe column corresponding to numbers in a pandas.series - and transfer to new series

I have a pandas series and a pandas dataframe. The pandas dataframe has two columns, ‘a’ and ‘b’
For every number in the pandas series, I need find the closest number in column ‘b’ of the dataframe but place the corresponding value in column ‘a’ into a new series.
I’ve been trying to use the iloc and index.get_loc functions to try find the right row in the dataframe but I’ve had no success.
Example:
Series = 0, 1, 2, 3, 4, 5
Dataframe =
‘a’ ‘b’
1 0
4 3
5 2
6 1
8 5
9 4
New_series = 1, 6, 5, 4, 9, 8
The order of the first series needs to be maintained in the new_series.
This should work for ya
for num in old_series:
idx = df[df['b'] == num].index
new_series = new_series.append(df.loc[idx,'a'])