drop columns according to header value () - pandas

I have this dataframe with multiple headers
name, 00590BL, 01090BL, 01100MS, 02200MS
lat, 613297, 626278, 626323, 616720
long, 5185127, 5188418, 5188431, 5181393
elv, 1833, 1915, 1915, 1499
1956-01-01, 1, 2, 2, -2
1956-01-02, 2, 3, 3, -1
1956-01-03, 3, 4, 4, 0
1956-01-04, 4, 5, 5, 1
1956-01-05, 5, 6, 6, 2
I read this as
dfr = pd.read_csv(f_name,
skiprows = 0,
header = [0,1,2,3],
index_col = 0,
parse_dates = True
)
I would like to remove the columns 01090BL, 01100MS. The idea, in the main program, is to have a list of the columns that i want to remove and then drop them. I have, consequently, done as follow:
2bremoved = ['01090BL', '01100MS']
dfr = dfr.drop(2bremoved, axis=1, inplace=True)
but I get the following error:
PerformanceWarning: dropping on a non-lexsorted multi-index without a level parameter may impact performance.
obj = obj._drop_axis(labels, axis, level=level, errors=errors)
/usr/lib/python3/dist-packages/pandas/core/frame.py:4906: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
I have thus done the following:
aa = dfr.drop(2bremoved, axis=1, inplace=True,level = 0)
but I get an empty dataframe. What am I missing?
thanks

Don't use inplace=True when assigning the output, also a variable name cannot start with a digit in python:
to_remove = ['01090BL', '01100MS']
aa = dfr.drop(to_remove, axis=1, level=0)
Output:
name 00590BL 02200MS
lat 613297 616720
long 5185127 5181393
elv 1833 1499
1956-01-01 1 -2
1956-01-02 2 -1
1956-01-03 3 0
1956-01-04 4 1
1956-01-05 5 2

Related

Applying a mask to a certain range in a pandas column

I'm currently trying to apply a mask to a column on a dataframe, in order to gain the mean from certain values. However, I don't want to do this over the whole column, just over a small range. This is my code at present:
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 4, 3, 2, 1, 4, 2, 2, 4, 2, 5]})
range_start = 5
range_finish = 17
mask = np.arange(len(data)) %4
measured_stress_ratio_overload = data.iloc[range_start:range_finish, mask == 0, 'test'].mean()
measured_stress_ratio_baseline = data.iloc[range_start:range_finish, mask!= 0, 'test'].mean()
My expected output would be that I gain the average of the values at position 8, 12, 16 for measured stress_ratio_overload, and measured_stress_ratio_baseline all the other values between 5 and 17. However, when I try to run this code, I get this error:
IndexingError: Too many indexers
How do I use this range to properly index and retrieve the answer I'd like? Any help would be greatly appreciated!
You shouldn't put the mask in the iloc. Since you are using divisor as a standard to find your desired row. You can first add a new column in your dataframe and then slice it.
data['divisor'] = np.arange(len(data)) %4
measured_stress_ratio_overload = data.iloc[range_start:range_finish][data['divisor'] == 0]['test'].mean()
measured_stress_ratio_baseline = data[data['divisor'] != 0].iloc[range_start:range_finish]['test'].mean()
or you can use df.where
measured_stress_ratio_overload = data.iloc[range_start:range_finish].where(data['divisor'] == 0)['test'].mean()
measured_stress_ratio_baseline = data.iloc[range_start:range_finish].where(data['divisor'] != 0)['test'].mean()

tensor cummulative addition

Suppose I have the following two tensors
count = torch.tensor([5, 3], dtype = torch.long)
label = torch.tensor([1,1,0,0,2,0,0,1], dtype = torch.long)
I want to add value k = 3 to label according to count. The result should looks like
count = torch.tensor([5, 3], dtype = torch.long)
for first 5 element in label, we add 0 to label, for 6 to 8 element in count , add 3 to it
torch.tensor([1, 1, 0, 0, 2, 3, 3, 4], dtype = torch.long)
how to make it applicable to general case?

Pandas - Row mask and 2d ndarray assignement

Got some problems with pandas, I think I'm not using it properly, and I would need some help to do it right.
So, I got a mask for rows of a dataframe, this mask is a simple list of Boolean values.
I would like to assign a 2D array, to a new or existing column.
mask = some_row_mask()
my2darray = some_operation(dataframe.loc[mask, column])
dataframe.loc[mask, new_or_exist_column] = my2darray
# Also tried this
dataframe.loc[mask, new_or_exist_column] = [f for f in my2darray]
Example data:
dataframe = pd.DataFrame({'Fun': ['a', 'b', 'a'], 'Data': [10, 20, 30]})
mask = dataframe['Fun']=='a'
my2darray = [[0, 1, 2, 3, 4], [4, 3, 2, 1, 0]]
column = 'Data'
new_or_exist_column = 'NewData'
Expected output
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]
dataframe[mask] and my2darray have both the exact same number of rows, but it always end with :
ValueError: Mus have equal len keys and value when setting with ndarray.
Thanks for your help!
EDIT - In context:
I just add some precisions, it was made for filling folds steps by steps: I compute and set some values from sub part of the dataframe.
Instead of this, according to Parth:
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
I changed to this:
dataframe.loc[mask, out] = pd.Series([f for f in features], index=mask[mask==True].index)
All values already set are overwrite by NaN values otherwise.
I miss to give some informations about it.
Thanks!
Try this:
dataframe[new_or_exist_column]=np.nan
dataframe[new_or_exist_column]=pd.Series(my2darray, index=mask[mask==True].index)
It will give desired output:
Fun Data NewData
0 a 10 [0, 1, 2, 3, 4]
1 b 20 NaN
2 a 30 [4, 3, 2, 1, 0]

Slicing of a Pandas Series when index elements are not default (doesn't start with 0)

Created a Pandas Series in Python 3.7, providing the 'data' and 'index', where the data contains a list of list; len(list) = 6 and the index list contains the element which starts from 3 rather than starting from 0.
I want to slice the series.
import pandas as pd
li_a = [[1,2],[3,4],[5,6],[7,8],(9,10),(11,12)]
li_c = [3,4,5,6,7,8]
ser1 = pd.Series(data=li_a,index=li_c)
so, ser1[3] output: [1,2] i.e. the First element of the Series
I expected the output of ser1[3:] to be entire Series, but the output was
6 [7, 8]
7 (9, 10)
8 (11, 12)
dtype: object
It is working that way because you are printing by row position, not using index:
print(ser1[3:])
output:
6 [7, 8]
7 (9, 10)
8 (11, 12)
If you want to print rows from specific index number you need to use loc
print(ser1.loc[3:])
output:
3 [1, 2]
4 [3, 4]
5 [5, 6]
6 [7, 8]
7 (9, 10)
8 (11, 12)
edited: from iloc to loc :
loc gets rows (or columns) with particular labels from the index.
your full code (i have changed also your if name line:
def main():
arr = np.arange(10,16)
index1 = np.arange(3,9)
ser1 = pd.Series(data=arr,index=index1)
print(ser1)
print(ser1.loc[3:])
if __name__ == "__main__":
main()

How to use the values of one column to access values in another column?

How to use the values of one column to access values in another
import numpy
impot pandas
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=[['Value']])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
so how to access the value 'bleh' for each row?
df.Value.iloc[df['bleh']]
Edit:
Thanks to #ScottBoston. My DF constructor had one layer of [] too much.
The correct answer is:
numpy.random.seed(123)
df = pandas.DataFrame((numpy.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: numpy.random.randint(0, x + 1, 1)[0])
df['idx_int'] = range(df.shape[0])
df['haa'] = df['idx_int'] - df.bleh.values
df['newcol'] = df.Value.iloc[df['haa'].values].values
Try:
df['Value'].tolist()
Output:
[-1.0856306033005612,
0.9973454465835858,
0.28297849805199204,
-1.506294713918092,
-0.5786002519685364,
1.651436537097151,
-2.426679243393074,
-0.42891262885617726,
1.265936258705534,
-0.8667404022651017]
Your dataframe constructor still needs to be fixed.
Are you looking for:
df.set_index('bleh')
output:
Value
bleh
0 -1.085631
1 0.997345
2 0.282978
1 -1.506295
4 -0.578600
0 1.651437
0 -2.426679
4 -0.428913
1 1.265936
7 -0.866740
If so you, your dataframe constructor has as extra set of [] in it.
np.random.seed(123)
df = pd.DataFrame((np.random.normal(0, 1, 10)), columns=['Value'])
df['bleh'] = df.index.to_series().apply(lambda x: np.random.randint(0, x + 1, 1)[0])
columns paramater in dataframe takes a list not a list of list.