cannot transform values in pandas dataframe using a mask [duplicate] - pandas

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 8 hours ago.
Here is an example to illustrate. I am doing something as follows:
import numpy as np
import pandas as pd
data = {'col_1': [3, 5, -1, 0], 'col_2': ['a', 'b', 'c', 'd']}
x = pd.DataFrame.from_dict(data)
mask = x['col_1'].values > 0
x[mask]['col_1'] = np.log(x[mask]['col_1'])
This comes back with:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Also, the dataframe remains unchanged.

Use DataFrame.loc for select and set column with condition:
mask = x['col_1'].values > 0
x.loc[mask, 'col_1'] = np.log(x.loc[mask, 'col_1'])
print (x)
col_1 col_2
0 1.098612 a
1 1.609438 b
2 -1.000000 c
3 0.000000 d

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

How to sort a dataframe by a multiindex level? [duplicate]

This question already has answers here:
Sorting columns of multiindex dataframe
(2 answers)
Closed 7 months ago.
I have a pandas dataframe with a multiindex with various data in it. Minimal example could be this one:
elev = [1, 100, 10, 1000]
number = [4, 3, 1, 2]
name = ['foo', 'bar', 'baz', 'qux']
idx = pd.MultiIndex.from_arrays([name, elev, number],
names=('name','elev', 'number'))
data = np.random.rand(4,4)
df = pd.DataFrame(data=data, columns=idx)
Now I want to sort if by its elevation or number. Seems like there's an inbuilt function for it: MultiIndex.sortlevel, but it just sorts the MultiIndex, and I can't figure out how to make it sort the dataframe along the index too.
df.columns.sortlevel(level=1) gives me a sorted Multiindex
(MultiIndex([('foo', 1, 4),
('baz', 10, 1),
('bar', 100, 3),
('qux', 1000, 2)],
names=['name', 'elev', 'number']),
array([0, 2, 1, 3], dtype=int64))
but trying to apply it with df.columns = df.columns.sortlevel(level=1) or df = ... just gives me ValueError: Length mismatch: Expected axis has 4 elements, new values have 2 elements or turns the df into the sorted multiindex. The keywords axis or inplace I'm used to for similar actions aren't supported in sortlevel.
How do I apply my sorting to my dataframe?
Use DataFrame.sort_index:
df = df.sort_index(level=1, axis=1)
print (df)
name foo baz bar qux
elev 1 10 100 1000
number 4 1 3 2
0 0.009359 0.113384 0.499058 0.049974
1 0.685408 0.897657 0.486988 0.647452
2 0.896963 0.831353 0.721135 0.827568
3 0.833580 0.368044 0.957044 0.494838

How to select rows of a dataframe according to the list of ids? [duplicate]

This question already has answers here:
Select rows from a DataFrame based on multiple values in a column in pandas [duplicate]
(1 answer)
How do I select rows from a DataFrame based on column values?
(16 answers)
Closed 1 year ago.
I have the following dataframe and data list, respectively:
import pandas as pd
df = pd.DataFrame({'ID': [1, 2, 4, 7, 30],
'Instrument': ['temp_sensor', 'temp_sensor', 'temp_sensor',
'strain_gauge', 'light_sensor'],
'Value': [1000, 0, 1000, 0, 1000]})
print(df)
ID Instrument Value
1 temp_sensor 1000
2 temp_sensor 0
4 temp_sensor 1000
7 strain_gauge 0
30 light_sensor 1000
list_ID = [2, 30]
I would like to generate a new dataframe that corresponds to the dataframe df, but that it would receive only the lines where the ID belongs to list_ID.
I tried to implement the following code. However, it is not working:
d = {'ID':[], 'Instrument':[], 'Value':[]}
df_aux = pd.DataFrame(d)
for j in range(0, len(df)):
for k in range(0, len(list_ID)):
if(df['ID'][j] == list_ID[k]):
df_aux.append(df[df['ID'][j] == list_ID[k]])
The error appears: KeyError: True
I would like the output of df_aux to be:
ID Instrument Value
2 temp_sensor 0
30 light_sensor 1000

Panda: Why the dataframe is not appended? [duplicate]

This question already has answers here:
Appending to an empty DataFrame in Pandas?
(5 answers)
Creating an empty Pandas DataFrame, and then filling it
(8 answers)
Closed 3 years ago.
I am trying to append a new row to an empty dataset and i found the below code fine:
import panda as pd
df = pd.DataFrame(columns=['A'])
for i in range(5):
df = df.append({'A': i}, ignore_index=True)
So, it gives me:
A
0 0
1 1
2 2
3 3
4 4
But, when i try the below code, my dataset is still empty:
df = pd.DataFrame(columns=['A'])
df.append({'A': 2}, ignore_index=True)
df
Can someone explain me the solution to add only 1 row?

Multiple, multi-value columns in pandas dataset - want to make multiple rows [duplicate]

This question already has answers here:
Split (explode) pandas dataframe string entry to separate rows
(27 answers)
Closed 4 years ago.
I have this following dataset from twitter in a pandas DataFrame.
app_clicks billed_charge_local_micro billed_engagements card_engagements ... retweets tweets_send unfollows url_clicks
0 None [422040000, 422040000, 422040000] [59, 65, 63] None ... [0, 2, 0] None [0, 0, 1] [65, 68, 67]
I want to make that three rows, but I'm not sure the best way to do that. Looked around and saw stuff like meld, merge and stack but nothing that really looks like it is for me.
Want it to be like this (don't care about index, just for visual purposes)
Index billed_charge_local_micro
0 422040000
1 422040000
2 422040000
Thanks.
you just use different functions of dataframe:
import pandas as pd
df2 = pd.DataFrame({ 'billed_charge_local_micro' : [[422040000, 422040000, 422040000]],
'other1': 10000,
'other2': 'abc'})
print(df2)
# billed_charge_local_micro other1 other2
# 0 [422040000, 422040000, 422040000] 10000 abc
df = df2['billed_charge_local_micro'].apply(pd.Series)
df = df.transpose()
df.columns = ["billed_charge_local_micro"]
print (df)
result final
billed_charge_local_micro
0 422040000
1 422040000
2 422040000