Pandas dataframe normalization returning NaNs - pandas

I'm using python 3.9.7 and pandas version 1.3.4.
I'm trying to create a normalized set of columns in pandas, but my columns keep returning as NaNs. I broke the steps down and assigned intermediate variables, which have non-NaN values, but when I go to do the final reassignment back to the dataframe, then everything becomes NaNs. I wrote a simpler example case
import numpy as np
import pandas as pd
time = [1.0, 1.1, 2.0]
col1 = [1.0, 3.0, 6.0]
col2 = [3.0, 5.0, 9.0]
col3 = [1.5, 2.5, 3.5]
junk = ['wow', 'fun', 'times']
df2 = pd.DataFrame({'Time [days]': time, 'col1': col1, 'col2': col2,'col3': col3, 'junk':junk})
df2
num1 = len(df2.columns)
num2 = len(df2.columns[1:-1])
for col in df2.columns[1:-1]:
df3 = pd.DataFrame({str(col)+'_normalized_values' : df2[str(col)]})
df2 = df2.join(df3)
del df3
df2.head()
df2.index = df2['Time [days]'].values
t=df2.index[1]
cols = df2.columns
a = df2.loc[t,cols[1:(num1-1)]]
b = (df2.groupby('Time [days]').sum().loc[t,cols[1:(num1-1)]]+1.0e-20)
c = a/b #c is coming back as the expected values
df2.loc[t,cols[num1:(num1+num2)]] = c
df2.loc[t,cols[num1:(num1+num2)]] #This step always prints all NaNs
I've checked the shapes of c and the LHS assignment, and they're the same. I also checked the dtypes, and they're also the same. At this point, I'm at a loss for what could be causing the issue.

There is an index-mismatch between c and df2. Changing the RHS of your final assignment to c.values solves the problem:
df2.loc[t,cols[num1:(num1+num2)]] = c.values

Related

drop rows from a Pandas dataframe based on which rows have missing values in another dataframe

I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1

Pandas: Replace pattern in column values with list values

Stuck on the following in Pandas...
I have the dataframe:
df = pd.DataFrame({'col1': ['aaCHANGEMEbb', 'nnANDMEdd', 'hhMETOOkk'], 'index': ['a', 'b', 'c']}).set_index('index')
col1
a aaCHANGEMEbb
b nnANDMEdd
c hhXXXXkk
And I want to change all uppercase cases in column 'col1' with values from m_list:
m_list = ['1.0', '2.0', '3.0']
One of my attempt that seems as something that is near to truth:
df['col1'] = df['col1'].str.replace('[A-Z]+', lambda x: [i for i in m_list], regex=True)
And another one:
df['col1'].str.replace('([A-Z]+)', lambda x: m_list[x.group(0)])
It doesn't work. I got this in both cases:
col1
a NaN
b NaN
c NaN
But the expected df below:
col1
a aa1.0bb
b nn2.0dd
c hh3.0kk
Please, share your thoughts about this. Thanks!
You can do with split
s = df['col1'].str.split(r'([A-Z]+)',expand=True)
s.loc[:,1] = m_list
df['col1'] = s.agg(''.join,1)
df
Out[255]:
col1
index
a aa1.0bb
b nn2.0dd
c hh3.0kk
for idx, x in enumerate(m_list):
df.loc[idx] = re.sub(r'([A-Z]+)', x, df.loc[idx][0])

How to insert a tuple into row of pandas DataFrame

I want to insert a row of values into a DataFrame based on the values in a tuple. Below is an example where I want to insert the values from names['blue'] intp columns 'a' and 'b' of the DataFrame.
import numpy as np
import pandas as pd
df = pd.DataFrame({'name': ['red', 'blue', 'green'], 'a': [1,np.nan,2], 'b':[2,np.nan,3]})
names = {'blue': (1,2),
'yellow': (5, 5)}
Note I have an attempt below (note 'a' and 'b' will always have missing together):
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
for colour, values in subset_dict.items():
df.loc[df['name']==colour, ['a','b']]=values
I think there has to be a more elegant solution, possibly using some map function?
Applying a lambda function over the rows where there are missing values, and then unpacking the values appropriately:
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
mask = df['name'].isin(list(subset_dict.keys()))
df.loc[mask, ['a', 'b']] = df[mask].apply(lambda x: subset_dict.get(x["name"]), axis=1).values[0]
Then gives you:
df
name a b
0 red 1.0 2.0
1 blue 1.0 2.0
2 green 2.0 3.0

pandas dataframe multiplication with missing values

I have a dataframe with 2columns (floating types), but one of them has missing data represented by a string ".."
When performing a multiplication operation, an exception is raised and the whole operation is aborted.
What I try to achieve is to perform the multiplication for the float values and leave ".." for the missing ones.
2 * 6
.. * 4
should give [12, ..]
I found a naive solution consisting in replacing .. by 0 then perform the multiplication, then replace back the 0 by ..
It doesn't seem very optimized. Any other solution?
df['x'] = pd.to_numeric(df['x'], errors='coerce').fillna(0)
mg['x'] = df['x'] * df["Value"]
for col in mg.columns:
mg[col] = mg[col].apply(update)
def update(v):
if (v == 0):
return ".."
return v
You can use np.where and Series.isna:
import numpy as np
mg['x'] = np.where(df['X'].isna(), df['X'], df['X']*df['Value'])
If you want to replace the null with '..' and multiply others:
mg['x'] = np.where(df['X'].isna(), '..', df['X']*df['Value'])
So anywhere the Value of column x is null, the it remains the same, otherwise it's multiplies with the value of the corresponding row of Value column
In you solution you can also do a fillna(1):
df['x'] = pd.to_numeric(df['x'], errors='coerce').fillna(1)
mg['x'] = df['x'] * df["Value"]
This is how I tried:
df = pd.DataFrame({'X': [ 2, np.nan],
'Value': [6, 4]})
df
X Value
0 2.0 6
1 NaN 4
np.where(df['X'].isna(), df['X'], df['X']*df['Value'])
array([12., nan])

to make pydata handle string columns

I have a dataframe that has a few columns with floats and a few columns that are string. All columns have nan. The string columns have either strings or nan which appear to have a type float. When I try to 'df.to_hdf' to store the dataframe, I get the following error:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['operation', 'snl_datasource_period', 'ticker', 'cusip', 'end_fisca_perio_date', 'fiscal_period', 'finan_repor_curre_code', 'earni_relea_date', 'finan_perio_begin_on']]
How can I work around it?
You can fill each column with the appropriate missing value. E.g.
import pandas as pd
import numpy as np
col1 = [1.0, np.nan, 3.0]
col2 = ['one', np.nan, 'three']
df = pd.DataFrame(dict(col1=col1, col2=col2))
df['col1'] = df['col1'].fillna(0.0)
df['col2'] = df['col2'].fillna('')
df.to_hdf('eg.hdf', 'eg')