I want to insert a row of values into a DataFrame based on the values in a tuple. Below is an example where I want to insert the values from names['blue'] intp columns 'a' and 'b' of the DataFrame.
import numpy as np
import pandas as pd
df = pd.DataFrame({'name': ['red', 'blue', 'green'], 'a': [1,np.nan,2], 'b':[2,np.nan,3]})
names = {'blue': (1,2),
'yellow': (5, 5)}
Note I have an attempt below (note 'a' and 'b' will always have missing together):
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
for colour, values in subset_dict.items():
df.loc[df['name']==colour, ['a','b']]=values
I think there has to be a more elegant solution, possibly using some map function?
Applying a lambda function over the rows where there are missing values, and then unpacking the values appropriately:
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
mask = df['name'].isin(list(subset_dict.keys()))
df.loc[mask, ['a', 'b']] = df[mask].apply(lambda x: subset_dict.get(x["name"]), axis=1).values[0]
Then gives you:
df
name a b
0 red 1.0 2.0
1 blue 1.0 2.0
2 green 2.0 3.0
Related
I'm trying to drop rows with missing values in any of several dataframes.
They all have the same number of rows, so I tried this:
model_data_with_NA = pd.concat([other_df,
standardized_numerical_data,
encode_categorical_data], axis=1)
ok_rows = ~(model_data_with_NA.isna().all(axis=1))
model_data = model_data_with_NA.dropna()
assert(sum(ok_rows) == len(model_data))
False!
As a newbie in Python, I wonder why this doesn't work? Also, is it better to use hierarchical indexing? Then I can extract the original columns from model_data.
In Short
I believe the all in ~(model_data_with_NA.isna().all(axis=1)) should be replaced with any.
The reason is that all checks here if every value in a row is missing, and any checks if one of the values is missing.
Full Example
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'a':[1, 2, 3]})
df2 = pd.DataFrame({'b':[1, np.nan]})
df3 = pd.DataFrame({'c': [1, 2, np.nan]})
model_data_with_na = pd.concat([df1, df2, df3], axis=1)
ok_rows = ~(model_data_with_na.isna().any(axis=1))
model_data = model_data_with_na.dropna()
assert(sum(ok_rows) == len(model_data))
model_data_with_na
a
b
c
0
1
1
1
1
2
nan
2
2
3
nan
nan
model_data
a
b
c
0
1
1
1
I'm using python 3.9.7 and pandas version 1.3.4.
I'm trying to create a normalized set of columns in pandas, but my columns keep returning as NaNs. I broke the steps down and assigned intermediate variables, which have non-NaN values, but when I go to do the final reassignment back to the dataframe, then everything becomes NaNs. I wrote a simpler example case
import numpy as np
import pandas as pd
time = [1.0, 1.1, 2.0]
col1 = [1.0, 3.0, 6.0]
col2 = [3.0, 5.0, 9.0]
col3 = [1.5, 2.5, 3.5]
junk = ['wow', 'fun', 'times']
df2 = pd.DataFrame({'Time [days]': time, 'col1': col1, 'col2': col2,'col3': col3, 'junk':junk})
df2
num1 = len(df2.columns)
num2 = len(df2.columns[1:-1])
for col in df2.columns[1:-1]:
df3 = pd.DataFrame({str(col)+'_normalized_values' : df2[str(col)]})
df2 = df2.join(df3)
del df3
df2.head()
df2.index = df2['Time [days]'].values
t=df2.index[1]
cols = df2.columns
a = df2.loc[t,cols[1:(num1-1)]]
b = (df2.groupby('Time [days]').sum().loc[t,cols[1:(num1-1)]]+1.0e-20)
c = a/b #c is coming back as the expected values
df2.loc[t,cols[num1:(num1+num2)]] = c
df2.loc[t,cols[num1:(num1+num2)]] #This step always prints all NaNs
I've checked the shapes of c and the LHS assignment, and they're the same. I also checked the dtypes, and they're also the same. At this point, I'm at a loss for what could be causing the issue.
There is an index-mismatch between c and df2. Changing the RHS of your final assignment to c.values solves the problem:
df2.loc[t,cols[num1:(num1+num2)]] = c.values
Demo dataframe:
import pandas as pd
df = pd.DataFrame({'a': [1,None,3], 'b': [5,10,15]})
I want to replace all NaN values in a with the corresponding values in b**2, and make b NaN (shift NaN values and make some operations on them).
Desired result:
1 5
100 NaN
3 15
How is it possible with pandas?
You can get the rows you want to change using df['a'].isnull(). Then you can use that to update the columns with loc.
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [1, None, 3], 'b': [5, 10, 15]})
change = df['a'].isnull()
df.loc[change, ['a', 'b']] = [df.loc[change, 'b']**2, np.NaN]
print(df)
Note that the change variable is only to keep from repeating df['a'].isnull() on both sides of the assignment. You could replace it with that expression to do this in one line, but I think that looks cluttered.
Result:
a b
0 1.0 5.0
1 100.0 NaN
2 3.0 15.0
I am trying to find a way to get the person correlation and p-value between two columns in a dataframe when a third column meets certain conditions.
df =
BucketID
Intensity
BW25113
825.326
3459870
0.5
825.326
8923429
0.95
734.321
12124
0.4
734.321
2387499
0.3
I originally tried something with the pd.Series.corr() function which is very fast and does what I want it to do to get my final outputs:
bio1 = df.columns[1:].tolist()
pcorrs2 = [s + '_Corr' for s in bio1]
coldict2 = dict(zip(bios,pcorrs2))
coldict2
df2 = df.groupby('BucketID')[bio1].corr(method = 'pearson').unstack()['Intensity'].reset_index().rename(columns = coldict2)
df3 = pd.melt(df2, id_vars = 'BucketID', var_name = 'Org', value_name = 'correlation')
df3['Org'] = df3.Org.apply(lambda x: x.rstrip('_corr'))
df3
This then gives me the (mostly) desired table:
BucketID
Org
correlation
734.321
Intensity
1.0
825.326
Intensity
1.0
734.321
BW25113
-1.0
825.326
BW25113
1.0
This works for giving me the person correlations but not the p-value, which would be helpful for determining the relevance of the correlations.
Is there a way to get the p-value associated with pd.Series.corr() in this way or would some version with scipy.stats.pearsonr that iterates over the dataframe for each BucketID be more efficient? I tried something of this flavor, but it has been incredibly slow (tens of minutes instead of a few seconds).
Thanks in advance for the assistance and/or comments.
You can use scipy.stats.pearsonr on a dataframe as follows:
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
import scipy
scipy.stats.pearsonr(df['col1'], df['col2'])
Results in a tuple, the first being the correlation and the second value being the p-value.
(0.9049484650760702, 0.00031797789083818853)
Update
for doing this for groups programmatically, you can groupby() then loop through the groups...
df = pd.DataFrame({'group': ['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B'],
'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
for group_name, group_data in df.groupby('group'):
print(group_name, scipy.stats.pearsonr(group_data['col1'], group_data['col2']))
Results in...
A (0.9817469600192116, 0.0029521879612042588)
B (0.8648495371134326, 0.05841898744667266)
These can also be stored in a new df results
results = pd.DataFrame()
for group_name, group_data in df.groupby('group'):
correlation, p_value = scipy.stats.pearsonr(group_data['col1'], group_data['col2'])
results = results.append({'group': group_name, 'corr': correlation, 'p_value': p_value},
ignore_index=True)
In Python,
How best to combine all rows of each column in a multi-column DataFrame
into one column,
separated by ‘ | ’ separator
including null values
import pandas as pd
html = 'https://en.wikipedia.org/wiki/Visa_requirements_for_Norwegian_citizens'
df = pd.read_html(html, header=0)
df= df[1]
df.to_csv('norway.csv)
From This:
To This:
df = pandas.DataFrame([
{'A' : 'x', 'B' : 2, 'C' : None},
{'A' : None, 'B' : 2, 'C' : 1},
{'A' : 'y', 'B' : None, 'C' : None},
])
pandas.DataFrame(df.fillna('').apply(lambda x: '|'.join(x.astype(str)), axis = 0)).transpose()
I believe you need replace missing values if necessary by fillna, convert values to strings with astype and apply with join. Get Series, so for one column DataFrame add to_frame with transposing:
df = df.fillna(' ').astype(str).apply('|'.join).to_frame().T
print (df)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free
Or use list comprehension with DataFrame constructor:
L = ['|'.join(df[x].fillna(' ').astype(str)) for x in df]
df1 = pd.DataFrame([L], columns=df.columns)
print (df1)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free