I have a dataframe that I'd like to merge with another dataframe with the same column values. Also with specified row values.
Dataframe 1
d = {'id': ['111', '222', '333'], 'queries': ['High', 'Mid', 'Low'], 'time_stay': ['High', 'Mid', 'Low']}
dd = pd.DataFrame(data=d)
Dataframe 2
l = {'Features': ['queries', 'queries', 'queries', 'time_stay', 'time_stay', 'time_stay'], 'groups':['High', 'Mid', 'Low', 'High', 'Mid', 'Low'], 'parameters':[1.2, 1.1, 1.0, 1000, 2000, 3000]}
feature_data = pd.DataFrame(data=l)
feature_data
I pivoted dataframe 2 to make the first row as columns.
feature_data = feature_data.T
feature_data.columns = feature_data.loc['Features', :]
Then I merged it
dd.merge(feature_data, on=list(feature_data.columns), how='left')
As expected, pandas doesn't let me merge it because column queries is duplicated.
Expected output
What's a better way to do this ? thanks
Filter for column values in feature_data dataframe and then merge it to dd dataframe
cols_name = 'queries'
queries = feature_data[feature_data['Features']==cols_name]
dd.merge(queries[['groups','parameters']],
left_on=['queries'],
right_on=['groups'],
how="left")
.drop(columns='groups')
print(dd)
id queries time_stay parameters
111 High High 1.2
222 Mid Mid 1.1
333 Low Low 1.0
Related
I have four different datasets. I have merged three of the dataframes correctly. I have same name column in 3rd and 4th dataset. When I merge it with 4th dataset. I am not getting the same name column values in well mannerd way. The user_id is repeating when I merge. I don't want to repeat the user_id. I want to see the value in the del_keys column where it's showing me NaN value rather than it's showing me the value in the last of table. Moreover, I want to merge values of same name column on the basis of their user_id.
In the above image you can see what kind of problem I am getting.
My expected output will look like. There should not be repeated user_id.
using merge on user_id column
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3, 4],
'del': [1.0, np.nan, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
final=df.merge(df2,on="user_id",how="outer")
Combine first to get rid of Nan values and then drop duplicates
final["del_keys"]=final['del_keys_y'].combine_first(final['del_keys_x'])
final.drop(columns=["del_keys_x","del_keys_y"],inplace=True)
final.drop_duplicates(subset="user_id")
I'm guessing that you use pd.concat to merge the dataframes.
Some dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({
'user_id': [1, 2, 3],
'del_keys': [1.0, np.nan, np.nan]
})
df2 = pd.DataFrame({
'user_id': [3, 4, 5],
'del_keys': [1.0, 2.0, 3.0]
})
Merge using pd.concat:
df = pd.concat([df1, df2])
>>> user_id del_keys
0 1 1.0
1 2 NaN
2 3 NaN
0 3 1.0
1 4 2.0
2 5 3.0
Remove duplicates using pd.drop_duplicates:
(
df
.sort_values('del_keys')
.drop_duplicates('user_id', keep='first')
.sort_values('user_id')
)
>>> user_id del_keys
0 1 1.0
1 2 NaN
0 3 1.0
1 4 2.0
2 5 3.0
First, we sort the values by del_keys such that all NaNs are the bottom of the dataframe. Then we can drop the duplicates and keep the first occurrence for each user_id. Lastly, we can sort again to restore the original order.
I have a list of 1,0 where each element is corresponding to an index of a column on a data frame, for example:
df.columns = ['a','b','c']
binary_list = [0,1,0]
based on that I want to select only b column from the data frame as on my binary list it has 1 only corresponds to b
is there a way to perform that in pandas?
P.S this is my first time posting on stackoverflow, apologies if I am not following a specific styling
If the binary list is aligned with the columns, you can use boolean indexing:
df = pd.DataFrame([[1, 2, 3]], columns=['a', 'b', 'c'])
binary_list = [0,1,0]
df.loc[:, map(bool, binary_list)]
Output:
b
0 2
I am trying to find a way to get the person correlation and p-value between two columns in a dataframe when a third column meets certain conditions.
df =
BucketID
Intensity
BW25113
825.326
3459870
0.5
825.326
8923429
0.95
734.321
12124
0.4
734.321
2387499
0.3
I originally tried something with the pd.Series.corr() function which is very fast and does what I want it to do to get my final outputs:
bio1 = df.columns[1:].tolist()
pcorrs2 = [s + '_Corr' for s in bio1]
coldict2 = dict(zip(bios,pcorrs2))
coldict2
df2 = df.groupby('BucketID')[bio1].corr(method = 'pearson').unstack()['Intensity'].reset_index().rename(columns = coldict2)
df3 = pd.melt(df2, id_vars = 'BucketID', var_name = 'Org', value_name = 'correlation')
df3['Org'] = df3.Org.apply(lambda x: x.rstrip('_corr'))
df3
This then gives me the (mostly) desired table:
BucketID
Org
correlation
734.321
Intensity
1.0
825.326
Intensity
1.0
734.321
BW25113
-1.0
825.326
BW25113
1.0
This works for giving me the person correlations but not the p-value, which would be helpful for determining the relevance of the correlations.
Is there a way to get the p-value associated with pd.Series.corr() in this way or would some version with scipy.stats.pearsonr that iterates over the dataframe for each BucketID be more efficient? I tried something of this flavor, but it has been incredibly slow (tens of minutes instead of a few seconds).
Thanks in advance for the assistance and/or comments.
You can use scipy.stats.pearsonr on a dataframe as follows:
df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
import scipy
scipy.stats.pearsonr(df['col1'], df['col2'])
Results in a tuple, the first being the correlation and the second value being the p-value.
(0.9049484650760702, 0.00031797789083818853)
Update
for doing this for groups programmatically, you can groupby() then loop through the groups...
df = pd.DataFrame({'group': ['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B'],
'col1': [1,2,3,4,5,6,7,8,9,10],
'col2': [1,2,6,4,5,7,7,8,7,12]})
for group_name, group_data in df.groupby('group'):
print(group_name, scipy.stats.pearsonr(group_data['col1'], group_data['col2']))
Results in...
A (0.9817469600192116, 0.0029521879612042588)
B (0.8648495371134326, 0.05841898744667266)
These can also be stored in a new df results
results = pd.DataFrame()
for group_name, group_data in df.groupby('group'):
correlation, p_value = scipy.stats.pearsonr(group_data['col1'], group_data['col2'])
results = results.append({'group': group_name, 'corr': correlation, 'p_value': p_value},
ignore_index=True)
I want to insert a row of values into a DataFrame based on the values in a tuple. Below is an example where I want to insert the values from names['blue'] intp columns 'a' and 'b' of the DataFrame.
import numpy as np
import pandas as pd
df = pd.DataFrame({'name': ['red', 'blue', 'green'], 'a': [1,np.nan,2], 'b':[2,np.nan,3]})
names = {'blue': (1,2),
'yellow': (5, 5)}
Note I have an attempt below (note 'a' and 'b' will always have missing together):
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
for colour, values in subset_dict.items():
df.loc[df['name']==colour, ['a','b']]=values
I think there has to be a more elegant solution, possibly using some map function?
Applying a lambda function over the rows where there are missing values, and then unpacking the values appropriately:
names_needed = df.loc[df['a'].isnull(), 'name']
subset_dict = {colour:names[colour] for colour in names_needed}
mask = df['name'].isin(list(subset_dict.keys()))
df.loc[mask, ['a', 'b']] = df[mask].apply(lambda x: subset_dict.get(x["name"]), axis=1).values[0]
Then gives you:
df
name a b
0 red 1.0 2.0
1 blue 1.0 2.0
2 green 2.0 3.0
In Python,
How best to combine all rows of each column in a multi-column DataFrame
into one column,
separated by ‘ | ’ separator
including null values
import pandas as pd
html = 'https://en.wikipedia.org/wiki/Visa_requirements_for_Norwegian_citizens'
df = pd.read_html(html, header=0)
df= df[1]
df.to_csv('norway.csv)
From This:
To This:
df = pandas.DataFrame([
{'A' : 'x', 'B' : 2, 'C' : None},
{'A' : None, 'B' : 2, 'C' : 1},
{'A' : 'y', 'B' : None, 'C' : None},
])
pandas.DataFrame(df.fillna('').apply(lambda x: '|'.join(x.astype(str)), axis = 0)).transpose()
I believe you need replace missing values if necessary by fillna, convert values to strings with astype and apply with join. Get Series, so for one column DataFrame add to_frame with transposing:
df = df.fillna(' ').astype(str).apply('|'.join).to_frame().T
print (df)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free
Or use list comprehension with DataFrame constructor:
L = ['|'.join(df[x].fillna(' ').astype(str)) for x in df]
df1 = pd.DataFrame([L], columns=df.columns)
print (df1)
Country Allowed_stay Visa_requirement
0 Albania|Afganistan|Andorra 30|30|60 visa free| | visa free