Join values in different dataframes - pandas

I am trying trying to join two dataframes in such a way that the resulting union contains info about both of them. My dataframes are similar to:
>> df_1
user_id hashtag1 hashtag2 hashtag3
0000 '#breakfast' '#lunch' '#dinner'
0001 '#day' '#night' NaN
0002 '#breakfast' NaN NaN
The second dataframe contains a unique identifier of the hashtags and their respective score:
>> df_2
hashtag1 score
'#breakfast' 10
'#lunch' 8
'#dinner' 9
'#day' -5
'#night' 6
I want to add a set of columns on my first dataframe that contain the scores of each hashtag used, such as:
user_id hashtag1 hashtag2 hashtag3 score1 score2 score3
0000 '#breakfast' '#lunch' '#dinner' 10 8 9
0001 '#day' '#night' NaN -5 6 NaN
0002 '#breakfast' NaN NaN 10 NaN NaN
I tried to use df.join() but I get an error: "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat"
My code is as follows:
new_df = df_1.join(df_2, how='left', on='hashtag1')
I appreciate any help, thank you

You should try pandas.merge:
pandas.merge(df_1, df_2, on='hashtag1', how='left')
If you want to use .join, you need to set the index of df_2.
df_1.join(df_2.set_index('hashtag1'), on='hashtag1', how='left')
Some resources:
https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#database-style-dataframe-or-named-series-joining-merging
Trouble with df.join(): ValueError: You are trying to merge on object and int64 columns

Related

pandas joining strings in a group, skipping na values

I'm using a combination of str.join (let's call the column joined col_str) and groupby (Let's call the grouped col col_a) in order to summarize data row-wise.
col_str, may contain nan values. Unsurprisingly, and as seen in str.join documentation, joining nan will result in an empty string:
df = df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', '))
To mitigate this, I tried to convert col_str to string (e.g. df['col_str'] = df['col_str'].astype(str) ). But then, empty values now literally have a string nan value, hence considered non empty.
Not only that str.join now includes nan strings, but also other calculations over the script, that rely on those nans, are ruined.
To address that, I thought about converting just the non-empty values as follows:
df['col_str'] = np.where(pd.isnull(df['col_str']), df['col_str'],
df['col_str'].astype(str))
But now str.join return empty values again :-(
So, I tried fillna('') and even dropna(). None provided me with the desired results.
You get the vicious cycle here, right?
astype(str) => nan strings in join and calculations ruined
Leaving as-is => join.str returns empty results.
Thanks for your assistance!
Edit:
Data is read from a csv. Sample:
Code to test -
df = pd.read_csv('/Users/goidelg/Downloads/sample_data.csv', low_memory=False)
print("---Original DF ---")
print(df)
print("---Joining NaNs as NaN---")
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
print("---Convertin col to str---")
df['col_str'] = df['col_str'].astype(str)
print(df.join(df['col_a'].map(df.groupby('col_a')['col_str'].unique().str.join(', ')).rename('strings_concat')))
And results for the script:
First remove missing values by DataFrame.dropna or Series.notna in boolean indexing:
df = pd.DataFrame({'col_a':[1,2,3,4,1,2,3,4,1,2],
'col_str':['a','b','c','d',np.nan, np.nan, np.nan, np.nan,'a', 's']})
df1 = (df.join(df['col_a'].map(df[df['col_str'].notna()]
.groupby('col_a')['col_str'].unique()
.str.join(', ')). rename('labels')))
print (df1)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s
df2 = (df.join(df['col_a'].map(df.dropna(subset=['col_str'])
.groupby('col_a')['col_str']
.unique().str.join(', ')).rename('labels')))
print (df2)
col_a col_str labels
0 1 a a
1 2 b b, s
2 3 c c
3 4 d d
4 1 NaN a
5 2 NaN b, s
6 3 NaN c
7 4 NaN d
8 1 a a
9 2 s b, s

Filling Empty Rows with Dictionary Values via For Loop Pandas

I have a dictionary that looks like this:
my_dict = {2078:'T20',2153:'T20',2223:'T21',2219:'T21'}
My data frame:
Date Code Fleet KM
2021-20-03 2078 T20 20
2021-21-03 2078 NaN 22
2021-21-03 2153 T20 23
2021-21-03 2153 NaN 23
2021-22-03 2223 NaN 28
2021-22-03 2223 NaN 30
2021-22-03 2219 T21 23
2021-23-03 2219 NaN 23
I want to use the values of the dictionary to fill the empty rows in the Fleet column in my df.
So I wrote the code:
for index, row in df.iterrows():
if (pd.isnull(row['Fleet'])):
row['Fleet']= my_dict.row['Fleet']
However, when I check df.info() I can see that the code did not apply even though it runs.
Could someone tell me what I am doing wrong?
Use Series.map by dictionary and raplace missing values by Series.fillna:
df['Fleet'] = df['Fleet'].fillna(df['Code'].map(my_dict))
Or Series.combine_first:
df['Fleet'] = df['Fleet'].combine_first(df['Code'].map(my_dict))

Using python pandas, how can I merge datasets together and create a column that has the unique modifier? [duplicate]

This question already has answers here:
Convert columns into rows with Pandas
(6 answers)
Closed 1 year ago.
Here is the current dataset that I am working with.
df contains Knn, Kss, and Ktt in three separate columns.
What I have been unable to figure out is how to merge the three into a single column and have a column that has a label.
Here is what I currently have but I
df_CohBeh = pd.concat([pd.DataFrame(Knn),
pd.DataFrame(Kss),
pd.DataFrame(Ktt)],
keys=['Knn', 'Kss', 'Ktt'],
ignore_index=True)
Which looks like this:
display(df_CohBeh)
Knn Kss Ktt
0 24.579131 NaN NaN
1 21.673524 NaN NaN
2 25.785409 NaN NaN
3 20.686215 NaN NaN
4 21.504863 NaN NaN
.. ... ... ...
106 NaN NaN 27.615440
107 NaN NaN 27.636029
108 NaN NaN 26.215347
109 NaN NaN 27.626850
110 NaN NaN 25.473380
Which is in essence filtering them, but I would rather have a single column with a string that I can use for plotting on the same seaborn graph "Knn", "Kss", "Ktt". To look at various distributions.
I'm not sure how to create a column that can label the Knn value in the label column.
If df looks like that:
>>> df
Knn Kss Ktt
0 96.054660 72.301166 15.355594
1 36.221933 72.646999 41.670382
2 96.503307 78.597493 71.959442
3 53.867432 17.315678 35.006592
4 43.014227 75.122762 83.666844
5 63.301808 72.514763 64.597765
6 0.201688 1.069586 98.816202
7 48.558265 87.660352 9.140665
8 64.353999 43.534200 15.202242
9 41.260903 24.128533 25.963022
10 63.571747 17.474933 47.093538
11 91.006290 90.834753 37.672980
12 61.960163 87.308155 64.698762
13 87.403750 86.402637 78.946980
14 22.238364 88.394919 81.935868
15 56.356764 80.575804 72.925204
16 30.431063 4.466978 32.257898
17 21.403800 46.752591 59.831690
18 57.330671 14.172341 64.764542
19 54.163311 66.037043 0.822948
Try df.melt
to merge the three into a single column and have a column that has a label.
variable value
0 Knn 96.054660
1 Knn 36.221933
2 Knn 96.503307
3 Knn 53.867432
4 Knn 43.014227
5 Knn 63.301808
...
20 Kss 72.301166
21 Kss 72.646999
22 Kss 78.597493
23 Kss 17.315678
24 Kss 75.122762
25 Kss 72.514763
...
40 Ktt 15.355594
41 Ktt 41.670382
42 Ktt 71.959442
43 Ktt 35.006592
44 Ktt 83.666844
45 Ktt 64.597765
...
You should use an pandas Series.
knn = pd.DataFram({...})
kss = pd.DataFram({...})
ktt = pd.DataFram({...})
l = knn.values.flatten() + kss.values.flatten() + ktt.values.flatten()
s = pd.Series(l, name="Knn")

Can you prevent automatic alphabetical order of df.append()?

I am trying to append data to a log where the order of columns isn't in alphabetical order but makes logical sense, ex.
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 Diff_Goals_2
I am running through several calculations based on different variables and logging the results through appending a dictionary of the values after each run. Is there a way to prevent the df.append() function to order the columns alphabetically?
Seems you have to reorder the columns after the append operation:
In [25]:
# assign the appended dfs to merged
merged = df1.append(df2)
# create a list of the columns in the order you desire
cols = list(df1) + list(df2)
# assign directly
merged.columns = cols
# column order is now as desired
merged.columns
Out[25]:
Index(['Org_Goals_1', 'Calc_Goals_1', 'Diff_Goals_1', 'Org_Goals_2', 'Calc_Goals_2', 'Diff_Goals_2'], dtype='object')
example:
In [26]:
df1 = pd.DataFrame(columns=['Org_Goals_1','Calc_Goals_1','Diff_Goals_1'], data = randn(5,3))
df2 = pd.DataFrame(columns=['Org_Goals_2','Calc_Goals_2','Diff_Goals_2'], data=randn(5,3))
merged = df1.append(df2)
cols = list(df1) + list(df2)
merged.columns = cols
merged
Out[26]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.028935 NaN -0.687143 NaN 1.528579
1 0.943432 NaN -2.055357 NaN -0.720132
2 0.035234 NaN 0.020756 NaN 1.556319
3 1.447863 NaN 0.847496 NaN -1.458852
4 0.132337 NaN -0.255578 NaN -0.222660
0 NaN 0.131085 NaN 0.850022 NaN
1 NaN -1.942110 NaN 0.672965 NaN
2 NaN 0.944052 NaN 1.274509 NaN
3 NaN -1.796448 NaN 0.130338 NaN
4 NaN 0.961545 NaN -0.741825 NaN
Diff_Goals_2
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
0 0.727619
1 0.022209
2 -0.350757
3 1.116637
4 1.947526
The same alpha sorting of the columns happens with concat also so it looks like you have to reorder after appending.
EDIT
An alternative is to use join:
In [32]:
df1.join(df2)
Out[32]:
Org_Goals_1 Calc_Goals_1 Diff_Goals_1 Org_Goals_2 Calc_Goals_2 \
0 0.163745 1.608398 0.876040 0.651063 0.371263
1 -1.762973 -0.471050 -0.206376 1.323191 0.623045
2 0.166269 1.021835 -0.119982 1.005159 -0.831738
3 -0.400197 0.567782 -1.581803 0.417112 0.188023
4 -1.443269 -0.001080 0.804195 0.480510 -0.660761
Diff_Goals_2
0 -2.723280
1 2.463258
2 0.147251
3 2.328377
4 -0.248114
Actually, I found "advanced indexing" to work quite well
df2=df.ix[:,'order of columns']
As I see it, the order is lost, but when appending, the original data should have the correct order. To maintain that, assuming Dataframe 'alldata' and dataframe to be appended data 'newdata', appending and keeping column order as in 'alldata' would be:
alldata.append(newdata)[list(alldata)]
(I encountered this problem with named date fields, where 'Month' would be sorted between 'Minute' and 'Second')

Pandas expanding window with min_periods

I want to compute expanding window statistics, but with a minimum number of periods of 3, rather than 1. That is, I want it start computing the statistic after the window of 3 values, and then include all values up until that point:
value expanding_min
------------------------
6 NaN
5 NaN
2 NaN
3 2
1 1
however, using
df['expanding_min']= df.groupby(groupby)['value'].transform(lambda x: pd.rolling_min(x, window=len(x), min_periods=3))
or
df['expanding_min']= df.groupby(groupby)['value'].transform(lambda x: pd.expanding_min(x, min_periods=3))
I get the following error:
ValueError: min_periods (3) must be <= window (1)
This works for me, changing from value to df.value:
pd.expanding_min(df.value, min_periods=3)
or
pd.rolling_min(df.value, window=len(df.value), min_periods=3)
both output:
0 NaN
1 NaN
2 2
3 2
4 1
dtype: float64
Perhaps your window is being set by some other 'value' whose length is 1? This is why pandas is giving the error message