My pandas merge is not bringing over data from the right df. Why? - pandas

The code runs without error, but the right data is not populating into the resulting dataframe.
I've tried with and without the index and neither seem to work. I looked into dtypes but it looks like they match on the fields I'm using as the index. I noted that the indicator is saying left_only, making me think the merge is not actually bringing anything over. It clearly must not be, because fields that are not null in the right df are showing null in the resulting dataframe.
df = df[(df['A'].notna())]
group = df.groupby(['A', 'B', 'Period', 'D'])
df2 = group['Monthly_Need'].sum()
df2 = df2.reset_index()
df = df.set_index(['A', 'B', 'Period', 'D'])
df2 = df2.set_index(['A', 'B', 'Period', 'D'])
df = df.merge(df2, how='left', left_index=True, right_index=True, indicator=True)
df = df.reset_index()

Related

Merging pandas dataframe on unique values in a column

I have a df1 as:
There are a lot of duplicating values for SUBJECT_ID as shown in the picture. I have a df2 to merge from, but I want to merge it on unique SUBJECT_ID. For now I only know how to merge to entire SUBJECT_ID through this code:
df1 = pd.merge(df1,df2[['SUBJECT_ID', 'VALUE']], on='SUBJECT_ID', how='left' )
But this will merge on every SUBJECT_ID. I just need unique SUBJECT_ID. Please help me with this.
I think you will find your answer with the merge documentation.
It's not fully clear what you want, but here are some examples that may contain the answer you are looking for:
import pandas as pd
df1 = pd.read_csv('temp.csv')
display(df1)
SUBJECT_ID = [31, 32, 33]
something_interesting = ['cat', 'dog', 'fish']
df2 = pd.DataFrame(list(zip(SUBJECT_ID, something_interesting)),
columns =['SUBJECT_ID', 'something_interesting'])
display(df2)
df_keep_all = df1.merge(df2, on='SUBJECT_ID', how='outer')
display(df_keep_all)
df_keep_df1 = df1.merge(df2, on='SUBJECT_ID', how='inner')
display(df_keep_df1)
df_thinned = pd.merge(df1.drop_duplicates(), df2, on='SUBJECT_ID', how='inner')
display(df_thinned)
You can use pandas drop function for it using this function you can remove all duplicate values for column or columns.
df2 = df.drop_duplicates(subset=['SUBJECT_ID'])`

Conditional join in pandas

I am merging 2 datasets liek this:
df1.merge(df2,how='left' on='ID')
I only want to select records where df2.NAME='ABC'
What is the quckest way to do this? In SQL, it would be:
select * from df1 left join df2 on df1.id=df2.id and df2.name='ABC'
df1.merge(df2[df2.NAME=='ABC'], how='left', on='ID')
or
df = df1.merge(df2, how='left', on='ID')
df = df[df.NAME=='ABC']
depending on whether you want these rows to exist in the resulting df (with NaNs) [snippet 1] or for them to be dropped entirely [snippet 2].

Pandas dataframes and PyCharm IntelliSense

When I create new dataframes from old ones, using concat or merge, PyCharm intellisense stops working for the resulting dataframe unless I explicitly pass it to a DataFrame constructor
import pandas as pd
d1 = {1: [1, 2, 3], 2: [11, 22, 33]}
d2 = {1: [4], 2: [5]}
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df3 = pd.concat([df1, df2], axis=0)
df3_ = pd.DataFrame(pd.concat([df1, df2], axis=0))
In the above example df3 and df3_ are the "same" dataframe, but intellisense only works on df3_. Am I doing something wrong? How can I avoid always having to call the DataFrame constructor and still get intellisense out of pycharm?
The answer is to use type hints like this:
df3 = pd.concat([df1, df2], axis=0) # type: pandas.DataFrame

Pandas - understanding output of pivot table

Here is my example:
import pandas as pd
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': [72, 19, 92]})
df = df.pivot_table(
index='Student',
columns='Assessor',
values='Score',
aggfunc=lambda x: x)
print(df)
The output looks like:
Assessor C D
Student
A 72 NaN
B NaN [1, 2]
I am not sure why I get '[1,2]' as output. I would expect something like:
Assessor C D
Student
A 72 NaN
B NaN 19
B NaN 92
Here is related question:
if I replace my dataframe with
df = pd.DataFrame({
'Student': ['A', 'B', 'B'],
'Assessor': ['C', 'D', 'D'],
'Score': ['foo', 'bar', 'foo']})
The output of the same pivot is going to be
Process finished with exit code 255
Any thoughts.
pivot_table finds the unique values of the index/columns and aggregates if there are multiple rows in the original DataFrame in a particular cell.
Indexes/columns are generally meant to be unique, so if you want to get the data in that form, you have do something a little ugly like this, although you probably don't want to.
In [21]: pivoted = pd.DataFrame(columns=df['Assessor'], index=df['Student'])
In [22]: for (assessor, score, student) in df.itertuples(index=False):
...: pivoted.loc[student, assessor] = score
For your second question, the reason that groupby generally fails if that there are no numeric columns to aggregate, although it seems to be a bug that it completely crashes like that. I added a note to the issue here.

Conditional on pandas DataFrame's

Let df1, df2, and df3 are pandas.DataFrame's having the same structure but different numerical values. I want to perform:
res=if df1>1.0: (df2-df3)/(df1-1) else df3
res should have the same structure as df1, df2, and df3 have.
numpy.where() generates result as a flat array.
Edit 1:
res should have the same indices as df1, df2, and df3 have.
For example, I can access df2 as df2["instanceA"]["parameter1"]["paramter2"]. I want to access the new calculated DataFrame/Series res as res["instanceA"]["parameter1"]["paramter2"].
Actually numpy.where should work fine there. Output here is 4x2 (same as df1, df2, df3).
df1 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df2 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
df3 = pd.DataFrame( np.random.randn(4,2), columns=list('xy') )
res = df3.copy()
res[:] = np.where( df1 > 1, (df2-df3)/(df1-1), df3 )
x y
0 -0.671787 -0.445276
1 -0.609351 -0.881987
2 0.324390 1.222632
3 -0.138606 0.955993
Note that this should work on both series and dataframes. The [:] is slicing syntax that preserves the index and columns. Without that res will come out as an array rather than series or dataframe.
Alternatively, for a series you could write as #Kadir does in his answer:
res = pd.Series(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index)
Or similarly for a dataframe you could write:
res = pd.DataFrame(np.where( df1>1, (df2-df3)/(df1-1), df3 ), index=df1.index,
columns=df1.columns)
Integrating the idea in this question into JohnE's answer, I have come up with this solution:
res = pd.Series(np.where( df1 > 1, (df2-df3)/(df1-1), df3 ), index=df1.index)
A better answer using DataFrames will be appreciated.
Say df is your initial dataframe and res is the new column. Use a combination of setting values and boolean indexing.
Set res to be a copy of df3:
df['res'] = df['df3']
Then adjust values for your condition.
df[df['df1']>1.0]['res'] = (df['df2'] - df['df3'])/(df['df1']-1)