Update one dataframe from another dataframe [duplicate] - pandas

I have a situation like that, one original pandas dataframe, for example, like:
columnA columnB
1 2
1 3
then because of updating, this table looks like this:
columnA columnB columnC
2 3 2
2 4 3
1 3 3
However, I want keep original table, so what I wanted is shown below, only new things, third column and third row updated
columnA columnB columnC
1 2 2
1 3 3
1 3 3
so which kind of merge should I use to expand original table with new rows and columns? THX!

I think need combine_first which convert integer columns to floats, so added astype:
df = df1.combine_first(df2).astype(int)
print (df)
columnA columnB columnC
0 1 2 2
1 1 3 3
2 1 3 3

Related

Merge and inverleave rows of two dataframes [duplicate]

This question already has answers here:
Pandas - Interleave / Zip two DataFrames by row
(5 answers)
Closed 20 days ago.
This post was edited and submitted for review 20 days ago.
Suppose we have:
>>> df1
A B
0 1 a
1 2 a
2 3 a
3 4 a
>>> df2
A B
0 1 b
1 2 b
2 3 b
3 5 b
I would like to merge them on "A" and then list them by interleaving rows like:
A B
0 1 a
0 1 b
1 2 a
1 2 b
2 3 a
2 3 b
I tried merge but it list them column by column. For example if I have 3 or more data frames, merge can merge them on some columns, but my problem would be then to interleave them
If need match by A filter rows by Series.isin in boolean indexing, pass to concat with DataFrame.sort_index:
df = pd.concat([df1[df1.A.isin(df2.A)],
df2[df2.A.isin(df1.A)]]).sort_index(kind='stable')
print (df)
A B
0 1 a
0 1 b
1 2 a
1 2 b
2 3 a
2 3 b
EDIT:
For general data is possible sorting by A and create default index for correct interleaving:
df = (pd.concat([df1[df1.A.isin(df2.A)].sort_values('A', kind='stable').reset_index(drop=True),
df2[df2.A.isin(df1.A)].sort_values('A', kind='stable').reset_index(drop=True)])
.sort_index(kind='stable'))

How to indicate count of values in categorical column in Pandas, Python?

I have the following Pandas DataFrame:
ID CAT
1 A
1 B
1 A
2 A
2 B
2 A
1 B
1 A
I'd like to have a table that indicates the number of occurance per CAT values for each ID in different columns like this:
ID CAT_A_NUM CAT_B_NUM
1 3 2
2 2 1
I tried in many ways, like this one with pivot table, but unsuccessfully:
df.pivot_table(values='CAT', index='ID', columns='CAT', aggfunc='count')
you can use crosstab():
df=pd.DataFrame(data={'ID':[1,1,1,2,2,2,1,1],'CAT':['A','B','A','A','B','A','B','A']})
final = pd.crosstab(df['ID'], df['CAT'])
final.columns=['CAT_A_NUM','CAT_B_NUM']
final
ID CAT_A_NUM CAT_B_NUM
1 3 2
2 2 1
Probably you can use groupby + unstack
df.groupby(["ID","CAT"]).size().unstack()
which gives
CAT A B
ID
1 3 2
2 2 1

How to merge two datasets on incomplete columns?

I want to merge two datasets on 'key1' and 'key2' columns so that in case of missing value, for example, in the 'key2' column, it would take all combinations of the second key that belong to the first key. Here is an example:
def merge_nan_as_any(mask, data, on, how)
...
mask = pd.DataFrame({'key1': [1,1,2,2],
'key2': [None,3,1,2],
'value2': [1,2,3,4]})
data = pd.DataFrame({'key1': [1,1,1,2,2,2],
'key2': [1,2,3,1,2,3],
'value1': [1,2,3,4,5,6]})
result = merge_nan_as_any(mask, data, on=['key1', 'key2'], how='left')
result = pd.DataFrame({'key1': [1,1,1,1,2,2],
'key2': [1,2,3,3,1,2],
'value2': [1,1,1,2,3,4],
'value1': [1,2,3,3,4,5]})
There is a missed value of the second key, so it takes all rows from the second dataset that satisfy the condition: key1 must equal to 1, key2 is any the second key value from the second dataset. How to do that?
The first obvious solution that came to my mind is to iterate over the first dataset and filter out combinations that satisfy the condition and the second one is to split the first dataset into several ones so that they have NaNs in the same columns and merge each of them on columns that have values.
But I don't like these solutions and guess there is more elegant way to do what I want.
I will appreciate for any help!
Simple approach, merge on key1/key2 for the non-NaN values, merge on key1 only for the NaN values and concat:
m = mask['key2'].notna()
result = pd.concat([data.merge(mask[~m].drop(columns='key2'), on='key1'),
data.merge(mask[m], on=['key1', 'key2']),
], ignore_index=True)
Output:
key1 key2 value1 value2
0 1 1 1 1
1 1 2 2 1
2 1 3 3 1
3 1 3 3 2
4 2 1 4 3
5 2 2 5 4
I would begin by filling the null values with a list of all unique values from the other dataframe. Then, explode it to get all possible combinations and transform back to numeric. Finally, merge them both achieving the expected output:
mask['key2'] = mask['key2'].fillna(' '.join([str(x) for x in data['key2'].unique()])).astype(str).str.split(' ')
mask = mask.explode('key2')
mask['key2'] = pd.to_numeric(mask['key2'])
pd.merge(mask,data,on=['key1','key2'],how='left')
Outputting:
key1 key2 value2 value1
0 1 1 1 1
1 1 2 1 2
2 1 3 1 3
3 1 3 2 3
4 2 1 3 4
5 2 2 4 5
use pandasql it will be easy:
mask.sql("""
select data.*,self.value2
from self left join data
on self.key1=data.key1 and (self.key2=data.key2 or self.key2 is null)
""",**globals())
out:
key1 key2 value1 value2
0 1 1 1 1
1 1 2 2 1
2 1 3 3 1
3 1 3 3 2
4 2 1 4 3
5 2 2 5 4

Convert subset of rows to column pyspark dataframe

Suppose we have the following df
Id PlaceCod Val
1 1 0
1 2 3
2 2 4
2 1 5
3 1 6
How can I convert this DF to this one:
Id Store Warehouse
1 0 3
2 5 4
3 6 null
I've tried to use df.pivot(f.col("PlaceCod")) but got error message 'DataFrame has no pivot attribute'
As posted by #Emma on the comments:
df.groupby('Id').pivot('PlaceCod').agg(F.first('Val'))
Using the above solution my problem was solved!

Group one column by another column in pandas?

I would like to get the median value of one column and use the associated value of another column. For example,
col1 col2 index
0 1 3 A
1 2 4 A
2 3 5 A
3 4 6 B
4 5 7 B
5 6 8 B
6 7 9 B
I group by the index to get the median value of col 1, and use the associated value of col 2 to get
col1 col2 index
2 4 A
5 7 B
I can't use the actual median value for index B because it will average the two middle values and that value won't have a corresponding value in col 2.
What's the best way to do this? Will a groupby method work? Or somehow use sort? Do I need to define my own function?
Seems you need take middle position not median from origial df
df.groupby('index')[['col1','col2']].apply(lambda x : pd.Series(sorted(x.values.tolist())[len(x)//2]))
Out[297]:
0 1
index
A 2 4
B 6 8