How to update at a specific row, after finding the same value in two tables - pandas

import pandas as pd
data1 = [['xx'], ['4']]
data2 = [['4', 'x0'], ['aa', 'bb'], ['cc', 'dd']]
df1 = pd.DataFrame(data=data1, columns=["isin"])
print(df1)
df2 = pd.DataFrame(data=data2, columns=["isin", "data"])
print(df2)
df1.loc[df1['isin'] == df2['isin'], 'data'] = df2['data']
print (df1)
# Exception has occurred: ValueError
# Can only compare identically-labeled Series objects
# df1.loc[df1['isin'] == df2['isin'], 'data'] = df2['data']
# THIS IS IT NOW
# df1:
# isin
# 0 xx
# 1 4
# df2:
# isin data
# 0 4 x0
# 1 aa bb
# 2 cc dd
Problem:
algorithm find the row with the '4' - in column 'isin' in both df
pull from df2 the 'data' at this row (in this case 'x0')
add it to df1 - in this case(x0) - at the row of '4' at new column 'data'
# df1:
# isin
# 0 xx
# 1 4
# df2:
# isin data
# 0 4 x0
# 1 aa bb
# 2 cc dd
# df3:
# isin data
# 0 xx NaN
# 1 4 x0

I agree with tranbi that the question needs more clarity. 100 is not in df1 anywhere. But, if you want to update just one cell in the dataframe, assuming we have this:
years isin toast
0 55 55 55
1 55 55 55
2 this information 4 55
then
df2.loc[df2['years']=='this information',['years']]='that information'
will update just that cell. You could use df.loc instead of 'this information' to find the value in df1. Couldn't do it because that value doesn't exists in the example you gave so not quite sure that is what you are referring to.

import pandas as pd
data1 = [['xx'], ['4']]
data2 = [['4', 'x0'], ['aa', 'bb'], ['cc', 'dd']]
df1 = pd.DataFrame(data=data1, columns=["isin"])
print(df1)
df2 = pd.DataFrame(data=data2, columns=["isin", "data"])
print(df2)
# merge was the solution ive looked for
df3 = df1.merge(df2, how = 'left')
print (df3)

Related

fill only unique value from another dataframe based on condition

How to fill '0' value in df1 from unique value from another dataframe (df2). the expected output is no duplicate in df1.
any reference links for this. thank for helping out.
data1 = {'test' :['b',0,'e',0,0,'f']}
df1 = pd.DataFrame(data=data1)
data2 = {'test' :['a','b','c','d','e','f']}
df2 = pd.DataFrame(data=data2)
df1
test
0 b
1 0
2 e
3 0
4 0
5 f
df2
test
0 a
1 b
2 c
3 d
4 e
5 f
expected output:
test
0 b -- df1
1 a -- fill with a from df2
2 e -- df1
3 c -- fill with c from df2
4 d -- fill with d from df2
5 f -- df1
Assuming you have enough unique values in df2 to fill the 0s in df1, extract those unique values, and assign them with boolean indexing:
# which rows are 0?
m = df1['test'].eq(0)
# extract values of df2 that are not in df1
vals = df2.loc[~df2['test'].isin(df1['test']), 'test'].tolist()
# ['b', 'e', 'f']
# assign the values in the limit of the needed number
df1.loc[m, 'test'] = vals[:m.sum()]
print(df1)
Output:
test
0 b
1 a
2 e
3 c
4 d
5 f
If there is not always enough values in df2 and you want to fill the first possible 0s:
m = df1['test'].eq(0)
vals = df2.loc[~df2['test'].isin(df1['test']), 'test'].unique()
# ['b', 'e', 'f']
m2 = m.cumsum().le(len(vals))
df1.loc[m&m2, 'test'] = vals[:m.sum()]
print(df1)
Solution Assumptions:
number of '0' == unique values in df2
have a column like 'test' to be manipulated
# get the unique values in df1
uni = df1['test'].unique()
# get the unique values in df2 which are not in df1
unique_df2 = df2[~df2['test'].isin(uni)]
# get the index of all the '0' in df1 in a list
index_df1_0 = df1.index[df1['test'] == 0].tolist()
# replace the '0' in df1 with unique values from df1 (assumption #1 imp!)
for val_ in range(len(index_df1_0)):
df1.iloc[index_df1_0[val_]] = unique_df2.iloc[val_]
print(df1)

How to append a column name to pandas.core.indexes.base.Index

Thanks for looking at my problem. I want to add target column to new_thing, what should I do. Thanks.
import pandas as pd
# reading data from csv
df = pd.read_csv('data.csv')
df.head()
# The csv format
# ID cot1 num1 target num2 cat3
# 0 123 Santa Elena 100 1 52.00 a
# 1 124 India 77 1 25.00 d
# 2 125 Ruanda 60 0 32.10 b
# 3 126 Lesoto 11 0 -11.00 h
# 4 127 Singapur 79 0 0.07 j
df.dtypes
# diffrent columns category(int,string)
# out
# ID int64
# cot1 object
# num1 int64
# target int64
# num2 float64
# cat3 object
# dtype: object
new_thing = df.select_dtypes(include = ['object']).columns
new_thing
# out
# Index(['cot1', 'cat3'], dtype='object')
type(new_thing)
# out
#pandas.core.indexes.base.Index
# I want to add target column to the new_thing
# I have tried the below but no success
new_thing.append(df.target)
#
# TypeError: all inputs must be Index
As you see I could not add the target to new_thing
 This error is because new_thing store values of type pandas.DataFrame.index (columns) if you want to add 'target' to new_thing you must do this
new_thing = df.select_dtypes(include = ['object']).columns
new_thing.append(df.columns[df.columns.get_loc("target")])
to add column and rows selecting type object you should not use instance .columns
new_thing = df.select_dtypes(include = ['object'])
print(pd.concat([new_thing, df[['target']]]))

Pyspark/SQL join a column having list values to another dataframe column

I want to join two tables the way it is asked here, Pandas merge a list in a dataframe column with another dataframe
# Input Data Frame
ID LIST_VALUES
1 [a,b,c]
2 [a,n,t]
3 [x]
4 [h,h]
VALUE MAPPING
a alpha
b bravo
c charlie
n november
h hotel
t tango
x xray
I want the following output, How do I do this in pyspark or in SQL?
# EXPECTED OUTPUT DATAFRAME
ID LIST_VALUES new_col
1 [a,b,c] alpha,bravo,charlie
2 [a,n,t] alpha,november,tango
3 [x] xray
4 [h,h] hotel
I have created the following data and output based on the provided link
the program with pyspark DataFrame API would like the following:
# imports
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# replicating the data
cols = ['ID','LIST_VALUES']
row_1 = [1,['a','b','c']]
row_2 = [2,['a','n','t']]
row_3 = [3,['x']]
row_4 = [4, ['h','h']]
rows = [row_1, row_2,row_3,row_4]
df_1 = spark.createDataFrame(rows, cols)
cols = ['VALUE','MAPPING']
row_1 = ['a','alpha']
row_2 = ['b', 'bravo']
row_3 = ['c', 'charlie']
row_4 = ['n', 'november']
row_5 = ['h', 'hotel']
row_6 = ['t', 'tango']
row_7 = ['x', 'xray']
rows = [row_1, row_2,row_3,row_4, row_5, row_6, row_7]
df_a = spark.createDataFrame(rows, cols)
# we need to explode the LIST_VALUES Column first
df_1 = df_1.withColumn("EXP_LIST_VALUES",F.explode(F.col('LIST_VALUES')))
df_2 = df_1.select('ID','EXP_LIST_VALUES')
# then we can do a left join with df_2 and df_a
df_new = df_a.join(df_2,df_a.VALUE == df_2.EXP_LIST_VALUES,'left')
# applying a window functions
df_output = df_new.select(F.col('ID'),
F.collect_set(F.col('VALUE')).over(Window.partitionBy(F.col('ID'))).alias('LIST_VALUES'), \F.array_join(F.collect_set(F.col('MAPPING')).over(Window.partitionBy(F.col('ID'))),',').alias('new_col')).dropDuplicates()
display(df_output)
The output looks like the following dataframe
# +---+-----------+--------------------+
# | ID|LIST_VALUES| new_col|
# +---+-----------+--------------------+
# | 1|[c, b, a] | bravo,charlie,alpha|
# | 2|[t, n, a] |november,tango,alpha|
# | 3| [x] | xray|
# | 4| [h] | hotel|
# +---+-----------+--------------------|

How to add a new row to pandas dataframe with non-unique multi-index

df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,2,1,2]], columns=list('xyz'))
where df looks like:
Now I add a new row by:
df.loc['new',:]=[0,0,0]
Then df becomes:
Now I want to do the same but with a different df that has non-unique multi-index:
df = pd.DataFrame(np.arange(4*3).reshape(4,3), index=[['a','a','b','b'],[1,1,2,2]], columns=list('xyz'))
,which looks like:
and call
df.loc['new',:]=[0,0,0]
The result is "Exception: cannot handle a non-unique multi-index!"
How could I achieve the goal?
Use append or concat with helper DataFrame:
df1 = pd.DataFrame([[0,0,0]],
columns=df.columns,
index=pd.MultiIndex.from_arrays([['new'], ['']]))
df2 = df.append(df1)
df2 = pd.concat([df, df1])
print (df2)
x y z
a 1 0 1 2
1 3 4 5
b 2 6 7 8
2 9 10 11
new 0 0 0

Pandas changing value in a column for selected rows

Trying to create a new dataframe first spliting the original one in two:
df1 - that contains only rows from original frame which in selected colomn has values from a given list
df2 - that contains only rows from original which in selected colomn has other values, with these values then changed to a new given value.
Return new dataframe as concatenation of df1 and df2
This works fine:
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
print(df)
cat val
0 a 1
1 b 2
2 c 3
3 d 4
4 a 5
5 b 6
df['cat'] = df['cat'].apply(lambda x: 'other')
print(df)
cat val
0 other 1
1 other 2
2 other 3
3 other 4
4 other 5
5 other 6
Yet when I define function:
def create_df(df, select, vals, other):
df1 = df.loc[df[select].isin(vals)]
df2 = df.loc[~df[select].isin(vals)]
df2[select] = df2[select].apply(lambda x: other)
result = pd.concat([df1, df2])
return result
and call it:
df3 = create_df(df, 'cat', ['a','b'], 'xxx')
print(df3)
Which results in what I actually need:
cat val
0 a 1
1 b 2
4 a 5
5 b 6
2 xxx 3
3 xxx 4
And for some reason in this case I get a warning:
..\usr\conda\lib\site-packages\ipykernel\__main__.py:10: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
So how this case (when I assign value to a column in a function) is different from the first one, when I assign value not in a function?
What is the right way to change column value?
Well there are many ways that code can be optimized I guess but for it to work you could simply save copies of the input dataframe and concat those:
def create_df(df, select, vals, other):
df1 = df.copy()[df[select].isin(vals)] #boolean.index
df2 = df.copy()[~df[select].isin(vals)] #boolean-index
df2[select] = other # this is sufficient
result = pd.concat([df1, df2])
return result
Alternative version:
l1 = ['a','b','c','d','a','b']
l2 = [1,2,3,4,5,6]
df = pd.DataFrame({'cat':l1,'val':l2})
# define a mask
mask = df['cat'].isin(list("ab"))
# concatenate mask, nonmask
df2 = pd.concat([df[mask],df[-mask]])
# change values to 'xxx'
df2.loc[-mask,["cat"]] = "xxx"
Outputs
cat val
0 a 1
1 b 2
4 a 5
5 b 6
2 xxx 3
3 xxx 4
Or function:
def create_df(df, filter_, isin_, value):
# define a mask
mask = df[filter_].isin(isin_)
# concatenate mask, nonmask
df = pd.concat([df[mask],df[-mask]])
# change values to 'xxx'
df.loc[-mask,[filter_]] = value
return df
df2 = create_df(df, 'cat', ['a','b'], 'xxx')
df2