how to use union of two dataframes which have different column numbers pyspark - dataframe

I have two dataframes:
df1 which consists of column from col1 to col7
df2 which consists of column from col1 to col9
I need to perform union of these two dataframes,
however it fails because of the two extra columns.
Any idea what other function can be used?

Add two columns to df2 and then go ahead with the union.
Import -
from pyspark.sql.functions import lit
If col8 and col9 are numbers then do -
new_df = df2.withColumn("col8", lit(float('nan'))).withColumn("col9", lit(float('nan')))
Or if col8 and col9 are strings then do -
new_df = df2.withColumn("col8", lit("")).withColumn("col9", lit(""))
Now union the new_df with df1.

Related

Pandas get unique values of col1 which has max value in col3

My data looks something like this:
I am looking to get the unique values of col1 (which could have duplicates) and their corresponding max value in col3. I also need the col2 value of the row which has that max value.
I referred to this solution but it's not quite giving me what I am looking for.
Any help on this is appreciated. Thanks!
This could be done by find the max values and return new dataframe and then merge it with the first dataframe.
# initialize list of lists
data = [['a1','b1', 5], ['a1','b2', 6], ['c1', 'd1',3],['c1','d2', 4],['c1','d3', 1]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns=['col1','col2', 'col3'])
# Create dataframe from the max values
df2 = pd.DataFrame(df.groupby(['col1'])['col3'].max()).reset_index()
# Merge and return new dataframe
df1.merge(df2['col3'])

removing square brackets in creating a new column pandas

I have a pandas df, something like this:
col1 col2
ABC [hello, hi, hey, hiya]
my task is to extract the first three words of col2 into a new column with a hyphen in between. Something like this:
col1 col2 col3
ABC [hello, hi, hey, hiya] hello-hi-hey
this seemed simple enough, but I am not able to remove the square brackets anyway I try in new column. Is this possible to do? Any help will be appreciated.
Assuming a Series of lists, slice and join:
df['col3'] = df['col2'].str[:3].agg('-'.join)
If you rather have string representations of lists:
import re
df['col3'] = ['-'.join(re.split(', ', s[1:-1])[:3]) for s in df['col2']]
output:
col1 col2 col3
0 ABC [hello, hi, hey, hiya] hello-hi-hey

Pyspark or Pandas - compare col1 of DF1 and col2 of DF2 and remove duplicate words in col1 based on col2

I want to remove the words present in the list of col2 of DF2 from col1 of DF1.
DF2 has only one list.
DF1 has multiple lists and each list is in one row.
DF1
DF2
DF1 Output/Result
In the output/result, I want to keep the repeating words of DF1 for each row. I tried array_except(),it gives the desired output but also removes the duplicates within each row of DF1, like, instead of 2 D's only 1 D comes out in the result. Any other solution?
Much appreciated!
If DF2 contains only one list, it probably does not need to be a dataframe, it can just be a list. Try this:
df1 = pd.DataFrame({'col1': [['A', 'B', 'D', 'D'], ['A', 'G']]})
to_remove = ['A', 'B', 'E']
df1['col3'] = df1['col1'].apply(lambda list_: [val for val in list_ if val not in to_remove])
print(df1)
col1 col3
0 [A, B, D, D] [D, D]
1 [A, G] [G]

Keeping only one column after join

For the following code
d = {'col1': [33,34], 'col2': [1,2]}
d1 = {'col3': [33,34], 'col4': [3,4]}
df = pd.DataFrame(data=d)
df1 = pd.DataFrame(data=d1)
myDF=pd.merge(df, df1, how='inner', left_on=['col1'], right_on=['col3'])
in myDF, it kept two columns (col1 and col3), is there a way to keep either one column (col1 or col3) with merge function? (of course I can drop a column by applying drop later on after merge, just want to see is it possible to simplify the step.)
Use rename column, so output is only one column and also is not necessary use left_on and right_on parameters, because on is enough:
myDF=pd.merge(df.rename(columns={'col1':'col3'}), df1, on='col3')
print (myDF)
col3 col2 col4
0 33 1 3
1 34 2 4
myDF=pd.merge(df, df1.rename(columns={'col3':'col1'}), on='col1')
print (myDF)
col1 col2 col4
0 33 1 3
1 34 2 4

append one CSV to another as a dataframe based on certain column names without headers in pandas

I have a CSV in a data frame with these columns and data
ID. Col1. Col2. Col3 Col4
I have another CSV with just
ID. Column2. Column3
How can I append 1st CSV with 2nd data under their corresponding headers, without including CSV2 header
My Expected Dataframe
ID. Col1. Col2. Col3 Col4
Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1 Data.CSV1
ID.DataCSV2. Column2.DataCSV2. Column3.DataCSV2
Given the column names in CSV to is different
IIUC,
you'll need to clean your column names then you can do a simple concat.
import re
def col_cleaner(cols):
new_cols = [re.sub('\s+|\.','',x) for x in cols]
return new_cols
df1.columns = col_cleaner(df1.columns)
df2.columns = col_cleaner(df2.columns)
#output
#['ID', 'Val1', 'Val2', 'Val3', 'Val4']
#['ID', 'Val2', 'Val3']
new_df = pd.concat([df1,df2],axis=0)
new_df.to_csv('your_csv.csv')
I think you can use .append
df1.append(df2)
col1 col2 col3
0 1 2 2.0
1 2 3 3.0
2 3 4 4.0
0 3 2 NaN
1 4 3 NaN
2 5 4 NaN
Sample Data
df1 = pd.DataFrame({'col1': [1,2,3], 'col2':[2,3,4], 'col3':[2,3,4]})
df2 = pd.DataFrame({'col1': [3,4,5], 'col2':[2,3,4]})