Pyspark - Joins _ duplicate columns - dataframe

I have 3 dataframes.
each of them have columns which look like below:
I am using below code to join them:
cond = [df1.col8_S1 == df2.col8_S1, df1.col8_S2 == df2.col8_S2]
df = df1.join(df2,cond,how ='inner').drop('df1.col8_S1','df1.col8_S2')
cond = [df.col8_S1 == df3.col8_S1, df.col8_S2 == df3.col8_S2]
df4 = df.join(df3,cond,how ='inner').drop('df3.col8_S1','df3.col8_S2')
I am writing the dataframe onto csv file; however since they have same columns from col1 to col7, the write fails due to duplicate columns. How to I drop the duplicate columns without specifying their names.

Just use the column names for join, instead of explicitly using equal op.
cond = ['col8_S1', 'col8_S2']
df = df1.join(df2, cond, how ='inner')
cond = ['col8_S1', 'col8_S2']
df4 = df.join(df3, cond, how ='inner')

Related

Create a new column after if-else in dask

df[‘new_col’] = np.where(df[‘col1’] == df[‘col2’] , True, False), where col1 and col2 are both str data types, seems pretty straight forward. What is the more efficient method to create a column in dask after an if else statement? I tried the recommendation from this Create an if-else condition column in dask dataframe but it is taking forever. It has only processed about 30% after about an hour. I have 13mil rows and 70 columns
IIUC use if need set column to boolean:
df['new_col'] = df['col1'] == df['col2']
If need set to another values:
df['new_col'] = 'val for true'
ddf = df.assign(col1 = df.new_col.where(cond=df['col1'] == df['col2'], other='val for false'))

Create a Python script which compares several excel files(snapshots) and compares and creates a new dataframe with rows which are diffirent

Am new in Python and will appreciate your help.
I would like to create a python script which perfoms data validation by using my first file excel_file[0] as df1 and comparing it against several other excel_file[0:100] and looping through them while comparing with df1 and appending those rows which are diffirent to a new dataframe df3. Even though I have several columns, I would like to base my comparison on two columns which includes a primary key column; such that if the keys in the two dataframes match; then compare df1 and df2(loop).
Here's what I have tried..
***## import python module: pandasql which allows SQL syntax for Pandas;
It needs installation first though:pip install -U pandasql
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, locals(), globals() )
dateTimeObj = dt.datetime.now()
print('start file merge: ' ,dateTimeObj)
#path = os.getcwd()
##files = os.listdir(path1)
files=os.path.abspath(mydrive')
files
dff1 = pd.DataFrame()
##df2 = pd.DataFrame()
# method 1
excel_files = glob.glob(files+ "\*.xlsx")
##excel_files = [f for f in files if f[-4:] == '\*.xlsx' or f[-3:] == '*.xls']
df1=pd.read_excel(excel_files[14])
for f in excel_files[0:100]:
df2 = pd.read_excel(f)
## Lets drop the any unanamed column
##df1=df1.drop(df1.iloc[:, [0]], axis = 1)
### Gets all Rows and columns which are diffirent after comparing the two dataframes ; The
clause " _key HAVING COUNT(*)= 1" resolves to True if the two dataframes are diffirent
### Else we use The clause " _key HAVING COUNT(*)= 2" to output similar rows and columns
data=pysqldf("SELECT * FROM ( SELECT * FROM df1 UNION ALL SELECT * FROM df2) df1 GROUP BY _key
HAVING COUNT(*) = 1 ;")
## df = dff1.append(data).reset_index(drop = True)
print(dt.datetime.now().strftime("%x %X")+': files appended to make a Master file')***

Spark dataframe filter issue

Coming from a SQL background here.. I'm using df1 = spark.read.jdbc to load data from Azure sql into a dataframe. I am trying to filter the data to exclude rows meeting the following criteria:
df2 = df1.filter("ItemID <> '75' AND Code1 <> 'SL'")
The dataframe ends up being empty but when i run equivalent SQL it is correct. When i change it to
df2 = df1.filter("ItemID **=** '75' AND Code1 **=** 'SL'")
it produces the rows i want to filter out.
What is the best way to remove the rows meeting the criteria, so they can be pushed to a SQL server?
Thank you
In SQL world, <> means Checks if the value of two operands are equal or not, if values are not equal then condition becomes true.
The equivalent of it in spark sql is !=. Thus your sql condition inside filter becomes-
# A != B -> TRUE if expression A is not equivalent to expression B; otherwise FALSE
df2 = df1.filter("ItemID != '75' AND Code1 != 'SL'")
= has same meaning in spark sql as ansi sql
df2 = df1.filter("ItemID = '75' AND Code1 = 'SL'")
Use & operator with != in pyspark.
<> deprecated from python3.
Example:
df=spark.createDataFrame([(75,'SL'),(90,'SL1')],['ItemID','Code1'])
df.filter((col("ItemID") != '75') & (col("code1") != 'SL') ).show()
#or using negation
df.filter(~(col("ItemID") == '75') & ~(col("Code1") == 'SL') ).show()
#+------+-----+
#|ItemID|Code1|
#+------+-----+
#| 90| SL1|
#+------+-----+

pandas data frame columns - how to select a subset of columns based on multiple criteria

Let us say I have the following columns in a data frame:
title
year
actor1
actor2
cast_count
actor1_fb_likes
actor2_fb_likes
movie_fb_likes
I want to select the following columns from the data frame and ignore the rest of the columns :
the first 2 columns (title and year)
some columns based on name - cast_count
some columns which contain the string "actor1" - actor1 and actor1_fb_likes
I am new to pandas. For each of the above operations, I know what method to use. But I want to do all three operations together as all I want is a dataframe that contains the above columns that I need for further analysis. How do I do this?
Here is example code that I have written:
data = {
"title":['Hamlet','Avatar','Spectre'],
"year":['1979','1985','2007'],
"actor1":['Christoph Waltz','Tom Hardy','Doug Walker'],
"actor2":['Rob Walker','Christian Bale ','Tom Hardy'],
"cast_count":['15','24','37'],
"actor1_fb_likes":[545,782,100],
"actor2_fb_likes":[50,78,35],
"movie_fb_likes":[1200,750,475],
}
df_input = pd.DataFrame(data)
print(df_input)
df1 = df_input.iloc[:,0:2] # Select first 2 columns
df2 = df_input[['cast_count']] #select some columns by name - cast_count
df3 = df_input.filter(like='actor1') #select columns which contain the string "actor1" - actor1 and actor1_fb_likes
df_output = pd.concat(df1,df2, df3) #This throws an error that i can't understand the reason
print(df_output)
Question 1:
df_1 = df[['title', 'year']]
Question 2:
# This is an example but you can put whatever criteria you'd like
df_2 = df[df['cast_count'] > 10]
Question 3:
# This is an example but you can put whatever criteria you'd like this way
df_2 = df[(df['actor1_fb_likes'] > 1000) & (df['actor1'] == 'actor1')]
Make sure each filter is contained within it's own set of parenthesis () before using the & or | operators. & acts as an and operator. | acts as an or operator.

Pandas: OverflowError when trying to create a new df column using a long-int value

I have a dataframe as such
ID NAME group_id
0 205292 A 183144058824253894513539088231878865676
1 475121 B 183144058824253894513539088231878865676
1 475129 C 183144058824253894513539088231878865676
I want to transform it such that row 0 is linked to the other rows in the following way
LinkedBy By_Id LinkedTo To_Id group_id
1 A 205292 B 475121 183144058824253894513539088231878865676
2 A 205292 C 475129 183144058824253894513539088231878865676
Basically, I am compressing the first dataframe by linking 0th index row against all other such that an n row df will give me a (n-1) row df. I can accomplish this without the group id (which is of type long and stays constant) by the following code:
pd.DataFrame({"LinkedBy": df['NAME'].iloc[0],"By_Id": df['ID'].iloc[0],"LinkedTo":df['NAME'].iloc[1:],"To_Id":df['ID'].iloc[1:]})
But I am facing problems while adding a group id. When I do the following
pd.DataFrame({"LinkedBy": df['NAME'].iloc[0],"By_Id": df['ID'].iloc[0],"LinkedTo":df['NAME'].iloc[1:],"To_Id":df['ID'].iloc[1:],"GroupId":df['potential_group_id'].iloc[0]})
I get OverflowError: long too big to convert
How do I add the group_id of type long to my new df.
Since your group_id in all rows appears to be the same, you could try this:
res = pd.merge(left=df.iloc[0,:], right=df.iloc[1:,:], how='right', on=['group_id'])
res.columns = ['By_Id', 'LinkedBy', 'group_id', 'To_Id', 'LinkedTo']
Note that this will only work when group_id can be used as your join key.
groupby everything and then apply with custom function
cond1 make sure 'group_id' matches
cond2 make sure 'NAME' does not match
subset df in apply function
rename and drop stuff
more renaming and dropping and resetting
def find_grp(x):
cond1 = df.group_id == x.name[2]
cond2 = df.NAME != x.name[1]
temp = df[cond1 & cond2]
rnm = dict(ID='To_ID', NAME='LinkedTo')
return temp.drop('group_id', axis=1).rename(columns=rnm)
cols = ['ID', 'NAME', 'group_id']
df1 = df.groupby(cols).apply(find_grp)
df1.index = df1.index.droplevel(-1)
df1.rename_axis(['By_ID', 'LinkedBy', 'group_id']).reset_index()
OR
df1 = df.merge(df, on='group_id', suffixes=('_By', '_To'))
df1 = df1[df1.NAME_By != df1.NAME_To]
rnm = dict(ID_By='By_ID', ID_To='To_ID', NAME_To='LinkedTo', NAME_By='LinkedBy')
df1.rename(columns=rnm)