Spark dataframe filter issue - dataframe

Coming from a SQL background here.. I'm using df1 = spark.read.jdbc to load data from Azure sql into a dataframe. I am trying to filter the data to exclude rows meeting the following criteria:
df2 = df1.filter("ItemID <> '75' AND Code1 <> 'SL'")
The dataframe ends up being empty but when i run equivalent SQL it is correct. When i change it to
df2 = df1.filter("ItemID **=** '75' AND Code1 **=** 'SL'")
it produces the rows i want to filter out.
What is the best way to remove the rows meeting the criteria, so they can be pushed to a SQL server?
Thank you

In SQL world, <> means Checks if the value of two operands are equal or not, if values are not equal then condition becomes true.
The equivalent of it in spark sql is !=. Thus your sql condition inside filter becomes-
# A != B -> TRUE if expression A is not equivalent to expression B; otherwise FALSE
df2 = df1.filter("ItemID != '75' AND Code1 != 'SL'")
= has same meaning in spark sql as ansi sql
df2 = df1.filter("ItemID = '75' AND Code1 = 'SL'")

Use & operator with != in pyspark.
<> deprecated from python3.
Example:
df=spark.createDataFrame([(75,'SL'),(90,'SL1')],['ItemID','Code1'])
df.filter((col("ItemID") != '75') & (col("code1") != 'SL') ).show()
#or using negation
df.filter(~(col("ItemID") == '75') & ~(col("Code1") == 'SL') ).show()
#+------+-----+
#|ItemID|Code1|
#+------+-----+
#| 90| SL1|
#+------+-----+

Related

Create a new column after if-else in dask

df[‘new_col’] = np.where(df[‘col1’] == df[‘col2’] , True, False), where col1 and col2 are both str data types, seems pretty straight forward. What is the more efficient method to create a column in dask after an if else statement? I tried the recommendation from this Create an if-else condition column in dask dataframe but it is taking forever. It has only processed about 30% after about an hour. I have 13mil rows and 70 columns
IIUC use if need set column to boolean:
df['new_col'] = df['col1'] == df['col2']
If need set to another values:
df['new_col'] = 'val for true'
ddf = df.assign(col1 = df.new_col.where(cond=df['col1'] == df['col2'], other='val for false'))

Alteryx regex_countmatches equivalent in PySpark?

I am working on some alteryx workflow migration to PySpark task, as part of which came across the following filter condition.
length([acc_id]) = 9
AND
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],2),"[[:alpha:]]")=2)
OR
(REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0 AND
REGEX_CountMatches(left([acc_id],1),"[[:alpha:]]")=1 AND
REGEX_CountMatches(right(left([acc_id],2),1), '9')=1
)
Can someone help me in re-writing this condition in PySpark dataframe?
You can use length with regexp_replace to get the equivalent of Alteryx's REGEX_CountMatches function :
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
Becomes:
# replace all non aplhapetic caracters with '' then get length
F.length(F.regexp_replace(F.expr("right(acc_id, 7)"), '[^A-Za-z]', '')) == 0
right and left functions are only available in SQL, you can use them with expr.
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame([("AB1234567",), ("AD234XG1234TT5",)], ["acc_id"])
def regex_count_matches(c: Column, regex: str) -> Column:
"""
helper function equivalent to REGEX_CountMatches
"""
return F.length(F.regexp_replace(c, regex, ''))
df.filter(
(F.length("acc_id") == 9) &
(
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 2)"), '[^A-Za-z]') == 2)
) | (
(regex_count_matches(F.expr("right(acc_id, 7)"), '[^A-Za-z]') == 0)
& (regex_count_matches(F.expr("left(acc_id, 1)"), '[^A-Za-z]') == 1)
& (regex_count_matches(F.expr("right(left(acc_id, 2), 1)"), '[^9]') == 1)
)
).show()
#+---------+
#| acc_id|
#+---------+
#|AB1234567|
#+---------+
You can use size and split. You also need to use '[a-zA-Z]' for the regex because expressions like "[[:alpha:]]" is not supported in Spark.
For example,
REGEX_CountMatches(right([acc_id],7),"[[:alpha:]]")=0
should be equivalent to (in Spark SQL)
size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0
You can put the Spark SQL string directly into the filter clause for a Spark dataframe:
df2 = df.filter("size(split(right(acc_id, 7), '[a-zA-Z]')) - 1 = 0")

pandas data frame columns - how to select a subset of columns based on multiple criteria

Let us say I have the following columns in a data frame:
title
year
actor1
actor2
cast_count
actor1_fb_likes
actor2_fb_likes
movie_fb_likes
I want to select the following columns from the data frame and ignore the rest of the columns :
the first 2 columns (title and year)
some columns based on name - cast_count
some columns which contain the string "actor1" - actor1 and actor1_fb_likes
I am new to pandas. For each of the above operations, I know what method to use. But I want to do all three operations together as all I want is a dataframe that contains the above columns that I need for further analysis. How do I do this?
Here is example code that I have written:
data = {
"title":['Hamlet','Avatar','Spectre'],
"year":['1979','1985','2007'],
"actor1":['Christoph Waltz','Tom Hardy','Doug Walker'],
"actor2":['Rob Walker','Christian Bale ','Tom Hardy'],
"cast_count":['15','24','37'],
"actor1_fb_likes":[545,782,100],
"actor2_fb_likes":[50,78,35],
"movie_fb_likes":[1200,750,475],
}
df_input = pd.DataFrame(data)
print(df_input)
df1 = df_input.iloc[:,0:2] # Select first 2 columns
df2 = df_input[['cast_count']] #select some columns by name - cast_count
df3 = df_input.filter(like='actor1') #select columns which contain the string "actor1" - actor1 and actor1_fb_likes
df_output = pd.concat(df1,df2, df3) #This throws an error that i can't understand the reason
print(df_output)
Question 1:
df_1 = df[['title', 'year']]
Question 2:
# This is an example but you can put whatever criteria you'd like
df_2 = df[df['cast_count'] > 10]
Question 3:
# This is an example but you can put whatever criteria you'd like this way
df_2 = df[(df['actor1_fb_likes'] > 1000) & (df['actor1'] == 'actor1')]
Make sure each filter is contained within it's own set of parenthesis () before using the & or | operators. & acts as an and operator. | acts as an or operator.

Pyspark - Joins _ duplicate columns

I have 3 dataframes.
each of them have columns which look like below:
I am using below code to join them:
cond = [df1.col8_S1 == df2.col8_S1, df1.col8_S2 == df2.col8_S2]
df = df1.join(df2,cond,how ='inner').drop('df1.col8_S1','df1.col8_S2')
cond = [df.col8_S1 == df3.col8_S1, df.col8_S2 == df3.col8_S2]
df4 = df.join(df3,cond,how ='inner').drop('df3.col8_S1','df3.col8_S2')
I am writing the dataframe onto csv file; however since they have same columns from col1 to col7, the write fails due to duplicate columns. How to I drop the duplicate columns without specifying their names.
Just use the column names for join, instead of explicitly using equal op.
cond = ['col8_S1', 'col8_S2']
df = df1.join(df2, cond, how ='inner')
cond = ['col8_S1', 'col8_S2']
df4 = df.join(df3, cond, how ='inner')

Pandas Groupby: Groupby conditional statement

I am trying to identify the location of stops from gps data but need to account for some gps drift.
I have identified stops and isolated them into a new dataframe:
df['Stopped'] = (df.groupby('DAY')['LAT'].diff().abs() <= 0.0005) & (df.groupby('DAY')['LNG'].diff().abs() <= 0.0005)
df2 = df.loc[(df['Stopped'] == True)]
Now I can label groups that have the exact match in coordinates using:
df2['StoppedEvent'] = df2.groupby(['LAT','LNG']).ngroup()
But I want to group by the same conditions of Stopped. Something like this but that works:
df2['StoppedEvent'] = df2.groupby((['LAT','LNG']).diff().fillna(0).abs() <= 0.0005).ngroup()
I would do something like the following:
df['Stopped'] = (df.groupby('DAY')['LAT'].diff().abs() <= 0.0005)\
& (df.groupby('DAY')['LNG'].diff().abs() <= 0.0005)
df["Stopped_Group"] = (~df["Stopped"]).cumsum()
df2 = df.loc[df['Stopped']]
Now you'll have a column, "Stopped_Group", which is constant within a set of rows that are close to each other as determined by your logic. In the original dataframe, df, this column won't have any meaning for rows that correspond to motion.
To get your desired output (if I understand you correctly), do something like the following:
df2["Stopped_Duration"] = df2.groupby("Stopped_Group").transform("size")