Building the Pyspark sytax dynamically - apache-spark-sql

We have a requirement where we need to save the ETL operations rules in mysql database and run the AWS glue job based on the rules coded in Pyspark.
we are going to save the actual Pysaprk syntax in the rules table as string like below.
s.no|rule|output
1|df1.join(df2, on=['age'], how='right_outer')|df3
2|df3.join(df4, on=['age'], how='right_outer')|df5
3|df5.join(df6, on=['age'], how='right_outer')|df7
we are going to pull this from DB and store it as DF.
For i in DF:
i.output = i.rule
#after substituting the value it look like (df3 = df1.join(df2, on=['age'], how='right_outer') )
But the join operation is not happening. Since the values are stored as string in the DB its just substituting the values.
Please help me here what needs to be changed for the join operations to be executed.do i need to change the data type?
Many thanks in advance.

Can you try this :
df_all = df1.join(df2, on=['age'], how='right_outer')\
.join(df3, on=['age'], how='right_outer')\
.join(df4, on=['age'], how='right_outer')
df_all.show()
Best,

Related

Multiple persists in the Spark Execution plan

I currently have some spark code (pyspark), which loads in data from S3 and applies several transformations on it. The current code is structured in such a way that there are a few persists along the way in the following format
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df = df.transformation5
df = df.transformation6
df = df.transformation7
df.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
df = df.transformationN-1
df = df.transformationN
df.persist(MEMORY_AND_DISK)
When I do df.explain() at the very end of all transformations, as expected, there are multiple persists in the execution plan. Now when I do the following at the end of all these transformations
print(df.count())
All transformations get triggered, including the persist. Since spark will flow through the execution plan, it will execute all these persists. Is there any way that I can inform Spark to unpersist the N-1th persist, when performing the Nth persist, or is Spark smart enough to do this. My issue stems from the fact that later on in the program, I run out of disk space, ie, spark errors out with the following error:
No space left on device
An easy solution is of course to increase the underlying number of instances. But my hypothesis is that the high number of persists eventually costs the disk to go out of space.
My question is, do these donkey persists cause this issue? If they do, what is the best way/practice to structure the code so that I can unpersist the N-1th persist automatically.
I'm more experienced with Scala Spark but it's definitely possible to unpersist a Dataframe.
In fact, the Pyspark method of a Dataframe is also called unpersist. So in your example, you could do something like (this is quite crude):
df = spark.read.csv(s3path)
df = df.transformation1
df = df.transformation2
df = df.transformation3
df = df.transformation4
df.persist(MEMORY_AND_DISK)
df1 = df.transformation5
df1 = df1.transformation6
df1 = df1.transformation7
df.unpersist()
df1.persist(MEMORY_AND_DISK)
.
.
.
dfM = dfM-1.transformationN-2
dfM = dfM.transformationN-1
dfM = dfM.transformationN
dfM-1.unpersist()
dfM.persist(MEMORY_AND_DISK)
Now, the way this code looks triggers some questions in me. It might be that you've mostly written this as pseudocode to be able to ask this question, but still maybe the following questions might help you further:
I only see transformations in there, and no actions. If this is the case, do you even need to persist?
Also, you only seem to have 1 source of data (the spark.read.csv bit): this also seems to hint at not necessarily needing to persist.
This is more of a point about style (and maybe opinionated, so don't worry if you don't agree). As I said in the beginning, I have no experience with Pyspark but the way that I would write (in Scala Spark) something similar to what you have written would be something like this:
df = spark.read.csv(s3path)
.transformation1
.transformation2
.transformation3
.transformation4
.persist(MEMORY_AND_DISK)
df = df.transformation5
.transformation6
.transformation7
.persist(MEMORY_AND_DISK)
.
.
.
df = df.transformationN-2
.transformationN-1
.transformationN
.persist(MEMORY_AND_DISK)
This is less verbose and IMO a little more "true" to what really happens, just a chaining of transformations on the original dataframe.

How to how to create a Dataframe based on a condition

I want to create a new DataFrame from another for rows that meet a condition such as:
uk_cities_df['location'] = cities_df['El Tarter'].where(cities_df['AD'] == 'GB')
uk_cities_df[:5]
but the resulting uk_cities_df is returning NaN.
The csv file that I am needing to extract from has no headers so it used the first row values for such. I need to only include rows in uk_cities_df include the ISO code "GB" so "El Tarter" denotes the values for location and "AD" for iso code.
Could you please provide a visual of what uk_cities_df and cities_df look like ?
From what I can gather, I think you might be looking for the .loc operator,
you could try for example :
uk_cities_df['location'] = cities_df.loc[cities_df['AD'] == 'GB']['location']
Also, I did not really get what role 'El Tarter' plays here, maybe you could give more details ?

Spark Dataframe sql in java - How to escape single quote

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...
We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

How to filter 'NaN' in Pig

I have data that has some rows that look like this:
(1655,var0,var1,NaN)
The first column is an ID, the second and third come from the correlation. The fourth column is the correlation value (from using the COR function).
I would like to filter these rows.
From the Apache Pig documentation, I was under the impression that NaN is equivalent to a null. Therefore I added this to my code:
filter_corr = filter correlation by (corr IS NOT NULL);
This obviously did not work since apparently Pig does not treat null and NaN in the same way.
I would like to know what is the correct way to filter NaN since it is not clear from the Pig documentation.
Thanks!
Eventually you could specify your column as chararray in you schema and Filter with a not matches 'NaN'
Or evenly if you want to replace your NaNs by something else, you put the chararray in your schema as previously and then :
Data = FOREACH Data GENERATE ..., (correlation matches 'NaN' ? 0 : (double) correlation), ...
I hope this could help, good luck ;)
You could read in the data as one chararray line and the use a udf to parse the rows. I made a dataset it looks like this
1665,var0,var1,NaN
1453,var2,var3,5.432
3452,var4,var5,7.654
8765,var6,var7,NaN
Create UDF
#!/usr/bin/env python
# -*- coding: utf-8 -*-
### name of file: udf.py ###
#outputSchema("id:int, col2:chararray, col3:chararray, corr:float")
def format_input(line):
parsed = line.split(',')
if parsed[len(parsed) - 1] == 'NaN'
parsed.pop()
parsed.append(None)
return tuple(parsed)
Then in the pig shell
$ pig -x local
grunt>
/* register udf */
register 'udf.py' using jython as udf;
data = load 'file' as (line:chararray);
A = foreach data generate FLATTEN(udf.format_input(line));
filtered = filter A by corr is not null;
dump filtered
output
(1453,var2,var3,5.432)
(3452,var4,var5,7.654)
I've gone with this solution:
filter_corr = filter data by (corr != 'NaN');
data1 = foreach filter_corr generate ID, (double)corr as double_corr;
I renamed the column and reassigned the data type from chararray to double.
I appreciate the responses but I cannot use UDFs during prototyping due to a limitation in the UI that I am using (Cloudera)

Conditions in Pig storage

Say I am having a input file as map.
sample.txt
[1#"anything",2#"something",3#"anotherthing"]
[2#"kish"]
[3#"mad"]
[4#"sun"]
[1#"moon"]
[1#"world"]
Since there are no values with the specified key, I do not want to save it to a file. Is there any conditional statements that i can include with the Store into relation ? Please Help me thro' this, following is the pig script.
A = LOAD 'sample.txt';
B = FOREACH A GENERATE $0#'5' AS temp;
C = FILTER B BY temp is not null;
-- It actually generates an empty part-r-X file
-- Is there any conditional statements i can include where if C is empty, Do not store ?
STORE C INTO '/user/logs/output';
Thanks
Am I going wrong somewhere ? Please correct me if I am wrong.
From Chapter 9 of Programming Pig,
Pig Latin is a dataflow language. Unlike general purpose programming languages, it does not include control flow constructs like if and for.
Thus, it is impossible to do this using just Pig.
I'm inclined to say you could achieve this using a combination of a custom StoreFunc and a custom OutputFormat, but that seems like it would be too much added overhead.
One way to solve this would be to just delete the output file if no records are written. This is not too difficult using embedded Pig. For example, using Python embedding:
from org.apache.pig.scripting import Pig
P = Pig.compile("""
A = load 'sample.txt';
B = foreach A generate $0#'5' AS temp;
C = filter B by temp is not null;
store C into 'output/foo/bar';
""")
bound = P.bind()
stats = bound.runSingle()
if not stats.isSuccessful():
raise RuntimeError(stats.getErrorMessage())
result = stats.result('C')
if result.getNumberRecords() < 1:
print 'Removing empty output directory'
Pig.fs('rmr ' + result.getLocation())