Spark Dataframe sql in java - How to escape single quote - apache-spark-sql

I'm using spark-core, spark-sql, Spark-hive 2.10(1.6.1), scala-reflect 2.11.2. I'm trying to filter a dataframe created through hive context...
df = hiveCtx.createDataFrame(someRDDRow,
someDF.schema());
One of the column that I'm trying to filter has multiple single quotes in it. My filter query will be something similar to
df = df.filter("not (someOtherColumn= 'someOtherValue' and comment= 'That's Dany's Reply'"));
In my java class where this filter occurs, I tried to replace the String variable for e.g commentValueToFilterOut, which contains the value "That's Dany's Reply" with
commentValueToFilterOut= commentValueToFilterOut.replaceAll("'","\\\\'");
But when apply the filter to the dataframe I'm getting the below error...
java.lang.RuntimeException: [1.103] failure: ``)'' expected but identifier
s found
not (someOtherColumn= 'someOtherValue' and comment= 'That\'s Dany\'s Reply'' )
^
scala.sys.package$.error(package.scala:27)
org.apache.spark.sql.catalyst.SqlParser$.parseExpression(SqlParser.scala:49)
org.apache.spark.sql.DataFrame.filter(DataFrame.scala:768)
Please advise...

We implemented a workaround to overcome this issue.
Workaround:
Create a new column in the dataframe and copy the values from the actual column (which contains special characters in it, that may cause issues (like singe quote)), to the new column without any special characters.
df = df.withColumn("comment_new", functions.regexp_replace(df.col("comment"),"'",""));
Trim out the special characters from the condition and apply the filter.
commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Now, the filter has been applied, you can drop the new column that you created for the sole purpose of filtering and restore it to the original dataframe.
df = df.drop("comment_new");
If you dont wnat to create a new column in the dataframe, you can also replace the special character with some "never-happen" string literal in the same column, for e.g
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"'","^^^^"));
and do the same with the string literal that you want to apply against
comment_new commentToFilter = "That's Dany's Reply'"
commentToFilter = commentToFilter.replaceAll("'","^^^^");
df = df.filter("(someOtherColumn= 'someOtherValue' and comment_new= '"+commentToFilter+"')");
Once filtering is done restore the actual value by reverse-applying the string litteral
df = df.withColumn("comment", functions.regexp_replace(df.col("comment"),"^^^^", "'"));
Though It's not answer the actual issue, but someone having the same issue, can try this out as a workaround.
The actual solution could be, use sqlContext (instead of hiveContext) and / or Dataset (instead of dataframe) and / or upgrade to spark hive 2.12.
experts to debate & answer
PS: Thanks to KP, my lead

Related

How to how to create a Dataframe based on a condition

I want to create a new DataFrame from another for rows that meet a condition such as:
uk_cities_df['location'] = cities_df['El Tarter'].where(cities_df['AD'] == 'GB')
uk_cities_df[:5]
but the resulting uk_cities_df is returning NaN.
The csv file that I am needing to extract from has no headers so it used the first row values for such. I need to only include rows in uk_cities_df include the ISO code "GB" so "El Tarter" denotes the values for location and "AD" for iso code.
Could you please provide a visual of what uk_cities_df and cities_df look like ?
From what I can gather, I think you might be looking for the .loc operator,
you could try for example :
uk_cities_df['location'] = cities_df.loc[cities_df['AD'] == 'GB']['location']
Also, I did not really get what role 'El Tarter' plays here, maybe you could give more details ?

Building the Pyspark sytax dynamically

We have a requirement where we need to save the ETL operations rules in mysql database and run the AWS glue job based on the rules coded in Pyspark.
we are going to save the actual Pysaprk syntax in the rules table as string like below.
s.no|rule|output
1|df1.join(df2, on=['age'], how='right_outer')|df3
2|df3.join(df4, on=['age'], how='right_outer')|df5
3|df5.join(df6, on=['age'], how='right_outer')|df7
we are going to pull this from DB and store it as DF.
For i in DF:
i.output = i.rule
#after substituting the value it look like (df3 = df1.join(df2, on=['age'], how='right_outer') )
But the join operation is not happening. Since the values are stored as string in the DB its just substituting the values.
Please help me here what needs to be changed for the join operations to be executed.do i need to change the data type?
Many thanks in advance.
Can you try this :
df_all = df1.join(df2, on=['age'], how='right_outer')\
.join(df3, on=['age'], how='right_outer')\
.join(df4, on=['age'], how='right_outer')
df_all.show()
Best,

how to find column whose name contains a specific string

I have following pandas dataframe with following columns
code nozzle_no nozzle_var nozzle_1 nozzle_2 nozzle_3 nozzle_4
I want to get columns names nozzle_1,nozzle_2,nozzle_3,nozzle_4 from above dataframe
I am doing following in pandas
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_')]
But, it also includes following nozzle_no and nozzle_var, which I do not want. How to do it in pandas?
You can use df.filter regex param here:
df.filter(regex='nozzle_\d+')
The .str.contains has a regex flag, that is True by default, so you can enter a regex:
colnames= sir_df_subset.columns[sir_df_subset.columns.str.contains(pat = 'nozzle_\d+$')]
but the answer of #anky_91 with df.filter is MUCH better.

Splunk : formatting a csv file during indexing, values are being treated as new columns?

I am trying to create a new field during indexing however the fields become columns instead of values when i try to concat. What am i doing wrong ? I have looked in the docs and seems according ..
Would appreciate some help on this.
e.g.
.csv file
**Header1**, **Header2**
Value1 ,121244
transform.config
[test_transformstanza]
SOURCE_KEY = fields:Header1,Header2
REGEX =^(\w+\s+)(\d+)
FORMAT =
testresult::$1.$2
WRITE_META = true
fields.config
[testresult]
INDEXED = True
The regex is good, creates two groups from the data, but why is it creating a new field instead of assigning the value to result?. If i was to do ... testresult::$1 or testresult::$2 it works fine, but when concatenating it creates multiple headers with the value as headername. Is there an easier way to concat fields , e.g. if you have a csv file with header names can you just not refer to the header names? (i know how to do these using calculated fields but want to do it during indexing)
Thanks

How to filter 'NaN' in Pig

I have data that has some rows that look like this:
(1655,var0,var1,NaN)
The first column is an ID, the second and third come from the correlation. The fourth column is the correlation value (from using the COR function).
I would like to filter these rows.
From the Apache Pig documentation, I was under the impression that NaN is equivalent to a null. Therefore I added this to my code:
filter_corr = filter correlation by (corr IS NOT NULL);
This obviously did not work since apparently Pig does not treat null and NaN in the same way.
I would like to know what is the correct way to filter NaN since it is not clear from the Pig documentation.
Thanks!
Eventually you could specify your column as chararray in you schema and Filter with a not matches 'NaN'
Or evenly if you want to replace your NaNs by something else, you put the chararray in your schema as previously and then :
Data = FOREACH Data GENERATE ..., (correlation matches 'NaN' ? 0 : (double) correlation), ...
I hope this could help, good luck ;)
You could read in the data as one chararray line and the use a udf to parse the rows. I made a dataset it looks like this
1665,var0,var1,NaN
1453,var2,var3,5.432
3452,var4,var5,7.654
8765,var6,var7,NaN
Create UDF
#!/usr/bin/env python
# -*- coding: utf-8 -*-
### name of file: udf.py ###
#outputSchema("id:int, col2:chararray, col3:chararray, corr:float")
def format_input(line):
parsed = line.split(',')
if parsed[len(parsed) - 1] == 'NaN'
parsed.pop()
parsed.append(None)
return tuple(parsed)
Then in the pig shell
$ pig -x local
grunt>
/* register udf */
register 'udf.py' using jython as udf;
data = load 'file' as (line:chararray);
A = foreach data generate FLATTEN(udf.format_input(line));
filtered = filter A by corr is not null;
dump filtered
output
(1453,var2,var3,5.432)
(3452,var4,var5,7.654)
I've gone with this solution:
filter_corr = filter data by (corr != 'NaN');
data1 = foreach filter_corr generate ID, (double)corr as double_corr;
I renamed the column and reassigned the data type from chararray to double.
I appreciate the responses but I cannot use UDFs during prototyping due to a limitation in the UI that I am using (Cloudera)