Strings concatenation in Spark SQL query - sql

I'm experimenting with Spark and Spark SQL and I need to concatenate a value at the beginning of a string field that I retrieve as output from a select (with a join) like the following:
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select("1"+"s.codeA".attr, "e.name".attr)
Let's say my tables contain:
sim:
codeA,codeB
0001,abcd
0002,efgh
events:
codeA,name
0001,freddie
0002,mercury
And I would want as output:
10001,freddie
10002,mercury
In SQL or HiveQL I know I have the concat function available, but it seems Spark SQL doesn't support this feature. Can somebody suggest me a workaround for my issue?
Thank you.
Note:
I'm using Language Integrated Queries but I could use just a "standard" Spark SQL query, in case of eventual solution.

The output you add in the end does not seem to be part of your selection, or your SQL logic, if I understand correctly. Why don't you proceed by formatting the output stream as a further step ?
val results = sqlContext.sql("SELECT s.codeA, e.code FROM foobar")
results.map(t => "1" + t(0), t(1)).collect()

It's relatively easy to implement new Expression types directly in your project. Here's what I'm using:
case class Concat(children: Expression*) extends Expression {
override type EvaluatedType = String
override def foldable: Boolean = children.forall(_.foldable)
def nullable: Boolean = children.exists(_.nullable)
def dataType: DataType = StringType
def eval(input: Row = null): EvaluatedType = {
children.map(_.eval(input)).mkString
}
}
val result = sim.as('s)
.join(
event.as('e),
Inner,
Option("s.codeA".attr === "e.codeA".attr))
.select(Concat("1", "s.codeA".attr), "e.name".attr)

Related

Scala Spark: Parse SQL string to Column

I have two functions, foo and bar, that I want to write like follows:
def foo(df : DataFrame, conditionString : String) =
val conditionColumn : Column = something(conditionString) //help me define "something"
bar(df, conditionColumn)
}
def bar(df : DataFrame, conditionColumn : Column) = {
df.where(conditionColumn)
}
Where condition is a sql string like "person.age >= 18 AND person.citizen == true" or something.
Because reasons, I don't want to change the type signatures here. I feel this should work because if I could change the type signatures, I could just write:
def foobar(df : DataFrame, conditionString : String) = {
df.where(conditionString)
}
As .where is happy to accept a sql string expression.
So, how can I turn a string representing a column expression into a column? If the expression were just the name of a single column in df I could just do col(colName), but that doesn't seem to take the range of expressions that .where does.
If you need more context for why I'm doing this, I'm working on a databricks notebook that can only accept string arguments (and needs to take a condition as an argument), which calls a library I want to take column-typed arguments.
You can use functions.expr:
def expr(expr: String): Column
Parses the expression string into the column that it represents

How to create a filter on an aws glue dynamicframe that filters out set of (literal) values

In a glue script (running in a zeppelin notebook forwarding to a dev endpoint in glue), I've created a dynamicframe from a glue table, that I would like to filter on field "name" not being in a static list of values, i.e. ("a","b","c").
Filtering on non-equality works just fine like this:
def unknownNameFilter(rec: DynamicRecord): Boolean = {
rec.getField("name").exists(_ != "a")
}
I have tried several things like
!rec.getField("name").exists(_ isin ("a","b","c"))
but it gives errors (value isin is not a member of Any), and I can only find pyspark examples and examples that first convert the dynamicframe to a dataframe on the web (which I want to prevent if possible).
Help much appreciated, thanks.
Okay, found my answer, I'll post it for anyone else looking for this, it is done with
!(knownevents.contains(eventname))
Like this in a filter function:
def unknownEventFilter(rec: DynamicRecord): Boolean = {
val knownevents = List("evt_a","evt_b")
rec.getField("name") match {
case Some(eventname: String) => !(knownevents.contains(eventname))
case _ => throw new IllegalArgumentException(s"Unable to extract field name")
}
}
val dfUnknownEvents = df.filter(unknownEventFilter)

DBArrayList to List<Map> Conversion after Query

Currently, I have a SQL query that returns information to me in a DBArrayList.
It returns data in this format : [{id=2kjhjlkerjlkdsf324523}]
For the next step, I need it to be in a List<Map> format without the id: [2kjhjlkerjlkdsf324523]
The Datatypes being used are DBArrayList, and List.
If it helps any, the next step is a function to collect the list and then to replace all single quotes if any [SQL-Injection prevention]. Using:
listMap = listMap.collect() { "'" + Util.removeSingleQuotes(it) + "'" }
public static String removeSingleQuotes(s) {
return s ? s.replaceAll(/'"/, '') : s
}
I spent this morning working on it, and I found out that I needed to actually collect the DBArrayList like this:
listMap = dbArrayList.collect { it.getAt('id')}
If you're in a bind like I was and restrained to a specific schema this might help, but #ou_ryperd has the correct answer!
While using a DBArrayList is not wrong, Groovy's idiom is to use the db result as a collection. I would suggest you use it that way directly from the db:
Map myMap = [:]
dbhandle.eachRow("select fieldSomeID, fieldSomeVal from yourTable;") { row ->
map[row.fieldSomeID] = row.fieldSomeVal.replaceAll(/'"/, '')
}

How to use SQL to query a csv file with Scala?

I am new to Spark Scala and I am trying to make a SQL query on a csv file and return the records. Below is what I have, but is not working:
val file = sc.textFile(“file:///data/home_data.csv”)
val records = file.sqlContext("SELECT id FROM home_data WHERE yr_built < 1979")
combined.collect().foreach(records)
I get errors with the file.sqlContext function.
Thanks
Can you use class to map the data with the respective field names and datatypes, then use your query:
case class Person(first_name:String,last_name: String,age:Int)
val pmap = p.map ( line => line.split (","))
val personRDD = pmap.map ( p => Person (p(0), p(1), p(2). toInt))
val personDF = personRDD. toDF
then query the persondf.
I dont know the schema, so i formulated this way.

Entity Framework filter data by string sql

I am storing some filter data in my table. Let me make it more clear: I want to store some where clauses and their values in a database and use them when I want to retrieve data from a database.
For example, consider a people table (entity set) and some filters on it in another table:
"age" , "> 70"
"gender" , "= male"
Now when I retrieve data from the people table I want to get these filters to filter my data.
I know I can generate a SQL query as a string and execute that but is there any other better way in EF, LINQ?
One solution is to use Dynamic Linq Library , using this library you can have:
filterTable = //some code to retrive it
var whereClause = string.Join(" AND ", filterTable.Select(x=> x.Left + x.Right));
var result = context.People.Where(whereClause).ToList();
Assuming that filter table has columns Left and Right and you want to join filters by AND.
My suggestion is to include more details in the filter table, for example separate the operators from operands and add a column that determines the join is And or OR and a column that determines the other row which joins this one. You need a tree structure if you want to handle more complex queries like (A and B)Or(C and D).
Another solution is to build expression tree from filter table. Here is a simple example:
var arg = Expression.Parameter(typeof(People));
Expression whereClause;
for(var row in filterTable)
{
Expression rowClause;
var left = Expression.PropertyOrField(arg, row.PropertyName);
//here a type cast is needed for example
//var right = Expression.Constant(int.Parse(row.Right));
var right = Expression.Constant(row.Right, left.Member.MemberType);
switch(row.Operator)
{
case "=":
rowClause = Expression.Equal(left, right);
break;
case ">":
rowClause = Expression.GreaterThan(left, right);
break;
case ">=":
rowClause = Expression.GreaterThanOrEqual(left, right);
break;
}
if(whereClause == null)
{
whereClause = rowClause;
}
else
{
whereClause = Expression.AndAlso(whereClause, rowClause);
}
}
var lambda = Expression.Lambda<Func<People, bool>>(whereClause, arg);
context.People.Where(lambda);
this is very simplified example, you should do many validations type casting and ... in order to make this works for all kind of queries.
This is an interesting question. First off, make sure you're honest with yourself: you are creating a new query language, and this is not a trivial task (however trivial your expressions may seem).
If you're certain you're not underestimating the task, then you'll want to look at LINQ expression trees (reference documentation).
Unfortunately, it's quite a broad subject, I encourage you to learn the basics and ask more specific questions as they come up. Your goal is to interpret your filter expression records (fetched from your table) and create a LINQ expression tree for the predicate that they represent. You can then pass the tree to Where() calls as usual.
Without knowing what your UI looks like here is a simple example of what I was talking about in my comments regarding Serialize.Linq library
public void QuerySerializeDeserialize()
{
var exp = "(User.Age > 7 AND User.FirstName == \"Daniel\") OR User.Age < 10";
var user = Expression.Parameter(typeof (User), "User");
var parsExpression =
System.Linq.Dynamic.DynamicExpression.ParseLambda(new[] {user}, null, exp);
//Convert the Expression to JSON
var query = e.ToJson();
//Deserialize JSON back to expression
var serializer = new ExpressionSerializer(new JsonSerializer());
var dExp = serializer.DeserializeText(query);
using (var context = new AppContext())
{
var set = context.Set<User>().Where((Expression<Func<User, bool>>) dExp);
}
}
You can probably get fancier using reflection and invoking your generic LINQ query based on the types coming in from the expression. This way you can avoid casting the expression like I did at the end of the example.