How can I write SQL Spark Commands to return fields with Case Insensitive results?
Example:
Sample_DF below
+--------+
| name |
+--------+
| Johnny|
| Robert|
| ROBERT|
| robert|
+--------+
It seems by Default it seems Spark SQL is case sensitive via the field you query for:
spark.sql("select name from Sample_DF where status like '%Robert%'").show
+------+
|name |
+------+
|Robert|
+------+
What can I do to configure above query to be case insensitive so that it can return below, assuming there is a large list of various roberts of different lower/uppercase variations?
+--------+
| name |
+--------+
| Robert|
| ROBERT|
| robert|
+--------+
As I understand SQL Spark does not support MSSQL Collate
you can make all characters lowercaser.
spark.sql("select status from Sample_DF where lower(status) like '%' || lower('Robert') || '%'").show
there is also a builtin function How to change case of whole column to lowercase?
If you want to take a look at all the names in the name column, you could use the lower function, which converts all chars to lowercase.
Sample_DF.select(F.lower('name')).show()
Related
Mariadb version 10.3.34.
SQL to create the example tables is on this gist.
I have to work with a foreign database on which I have no control. So suggestions to modify the structure of the DB are, sadly, unacceptable. I can add functions, though.
Now, in this database, things can have from 0 to n colors, and the color references are coded as a string of all possible values joined by a | char. I know this is a bad practice, but this is not my db, I can't change it.
+----------------------+
| things |
| name (pkey)| colorsid|
+------------+---------+
| 'door' | '20|5' |
| 'car' | '10' |
| 'hammer' | null |
| 'box' | '5' |
+------------+---------+
+------------------+
| colors |
| id | color |
+------+–––––––––––+
| 5 | 'red' |
| 10 | 'blue' |
| 20 | 'black' |
+------+–––––––––––+
So the door is black and red, the car is blue, the hammer has no color, and the box is red.
Is there a way to build a thing_has_color function so I could do something like this:
SELECT name from things WHERE thing_has_color( name, 'red' );
The result would be
+--------+
| name |
+--------+
| 'door' |
| 'box' |
+--------+
Performance is not an issue (to a reasonable extent, of course). The DB is expected to contain at most a few tens of colors, and no more than 10 000 things.
MariaDB has a FIND_IN_SET function, where set is a list of comma separated values. Just replace pipe by comma:
SELECT name FROM things
WHERE FIND_IN_SET((
SELECT id FROM colors WHERE color="red"),
REPLACE(colorsid,"|", ","));
Another option would be to use a regular expression:
SELECT name FROM things
WHERE colorsid REGEXP
concat("[[:<:]]",(SELECT ID FROM colors WHERE color="red"),"[[:>:]]");
However both solutions will be slow, since they can't use an index.
You may join the tables as the following:
SELECT T.name
FROM things T JOIN colors D
ON CONCAT('|', T.colorsid, '|') LIKE CONCAT('%|', D.id, '|%')
WHERE D.color = 'red'
See a demo.
I'm using Spark and I found that my data is not being correctly interpreted. I've tried using decode and encode built-in functions but they can be applied only to one column at a time.
Update:
An example of the behaviour I am having:
+-----------+
| Pa�s |
+-----------+
| Espa�a |
+-----------+
And the one I'm expecting:
+-----------+
| País |
+-----------+
| España |
+-----------+
The sentence is just a simple
SELECT * FROM table
There is a column named as keyword of the product table.
+---------+
| keyword |
+---------+
| dump |
| dump2 |
| dump4 |
| dump5 |
| pro |
+---------+
I am fetching those results from product table by using regex whose keyword containing the string du anywhere.
I used select * from products where keyword LIKE '%[du]%';
but it is returning empty set.
What am I doing wrong here ?
If you must use regex, you can just use du as the regex; that will match the string du anywhere in the keyword:
SELECT *
FROM products
WHERE keyword REGEXP 'du'
Output:
keyword
dump
dump2
dump4
dump5
Demo on dbfiddle
I have created dataframes for exploding a row into multiple rows based on delimiter. I have used explode function for same. Would like to know if i can by pass the use of dataframe here and use only SparkSQL to perform this operation.
Example there is Strtok function in teradata to perform this action.
Quick answer: There is no built-in function in SQL that helps you efficiently breaking a row to multiple rows based on (string value and delimiters), as compared to what flatMap() or explode() in (Dataset API) can achieve.
And simply it is because in Dataframe you can manipulate Rows programmatically in much higher level and granularity than Spark SQL.
Note: Dataset.explode() is deprecated starting from (Spark 2.0)
explode() Deprecated: (Since version 2.0.0) use flatMap() or select() with
functions.explode() instead
Here are two examples for both methods recommended from the previous quote.
Examples
// Loading testing data
val mockedData = sc.parallelize(Seq("hello, world", "foo, bar")).toDF
+------------+
| value|
+------------+
|hello, world|
| foo, bar|
+------------+
Option 1 - flatMap()
Breaking rows into multiples using flatMap()
scala> mockedData.flatMap( r => r.getString(0).split(",")).show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
Option 2 - functions.explode()
Replacing value column with a new set of Rows generated by explode(), which is deprecated in favor of using flatMap()
scala> mockedData.withColumn("value", explode(split($"value", "[,]"))).show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
Switching to Spark SQL API:
If you want to use sqlContext, and start querying data through SQL, now you can create a temporary view from the resulted Dataset:
scala> val resultedDf = mockedData.flatMap( r => r.getString(0).split(","))
resultedDf: org.apache.spark.sql.Dataset[String] = [value: string]
scala> resultedDf.createOrReplaceTempView("temp")
scala> spark.sql("select * from temp").show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
I hope this answers your question.
I have a table like this,
+----+-----------+
| Id | Value |
+----+-----------+
| 1 | ABC_DEF |
| 31 | AcdEmc |
| 44 | AbcDef |
| 2 | BAA_CC_CD |
| 55 | C_D_EE |
+----+-----------+
I need a query to get the records which Value is only in camelcase (ex: AcdEmc, AbcDef etc. not ABC_DEF).
Please note that this table has only these two types of string values.
You can use UPPER() for this
select * from your_table
where upper(value) <> value COLLATE Latin1_General_CS_AS
If your default collation is case-insensitive you can force a case-sensitive collation in your where clause. Otherwise you can remove that part from your query.
Based on the sample data, the following will work. I think the issue we're dealing with is checking whether the string contains underscores.
SELECT * FROM [Foo]
WHERE Value NOT LIKE '%[_]%';
See Fiddle
UPDATE: Corrected error. I forgot '_' meant "any character".