I am facing a peculiar or unknown concatenation problem during PySpark SQL query
spark.sql("select *,rtrim(IncomeCat)+' '+IncomeCatDesc as trimcat from Dim_CMIncomeCat_handled").show()
In this query both IncomeCat and IncomeCatDesc fields hold String type value so logically i thought it would concatenate but i get resultant field null
where the achievable result will be '14100abcd' where 14100 is IncomeCat part and abcd is IncomeCatdesc part . i have tried explicit casting as well on IncomeCat field
spark.sql("select *,cast(rtrim(IncomeCat) as string)+' '+IncomeCatDesc as IncomeCatAndDesc from Dim_CMIncomeCat_handled").show()
but I am getting same result. so am i something missing here. kindly help me to solve this
Spark doesn't override + operator for strings and as a result query you use doesn't express concatenation. If you take a look at the basic example you'll see what is going on:
spark.sql("SELECT 'a' + 'b'").explain()
== Physical Plan ==
*Project [null AS (CAST(a AS DOUBLE) + CAST(b AS DOUBLE))#48]
+- Scan OneRowRelation[]
Both arguments are assumed to be numeric and in general case the result will be undefined. Of course it will work for strings that can be casted to numerics:
spark.sql("SELECT '1' + '2'").show()
+---------------------------------------+
|(CAST(1 AS DOUBLE) + CAST(2 AS DOUBLE))|
+---------------------------------------+
| 3.0|
+---------------------------------------+
To concatenate strings you can use concat:
spark.sql("SELECT CONCAT('a', 'b')").show()
+------------+
|concat(a, b)|
+------------+
| ab|
+------------+
or concat_ws:
spark.sql("SELECT CONCAT_WS('*', 'a', 'b')").show()
+------------------+
|concat_ws(*, a, b)|
+------------------+
| a*b|
+------------------+
Related
I have an sql issue using teradata sql-assistant with like operator as is shown in the below exemple:
table A
id|
23_0
111_10
201_540
so i should select only the id that finish with '_0'
i tried the below query but it give me all the three ids
select * from A
where id like '%_0'
but i expect only
id|
23_0
have you any idea, please ?
The problem is that _ is a special character. So, one method is:
where id like '$_0' escape '$'
You can also use right():
where right(id, 2) = '_0'
Like we have SQL ISNUMERIC Function which validates whether the expression is numeric or not , I need if there is any equivalent function in Spark SQL, I have tried to find it but couldn't get it. Please if someone can help or suggest for the same ?
Try using spark udf, this approach will help you clone any function -
scala> spark.udf.register("IsNumeric", (inpColumn: Int) => BigInt(inpColumn).isInstanceOf[BigInt])
res46: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(IntegerType)))
scala> spark.sql(s""" select "ABC", IsNumeric(123) as IsNumeric_1 """).show(false)
+---+-----------+
|ABC|IsNumeric_1|
+---+-----------+
|ABC|true |
+---+-----------+
scala> spark.sql(s""" select "ABC", IsNumeric("ABC") as IsNumeric_1 """).show(false)
+---+-----------+
|ABC|IsNumeric_1|
+---+-----------+
|ABC|null |
+---+-----------+
Here, above function will return null if column value is not integer.
Hope this will be helpful.
For anyone coming here by way of Google :) , there is an alternative answer by regex for isnumeric in spark sql
select
OldColumn,
CASE WHEN OldColumn not rlike '[^0-9]' THEN 1 ELSE 0 END AS OldColumnIsNumeric
from table
The regex simply checks if the column is numeric or not.
You can modify to fit for substrings of the column you are checking too.
Is there a way to match zz-10% in find_in_set?
For example:
select find_in_set('zz-1000','zz-10%,zz-2000,zz-3000');
This should return 1 but Impala doesn't support it.
I am wondering if there is some trick with Regex to workaround? The find_in_set seems to do only exact match.
Ideally this should return 1 only as I want to avoid hardcoding a bunch of zz-10% variations.
This is the definition of this function from https://www.cloudera.com/documentation/enterprise/5-14-x/topics/impala_string_functions.html
find_in_set(string str, string strList)
Purpose: Returns the position
(starting from 1) of the first occurrence of a specified string within
a comma-separated string. Returns NULL if either argument is NULL, 0
if the search string is not found, or 0 if the search string contains
a comma. Return type: int
I cannot change zz-1000 (the first param) because it's basically a Column. I could do a bunch of IF / CASE WHEN though if there is a way.
Thanks.
UPDATE 1
I tried this:
select find_in_set('zz-1000','zz-10\d+,zz-2000,zz-3000');
And got this:
+----------------------------------------------------+
| find_in_set('zz-1000', 'zz-10\d+,zz-2000,zz-3000') |
+----------------------------------------------------+
| 0 |
+----------------------------------------------------+
So that doesn't work either.
What about to use REGEXP_LIKE function:
+----------------------------------------------+
| regexp_like('zz-1000', 'zz-10\\d+$|zz-2000') |
+----------------------------------------------+
| true |
+----------------------------------------------+
When you have a static number of strings to compare, we can try this:
SELECT CASE
WHEN regexp_like('zz-1000', 'zz-10\\d+$') THEN 1
WHEN regexp_like('zz-1000', 'zz-2000') THEN 2
ELSE 0
END;
Could someone explain what does the below code mean .
In ms access select join query with below condition (inside where )
Table1.col1 <> [table2]!col2
What does it mean?
It means not equal. Same as !=. It fetches all rows that are not equal to your table1.col1
Table of operators
https://support.office.com/en-us/article/table-of-operators-e1bc04d5-8b76-429f-a252-e9223117d6bd
Comparison operators
| Operator | Purpose | Example |
| <> | Returns True if the first value is not equal to the second value. | Value1 <> Value2 |
—-
Introduction to Access SQL
https://support.office.com/en-gb/article/introduction-to-access-sql-d5f21d10-cd73-4507-925e-bb26e377fe7e
Any relational comparison operator: "=," "<," ">," "<=," ">=," or "<>."
—-
Definition: The Bang (!) Operator
What the bang operator does is simple and specific:
The bang operator provides late-bound access to the default member of an object, by passing the literal name following the bang operator as a string argument to that default member.
I have float data in a BigQuery table like 5302014.2 and 5102014.4.
I'd like to run a BigQuery SQL that returns the values in String format, but the following SQL yields this result:
select a, string(a) from my_table
5302014.2 "5.30201e+06"
5102014.4 "5.10201e+06"
How can I rewrite my SQL to return:
5302014.2 "5302014.2"
5102014.4 "5102014.4"
use standardSQL doesn't have the problem
$ bq query '#standardSQL
SELECT a, CAST(a AS STRING) AS a_str FROM UNNEST(ARRAY[530201111114.2, 5302014.4]) a'
+-------------------+----------------+
| a | a_str |
+-------------------+----------------+
| 5302014.4 | 5302014.4 |
| 5.302011111142E11 | 530201111114.2 |
+-------------------+----------------+
SELECT STRING(INTEGER(f)) + '.' + SUBSTR(STRING(f-INTEGER(f)), 3)
FROM (SELECT 5302014.5642 f)
(not a nice hack, but a better method would be a great feature request to post at https://code.google.com/p/google-bigquery/issues/list?can=2&q=label%3DFeature-Request)
Converting your legacy sql to standard sql is really the best way going forward as far as working with GBQ is concerned. Standard sql is much faster and have way better implementation of features.
For your use case, going with standard sql with CAST(a AS STRING) would be best.