I need to replace a null character in a spark sql string.
I can't find an equivalent of COLLATE in spark sql.
Can you help?
If you are looking for a way to replace every NULL character in a string you can use regexp_replace. Depending on the encoding and programming language you use the NULL character can be different: \000, \x00, \z, or \u0000. [Wikipedia]
null_character = u'\u0000'
replacement = ' '
df = df.withColumn('e', F.regexp_replace(F.col('columnX'), null_character, replacement))
Related response: https://stackoverflow.com/a/41152572/14338716
Related
I need to clean one character column and for that I am using REGEXP_REPLACE function in Teradata 14.
The same piece of code worked for some other data source (having the same LATIN encoding).
The data definition using show table has given me below format of the data:
CREATE SET TABLE pp_oap_cj_t.dc_loss_fdr_kn ,NO FALLBACK ,
NO BEFORE JOURNAL,
NO AFTER JOURNAL,
CHECKSUM = DEFAULT,
DEFAULT MERGEBLOCKRATIO
( PARENT_ID DECIMAL(38,0),
FS_MRCH_NM VARCHAR(25) CHARACTER SET LATIN NOT CASESPECIFIC
) PRIMARY INDEX ( PARENT_ID );
The query I am performing is as below:
CREATE TABLE pp_oap_pb_piyush_t.CHECKMERCHANT1 AS (
SELECT
FS_MRCH_NM,
REGEXP_REPLACE(trim(Upper(trim(REGEXP_REPLACE( (FS_MRCH_NM ) , '[^a-z]',' ',1,0,'i'))) ), '[[:space:]]+',' ',1,0,'i') as cleaned_merchant
FROM pp_oap_pb_piyush_t.CHECKMERCHANT)
WITH DATA PRIMARY INDEX (FS_MRCH_NM);
Error
CREATE TABLE Failed. 6706: The string contains an untranslatable character.
I need a quick turnaround this bottleneck.
Help is really appreciated !
Thanks !!!!
REGEXP_REPLACE under the hood converts character set Latin to Unicode. You have defined your variable as character set Latin. You see the error when data has something which cannot be converted from Latin to Unicode. Best thing is to fix your DDL to have character set as Unicode instead of Latin. something like TRANSLATE(FS_MRCH_NM USING LATIN_TO_UNICODE WITH ERROR) in your code instead of FS_MRCH_NM should work. Problem with this it result in null values when you have untranslatable characters.
UPDATE CFSYSUAT.Metadata
SET MetadataTxt =
CASE WHEN TRANSLATE_CHK(MetadataTxt USING UNICODE_TO_LATIN) > 0
THEN SUBSTRING(MetadataTxt,1,TRANSLATE_CHK(MetadataTxt USING UNICODE_TO_LATIN) - 1) ||
SUBSTRING(MetadataTxt,TRANSLATE_CHK(MetadataTxt USING UNICODE_TO_LATIN) + 1)
ELSE MetadataTxt END;
Had some luck with TRANSLATE_CHK.
It returns the position of the offending character.
Used it with SUBSTRING to remove the offending character.
If the text contains multiple bad characters you have to run the update multiple times, each pass will correct another bad character.
HTH,
Nathan
A non-Unicode-compatible version of oreplace is installed in our syslib, and a Unicode-compatible version is in our td_sysfnlib. When the database is not specified, syslib is used before td_sysfnlib. So forcing TD to use the td_sysfnlib version of oreplace solved the problem.
Here's the code used:
SELECT td_sysfnlib.OREPLACE(item_name,'|','') FROM databaseB.sales;
I hope that helps anyone else who's running into the same issue!
TRANSLATE(OREPLACE(TRANSLATE(item_name USING LATIN_TO_UNICODE WITH ERROR),'|','') USING UNICODE_TO_LATIN WITH ERROR) AS LBL
I am working with Sybase SQL and want to exclude all entries that look like this:
(NOT PRESENT)
So I tried using:
SELECT col FROM table WHERE col NOT LIKE '(%)'
Do you guys know what is happening? I think I need to escap ( somehow, but I do not know how. The following returns an error:
SELECT col FROM table WHERE col NOT LIKE '\(%\)' ESCAPE '\'
Kind Regards
Try this :
SELECT col FROM table WHERE col NOT LIKE ('(%)')
You might find this helpful
Sybase Event Stream Processor 5.0 CCL Programmers Guide - String Functions
like()
Scalar. Determines whether a given string matches a specified pattern string.
Syntax
like ( string, pattern )
Parameters
string A string.
pattern A pattern of characters, as a string. Can contain wildcards.
Usage
Determines whether a string matches a pattern string. The function returns 1 if the string matches the pattern, and 0 otherwise. The pattern argument can contain wildcards: '_' matches a single arbitrary character, and '%' matches 0 or more arbitrary characters. The function takes in two strings as its arguments, and returns an integer.
Note: In SQL, the infix notation can also be used: sourceString like patternString.
Example
like ('MSFT', 'M%T') returns 1.
I'm currently trying to check if a string has a special character (value that is not 0 to 9 A to Z a to z), but the inhouse language that I'm currently using has a very limited function to do it (possible but it will take a lot of lines). but I am able to do a query on sql. Now I would like to ask if it is possible to query using the dual table on sql, My plan is to pass the string to variable and this variable will be use on my sql command. Thanks in advance.
Here is what you can use
SELECT REGEXP_INSTR('Test!ing','[^[:alnum:]]') FROM dual;
This will return a number other than 0 whenever your string has anything other than letters or numbers.
You can use TRANSLATE to remove all okay characters from the string. You get back a string containing only undesired characters - or an empty string when there are none.
select translate(
'AbcDefg1234%99.26éXYZ', -- your string
'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789',
'.') from dual;
returns: %.é
I have postgres function in which i am appending values in query such that i have,
DECLARE
clause text = '';
after appending i have some thing like,
clause = "and name='john' and age='24' and location ='New York';"
I append above in where clause of the query i already have. While executing query i am getting "and" just after "where" result in error
How to use regex_replace so that i remove the first "and" from clause before appending it to the query ?
Instead of fixing clause after the fact, you could avoid the problem by using
concat_ws (concatenate with separator):
clause = concat_ws(' and ', "name='john'", "age='24'", "location ='New York'")
will make clause equal to
"name='john' and age='24' and location ='New York'"
This can be even simpler. Use right() with a negative offset.
Truncates the first n characters and you don't need to specify the length of the string. Faster, simpler.
Double quotes (") are for identifiers in Postgres (and standard SQL) and incorrect in your example. Enclose string literals in single quotes (') and escape single quotes within - or use dollar quoting:
Insert text with single quotes in PostgreSQL
Since this is a plpgsql assignment, use the proper assignment operator :=. The SQL assignment operator = is tolerated, too, but can lead to ambiguity in corner cases.
Finally, you can assign a variable in plpgsql at declaration time. Assignments in plpgsql are still cheap but more expensive than in other programming languages.
DECLARE
clause text := right($$and name='john' and age='24' ... $$, -5)
All that said, it seems like you are trying to work with dynamic SQL and starting off on the wrong foot here. If those values can change, rather supply them as values with the USING clause of EXECUTE and be wary of SQL injection. Read some of the related questions and answers on the matter:
https://stackoverflow.com/search?q=[plpgsql]+[dynamic-sql]+EXECUTE+USING
You do not need regex:
clause = substr(clause, 5, 10000);
clause = substr(clause, 5, length(clause)- 4); -- version for formalists
concat_ws sounds like the best option, but as a general solution for things like this (or any sort of list with a delimiter) you can use logic like (pseudocode):
delim = '';
while (more appendages)
clause = delim + nextAppendage;
delim = ' AND ';
If you want to do it with regular expression try this:
result = regexp_replace(clause, '^and ', '')
When comparing two strings how to avoid checking if a string is of different case in MS SQL 2000
Example:
String1 = Anish
String2 = anish
When comparing Anish = anish the result will be "the strings are not equal". How we compare these strings in that way?
Here is some information about case sensitivity. The thing that i can see is that the problem is how the server is installed.
Case sensitive search
Change the collation of the strings to some form of CI (case insensitive).
E.g. COLLATE Latin1_General_CI_AS
Try the following queries seperately in Northwind database:
SELECT * FROM dbo.Customers WHERE Country COLLATE SQL_Latin1_General_CP1_CS_AS ='Germany'
SELECT * FROM dbo.Customers WHERE Country COLLATE SQL_Latin1_General_CP1_CS_AS ='geRmany'
String Comparison in java is used to compare two different strings.
We can compare string irrespective of case(upper case/ lower case).
Consider str1="HELLO WORLD";
str2="hello world";
If we want to compare these to strings, there are two ways:
String compareTo (String).
String compareToIgnoreCase(String).
Comparing String:
str1 compareTo (str2);
This statement will produce false as the output because java is case sensitive language.
You can also compare the string irrespective of their case using the statement:
str1 compareToIgnoreCase (str2);
This will produce the output true because it will check only the character that stored in str1 and str2 without worrying about the case.