Replacing null values does not work in pig - apache-pig

I have some columns that are empty in my dataset.
C1;C2
;;;
;;;
;;;
;;;
I did simple operation that replace empty values by specific space length if it is only empty.
Because C1 and C2 have sometimes respectivly these values :
ZZZZZZZZZZZZZZZZ
ZZZZZZZZZZZZZZ
So that I want to replace them by same string length using space string
So I tried like this
(C1 =='' ? CONCAT(C1,' '): C1) AS C1,
(C2 =='' ? CONCAT(C2,' '):C2) AS C2;
But this doesn't resolve the problem.
Any help, please ?

Maybe try
((C1 is null) OR (C1 == '')) ? ...

Related

Splitting the string into columns to extract values using BigQuery

How can i split the string by a specific character and extract the value of each. The idea is that i need to extract each word between the line including the start/end of the string as this information represents something. Is there a regex pattern ? or a way to split the info into columns ?
Name
A|B|C|D|E|F|G
Name col1 col2 col3 col4 col5 col6 col7
A|B|C|D|E|F|G A B C D E F G
I am using BigQuery for this and couldn't find a way to get the info of all of those. I tried the regex code which only works for the case where we have A|B|C.
I have to compare each column value and then create conditions using case when
CODE:
select
regexp_extract(name, "\\w+\\S(x|y)") as c2, -- gives either x or y
left(regexp_substr(name, "\\w+\\S\\w+\\S\\w+"),1) as c1,
right(regexp_extract(name, "\\w+\\S\\w+\\S\\w+"),1) as c3
from Table
Consider below approach
select * from (
select *
from your_table, unnest(split(name, '|')) value with offset
)
pivot(any_value(value) as col for offset in (0,1,2,3,4,5,6))
if applied to dummy data as in your question - output is
This seems like a use case for SPLIT().
select split(name,"|")[safe_offset(0)] as c1, split(name,"|")[safe_offset(1)] as c2, ..
from table
see https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#split
Added use of safe_offset instead of offset per Array index 74 is out of bounds (overflow) google big query

Replacing any String that contains "00" or "0" with 00 or 0 (Handling double quotes in SQL Server)

My output in a column currently has:
Pill Type
"00" Vegetarian Capsules : (
"0" Vegetarian Capsules : (
"0" Gelatin Capsules : (
"DINO" : (
I need to replace the entire string with only what is contained between the double quotes, with this being my desired result:
Pill Type
00
0
0
DINO
I'm newer to SQL and before would get by doing a CASE statement or even a nested REPLACE() statement to clean up some strings
Now that there are double quotes, too many phrases that would take a long time to write out replacing each one, and needing only what is contained within the double quotes has me stuck and I can't quite figure out a solution.
Here's a script that demonstrates how you can get the desired result:
declare #tmp as table (s nvarchar(100) not null);
insert into #tmp values ('"00" Vegetarian Capsules : (');
insert into #tmp values ('"0" Vegetarian Capsules : (');
insert into #tmp values ('"0" Gelatin Capsules : (');
insert into #tmp values ('"DINO" : (');
select SUBSTRING(s, charindex('"', s) + 1, len(s) - charindex('"', reverse(s)) - charindex('"', s)) from #tmp;
However, please be aware that the entire string is the result if no double quote is present. You may use a CASE expression if you need to address this.

How to select rows with only Numeric Characters in Oracle SQL

I would like to keep rows only with Numeric Character i.e. 0-9. My source data can have any type of character e.g. 2,%,( .
Input (postcode)
3453gds sdg3
454232
sdg(*d^
452
Expected Output (postcode)
454232
452
I have tried using WHERE REGEXP_LIKE(postcode, '^[[:digit:]]+$');
however in my version of Oracle I get an error saying
function regexp_like(character varying, "unknown") does not exist
You want regexp_like() and your version should work:
select t.*
from t
where regexp_like(t.postcode, '^[0-9]+$');
However, your error looks more like a Postgres error, so perhaps this will work:
t.postcode ~ '^[0-9]+$'
For Oracle 10 or higher you can use regexp functions. In earlier versions translate function will help you :
SELECT postcode
FROM table_name
WHERE length(translate(postcode,'0123456789','1')) is null
AND postcode IS NOT NULL;
OR
SELECT translate(postcode, '0123456789' || translate(postcode,'x123456789','x'),'0123456789') nums
FROM table_name ;
the above answer also works for me
SELECT translate('1234bsdfs3#23##PU', '0123456789' || translate('1234bsdfs3#23##PU','x123456789','x'),'0123456789') nums
FROM dual ;
Nums:
1234323
For an alternative to the Gordon Linoff answer, we can try using REGEXP_REPLACE:
SELECT *
FROM yourTable
WHERE REGEXP_REPLACE(postcode, '[0-9]+', '') IS NULL;
The idea here is to strip away all digit characters, and then assert that nothing were left behind. For a mixed digit-letter value, the regex replacement would result in a non-empty string.

Space handling in Teradata

I have the following rows in the table
Record_Value
E1X4B1 20160822
E1XBA1 20160822
E1 X920160822
I need to select the values X4,XB and X9. I wrote the query :
SELECT SUBSTR(Record_Value,3,2)
It selects only X4 and XB. To select the value X9 (which is in the 7th position) I thought of using the coalesce function but it handles only NULL values and not BLANK values. Can you please guide me. Expected Output would be
X4
XB
X9
A different solution simply removes all spaces before the substring:
SUBSTRING(OTRANSLATE(Record_value,' ','') FROM 3 FOR 2)
Use TRIM to remove the leading blanks (and trailing...):
SELECT SUBSTR(TRIM(Record_Value),1,2)
Answer number 2, to the edited question, ANSI SQL compliant:
SELECT SUBSTRING(TRIM(SUBSTRING(Record_Value FROM 3)) FROM 1 FOR 2)

PostgreSQL count number of times substring occurs in text

I'm writing a PostgreSQL function to count the number of times a particular text substring occurs in another piece of text. For example, calling count('foobarbaz', 'ba') should return 2.
I understand that to test whether the substring occurs, I use a condition similar to the below:
WHERE 'foobarbaz' like '%ba%'
However, I need it to return 2 for the number of times 'ba' occurs. How can I proceed?
Thanks in advance for your help.
I would highly suggest checking out this answer I posted to "How do you count the occurrences of an anchored string using PostgreSQL?". The chosen answer was shown to be massively slower than an adapted version of regexp_replace(). The overhead of creating the rows, and the running the aggregate is just simply too high.
The fastest way to do this is as follows...
SELECT
(length(str) - length(replace(str, replacestr, '')) )::int
/ length(replacestr)
FROM ( VALUES
('foobarbaz', 'ba')
) AS t(str, replacestr);
Here we
Take the length of the string, L1
Subtract from L1 the length of the string with all of the replacements removed L2 to get L3 the difference in string length.
Divide L3 by the length of the replacement to get the occurrences
For comparison that's about five times faster than the method of using regexp_matches() which looks like this.
SELECT count(*)
FROM ( VALUES
('foobarbaz', 'ba')
) AS t(str, replacestr)
CROSS JOIN LATERAL regexp_matches(str, replacestr, 'g');
How about use a regular expression:
SELECT count(*)
FROM regexp_matches('foobarbaz', 'ba', 'g');
The 'g' flag repeats multiple matches on a string (not just the first).
There is a
str_count( src, occurence )
function based on
SELECT (length( str ) - length(replace( str, occurrence, '' ))) / length( occurence )
and a
str_countm( src, regexp )
based on the #MikeT-mentioned
SELECT count(*) FROM regexp_matches( str, regexp, 'g')
available here: postgres-utils
Try with:
SELECT array_length (string_to_array ('1524215121518546516323203210856879', '1'), 1) - 1
--RESULT: 7