Removing null columns with U-SQL - azure-data-lake

I have many files that I'm attempting to join together. I happen to know that many of the columns in each of these files contain nothing but null values and I can do without having them in there. How can I write a U-SQL statement to extract the data from the files, check for columns of nothing but nulls and exclude them?
Thanks!

The most performing approach would probably be to write a custom extractor that just skips the rows that are just containing null values.
Otherwise, you could write something like this (note the null indicators on the non-object types):
#data = EXTRACT c1 string, c2 int?, c3 DateTime? // ... more columns
FROM "/path/file.csv"
USING Extractors.Csv();
#data = SELECT * FROM #data WHERE c1 != null AND c2 != null AND c3 != null;
(note that you will have to most likely cast the null to the column type in the comparison).
If your schema is different between the different files, you could also do the filter using a so-called processor that can look at the schema of the input row.
Something along the line of
#data = PROCESS #data PRODUCE c1 string, c2 int?, c3 DateTime?
USING new MyAsm.NullFilterProcessor();
Where you would have to implement the NullFilterProcessor as an IProcessor.

#data = EXTRACT c1 string, c2 int?, c3 DateTime? // ... more columns
FROM "/path/file.csv"
USING Extractors.Csv();
Sometimes the above code will also show error, whenever the null values are already replaced with some other values like "", "\N", null, etc.
USING Extractors.Csv(nullEscape:"\N");
so we have to use nullEscape parameter in the default Extractor in order to exclude the null values in the file.

Related

Selecting substrings from different points in strings depending on another column entry SQL

I have 2 columns that look a little like this:
Column A
Column B
Column C
ABC
{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}
1.0
DEF
{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}
24.0
I need a select statement to create column C - the numerical digits in column B that correspond to the letters in Column A. I have got as far as finding the starting point of the numbers I want to take out. But as they have different character lengths I can't count a length, I want to extract the characters from the calculated starting point( below) up to the next comma.
STRPOS(Column B, Column A) +5 Gives me the correct character for the starting point of a SUBSTRING query, from here I am lost. Any help much appreciated.
NB, I am using google Big Query, it doesn't recognise CHARINDEX.
You can use a regular expression as well.
WITH sample_table AS (
SELECT 'ABC' ColumnA, '{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}' ColumnB UNION ALL
SELECT 'DEF', '{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}' UNION ALL
SELECT 'XYZ', '{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}'
)
SELECT *,
REGEXP_EXTRACT(ColumnB, FORMAT('"%s":([0-9.]+)', ColumnA)) ColumnC
FROM sample_table;
Query results
[Updated]
Regarding #Bihag Kashikar's suggestion: sinceColumnB is an invalid json, it will not be properly parsed within js udf like below. If it's a valid json, js udf with json key can be an alternative of a regular expression. I think.
CREATE TEMP FUNCTION custom_json_extract(json STRING, key STRING)
RETURNS STRING
LANGUAGE js AS """
try {
obj = JSON.parse(json);
}
catch {
return null;
}
return obj[key];
""";
SELECT custom_json_extract('{"ABC":1.0,"DEF":24.0,"XYZ":10.50,}', 'ABC') invalid_json,
custom_json_extract('{"ABC":1.0,"DEF":24.0,"XYZ":10.50}', 'ABC') valid_json;
Query results
take a look at this post too, this shows using js udf and with split options
Error when trying to have a variable pathsname: JSONPath must be a string literal or query parameter

Oracle ERROR-01722 not showing up consistently

There seems to be inconsistencies with how ERROR-01722 error worked, for those who don't know the issue is due to an invalid number and to fix it you'll need to wrap the number to char.
But when filtering VARCHAR2 it is stated that Oracle will convert the data of the column being filtered based on the value given to it. (see: https://stackoverflow.com/a/10422418/5337433)
Now that this is explained for some reason, the error is inconsistent. As an example I have this query:
In this example filter1 is varchar2
select *
from table
where filter1 = 12345
and filter2 = ''
and filter3 = '';
When this statement run there were no issues, but when you run it like this:
select *
from table
where filter1 = 12345
and filter2 = '';
it errors out to ERROR-01722, im not sure why it is acting this way, and how to fix it.
When you compare a varchar column to a number, Oracle will try to convert the column's content to a number, not the other way round (because 123 could be stored as '0123' or '00123')
In general you should always use constant values that match the data type of the column you compare them with. So it should be:
where filter1 = '12345'
However if you are storing numbers in that column, you should not define it as varchar - it should be converted to a proper number column.
The reason the error doesn't show up "consistently" is that you seem to have some values that can be converted to a number and some can't. It depends on other conditions in the query if the those values are included or not.
Additionally: empty strings are converted to NULL in Oracle. So the condition filter2 = '' will never be true. You will have to use filter2 is null if you want to check for an "empty" column.

IS ISNULL() specific for integers?

This has been bothering me with my coding continuously and I can't seem to google a good workaround.
I have a number of columns which are data type nvarchar(255). Pretty standard I would assume.
Anyway, I want to run:
DELETE FROM Ranks WHERE ISNULL(INST,0) = 0
where INST is nvarchar(255). I am thrown the error:
Conversion failed when converting the nvarchar value 'Un' to data type int.
which is the first non null in the column. However, I don't care for this showing me the error means it's not null? - I just want to delete the nulls!
Is there something simple I'm missing.
Any help would be fab!
An expression may only be of one type.
Expression ISNULL(INST,0) involves two source types, nvarchar(255) and int. However, no type change happens at this point, because ISNULL is documented to return the type of its first argument (nvarchar), and will convert the second argument to that type if needed, so the entire original expression is equivalent to ISNULL(INST, '0').
Next step is the comparison expression, ISNULL(INST, '0') = 0. It again has nvarchar(255) and int as the source data types, but this time nothing can stop the conversion - in fact, it must happen for the comparison operator, =, to even work. According to the data type precedence list, the int wins, and is chosen as the resulting type of the comparison expression. Hence all values from column INST must be converted to int before the comparison = 0 is made.
If you
just want to delete the nulls
, then just delete the nulls:
DELETE FROM Ranks WHERE INST IS NULL
If for some reason you absolutely have to use isnull in this fashion, which there is no real reason for, then you should have stayed in the realm of strings:
DELETE FROM Ranks WHERE ISNULL(INST, '') = ''
That would have deleted null entries and entries with empty strings (''), just like the WHERE ISNULL(INST, 0) = 0 would have deleted null entries and entries with '0's if all values in INST could have been converted to int.
With ISNULL(INST,0) you are saying: If the string INST is null, replace it with the string 0. But 0 isn't a string, so this makes no sense.
With WHERE ISNULL(INST,0) = 0 you'd access all rows where INST is either NULL or 0 (but as mentioned a string is not an integer).
So what do you want to achieve? Delete all rows where INST is null? That would be
DELETE FROM ranks WHERE inst IS NULL;

How to handle null + number addition in pig gracefully

I have a data schema where i am having 50+ cols. Now i have a scenario where i need to add four int columns together. there might be the chance that anyone out of four can be null.
if i do null + 1 + null + 7 i get the result as null which is true as per given in the PIG
https://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Nulls
i.e. if either sub-expression is null, the resulting expression is null.
Could someone please let me know as how to handle as such scenarios. Do i need to define a UDF or just caring null and then perform add operation is good. Thanks in advance
One option is, if the column value is null set the column value as zero else proceed with original value. sample example below.
input.txt
1,,3
,5,6
7,8,
PigScript:
A = LOAD 'input.txt' USING PigStorage(',') AS (f1:int,f2:int,f3:int);
B = FOREACH A GENERATE ((f1 is null)?0:f1) + ((f2 is null)?0:f2) +((f3 is null)?0:f3);
DUMP B;
Output:
(4)
(11)
(15)

String input matched against a binary field in SQL WHERE

Here is the scenario:
I have a SQL select statement that returns a binary data object as a string. This cannot be changed it is outside the area of what I can modify.
So for example it would return '1628258DB0DD2F4D9D6BC0BF91D78652'.
If I manually add a 0x in front of this string in a query I will retrieve the results I'm looking for so for example:
SELECT a, b FROM mytable WHERE uuid = 0x1628258DB0DD2F4D9D6BC0BF91D78652
My result set is correct.
However I need to find a Microsoft SQL Server 2008 compatible means to do this programatically. Simply concatenating 0x to the string variable does not work. Obvious, but I did try it.
Help please :)
Thank you
Mark
My understanding of your question is that you have a column uuid, which is binary.
You are trying to select rows with a particular value in uuid, but you are trying to use a string like so:
SELECT a, b FROM mytable WHERE uuid = '0x1628258DB0DD2F4D9D6BC0BF91D78652'
which does not work. If this is correct, you can use the CONVERT function with a style of 2 to have SQL Server treat the string as hex and not require a '0x' as the first characters:
SELECT a, b
FROM mytable
WHERE uuid = CONVERT(binary(16), '1628258DB0DD2F4D9D6BC0BF91D78652', 2)