Prevent Inserting NULL while using Hive Regex Serde - hive

RegexSerDe uses regular expression (regex) to deserialize data. It doesn't support data serialization. It can deserialize the data using regex and extracts groups as columns. In deserialization stage, if a row does not match the regex, then all columns in the row will be NULL. If a row matches the regex but has less than expected groups, the missing groups will be NULL. If a row matches the regex but has more than expected groups, the additional groups are just ignored.
How can I prevent insertion of NULL when there is a mismatch in the row and raise an exception?

select *
from mytable
where assert_true
(
mycol1 is not null
or mycol2 is not null
or mycol3 is not null
...
)

Related

Oracle SQL : Select Query : search if clob Contains a string with pattern matching

I have a Orable Table with one CLOB column which contains JSON data. I need a query which will search within the CLOB data.
I have used the condition where DBMS_LOB.instr(colName,'apple:')>0 which gives the records having apple:. However, I need to the query to return records with any number of apples other than blank, meaning, the json apple key should have a value.
I am thinking of something like where DBMS_LOB.instr(colName,'apple:**X**')>0, where X can be any number not null. I tried regexp_instr but it seems this is not correct for CLOB.
Are there any alternatives to solve this?
Generic string functions for parsing JSON inputs are dangerous - you will get false positives, for example, when something that looks like a JSON object is in fact embedded in a string value. (Illustrated by ID = 101 in my example below.)
The ideal scenario is that you are using Oracle 19 or higher; in that case you can use a simple call to json_exists as illustrated below. In the sample table I create, the first JSON string does not contain a member named apple. In the second row, the string does contain a member apple but the value is null. The first query I show (looking for all JSON with an apple member) will include this row in the output. The last query is what you need: it adds a filter so that a JSON string must include at least one apple member with non-null value (regardless of whether it also includes other members named apple, possibly with null value).
create table sample_data
( id number primary key
, colname clob check (colname is json)
);
insert into sample_data
values (101, '{name:"Chen", age:83, values:["{apple:6}", "street"]}');
insert into sample_data
values (102, '{data: {fruits: [{orange:33}, {apple:null}, {plum:44}]}}');
insert into sample_data
values (103, '[{po:3, "prods":[{"apple":4}, {"banana":null}]},
{po:4, "prods":null}]');
Note that I intentionally mixed together quoted and unquoted member names, to verify that the queries below work correctly in all cases. (Remember also that member names in JSON are case sensitive, even in Oracle!)
select id
from sample_data
where json_exists(colname, '$..apple')
;
ID
---
102
103
This is the query you need. Notice the .. in the path (meaning - find an object member named apple anywhere in the JSON) and the filter at the end.
select id
from sample_data
where json_exists(colname, '$..apple?(# != null)')
;
ID
---
103
You can use regexp_like function for this:
where regexp_like(colName,'apple : [0-9]')

Can one map more than one string to NULL in an SQL COPY command? [duplicate]

I have a source of csv files from a web query which contains two variations of a string that I would like to class as NULL when copying to a PostgreSQL table.
e.g.
COPY my_table FROM STDIN WITH CSV DELIMITER AS ',' NULL AS ('N/A', 'Not applicable');
I know this query will throw an error so I'm looking for a way to specify two separate NULL strings in a COPY CSV query?
I think your best bet in this case, since COPY does not support multiple NULL strings, is to set the NULL string argument to one of them, and then, once it's all loaded, do an UPDATE that will set values in any column you wish having the other NULL string you want to the actual NULL value (the exact query would depend on which columns could have those values).
If you have a bunch of columns, you could use CASE statements in your SET clause to return NULL if it matches your special string, or the value otherwise. NULLIF could also be used (that would be more compact). e.g. NULLIF(col1, 'Not applicable')

hive queries is giving wrong result for a condition is not null with many or conditions

I need to exculde all the rows having null in few specified column in hive managed table.
when is use "col is not null" or "not isdbnull(col)" with one or two columns it worked fine. But i need to check many col, So when add more or conditions in query, it ignores null condition and gives all rows.
I decide to understand the cause, I reach at conclusion that if all the columns having null same time will give right select result. if any of the isdbnull(col) condition fails will include all rows also which is still having nulls and specified in query with or condition.
Any clue much appreciated.
You mentioned you used "or" instead of "and" in your query. So you did "(not A) or (Not B)" which is equivalent to "not (A and B)". This will require both to be null. This is different than "not (A or B)" which is the same as "(not A) and (not B)" which is how I wrote the query below. See De Morgans laws for a further explanation.
If you want to select all rows that have non nulls then do this:
select col1, col2, col3 from table
where col1 is not null and col2 is not null and col3 is not null;
Additionally if you constitute an empty string as a null value then you can:
Select col1 .... where col1 != '';
I have seen people also do:
Select col1 .... where length(col1) > 0;
How does Hive understand nulls? An empty string is interpreted as empty by Hive, not as NULL. An empty string could be have a different meaning to an application than a NULL so they are interpreted differently.
When you load data the default Missing values are represented by the special value NULL. To import data with NULL fields, check documentation of the SerDe used by the table. The default Text Format uses LazySimpleSerDe which interprets the string \N as NULL when importing. This means you should have \N as values to represent nulls when loading hive.
You can modify this ("serialization.null.format"="") when creating a table to let hive know you have some other value to represent null. In the case here you can see it was set to "" for nulls.
Good luck!

is there any difference between the queries

select field from table where field = 'value'
select field from table where field in ('value')
The reason I'm asking is that the second version allow me to use the same syntax for null values, while in the first version I need to change the condition to 'where field is null'...
When you are comparing a field to a null like field_name=NULL you are comparing to a known data type from a field say varchar to not only an unknown value but also an unknown data type as well, that is, for NULL values. When comparison like field_name=NULL again implies therefore a checking of data type for both and thus the two could not be compared even if the value of the field is actually NULL thus it will always result to false. However, using the IS NULL you are only comparing for the value itself without the implied comparison for data type thus it could result either to false or true depending on the actual value of the field.
See reference here regarding the issue of NULL in computer science and here in relation to the similarity to your question.
Now, for the IN clause (i.e. IN(NULL)) I don't know what RDBMS you are using because when I tried it with MS SQL and MySQL it results to nothing.
See MS SQL example and MySQL example.
There is no difference in your example. The second, slightly longer, query is not usually used for a single value, it is usally seen for multiple values, such as
select field from table where field in ('value1', 'value2')
yes there is difference in both this queries. In first statment you can insert only 1 value in where clause "where field = 'value'" but in second statement in where field you can insert many values using IN clause "where field in (value1,value2..)"
Examples:
1) select field from table where field ='value1';
2) select field from table where field in ('value1', 'value2')
To check null values
SELECT field
FROM tbl_name
WHERE
(field IN ('value1', 'value2', 'value3') OR field IS NULL)

how to filter in sql script to not include any column null

imagine there are 50 columns. I dont wan't any row that includes a null value. Are there any tricky way?
SQL 2005 server
Sorry, not really. All 50 columns have to be checked in one form or another.
Column1 IS NOT NULL AND ... AND Column50 IS NOT NULL
Of course, under these conditions why not disallow NULLs in the first place by having NOT NULL in the table definition
If it's SQL Server 2005+ you can do something like:
SELECT fields
FROM MyTable
WHERE stuff
EXCEPT -- This excludes the below results
SELECT fields
FROM MyTable
WHERE (Col1 + Col2 + Col3....) IS NULL
Adding a null to a value results in a null, so the sum of all your columns will be NULL.
This may need to change based on your data types, but adding NULL to either a char/varchar or a number will result in another NULL.
If you are looking at the values not being null, you can do this in the select statement.
SELECT ISNULL(firstname,''), ISNULL(lastname,'') FROM TABLE WHERE SOMETHING=1
This will replace nulls with string blanks. If you want another value use: ISNULL(firstname,'empty') for example. You can use anything where the word empty is.
I prefer this query
select *
from table
where column1>''
and column2>''
and (column3>'' or column3<'')
Allows sql server to use an index seek if the proper index/es exist. you would have to do the syntext for column 3 for any numeric values that could be negative.