Storing and Query Blank Values in Hive Columns - hive

I have a requirement for storing blank strings of length 1, 2, and 3 in some columns of my Hive table.
Storing:
If my column type is char, then I see that the data is always trimmed before storing. i.e. length(column) is always 0
If my column type is varchar then the data is not trimmed. so length(column) is 1, 2 and 3 respectively.
So that solves my storing problem.
Querying:
I am unable to query the column by value.
say. select * from hive table where column = ' ';
it only works if I do something like
select * from hive table where length(column) > 0 and trim(column) = '';
Is there a way to handle this separately ?
say I want to query those records where column value is of a blank string of length 3? How do I do this?
This is what i Tried (Note that the issues seems to be when the file is stored as parquet)
CREATE EXTERNAL TABLE IF NOT EXISTS DUMMY5 (
col1 varchar(3))
STORED AS PARQUET
LOCATION "/DUMMY5";
insert into DUMMY5 values (' '); // 2 character strings
insert into DUMMY5 values (' '); //3 character strings
select col1, length(col1) from DUMMY5;
+-------+------+--+
| col1 | _c1 |
+-------+------+--+
| | 3 |
| | 2 |
+-------+------+--+
select col1, length(col1) from DUMMY5 where col1 = ' '; // 0 record
select col1, length(col1) from DUMMY5 where col1 = ' '; // 0 record

Running Hive 2.1.1
drop table dummy_tbl;
CREATE TABLE dummy_tbl (
col1 char(1),
col2 varchar(1),
col3 char(3),
col4 varchar(3)) ;
insert into dummy_tbl values (' ', ' ', ' ', ' ');
select length(col1), length(col2), length(col3), length(col4) from dummy_tbl;
Result:
c0 c1 c2 c3
0 1 0 2
Varchar column works absolutely correct. col2 was trimmed on insert, it is documented.
col4 varchar(2) works correctly, this query returns 1:
select count(*) from dummy_tbl where col4=' '; --returns 1
And length of all char columns shows 0 and comparison ignoring spaces like it is documented:
select count(*) from dummy_tbl where col1=' '; --single space --returns 1
select count(*) from dummy_tbl where col1=' '; --two spaces --also returns 1 because it is ignoring spaces
You can use varchar with proper length. Or STRING type if you not sure about length.

Related

SQL: How to split column by character count [duplicate]

This question already has answers here:
Split column into multiple columns based on character count
(3 answers)
Closed last year.
I have one column with letters. I want to split this column into chunks of three. What SQL code for Microsoft would I need? I have read 'split my a special character' but I am not sure how to create a split by value where the split is not restricted to number of columns either.
You can do :
select t.*, substring(col, 1, 3), substring(col, 4, 3), substring(col, 7, 3)
from table t
If you really want to do this dynamically, as stated in the question, and have a query that creates just as many columns as needed, then you do need dynamic SQL.
Here is a solution that uses a recusive CTE to generate the query string.
declare #sql nvarchar(max);
with cte as (
select
1 pos,
cast('substring(code, 1, 3) col1' as nvarchar(max)) q,
max(len(code)) max_pos from mytable
union all
select
pos + 1,
cast(
q
+ ', substring(code, ' + cast(pos * 3 + 1 as nvarchar(3))
+ ', 3) col'
+ cast(pos + 1 as nvarchar(3))
as nvarchar(max)),
max_pos
from cte
where pos < max_pos / 3
)
select #sql = N'select ' + q + ' from mytable'
from cte
where len(q) = (select max(len(q)) from cte);
select #sql sql;
EXEC sp_executesql #sql;
The anchor of the recursive query computes the length of the longest string in column code. Then, the recursive part generates a series of substring() expressions for each chunk of 3 characters, with dynamic column names like col1, col2 and so on. You can then (debug and) execute that query string.
Demo on DB Fiddle:
-- debug
| sql |
| :---------------------------------------------------------------------------------------------------------------------------------- |
| select substring(code, 1, 3) col1, substring(code, 4, 3) col2, substring(code, 7, 3) col3, substring(code, 10, 3) col4 from mytable |
-- results
col1 | col2 | col3 | col4
:--- | :--- | :--- | :---
ABC | DEF | GHI |
XYZ | ABC | |
JKL | MNO | PQR | STU
ABC | DEF | |
Try it like this, which does not need any generic SQL (as long as you can specify a maximum count of columns):
First we need to define a mockup scenario to simulate your issue
DECLARE #tbl TABLE(ID INT IDENTITY, YourString VARCHAR(100));
INSERT INTO #tbl VALUES ('AB')
,('ABC')
,('ABCDEFGHI')
,('XYZABC')
,('JKLMNOPQRSTU')
,('ABCDEF');
--We can set the chunk length generically. Try it with other values...
DECLARE #ChunkLength INT=3;
--The query
SELECT p.*
FROM
(
SELECT t.ID
,CONCAT('Col',A.Nmbr) AS ColumnName
,SUBSTRING(t.YourString,(A.Nmbr-1)*#ChunkLength + 1,#ChunkLength) AS Chunk
FROM #tbl t
CROSS APPLY
(
SELECT TOP((LEN(t.YourString)+(#ChunkLength-1))/#ChunkLength) ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) FROM master..spt_values
) A(Nmbr)
) src
PIVOT
(
MAX(Chunk) FOR ColumnName IN(Col1,Col2,Col3,Col4,Col5,Col6 /*add the maximum column count here*/)
) p;
The idea in short:
By using an APPLY call we can create a row-wise tally. This will return multiple rows per input string. The row count is defined by the computed TOP-clause.
We use the row-wise tally first to create a column Name and second as parameters in SUBSTRING().
Finally we can use PIVOT to return this as horizontal list.
One hint about generic result sets:
This might be kind of religion, but - at least in my point of view - I would prefer a fix resultset with a lot of empty columns, rather than a generically defined set. The consumer should know the result format in advance...
You might use exactly the same query as dynamically created SQL statement. The only thing you would need to change is the actual list of column names in the PIVOT's IN-clause.

Find the frequency of all words from a concatenated column

I have concatenated text column derived from three columns in a table. I need to have frequency of all single words from that concatenated column.
Column1 Column2 column3
This is Test 1
This was Test two
What I need is concatenation of all three i.e. This is Test 1, This was Test two and then count of each word ie.
This - 2
is - 1
was -1
Test - 2
1- 1
two - 1
You can use string_split and cross apply to achieve the required result. try the following:
Code:
declare #tab table (col1 varchar(100), col2 varchar(100), col3 varchar(100))
insert into #tab
select 'This is', 'Test', '1'
union
select 'This was','Test','two'
select value, count(*) rec_count
from #tab
cross apply string_split((col1+' '+col2+' '+col3), ' ')
group by value

Count the number of spaces in a string

I am working on data validation and I am trying to count the number of spaces in a string. My problem is that when I count the spaces, any sting with more than one space between texts or any string with trailing space(s) are not counted
I have tried the following codes without luck. each codes gives different result but not the desired output
DECLARE #MyTbl TABLE (ID INT, Name VARCHAR(300))
INSERT INTO #MyTBL VALUES
(1, 'Alfreds Futterkiste'), -- 1 space
(2,'Mike James Ray '), -- 4 spaces 1 space between each text and 2 spaces after text
(3,'Hanari Carnes'), -- 2 spaces between text
(4,'James Michael')
-- 1
SELECT ID,
LEN(Name)-LEN(REPLACE(Name, ' ', '')) AS Count_Of_Spaces
FROM #MyTBL
-- 2
SELECT ID,
LEN(Name + ';')-LEN(REPLACE(Name,' ','')) AS Count_Of_Spaces2
FROM #MyTBL
-- 3
SELECT ID,
LEN(Name)-LEN(REPLACE(Name,' ', '')) AS Count_Of_Spaces3
FROM #MyTBL
Current output based on the first query
ID Count_Of_Spaces
1 1
2 2
3 2
4 1
Desired output
ID Count_Of_Spaces
1 1
2 4
3 2
4 1
You could use DATALENGTH:
SELECT ID,
DATALENGTH(Name)-LEN(REPLACE(Name,' ', '')) AS Count_Of_Spaces
FROM #MyTBL;
DBFiddle Demo
LEN does not count trailing spaces.
If NVARCHAR then you need to divide by 2.
DECLARE #MyTbl TABLE (ID INT, Name NVARCHAR(300))
INSERT INTO #MyTBL VALUES
(1, 'Alfreds Futterkiste'), -- 1 space
(2,'Mike James Ray '), -- 4 spaces 1 space between
-- each text and 2 spaces after text
(3,'Hanari Carnes'), -- 2 spaces between text
(4,'James Michael');
SELECT ID,
DATALENGTH(Name)/2-LEN(REPLACE(Name,' ', '')) AS Count_Of_Spaces
FROM #MyTBL;
DBFiddle Demo2
You had the answer in your attempt #2. Probably just didn't realize to do the appending in the second part(REPLACE) of your query
DECLARE #MyTbl TABLE (ID INT, Name VARCHAR(300))
INSERT INTO #MyTBL VALUES
(1, 'Alfreds Futterkiste'), -- 1 space
(2,'Mike James Ray '), -- 4 spaces 1 space between each text and 2 spaces after text
(3,'Hanari Carnes'), -- 2 spaces between text
(4,'James Michael')
-- 2
SELECT ID,
LEN(';' + Name + ';')-LEN(REPLACE(';' + Name + ';',' ','')) AS Count_Of_Spaces2
FROM #MyTBL
When I need the length of a field passed into a function or stored procedure and the field could have trailing spaces that are meant to be there, I use the following statement:-
#Len = LEN(#Parm + '.') - 1

Count the Null columns in a row in SQL

I was wondering about the possibility to count the null columns of row in SQL, I have a table Customer that has nullable values, simply I want a query that return an int of the number of null columns for certain row(certain customer).
This method assigns a 1 or 0 for null columns, and adds them all together. Hopefully you don't have too many nullable columns to add up here...
SELECT
((CASE WHEN col1 IS NULL THEN 1 ELSE 0 END)
+ (CASE WHEN col2 IS NULL THEN 1 ELSE 0 END)
+ (CASE WHEN col3 IS NULL THEN 1 ELSE 0 END)
...
...
+ (CASE WHEN col10 IS NULL THEN 1 ELSE 0 END)) AS sum_of_nulls
FROM table
WHERE Customer=some_cust_id
Note, you can also do this perhaps a little more syntactically cleanly with IF() if your RDBMS supports it.
SELECT
(IF(col1 IS NULL, 1, 0)
+ IF(col2 IS NULL, 1, 0)
+ IF(col3 IS NULL, 1, 0)
...
...
+ IF(col10 IS NULL, 1, 0)) AS sum_of_nulls
FROM table
WHERE Customer=some_cust_id
I tested this pattern against a table and it appears to work properly.
My answer builds on Michael Berkowski's answer, but to avoid having to type out hundreds of column names, what I did was this:
Step 1: Get a list of all of the columns in your table
SELECT COLUMN_NAME FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = 'myTable';
Step 2: Paste the list in Notepad++ (any editor that supports regular expression replacement will work). Then use this replacement pattern
Search:
^(.*)$
Replace:
\(CASE WHEN \1 IS NULL THEN 1 ELSE 0 END\) +
Step 3: Prepend SELECT identityColumnName, and change the very last + to AS NullCount FROM myTable and optionally add an ORDER BY...
SELECT
identityColumnName,
(CASE WHEN column001 IS NULL THEN 1 ELSE 0 END) +
-- ...
(CASE WHEN column200 IS NULL THEN 1 ELSE 0 END) AS NullCount
FROM
myTable
ORDER BY
NullCount DESC
For ORACLE-DBMS only.
You can use the NVL2 function:
NVL2( string1, value_if_not_null, value_if_null )
Here is a select with a similiar approach as Michael Berkowski suggested:
SELECT (NVL2(col1, 0, 1)
+ NVL2(col2, 0, 1)
+ NVL2(col3, 0, 1)
...
...
+ NVL2(col10, 0, 1)
) AS sum_of_nulls
FROM table
WHERE Customer=some_cust_id
A more generic approach would be to write a PL/SQL-block and use dynamic SQL. You have to build a SELECT string with the NVL2 method from above for every column in the all_tab_columns of a specific table.
Unfortunately, in a standard SQL statement you will have to enter each column you want to test, to test all programatically you could use T-SQL. A word of warning though, ensure you are working with genuine NULLS, you can have blank stored values that the database will not recognise as a true NULL (I know this sounds strange).
You can avoid this by capturing the blank values and the NULLS in a statement like this:
CASE WHEN col1 & '' = '' THEN 1 ELSE 0 END
Or in some databases such as Oracle (not sure if there are any others) you would use:
CASE WHEN col1 || '' = '' THEN 1 ELSE 0 END
You don't state RDBMS. For SQL Server 2008...
SELECT CustomerId,
(SELECT COUNT(*) - COUNT(C)
FROM (VALUES(CAST(Col1 AS SQL_VARIANT)),
(Col2),
/*....*/
(Col9),
(Col10)) T(C)) AS NumberOfNulls
FROM Customer
Depending on what you want to do, and if you ignore mavens, and if you use SQL Server 2012, you could to it another way. .
The total number of candidate columns ("slots") must be known.
1. Select all the known "slots" column by column (they're known).
2. Unpivot that result to get a
table with one row per original column. This works because the null columns don't
unpivot, and you know all the column names.
3. Count(*) the result to get the number of non-nulls;
subtract from that to get your answer.
Like this, for 4 "seats" in a car
select 'empty seats' = 4 - count(*)
from
(
select carId, seat1,seat2,seat3,seat4 from cars where carId = #carId
) carSpec
unpivot (FieldValue FOR seat in ([seat1],[seat2],[seat3],[seat4])) AS results
This is useful if you may need to do more later than just count the number of non-null columns, as it gives you a way to manipulate the columns as a set too.
This will give you the number of columns which are not null. you can apply this appropriately
SELECT ISNULL(COUNT(col1),'') + ISNULL(COUNT(col2),'') +ISNULL(COUNT(col3),'')
FROM TABLENAME
WHERE ID=1
The below script gives you the NULL value count within a row i.e. how many columns do not have values.
{SELECT
*,
(SELECT COUNT(*)
FROM (VALUES (Tab.Col1)
,(Tab.Col2)
,(Tab.Col3)
,(Tab.Col4)) InnerTab(Col)
WHERE Col IS NULL) NullColumnCount
FROM (VALUES(1,2,3,4)
,(NULL,2,NULL,4)
,(1,NULL,NULL,NULL)) Tab(Col1,Col2,Col3,Col4) }
Just to demonstrate I am using an inline table in my example.
Try to cast or convert all column values to a common type it will help you to compare the column of different type.
I haven't tested it yet, but I'd try to do it using a PL\SQL function
CREATE OR REPLACE TYPE ANYARRAY AS TABLE OF ANYDATA
;
CREATE OR REPLACE Function COUNT_NULL
( ARR IN ANYARRAY )
RETURN number
IS
cnumber number ;
BEGIN
for i in 1 .. ARR.count loop
if ARR(i).column_value is null then
cnumber := cnumber + 1;
end if;
end loop;
RETURN cnumber;
EXCEPTION
WHEN OTHERS THEN
raise_application_error
(-20001,'An error was encountered - '
||SQLCODE||' -ERROR- '||SQLERRM);
END
;
Then use it in a select query like this
CREATE TABLE TEST (A NUMBER, B NUMBER, C NUMBER);
INSERT INTO TEST (NULL,NULL,NULL);
INSERT INTO TEST (1 ,NULL,NULL);
INSERT INTO TEST (1 ,2 ,NULL);
INSERT INTO TEST (1 ,2 ,3 );
SELECT ROWNUM,COUNT_NULL(A,B,C) AS NULL_COUNT FROM TEST;
Expected output
ROWNUM | NULL_COUNT
-------+-----------
1 | 3
2 | 2
3 | 1
4 | 0
This is how i tried
CREATE TABLE #temptablelocal (id int NOT NULL, column1 varchar(10) NULL, column2 varchar(10) NULL, column3 varchar(10) NULL, column4 varchar(10) NULL, column5 varchar(10) NULL, column6 varchar(10) NULL);
INSERT INTO #temptablelocal
VALUES (1,
NULL,
'a',
NULL,
'b',
NULL,
'c')
SELECT *
FROM #temptablelocal
WHERE id =1
SELECT count(1) countnull
FROM
(SELECT a.ID,
b.column_title,
column_val = CASE b.column_title
WHEN 'column1' THEN a.column1
WHEN 'column2' THEN a.column2
WHEN 'column3' THEN a.column3
WHEN 'column4' THEN a.column4
WHEN 'column5' THEN a.column5
WHEN 'column6' THEN a.column6
END
FROM
( SELECT id,
column1,
column2,
column3,
column4,
column5,
column6
FROM #temptablelocal
WHERE id =1 ) a
CROSS JOIN
( SELECT 'column1'
UNION ALL SELECT 'column2'
UNION ALL SELECT 'column3'
UNION ALL SELECT 'column4'
UNION ALL SELECT 'column5'
UNION ALL SELECT 'column6' ) b (column_title) ) AS pop WHERE column_val IS NULL
DROP TABLE #temptablelocal
Similary, but dynamically:
drop table if exists myschema.table_with_nulls;
create table myschema.table_with_nulls as
select
n1::integer,
n2::integer,
n3::integer,
n4::integer,
c1::character varying,
c2::character varying,
c3::character varying,
c4::character varying
from
(
values
(1,2,3,4,'a','b','c','d'),
(1,2,3,null,'a','b','c',null),
(1,2,null,null,'a','b',null,null),
(1,null,null,null,'a',null,null,null)
) as test_records(n1, n2, n3, n4, c1, c2, c3, c4);
drop function if exists myschema.count_nulls(varchar,varchar);
create function myschema.count_nulls(schemaname varchar, tablename varchar) returns void as
$BODY$
declare
calc varchar;
sqlstring varchar;
begin
select
array_to_string(array_agg('(' || trim(column_name) || ' is null)::integer'),' + ')
into
calc
from
information_schema.columns
where
table_schema in ('myschema')
and table_name in ('table_with_nulls');
sqlstring = 'create temp view count_nulls as select *, ' || calc || '::integer as count_nulls from myschema.table_with_nulls';
execute sqlstring;
return;
end;
$BODY$ LANGUAGE plpgsql STRICT;
select * from myschema.count_nulls('myschema'::varchar,'table_with_nulls'::varchar);
select
*
from
count_nulls;
Though I see that I didn't finish parametising the function.
My answer builds on Drew Chapin's answer, but with changes to get the result using a single script:
use <add_database_here>;
Declare #val Varchar(MAX);
Select #val = COALESCE(#val + str, str) From
(SELECT
'(CASE WHEN '+COLUMN_NAME+' IS NULL THEN 1 ELSE 0 END) +' str
FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '<add table name here>'
) t1 -- getting column names and adding the case when to replace NULLs for zeros or ones
Select #val = SUBSTRING(#val,1,LEN(#val) - 1) -- removing trailling add sign
Select #val = 'SELECT <add_identity_column_here>, ' + #val + ' AS NullCount FROM <add table name here>' -- adding the 'select' for the column identity, the 'alias' for the null count column, and the 'from'
EXEC (#val) --executing the resulting sql
With ORACLE:
Number_of_columns - json_value( json_array( comma separated list of columns ), '$.size()' ) from your_table
json_array will build an array with only the non null columns and the json_query expression will give you the size of the array
There isn't a straightforward way of doing so like there would be with counting rows. Basically, you have to enumerate all the columns that might be null in one expression.
So for a table with possibly null columns a, b, c, you could do this:
SELECT key_column, COALESCE(a,0) + COALESCE(b,0) + COALESCE(c,0) null_col_count
FROM my_table

Computed column should result to string

Here is a snap of my database.
Both col1 and col2 are declared as int.
My ComputedColumn currently adds the Columns 1 and 2, as follows...
col1 col2 ComputedColumn
1 2 3
4 1 5
Instead of this, my ComputedColumn should join the columns 1 and 2 (includimg the '-' character in the middle) as follows...
col1 col2 ComputedColumn
1 2 1-2
4 1 4-1
So, what is the correct syntax?
You're probably defining your computed column as col1+col2. Try CAST(col1 AS NVARCHAR(MAX))+'-'+CAST(col2 AS NVARCHAR(MAX)) instead.
Or if you prefer, you can replace NVARCHAR(MAX) with NVARCHAR(10) or a different length of your choice.
create table TableName
(
col1 int,
col2 int,
ComputedColumn as Convert(varchar, col1) + '-' + Convert(varchar, col2)
)
Bear in mind that if either value is null then the result of ComputedColumn will also be null (using the default collation and settings)
simple:
SELECT ComputedColumn = convert(varchar, col1) + '-' + convert(varchar, col2)
FROM Table
SELECT col1, col2, (col1 + '-' + col2) as ComputedColumn
"+" is both addition and the concatenation character. You could explicitly convert, but in this case, including the '-' in the middle should cause an implicit conversion.
first create table in design mode
add 2 column as col1 and col2
add another column computedcolumn and set computed column property
Also you can use that following script
CREATE TABLE [dbo].[tbl](
[col1] [varchar](50) NOT NULL,
[col2] [varchar](50) NOT NULL,
[ComputedColumn] AS ((CONVERT([varchar],[col1],(0))+'-')+CONVERT([varchar],[col2],(0)))
)