Hive Sql dynamically get null column counts from a table - hive

I am using datastax + spark integration and spark SQL thrift server, which gives me a Hive SQL interface to query the tables in Cassandra.
The tables in my database get dynamically created, what I want to do is get a count of null values in each column for the table based on just the table name.
I can get the column names using describe database.table but in hive SQL, how do I use its output in another select query which counts null for all the columns.
Update 1: Traceback with Dudu's solution
Error running query: TExecuteStatementResp(status=TStatus(errorCode=0,
errorMessage="org.apache.spark.sql.AnalysisException: Invalid usage of
'*' in explode/json_tuple/UDTF;", sqlState=None,
infoMessages=["org.apache.hive.service.cli.HiveSQLException:org.apache.spark.sql.AnalysisException:
Invalid usage of '' in explode/json_tuple/UDTF;:16:15",
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute:SparkExecuteStatementOperation.scala:258',
'org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation:runInternal:SparkExecuteStatementOperation.scala:152',
'org.apache.hive.service.cli.operation.Operation:run:Operation.java:257',
'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:388',
'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:369',
'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:262',
'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:437',
'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1313',
'org.apache.hive.service.cli.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1298',
'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39',
'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39',
'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56',
'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286',
'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1142',
'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:617',
'java.lang.Thread:run:Thread.java:745'], statusCode=3),
operationHandle=None)

In the following solution there is no need to deal with each column separately.
The result is a column index and the number of null values in that column.
You can later on join it by the column index to an information retrieved from the metastore.
One limitations is that strings containning the exact text null will be counted as nulls.
Demo
The CTE (mytable as defined by with mytable as) can obviously be replaced by as actual table
with mytable as
(
select stack
(
5
,1 ,1.2 ,date '2017-06-21' ,null
,2 ,2.3 ,null ,null
,3 ,null ,null ,'hello'
,4 ,4.5 ,null ,'world'
,5 ,null ,date '2017-07-22' ,null
) as (id,amt,dt,txt)
)
select pe.pos as col_index
,count(case when pe.val='null' then 1 end) as nulls_count
from mytable t lateral view posexplode (split(printf(concat('%s',repeat('\u0001%s',field(unhex(1),t.*,unhex(1))-2)),t.*),'\\x01')) pe
group by pe.pos
;
+-----------+-------------+
| col_index | nulls_count |
+-----------+-------------+
| 0 | 0 |
| 1 | 2 |
| 2 | 3 |
| 3 | 3 |
+-----------+-------------+

Instead of describe database.table, you can use
Select column_name from system_schema.columns where keyspace_name='YOUR KEYSPACE' and table_name='YOUR TABLE'
There is also a column called kind in the above table with values like partition_key,clustering,regular.
The columns with values as partition_key and clustering will not have null values.
For other columns you can use
select sum(CASE WHEN col1 is NULL THEN 1 ELSE 0 END) as col1_cnt,sum(CASE WHEN col2 is NULL THEN 1 ELSE 0 END) as col2_cnt from table1 where col1 is null;
You can also try below query (Not tried myself)
SELECT COUNT(*)-COUNT(col1) As A, COUNT(*)-COUNT(col2) As B, COUNT(*)-COUNT(col3) As C
FROM YourTable;
May be for above query you can create variable for count instead of count(*) everytime.
Note: system_schema.columns is cassandra table and cassandra user should have read permission to this table

Related

Compare two rows (both with different ID) & check if their column values are exactly the same. All rows & columns are in the same table

I have a table named "ROSTER" and in this table I have 22 columns.
I want to query and compare any 2 rows of that particular table with the purpose to check if each column's values of that 2 rows are exactly the same. ID column always has different values in each row so I will not include ID column for the comparing. I will just use it to refer to what rows will be used for the comparison.
If all column values are the same: Either just display nothing (I prefer this one) or just return the 2 rows as it is.
If there are some column values not the same: Either display those column names only or display both the column name and its value (I prefer this one).
Example:
ROSTER Table:
ID
NAME
TIME
1
N1
0900
2
N1
0801
Output:
ID
TIME
1
0900
2
0801
OR
Display "TIME"
Note: Actually I'm okay with whatever result or way of output as long as I can know in any way that the 2 rows are not the same.
What are the possible ways to do this in SQL Server?
I am using Microsoft SQL Server Management Studio 18, Microsoft SQL Server 2019-15.0.2080.9
Please try the following solution based on the ideas of John Cappelletti. All credit goes to him.
SQL
-- DDL and sample data population, start
DECLARE #roster TABLE (ID INT PRIMARY KEY, NAME VARCHAR(10), TIME CHAR(4));
INSERT INTO #roster (ID, NAME, TIME) VALUES
(1,'N1','0900'),
(2,'N1','0801')
-- DDL and sample data population, end
DECLARE #source INT = 1
, #target INT = 2;
SELECT id AS source_id, #target AS target_id
,[key] AS [column]
,source_Value = MAX( CASE WHEN Src=1 THEN Value END)
,target_Value = MAX( CASE WHEN Src=2 THEN Value END)
FROM (
SELECT Src=1
,id
,B.*
FROM #roster AS A
CROSS APPLY ( SELECT [Key]
,Value
FROM OpenJson( (SELECT A.* For JSON Path,Without_Array_Wrapper,INCLUDE_NULL_VALUES))
) AS B
WHERE id=#source
UNION ALL
SELECT Src=2
,id = #source
,B.*
FROM #roster AS A
CROSS APPLY ( SELECT [Key]
,Value
FROM OpenJson( (SELECT A.* For JSON Path,Without_Array_Wrapper,INCLUDE_NULL_VALUES))
) AS B
WHERE id=#target
) AS A
GROUP BY id, [key]
HAVING MAX(CASE WHEN Src=1 THEN Value END)
<> MAX(CASE WHEN Src=2 THEN Value END)
AND [key] <> 'ID' -- exclude this PK column
ORDER BY id, [key];
Output
+-----------+-----------+--------+--------------+--------------+
| source_id | target_id | column | source_Value | target_Value |
+-----------+-----------+--------+--------------+--------------+
| 1 | 2 | TIME | 0900 | 0801 |
+-----------+-----------+--------+--------------+--------------+
A general approach here might be to just aggregate over the entire table and report the state of the counts:
SELECT
CASE WHEN COUNT(DISTINCT ID) = COUNT(*) THEN 'Yes' ELSE 'No' END AS [ID same],
CASE WHEN COUNT(DISTINCT NAME) = COUNT(*) THEN 'Yes' ELSE 'No' END AS [NAME same],
CASE WHEN COUNT(DISTINCT TIME) = COUNT(*) THEN 'Yes' ELSE 'No' END AS [TIME same]
FROM yourTable;

MS SQL: Transform delimeted string with key value pairs into table where keys are column names

I have a MS SQL table that has a message field containing a string with key value pairs in comma delimited format. Example:
id
date
message
1
11-5-2021
species=cat,color=black,says=meow
I need to read the data from tables message field and insert it into a table where keys are column names.
Format of the strings:
species=cat,color=black,says=meow
And this should be transformed into table as follows:
species
color
says
cat
black
meow
The order of key value pairs is not fixed in the message. Message can also contain additional keys that should be ignored.
How can I achieve this using MS SQL?
It is so much easier to implement by using JSON.
It will work starting from SQL Server 2016 onwards.
This way all the scenarios are taken in the account. I added them to the DDL and sample data population section in the T-SQL.
The order of key value pairs is not fixed in the message. Message can
also contain additional keys that should be ignored.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, [Date] DATE, Message VARCHAR(500));
INSERT INTO #tbl VALUES
('2021-05-01', 'species=cat,color=black,says=meow'),
('2021-05-11', 'species=dog,says=bark,comment=wow,color=white');
-- DDL and sample data population, end
WITH rs AS
(
SELECT *
, '[{"' + REPLACE(REPLACE(Message
, '=', '":"')
, ',', '","') + '"}]' AS jsondata
FROM #tbl
)
SELECT rs.ID, rs.Date, report.*
FROM rs
CROSS APPLY OPENJSON(jsondata)
WITH
(
[species] VARCHAR(10) '$.species'
, [color] VARCHAR(10) '$.color'
, [says] VARCHAR(30) '$.says'
) AS report;
Output
+----+------------+---------+-------+------+
| ID | Date | species | color | says |
+----+------------+---------+-------+------+
| 1 | 2021-05-01 | cat | black | meow |
| 2 | 2021-05-11 | dog | white | bark |
+----+------------+---------+-------+------+
You can use string_split() and some string operations:
select t.*, ss.*
from t cross apply
(select max(case when s.value like 'color=%'
then stuff(s.value, 1, 6, '')
end) as color,
max(case when s.value like 'says=%'
then stuff(s.value, 1, 5, '')
end) as s
from string_split(t.message, ',') s
) ss
Assuming you are using a fully supported version of SQL Server you could do something like this:
SELECT MAX(CASE PN.ColumnName WHEN 'species' THEN PN.ColumnValue END) AS Species,
MAX(CASE PN.ColumnName WHEN 'color' THEN PN.ColumnValue END) AS Color,
MAX(CASE PN.ColumnName WHEN 'says' THEN PN.ColumnValue END) AS Says
FROM (VALUES(1,CONVERT(date,'20210511'),'species=cat,color=black,says=meow'))V(id,date,message)
CROSS APPLY STRING_SPLIT(V.message,',') SS
CROSS APPLY (VALUES(PARSENAME(REPLACE(SS.[value],'=','.'),2),PARSENAME(REPLACE(SS.[value],'=','.'),1)))PN(ColumnName, ColumnValue);
Hopefully the reason you are doing this exercise is the normalise your design. If you aren't, I suggest you do.

Pivot in SQL: count not working as expected

I have in my Oracle Responsys Database a table that contains records with amongst other two variables:
status
location_id
I want to count the number of records grouped by status and location_id, and display it as a pivot table.
This seems to be the exact example that appears here
But when I use the following request :
select * from
(select status,location_id from $a$ )
pivot (count(status)
for location_id in (0,1,2,3,4)
) order by status
The values that appear in the pivot table are just the column names :
output :
status 0 1 2 3 4
-1 0 1 2 3 4
1 0 1 2 3 4
2 0 1 2 3 4
3 0 1 2 3 4
4 0 1 2 3 4
5 0 1 2 3 4
I also gave a try to the following :
select * from
(select status,location_id , count(*) as nbreports
from $a$ group by status,location_id )
pivot (sum(nbreports)
for location in (0,1,2,3,4)
) order by status
but it gives me the same result.
select status,location_id , count(*) as nbreports
from $a$
group by status,location_id
will of course give me the values I want, but displaying them as a column and not as a pivot table
How can I get the pivot table to have in each cell the number of records with the status and location in row and column?
Example data:
CUSTOMER,STATUS,LOCATION_ID
1,-1,1
2,1,1
3,2,1
4,3,0
5,4,2
6,5,3
7,3,4
The table data types :
CUSTOMER Text Field (to 25 chars)
STATUS Text Field (to 25 chars)
LOCATION_ID Number Field
Please check if my understanding for your requirement is correct, you can do vice versa for the location column
create table test(
status varchar2(2),
location number
);
insert into test values('A',1);
insert into test values('A',2);
insert into test values('A',1);
insert into test values('B',1);
insert into test values('B',2);
select * from test;
select status,location,count(*)
from test
group by status,location;
select * from (
select status,location
from test
) pivot(count(*) for (status) in ('A' as STATUS_A,'B' as STATUS_B))

Oracle SQL - return record only if colB is the same for all of colA

I have a table like the following ( there is of course other data in the table):
Col A Col B
1 Red
1 Red
2 Blue
2 Green
3 Black
I am trying to return a value for Col A only when ALL the Col B values match, otherwise return null.
This will be used as part of another sql statement that will be passing the Col A value, ie
Select * from Table where Col A = 1
I need to return the value in Col B. The correct result in the above table would be Red,Black
any ideas ?
how about this?
SQL Fiddle
Oracle 11g R2 Schema Setup:
create table t( id number, color varchar2(20));
insert into t values(1,'RED');
insert into t values(1,'RED');
insert into t values(2,'BLUE');
insert into t values(2,'GREEN');
insert into t values(3,'BLACK');
Query 1:
select color from t where id in (
select id
from t
group by id having min(color) = max(color) )
group by color
Results:
| COLOR |
|-------|
| RED |
| BLACK |
If you just want the values in A (rather than each row), then use group by:
select a
from table t
group by a
having min(b) = max(b);
Note: this ignores NULL values. If you want to treat them as an additional value, then add another condition:
select a
from table t
group by a
having min(b) = max(b) and count(*) = count(b);
It is also tempting to use count(distinct). In general, though, count(distinct) requires more processing effort than a min() and a max().
You can use a case statement.
select cola,
case when max(colb) = min(colb) and count(*) = count(colb) then max(colb)
end as colb
from tablename
group by cola
SQL Fiddle
Oracle 11g R2 Schema Setup:
create table t( id number, color varchar2(20));
insert into t values(1,'RED');
insert into t values(1,'RED');
insert into t values(2,'BLUE');
insert into t values(2,'GREEN');
insert into t values(3,'BLACK');
Query 1:
select id
from t
group by id having min(color) = max(color)
Results:
| ID |
|----|
| 1 |
| 3 |
hope this is what you were looking for.. :)

SQL count one field two times in select with different parameters

I like to have my query count one column two times in my select based on the value. So for example.
input: table
id | type
-------------|-------------
1 | 1
2 | 1
3 | 2
4 | 2
5 | 2
output: query (in 1 row, not two):
countfirst = 2 (two times 1)
countsecond = 3 (three times 2)
An default count in an select counts all rows in the query. But i like to count rows based
on an number without limiting the query. When using for example WHERE type = '1', type 2
gets filtered and cannot be counted anymore.
Is there an solution for this case in SQL?
--- EXAMPLE USE (situation above is simplefied but case is the same) ---
With one query i get all cars grouped by type from an table. There are two type signs: yellow (in db 1) and grey (in db 2). So in that query i have the folowing output:
Renault - ten times found - two yellow signs - eight grey signs
Create a table, script is given below.
CREATE TABLE [dbo].[temptbl](
[id] [int] NULL,
[type] [int] NULL
) ON [PRIMARY]
Execute the insert script as
insert into [temptbl] values(1,1)
insert into [temptbl] values(2,1)
insert into [temptbl] values(3,2)
insert into [temptbl] values(4,2)
insert into [temptbl] values(5,2)
Then execute the query.
;WITH cte as(
SELECT [type], Count([type]) cnt
FROM temptbl
GROUP BY [type]
)
SELECT * FROM cte
pivot (Sum([cnt]) for [type] in ([1],[2])) as AvgIncomePerDay
You can use the GROUP BY clause as Mureinik suggested, but with the addition of a WHERE clause to filter the results.
Below shows the results for type = 1 (assuming type is an INT:
SELECT type, COUNT(*) AS NoOfRecords
FROM table
WHERE type IN (1)
GROUP BY type
So if we wanted 1 and 2 we can use:
SELECT type, COUNT(*) AS NoOfRecords
FROM table
WHERE type IN (1, 2)
GROUP BY type
Lastly, that IN statement can pull type from another query:
SELECT type, COUNT(*) AS NoOfRecords
FROM table
WHERE type IN (SELECT type FROM someOtherTable)
GROUP BY type