I want to load the XML file into hive columns, but I am getting NULL when triggering select query on hive table.
Can anyone help?
Create table statement
CREATE TABLE `clobtest_h`(
`id` double,
`subject` string,
`body` string,
`purge_id` double,
`purge_date` timestamp,
`s_retention_applied` string,
`d_primary_column` double)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
'hdfs://nameservice1/user/hive/warehouse/support.db/clobtest_h'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='false',
'last_modified_by'='root',
'last_modified_time'='1488897940',
'numFiles'='0',
'numRows'='-1',
'rawDataSize'='-1',
'totalSize'='0',
'transient_lastDdlTime'='1488897940')
Insert query
insert into clobtest_h values(2,'Testing issue','<?xml version="1.0"?>
<?xml-stylesheet href="catalog.xsl" type="text/xsl"?>
<!DOCTYPE catalog SYSTEM "catalog.dtd">
<catalog>
<product description="Cardigan Sweater" product_image="cardigan.jpg">
<catalog_item gender="Mens">
<item_number>QWZ5671</item_number>
<price>39.95</price>
<size description="Medium">
<color_swatch image="red_cardigan.jpg">Red</color_swatch>
<color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch>
</size>
<size description="Large">
<color_swatch image="red_cardigan.jpg">Red</color_swatch>
<color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch>
</size>
</catalog_item>
<catalog_item gender="Womens">
<item_number>RRX9856</item_number>
<price>42.50</price>
<size description="Small">
<color_swatch image="red_cardigan.jpg">Red</color_swatch>
<color_swatch image="navy_cardigan.jpg">Navy</color_swatch>
<color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch>
</size>
<size description="Medium">
<color_swatch image="red_cardigan.jpg">Red</color_swatch>
<color_swatch image="navy_cardigan.jpg">Navy</color_swatch>
<color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch>
<color_swatch image="black_cardigan.jpg">Black</color_swatch>
</size>
<size description="Large">
<color_swatch image="navy_cardigan.jpg">Navy</color_swatch>
<color_swatch image="black_cardigan.jpg">Black</color_swatch>
</size>
<size description="Extra Large">
<color_swatch image="burgundy_cardigan.jpg">Burgundy</color_swatch>
<color_swatch image="black_cardigan.jpg">Black</color_swatch>
</size>
</catalog_item>
</product>
</catalog>',1234.0,'2017-03-07 20:15:04','N',6.0)
Select query on table, getting NULLs after first line fetching
"select * from support.clobtest_h"
Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/jars/hive-common-1.1.0-cdh5.6.0.jar!/hive-log4j.properties
OK
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/jars/parquet-pig-bundle-1.5.0-cdh5.6.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/jars/parquet-format-2.1.0-cdh5.6.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/jars/parquet-hadoop-bundle-1.5.0-cdh5.6.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/jars/hive-exec-1.1.0-cdh5.6.0.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/jars/hive-jdbc-1.1.0-cdh5.6.0-standalone.jar!/shaded/parquet/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [shaded.parquet.org.slf4j.helpers.NOPLoggerFactory]
2.0 Testing issue <?xml version="1.0"?> NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL NULL
NULL 1234.0 2017-03-07 20:15:04 NULL NULL NULL NULL
Time taken: 1.878 seconds, Fetched: 42 row(s)
Mar 8, 2017 1:37:37 PM WARNING: parquet.hadoop.ParquetRecordReader: Can not initialize counter due to context is not a instance of TaskInputOutputContext, but is org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Mar 8, 2017 1:37:37 PM INFO: parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 42 records.
Mar 8, 2017 1:37:37 PM INFO: parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
Mar 8, 2017 1:37:37 PM INFO: parquet.hadoop.InternalParquetRecordReader: block read in memory in 21 ms. row count = 42
I have set the below property in hive-site.xml, which resolved the issue.
<property>
<name>hive.query.result.fileformat</name>
<value>SequenceFile</value>
</property>
Related
I have a query that results in one applicationid being associated with multiple dates. I want to get my query to count all the occurrences of the applicationid as one count for the purposes of reporting
The code is currently:
SELECT
ap.ApplicationId,
ql.AssignedDate,
ql.UnassignedDate,
ql.CompletedDate,
d.CompletedDate AS FinalDate
FROM
Application ap
JOIN
QueueLog ql ON ap.ApplicationId=ql.ApplicationId
ORDER BY
ql.AssignedDate
Current results are:
ApplicationId
AssignedDate
UnassignedDate
CompletedDate
FinalDate
2765201
2022-02-25 09:55:28.210
NULL
NULL
NULL
2765201
2022-02-25 09:55:28.167
NULL
NULL
NULL
2765205
2022-02-25 09:55:18.580
NULL
NULL
NULL
2765205
2022-02-25 09:55:18.567
NULL
NULL
NULL
2765206
2022-02-25 09:55:13.097
NULL
NULL
NULL
2765206
2022-02-25 09:55:13.067
NULL
NULL
NULL
2765212
2022-02-25 09:54:59.957
NULL
NULL
NULL
2765212
2022-02-25 09:54:59.940
NULL
NULL
NULL
2765219
2022-02-25 09:54:49.480
NULL
NULL
NULL
2765219
2022-02-25 09:54:49.467
NULL
NULL
NULL
The desired output is to count as one every occurrence of the ApplicationId in the Queuelog table that meets the criteria of
ql.AssignedDate <= GETDATE()
AND (ql.CompletedDate IS NULL OR ql.UnassignedDate IS NULL)
AND d.CompletedDate IS NULL
EXPECTED RESULT
The occurence which has the most recent date for the appIds that has multiple occurences and those that have one occurence to return that value i.e
|ApplicationId| AssignedDate| UnassignedDate| CompletedDate| FinalDate|
|:--------------:|:------------: |:----------------:|:-------------:|---------|
|2765201| 2022-02-25 09:55:28.210| NULL| NULL| NULL|
|2765201| 2022-02-25 09:55:28.167| NULL| NULL| NULL|
to return the first line only
Try out this SQL Query:
SELECT
ap.ApplicationId,
COUNT(ap.ApplicationId) AS ApplicationCount
FROM
Application ap
JOIN
QueueLog ql ON ap.ApplicationId=ql.ApplicationId where ql.AssignedDate <= GETDATE()
AND (ql.CompletedDate IS NULL OR ql.UnassignedDate IS NULL)
AND ql.CompletedDate IS NULL
GROUP BY ap.ApplicationId
You can use row_number to assign a priority to each repeating group and select only the most recent:
select
ApplicationId,
AssignedDate,
UnassignedDate,
CompletedDate,
FinalDate
from (
select
ap.ApplicationId,
ql.AssignedDate,
ql.UnassignedDate,
ql.CompletedDate,
d.CompletedDate as FinalDate,
Row_Number() over(partition by ApplicationId order by AssignedDate desc) rn
from Application ap
join QueueLog ql on ap.ApplicationId=ql.ApplicationId
)t
where rn=1
order by AssignedDate
AgencyId
SourceType
SourceCode
PropertyState
Code
a1002
NULL
Xyz1
NULL
test1
a1002
NULL
Xyz2
NULL
test2
a1002
NULL
Xyz3
NULL
test3
a1002
NULL
Xyz4
NULL
test4
a1002
NULL
Xyz5
NULL
test5
a1002
NULL
Xyz6
NULL
test6
a1002
NULL
Xyz7
NULL
test7
a1002
NULL
Xyz8
NULL
test8
a1002
NULL
Xyz9
NULL
test9
a1002
NULL
Xyz10
NULL
test10
a1002
NULL
Xyz11
NULL
test11
a1002
NULL
Xyz12
NULL
test12
a1002
NULL
Xyz13
NULL
test13
a1002
NULL
Xyz14
NULL
test14
a1002
NULL
Xyz15
NULL
test15
a1002
NULL
Xyz16
NULL
test16
a1002
NULL
Xyz17
NULL
test17
a1002
NULL
Xyz18
NULL
test18
a1002
NULL
Xyz19
NULL
test19
a1002
NULL
Xyz20
NULL
test20
a1002
NULL
Xyz21
NULL
test21
a1002
NULL
Xyz22
NULL
test22
a1002
NULL
Xyz23
NULL
test23
a1002
NULL
Xyz24
NULL
test24
a1002
NULL
Xyz25
NULL
test25
a1002
NULL
Xyz26
NULL
test26
a1003
NULL
Xyz27
FL
test27
a1003
NULL
Xyz28
NULL
test28
a1004
NULL
NULL
NULL
test29
a1005
NULL
NULL
NULL
test30
a1006
NULL
NULL
FL
test31
a1006
NULL
NULL
NULL
test32
a1007
NULL
NUL**L
NULL
test33
a1008
B
NULL
NULL
test34
a1008
O
NULL
NULL
test35
I have a table name test,in that there are 5 columns AgencyId,SourceType,SourceCode,PropertyStae,Code
Want to write sql query which will give Code as output base on following drilldown logic
First match by AgencyId,SourceCode,PropertyState if not then AgencyId,SourceCode if not then AgencyId,SourceType if not then AgencyId PropertyStae if not then SourceType PropertyState if not then only by AgencyId.
How to write sql query for this. Requirement is such that can not write sp nor functions. Kindly let me know about the solution
You could translate the requirement to a serie of OR & AND
For example
DECLARE
#AgencyId varchar(8) = 'a1002',
#SourceCode varchar(8) = 'Xyz1',
#SourceType char(1) = 'B',
#PropertyState char(2) = 'FL';
SELECT *
, CASE
WHEN AgencyId = #AgencyId
AND SourceCode = #SourceCode
AND PropertyState = #PropertyState
THEN 'match 1'
WHEN AgencyId = #AgencyId
AND SourceCode = #SourceCode
THEN 'match 2'
WHEN AgencyId = #AgencyId
AND SourceType = #SourceType
THEN 'match 3'
WHEN AgencyId = #AgencyId
AND PropertyState = #PropertyState
THEN 'match 4'
WHEN SourceType = #SourceType
AND PropertyState = #PropertyState
THEN 'match 5'
WHEN AgencyId = #AgencyId
THEN 'match 6'
ELSE 'no match'
END AS Match
FROM test
WHERE (AgencyId = #AgencyId AND SourceCode = #SourceCode AND PropertyState = #PropertyState)
OR (AgencyId = #AgencyId AND SourceCode = #SourceCode)
OR (AgencyId = #AgencyId AND SourceType = #SourceType)
OR (AgencyId = #AgencyId AND PropertyState = #PropertyState)
OR (SourceType = #SourceType AND PropertyState = #PropertyState)
OR (AgencyId = #AgencyId)
But those criteria can be simplified :
SELECT *
FROM test
WHERE (AgencyId = #AgencyId
OR (SourceType = #SourceType AND PropertyState = #PropertyState)
);
AgencyId
SourceType
SourceCode
PropertyState
Code
a1002
null
Xyz1
FL
test1
a1002
null
Xyz1
null
test2
a1002
B
null
null
test3
a1002
null
null
FL
test4
null
B
null
FL
test5
a1002
null
null
null
test6
Demo on db<>fiddle here
I'm trying to write a query that is sort of like a running total but not really. I want to get the previous weight (kg) and keep outputting that for each day until another weight (kg) is recorded then continue to output that until next weight recorded. Below is example of what I'm trying to accomplish (see KG column).
Current results:
ENCOUNTER_ID | KG | DATE_RECORDED | CALENDAR_DT
-----------------------------------------------
100 10 2019-01-01 2019-01-01
NULL NULL NULL 2019-01-02
100 12 2019-01-03 2019-01-03
NULL NULL NULL 2019-01-04
NULL NULL NULL 2019-01-05
NULL NULL NULL 2019-01-06
100 13 2019-01-07 2019-01-07
NULL NULL NULL 2019-01-08
Desired Results:
ENCOUNTER_ID | KG | DATE_RECORDED | CALENDAR_DT
-----------------------------------------------
100 10 2019-01-01 2019-01-01
NULL 10 NULL 2019-01-02
100 12 2019-01-03 2019-01-03
NULL 12 NULL 2019-01-04
NULL 12 NULL 2019-01-05
NULL 12 NULL 2019-01-06
100 13 2019-01-07 2019-01-07
NULL 13 NULL 2019-01-08
In standard SQL, you would use lag() with the ignore nulls option:
select t.*,
lag(kg ignore nulls) over (order by calendar_dt)
from t;
Not all databases support ignore nulls. But it is standard SQL and you haven't specified the database you are using.
A solution can be achieved by combining a CASE with a subquery that will fetch the first valid value ordered by data.
See the below example using T-SQL.
create table dbo.WeightLog
(
ENCOUNTER_ID int null,
KG int null,
DATE_RECORDED date null,
CALENDAR_DT date not null
)
GO
insert into dbo.WeightLog values
(100 , 10, '2019-01-01', '2019-01-01'),
(NULL, NULL, NULL, '2019-01-02'),
(100 , 12, '2019-01-03', '2019-01-03'),
(NULL, NULL, NULL, '2019-01-04'),
(NULL, NULL, NULL, '2019-01-05'),
(NULL, NULL, NULL, '2019-01-06'),
(100 , 13, '2019-01-07', '2019-01-07'),
(NULL, NULL, NULL, '2019-01-08')
GO
select
wl.ENCOUNTER_ID,
case when wl.KG is null
then (select top 1 x.KG from dbo.WeightLog x where x.CALENDAR_DT < wl.CALENDAR_DT
and x.KG is not null order by x.CALENDAR_DT desc)
else wl.KG end as [Kg],
wl.DATE_RECORDED,
wl.CALENDAR_DT
from dbo.WeightLog wl
Results in:
ENCOUNTER_ID Kg DATE_RECORDED CALENDAR_DT
------------ ----------- ------------- -----------
100 10 2019-01-01 2019-01-01
NULL 10 NULL 2019-01-02
100 12 2019-01-03 2019-01-03
NULL 12 NULL 2019-01-04
NULL 12 NULL 2019-01-05
NULL 12 NULL 2019-01-06
100 13 2019-01-07 2019-01-07
NULL 13 NULL 2019-01-08
Note: it doesn't explore the particular case where a first record is null.
i have below 4 columns
empid | name | dept | ph_no
---------------------------------
123 | null | null | null
124 | mike | science | null
125 | null | physics | 789
126 | null | null | 463
127 | john | null | null
and i need to merge all 4 columns into single columns only for null values.
And i need something like below--
empid
------------
123 is missing name,dept,ph_no
124 is missing ph_no
125 is missing name
126 is missing name,dept
127 is missing dept,ph_no
This can be done with case expressions.
select empid,empid||' is missing '||
trim(',' from
(case when name is null then 'name,' else '' end||
case when dept is null then 'dept,' else '' end||
case when ph_no is null then 'ph_no' else '' end
)
)
from tbl
I agree with Vamsi and would like just to add a where clause so the "complete" ones won't be returned.
select empid,empid||' is missing '||
case when name is null then 'name,' else '' end||
case when dept is null then 'dept,' else '' end||
case when ph_no is null then 'ph_no' else '' end
from tbl
where (name is null or dept is null or ph_no is null);
You can also use the NVL2 function.
SELECT empid||' is missing '||NVL2(name, NULL, 'name, ') ||NVL2(dept, NULL, 'dept, ')||NVL2(ph_no, NULL, 'ph_no') empid
FROM table_
I have the following data:
A B C D E F
NULL 1122111 NULL 0 NULL XBK
9226978 NULL 0 NULL XGI NULL
NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL
NULL NULL NULL NULL NULL NULL
Now I need to collapse that to a single row with the below results:
A B C D E F
9226978 1122111 0 0 XGI XBK
I have no idea where to get started. Please help.
SELECT MAX(A) AS A,MAX(B) AS B
FROM Table_Name
Try this:-
SELECT COALESCE(A,0), COALESCE(B,0), COALESCE(C,0), COALESCE(D,0), COALESCE(E,0), COALESCE(F,0)
FROM YOUR_TABLE;