Why HIVE table is returning Null Values

Why HIVE table is returning Null Values - hive

I have coded this to Create an External Table:
CREATE EXTERNAL TABLE IF NOT EXISTS carss (
> maker STRING,
> model STRING,
> mileage FLOAT,
> manufacture_year INT,
> engine_displacement FLOAT,
> engine_power STRING,
> body_type STRING,
> color_slug STRING,
> stk_year FLOAT,
> transmission STRING,
> door_count INT,
> seat_count INT,
> fuel_type STRING,
> date_created DATE,
> date_last_seen DATE,
> price_eur FLOAT)
> ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
> LOCATION '/BigData/Project2/'
> TBLPROPERTIES ("skip.header.line.count"="1", 'creator'='Janina', 'created_on'='2020-11-05', 'description'='dataset for classified Ads of cars in Germany and Czech Republic');
And when I run the query SELECT * FROM carss LIMIT 1; it returns null values and weird characters
hive> SELECT * FROM carss LIMIT 1;
OK
��T�;��9fu7z�C�WHqd��Y�P�c��/�^�B�4���*G����Ç�ǿN������y�z~>����Ǘ�?�Oo��ӿ��������r�������������ݷ����|������N����r����������o����������2}�����x=�ʗ����/�||;������9������߯�����z~:���~��\���/��㏟�vZ)5�5��_i�4���
�erS�>�O��D��I�O����տ�?D���?�o��d��1�_V�K�����?�h����׿.��������|��<��ң^
w��X���c�������Ӕ���S���F$z��J�FywP�.����X�S��T��CM6lE9�^��j�
h� NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL NULL
Time taken: 0.059 seconds, Fetched: 1 row(s)
hive>

Is your file in /BigData/Project2/ encoded in utf-8? If not, you might need to specify the encoding of the underlying file when creating the external table:
create external table carss
...
row format
serde 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
with serdeproperties("serialization.encoding"='WINDOWS-1252')
location
...
This article might be helpful.

Related

I can't find what is wrong with the case statement in sql server. Can someone help me?

i want to input in the code that if the operators work load is above a certain number of minutes per week (480,420,540) then cease the operation. it is giving me an underline on the first case and i can't understand why. thanks!
create table HOURS (
IDHOUR nvarchar(2) not null,
STARTHOUR time not null,
ENDHOUR time not null,
constraint PK_HOURS primary key (IDHOUR)
)
GO
create table OPERATOR (
IDOPERATOR nvarchar(3) not null,
TYPE nvarchar(20) null,
constraint TYPE
check (TYPE in ('assistant1','assistant2','assistant3')),
constraint PK_OPERATOR primary key (IDOPERATOR)
)
go
insert into HOURS (idhour,starthour, endhour) values (14,'14:30','15:30');
insert into HOURS (idhour,starthour, endhour) values (1,'09:00','09:45');
insert into OPERATOR (IDOPERATOR , type) values (1,'assistant1');
insert into OPERATOR (IDOPERATOR , type) values (2,'assistant2');
insert into OPERATOR (IDOPERATOR , type) values (3,'asssistant3');
select * from OPERATOR
case
when TYPE = 'assistant1'
when
select sum(datediff(minute,starthour,endhour)) as minutes from hours > 480 then 'cease operation'
when
TYPE = 'assistant2'
select sum(datediff(minute,starthour,endhour)) as minutes from hours> 420 then 'cease operation'
when
TYPE = 'asssistant3'
select sum(datediff(minute,starthour,endhour)) as minutes from hours> 540 then 'cease operation'
else 'without effect'
end

The HOURS table seems to require a foreign key that references the OPERATOR table.
Then you can use the CASE in the SELECT.
For example.
Sample data:
create table OPERATOR (
IDOPERATOR int not null,
[TYPE] nvarchar(20) not null,
constraint [TYPE]
check ([TYPE] in ('assistant1','assistant2','assistant3')),
constraint PK_OPERATOR
primary key (IDOPERATOR)
);
create table HOURS (
IDHOUR int identity(1,1) not null,
IDOPERATOR int not null,
STARTHOUR time not null,
ENDHOUR time not null,
constraint PK_HOURS
primary key (IDHOUR),
constraint FK_HOURS_OPERATOR
foreign key (IDOPERATOR) references OPERATOR(IDOPERATOR)
);
insert into OPERATOR
(IDOPERATOR , [TYPE]) values
(1,'assistant1')
, (2,'assistant2')
, (3,'assistant3')
;
insert into HOURS (idoperator, starthour, endhour) values
(2,'14:30','15:30')
, (1,'09:00','17:33')
;
Query:
SELECT op.IDOPERATOR, op.[type]
,sum(datediff(minute,starthour,endhour)) as tmdiff
,case
when [type] = 'assistant1'
and sum(datediff(minute,starthour,endhour)) > 480
then 'cease operation'
when [type] = 'assistant2'
and sum(datediff(minute,starthour,endhour)) > 420
then 'cease operation'
when [type] = 'assistant3'
and sum(datediff(minute,starthour,endhour)) > 540
then 'cease operation'
when max(hr.IDOPERATOR) is null
then 'unknown'
else 'without effect'
end AS [result]
FROM OPERATOR op
LEFT JOIN HOURS hr
ON hr.IDOPERATOR = op.IDOPERATOR
GROUP BY op.IDOPERATOR, op.[type];
Results:
IDOPERATOR | type | tmdiff | result
---------: | :--------- | -----: | :--------------
1 | assistant1 | 513 | cease operation
2 | assistant2 | 60 | without effect
3 | assistant3 | null | unknown
db<>fiddle here
Extra
Not really needed for just one query.
But a UDF (User Defined Function) is sometimes used if certain logic can be re-used.
IF OBJECT_ID (N'dbo.ufnEvalHourDiff', N'FN') IS NOT NULL
DROP FUNCTION dbo.ufnEvalHourDiff;
GO
--
-- A User Defined Function
--
CREATE FUNCTION dbo.ufnEvalHourDiff
(
#OperatorType NVARCHAR(20),
#TimeDiffMinutes INT
)
RETURNS VARCHAR(30)
AS
BEGIN
DECLARE #result VARCHAR(30);
IF (#OperatorType = 'assistant1'
AND #TimeDiffMinutes > 480) OR
(#OperatorType = 'assistant2'
AND #TimeDiffMinutes > 420) OR
(#OperatorType = 'assistant3'
AND #TimeDiffMinutes > 540)
begin
SET #result = 'cease operation';
end
ELSE
SET #result = 'without effect';
RETURN #result;
END;
GO
SELECT op.IDOPERATOR, op.[type]
, SUM(DATEDIFF(MINUTE, starthour, endhour)) AS tmdiff
, dbo.ufnEvalHourDiff(op.[type], SUM(DATEDIFF(MINUTE, starthour, endhour))) as result
FROM OPERATOR op
LEFT JOIN HOURS hr
ON hr.IDOPERATOR = op.IDOPERATOR
GROUP BY op.IDOPERATOR, op.[type];
GO
IDOPERATOR | type | tmdiff | result
---------: | :--------- | -----: | :--------------
1 | assistant1 | 513 | cease operation
2 | assistant2 | 60 | without effect
3 | assistant3 | null | without effect
db<>fiddle here

If the IDhour is the same as the IDoperator, you can do something like this:
select b.IDoperator,b.Type,
case when TYPE = 'assistant1' and sum(datediff(minute,b.starthour,b.endhour)) > 480 then 'cease operation'
when TYPE = 'assistant2' and sum(datediff(minute,b.starthour,b.endhour))> 420 then 'cease operation'
when TYPE = 'assistant2' and sum(datediff(minute,b.starthour,b.endhour))> 540 then 'cease operation'
else 'without effect' end as flag_check
from HOURS a inner join Operator b on a.idhour=b.idoperator
Group by b.IDOperator,b.Type

SELECT COL HIVE SQL VALUE WHERE VALUES <5000

I'm learning about HIVE and I have come across a question I cannot seem to find a workable answer for. I have to extract all of the numeric columns that ONLY contain integer values <5000 from a table and create a space separated text file. I am familiar with creating text files and selecting rows but selecting columns that meet a specific parameter I am not familiar with, any help or guidance will be appreciated! Below I've listed the structure of the table. Also, there is an image attached showing the data in table format. For OUTPUT I need to go through ALL the COLUMNS and RETURN ONLY the the COLUMNS that meet the parameter of integer values LESS THAN 5000.
create table lineorder (
lo_orderkey int,
lo_linenumber int,
lo_custkey int,
lo_partkey int,
lo_suppkey int,
lo_orderdate int,
lo_orderpriority varchar(15),
lo_shippriority varchar(1),
lo_quantity int,
lo_extendedprice int,
lo_ordertotalprice int,
lo_discount int,
lo_revenue int,
lo_supplycost int,
lo_tax int,
lo_commitdate int,
lo_shipmode varchar(10)
)
Data in tbl format

Conditional columns selecting is a terrible, horrible, no good, very bad idea.
Being that said, here is a demo.
with t as
(
select stack
(
3
,10 ,100 ,1000 ,'X' ,null
,20 ,null ,2000 ,'Y' ,200000
,30 ,300 ,3000 ,'Z' ,300000
) as (c1,c2,c3,c4,c5)
)
select regexp_replace
(
printf(concat('%s',repeat(concat(unhex(1),'%s'),field(unhex(1),t.*,unhex(1))-2)),*)
,concat('([^\\x01]*)',repeat('\\x01([^\\x01]*)',field(unhex(1),t.*,unhex(1))-2))
,c.included_columns
) as record
from t
cross join (select ltrim
(
regexp_replace
(
concat_ws(' ',sort_array(collect_set(printf('$%010d',pos+1))))
,concat
(
'( ?('
,concat_ws
(
'|'
,collect_set
(
case
when cast(pe.val as int) >= 5000
or cast(pe.val as int) is null
then printf('\\$%010d',pos+1)
end
)
)
,'))|(?<=\\$)0+'
)
,''
)
) as included_columns
from t
lateral view posexplode(split(printf(concat('%s',repeat(concat(unhex(1),'%s'),field(unhex(1),*,unhex(1))-2)),*),'\\x01')) pe
) c
+---------+
| record |
+---------+
| 10 1000 |
| 20 2000 |
| 30 3000 |
+---------+

I don't think hive supports variable substitution in the function. So you would have to write a shell scripts that executes the first query which returns the required columns.Then you can assign it to a variable in shell script and then create a new query for creating files in local directory and run it via hive -e from bash.
create table t1(x int , y int) ; // table used for below query
Sample bash script :
cols =hive -e 'select concat_ws(',', case when min(x) > 5000 then 'x' end , case when min(y) > 5000 then 'y' end) from t1'
query ="INSERT OVERWRITE LOCAL DIRECTORY <directory name> ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' select $cols from t1 "
hive -e query

How to copy a column value from one table and store the same value into another table as a row

enter image description hereHow to copy a column value from one table and store the same value into another table as a row .
I have two tables one is firsttbl and second one is testtbl. then first table have more records and second table need copy the same value in first table.
firsttbl original record below
CORRELATION_ID NAME VALUE TYPE OBJECT_ID ARCHIVE_FLAG ARCHIVE_DATE WF_ID REC_TIME KEY_ID VALUE_UPPER
43344255015b9b192916node1 TransactionSenderID SANMINACORT DOCUMENT 14804515b9b1924f9node1 -1 null 1912341 2017-04-23 08:56:09.0 20 SANMINACORT
43391255115b9b192916node1 TransactionReceiverID 4017395800 DOCUMENT 14804515b9b1924f9node1 -1 null 1912341 2017-04-23 08:56:09.0 21 4017395800
43376255215b9b192916node1 Level Transaction DOCUMENT 14804515b9b1924f9node1 -1 null 1912341 2017-04-23 08:56:09.0 41 TRANSACTION
43399255315b9b192916node1 GroupSenderID SANMINACORT DOCUMENT 14804515b9b1924f9node1 -1 null 1912341 2017-04-23 08:56:09.0 28 SANMINACORT
43356255415b9b192916node1 GroupReceiverID 4017395800 DOCUMENT 14804515b9b1924f9node1 -1 null 1912341 2017-04-23 08:56:09.0 30 4017395800
2nd table only created not yet stored in any value.
TransactionSenderID TransactionReceiverID Level GroupSenderID GroupReceiverID
second table coulmn names are already stored in firsttbl under NAME coulmn.
I'm going to copy records from firsttbl table so the firsttbl column name VALUE all the 5 values are need to stored in second table(testtbl) as a row.
This the query Im trying but not yet success.
INSERT INTO testtbl(TransactionSenderID,TransactionReceiverID,Level,GroupSenderID,GroupReceiverID) Select value from firsttbl;
This is the error coming in PostgreSQL.
ERROR: INSERT has more target columns than expressions LINE 1: INSERT
INTO testtbl(TransactionSenderID,TransactionReceiverI...
^
********** Error **********
ERROR: INSERT has more target columns than expressions SQL state:
42601 Character: 41

This looks like a pivot/group aggregation question. I don't have postgresql but since the question is tagged mysql here's a mysql solution (if the quesion is postgresql the approach should be similar)
drop table if exists t;
create table t (
CORRELATION_ID varchar(30), NAME varchar(30),VALUe varchar(30), TYPE VARCHAR(30), OBJECT_ID VARCHAR(30), ARCHIVE_FLAG VARCHAR(30),
ARCHIVE_DATE datetime, WF_ID int, REC_TIME datetime, KEY_ID int, VALUE_UPPER varchar(20));
insert into t values
('43344255015b9b192916node1', 'TransactionSenderID' ,'SANMINACORT' ,'DOCUMENT' ,'14804515b9b1924f9node1', '-1' ,null ,1912341 ,'2017-04-23 08:56:09.0', 20 ,'SANMINACORT'),
('43391255115b9b192916node1', 'TransactionReceiverID' ,'4017395800' ,'DOCUMENT' ,'14804515b9b1924f9node1', '-1' ,null ,1912341 ,'2017-04-23 08:56:09.0' ,21 ,'4017395800' ),
('43376255215b9b192916node1', 'Level', 'Transaction', 'DOCUMENT' ,'14804515b9b1924f9node1' ,'-1' ,null ,1912341 ,'2017-04-23 08:56:09.0', 41 ,'TRANSACTION' ),
('43399255315b9b192916node1', 'GroupSenderID' ,'SANMINACORT', 'DOCUMENT' ,'14804515b9b1924f9node1' ,'-1' ,null ,1912341 ,'2017-04-23 08:56:09.0' ,28, 'SANMINACORT'),
('43356255415b9b192916node1', 'GroupReceiverID', '4017395800', 'DOCUMENT' ,'14804515b9b1924f9node1' ,'-1' ,null ,1912341 ,'2017-04-23 08:56:09.0',30, '4017395800');
drop table if exists t2;
create table t2(wf_id int,TransactionSenderID varchar(20), TransactionReceiverID varchar(20), Level varchar(20), GroupSenderID varchar(20), GroupReceiverID varchar(20));
insert into t2
select wf_id,
max(case when name = 'TransactionSenderID' then value else null end),
max(case when name = 'TransactionReceiverID' then value else null end),
max(case when name = 'Level' then value else null end),
max(case when name = 'GroupSenderID' then value else null end),
max(case when name = 'GroupReceiverID' then value else null end)
from t
group by wf_id;
select * from t2;
MariaDB [sandbox]> select * from t2;
+---------+---------------------+-----------------------+-------------+---------------+-----------------+
| wf_id | TransactionSenderID | TransactionReceiverID | Level | GroupSenderID | GroupReceiverID |
+---------+---------------------+-----------------------+-------------+---------------+-----------------+
| 1912341 | SANMINACORT | 4017395800 | Transaction | SANMINACORT | 4017395800 |
+---------+---------------------+-----------------------+-------------+---------------+-----------------+
1 row in set (0.00 sec)

Postgresql - how to run a query on multiple tables with same schema

I have a postgres database that has several tables (a few hundreds). A subset Foo of the tables in the database have the same schema.
Ideally, I would like to create a stored procedure which can run a query against a single table, or against all tables in subset Foo.
Pseudocode:
CREATE TABLE tbl_a (id INTEGER, name VARCHAR(32), weight double, age INTEGER);
CREATE TABLE tbl_b (id INTEGER, name VARCHAR(32), weight double, age INTEGER);
CREATE TABLE tbl_c (id INTEGER, name VARCHAR(32), weight double, age INTEGER);
CREATE TABLE tbl_d (id INTEGER, name VARCHAR(32), weight double, age INTEGER);
CREATE TYPE person_info AS (id INTEGER, name VARCHAR(32), weight double, age INTEGER);
CREATE FUNCTION generic_func(ARRAY one_or_more_table_names)
RETURNS person_info
-- Run query on table or all specified tables
AS $$ $$
LANGUAGE SQL;
How could I implement this requirement in Postgresql 9.x ?

You should have a look at table inheritance in PostgreSQL, they allow exactly what you speak about.
For example, you could create a table parent_tbl:
CREATE TABLE parent_tbl (id INTEGER, name VARCHAR(32), weight numeric, age INTEGER);
Then link your tables to this parent table:
ALTER TABLE tbl_a INHERIT parent_tbl;
ALTER TABLE tbl_b INHERIT parent_tbl;
ALTER TABLE tbl_c INHERIT parent_tbl;
ALTER TABLE tbl_d INHERIT parent_tbl;
Then a SELECT query over parent_tbl will query all of tbl_x tables, while a query on tbl_x will query only this particular table.
INSERT INTO tbl_a VALUES (1, 'coucou', 42, 42);
SELECT * FROM tbl_a;
id | name | weight | age
----+--------+--------+-----
1 | coucou | 42 | 42
(1 row)
SELECT * FROM parent_tbl;
id | name | weight | age
----+--------+--------+-----
1 | coucou | 42 | 42
(1 row)
SELECT * FROM tbl_b;
id | name | weight | age
----+--------+--------+-----
(0 rows)
It is also possible to filter data from given children tables. For example, if you are interested in data coming from tables tbl_a and tbl_b, you can do
select id, name, weight, age
from parent_tbl
left join pg_class on oid = parent_tbl.tableoid
where relname in ('tbl_a', 'tbl_b');
EDIT : I put numeric for weight instead of double as this type is not supported on my server.

To create select query dynamically using items(table name) in an array you can use following select statement
SELECT string_agg(q, ' union all ')
FROM (
SELECT 'select * from ' || unnest(array ['tble_a','tble_b']) AS q
) t
Result:
string_agg
---------------------------------------------------
select * from tble_a union all select * from tble_b
You can create the function that returns table with columns
id INTEGER
,name VARCHAR(32)
,weight numeric
,age INTEGER
P.S: I am avoiding TYPE person_info
function:
CREATE
OR REPLACE FUNCTION generic_func (tbl varchar [])
RETURNS TABLE ( -- To store the output
id INTEGER
,name VARCHAR(32)
,weight numeric
,age INTEGER
) AS $BODY$
DECLARE qry text;
BEGIN
SELECT string_agg(q, ' union all ') --To create select query dynamically
INTO qry
FROM (
SELECT 'select * from ' || unnest(tbl) AS q
) t;
RAISE NOTICE 'qry %',qry; --optional
RETURN query --Executes the query to the defined table
EXECUTE qry;
END;$BODY$
LANGUAGE plpgsql VOLATILE
Usage:
select * from generic_func(array['tbl_a','tbl_b','tbl_c','tbl_d'])
Result:
id name weight age
-- ---- ------ ---
2 ABC 11 112
2 CBC 11 112
2 BBC 11 112
2 DBC 11 112
and
select * from generic_func(array['tbl_a'])
Result:
id name weight age
-- ---- ------ ---
2 ABC 11 112

Formatting big int SQL Server

I have table with following data example table (id int, sn bigint)
id sn
--------------------------
1 8921901414327625990
1 8921901414327625991
How can I remove the 892190 from sn?

You can do it this way, using modulo:
select sn % 10000000000000 from table1

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Why HIVE table is returning Null Values - hive

Related

I can't find what is wrong with the case statement in sql server. Can someone help me?

SELECT COL HIVE SQL VALUE WHERE VALUES <5000

How to copy a column value from one table and store the same value into another table as a row

Postgresql - how to run a query on multiple tables with same schema

Formatting big int SQL Server

Categories

Resources