Load data to hive array of struct - hive

I have data in CSV looks like
David,"""SMARTPHONE,6""|""COMPUTER,3""|""LAPTOP,1"""
I try to load this to my hive table
create table user_device(name string, devices array<struct<devicename: string, number : int>>)
FIELDS TERMINATED BY ','
collection items terminated by '|'
STORED AS TEXTFILE
LOCATION 'maprfs:///user/david/';
I expected to see
[{"devicename":"SMARTPHONE","number":6},{"devicename":"COMPUTER","number":3},{"devicename":"LAPTOP","number":1}]
But when I try to query the table, I see the array of struct is
[{"devicename":"\"\"\"SMARTPHONE","number":null}]
Rest of the array and struct are gone.
Does anyone know how I can achieve this?
Thanks
David

He is a code I used. In the approach I used python for cleaning before proceeding to HQL queries. So after doing some wrangling steps, I have a file like this below (saved without indices and headers) in my local file system since its a small file:
import pandas as pd
import numpy as np
Name devicename number
0 David SMARTPHONE 6
1 COMPUTER 3
2 LAPTOP 1
Then a temp table tempt is created and populated with data from LFS or HDFS:
create table tempt
(
name string,
devicename string,
number int
)
row format delimited
FIELDS TERMINATED BY ',';
load data local inpath '/path_to_file' overwrite into table tempt;
select * from tempt;
+--------------------+--------------------------+----------------------+--+
| tempt.name | tempt.devicename | tempt.number |
+--------------------+--------------------------+----------------------+--+
| David | SMARTPHONE | 6 |
| | COMPUTER | 3 |
| | LAPTOP | 1 |
+--------------------+--------------------------+----------------------+--+
And now
Insert overwrite table user_device
select name,
array(named_struct("devicename",devicename,"number",number)) from tempt;
select * from user_device;
and the output now is as you expected.
+-----------------+-------------------------------------------+--+
|user_device.name | user_device.devices |
+-----------------+-------------------------------------------+--+
| David | [{"devicename":"SMARTPHONE","number":6}] |
| | [{"devicename":"COMPUTER","number":3}] |
| | [{"devicename":"LAPTOP","number":1}] |
+-----------------+-------------------------------------------+--+
Cheers!

Related

How to display all columns and its data type in a table via SQL query

I am trying to print the column names from a table called 'meta' and I need also its data types.
I tried this query
SELECT meta FROM INFORMATION_SCHEMA.TABLES;
but it throws an error saying no information schema available. Could you please help me, I am a beginner in SQL.
Edit:
select tables.name from tables join schemas on
tables.schema_id=schemas.id where schemas.name=’sprl_db’ ;
This query gives me all the tables in database 'sprl_db'
You can use the monetdb catalog:
select c.name, c.type, c.type_digits, c.type_scale
from sys.columns c
inner join sys.tables t on t.id = c.table_id and t.name = 'meta';
as you are using monetDB you can get that by using sys.columns
sys.columns
it will return all information related to table columns
you can also check Schema, table and columns documentation for monetDB
in sql server we get that like this exec sp_columns TableName
If I understand correctly you need to see the columns and the types of a table you (or some other user) defined called meta?
There are at least two ways to do this:
First (as #GMB mentioned in their answer) you can query the SQL catalog: https://www.monetdb.org/Documentation/SQLcatalog/TablesColumns
SELECT * FROM sys.tables WHERE NAME='meta';
+------+------+-----------+-------+------+--------+---------------+--------+-----------+
| id | name | schema_id | query | type | system | commit_action | access | temporary |
+======+======+===========+=======+======+========+===============+========+===========+
| 9098 | meta | 2000 | null | 0 | false | 0 | 0 | 0 |
+------+------+-----------+-------+------+--------+---------------+--------+-----------+
1 tuple
So this gets all the relevant information about the table meta. We are mostly interested in the value of the column id because this uniquely identifies the table.
(Please note that this id will probably be different in your system)
After we have this information we can query the columns table with this table id:
SELECT * FROM sys.columns WHERE table_id=9098;
+------+------+------+-------------+------------+----------+---------+-------+--------+---------+
| id | name | type | type_digits | type_scale | table_id | default | null | number | storage |
+======+======+======+=============+============+==========+=========+=======+========+=========+
| 9096 | i | int | 32 | 0 | 9098 | null | true | 0 | null |
| 9097 | j | clob | 0 | 0 | 9098 | null | true | 1 | null |
+------+------+------+-------------+------------+----------+---------+-------+--------+---------+
2 tuples
Since you are only interested in the names and types of the columns, you can modify this query as follows:
SELECT name, type FROM sys.columns WHERE table_id=9098;
+------+------+
| name | type |
+======+======+
| i | int |
| j | clob |
+------+------+
2 tuples
You can combine the two queries above with a join:
SELECT col.name, col.type FROM sys.tables as tab JOIN sys.columns as col ON tab.id=col.table_id WHERE tab.name='meta';
+------+------+
| name | type |
+======+======+
| i | int |
| j | clob |
+------+------+
2 tuples
The second, and preferred way to get this information if you are using the mclient utility of MonetDB, is by using the describe meta-command of mclient. When used without arguments it presents a list of tables that have been defined in the current database and when it is given the name of the table it prints its SQL definition:
sql>\d
TABLE sys.data
TABLE sys.meta
sql>\d sys.meta
CREATE TABLE "sys"."meta" (
"i" INTEGER,
"j" CHARACTER LARGE OBJECT
);
You can use the \? meta-command to see a list of all meta-commands in mclient:
sql>\?
\? - show this message
\<file - read input from file
\>file - save response in file, or stdout if no file is given
\|cmd - pipe result to process, or stop when no command is given
\history - show the readline history
\help - synopsis of the SQL syntax
\D table - dumps the table, or the complete database if none given.
\d[Stvsfn]+ [obj] - list database objects, or describe if obj given
\A - enable auto commit
\a - disable auto commit
\e - echo the query in sql formatting mode
\t - set the timer {none,clock,performance} (none is default)
\f - format using renderer {csv,tab,raw,sql,xml,trash,rowcount,expanded,sam}
\w# - set maximal page width (-1=unlimited, 0=terminal width, >0=limit to num)
\r# - set maximum rows per page (-1=raw)
\L file - save client-server interaction
\X - trace mclient code
\q - terminate session and quit mclient
For MySQL:
SELECT column_name,
data_type
FROM information_schema.columns
WHERE table_schema = ’ yourdatabasename ’
AND table_name = ’ yourtablename ’;
Output:
+-------------+-----------+
| COLUMN_NAME | DATA_TYPE |
+-------------+-----------+
| Id | int |
| Address | varchar |
| Money | decimal |
+-------------+-----------+

How do I add two CSVs (Header and Content seperatly) into SQL Server?

I received raw data from SAP that I need to add to one local database. The issue I have is that I received two separate data sets per table.
One header file (describing the Name, Type, Primary Key, Not Null)
Actual data file (input to the rows defined in the header file)
I can only add them as flat file as far as I was able to research (and try out), and that means that I can only add one of those files. Either missing the header completely, or missing the input data.
Merging them manually within one CSV file would mean losing all additional information (Type, Primary Key, Not Null, etc.), right?
Any idea how I can proceed?
Thanks for helping me out.
Glad to learn something new here.
Sample header:
+-------------------------------+
| Col1 |
+-------------------------------+
| TABNAME CHAR 000030 000000 |
| DDLANGUAGE LANG 000001 000000 |
| ... |
+-------------------------------+
Sample data:
+------+-------+------+------+-----+
| Col1 | Col2 | Col3 | Col4 | ... |
+------+-------+------+------+-----+
| LFB1 | ZBOKD | A | ... | ... |
| ... | ... | ... | ... | ... |
+------+-------+------+------+-----+
Merged they would like this (and if I am not mistaken, they need to look like that):
+---------+------------+-----+-----+
| TABNAME | DDLANGUAGE | ... | ... |
+---------+------------+-----+-----+
| LFB1 | ZBOKD | A | ... |
| ... | ... | ... | ... |
+---------+------------+-----+-----+
You will want to CREATE TABLE, and then BULK INSERT into it.
Open up your header file and determine what the column names and datatypes are
Create your table in SQL Server based off the information in your header file
Bulk insert the data file into your table
Even if the header and data were in the same file, you'd ignore the first row since it doesn't contain data.
create table myTable (Col1 <datatype>, Col2 <datatype>, ...)
go
bulk insert myTable
from 'c:\somedirectory\somefile.csv'
with(
FIRSTROW = 1
,FIELDTERMINATOR = ','
,ROWTERMINATOR = '\n'
,ERRORFILE = 'c:\someDir\yourErrorFile')
Comma separated files can be a pain, primarily if any value in any column can contain a comma. In this case, SQL Server would treat it like the end of that column. If this is the case, you need to do something outside of SQL Server in PowerShell or Python or whatever to make your file tab delimited, or delimited by another special character that isn't found anywhere in the data.
Also, your ROWTERMINATOR may need to be '0x1E' or another value depending on the source system. Drop the file in NotePad++ or some other text editor that you can see the Unicode symbols on.

How to analyze the contents fsimage via hive queries

Help needed, please
I have downloaded the fsimage converted into a delimited csv file via OIV tool.
I also created a hive table and inserted the csv file into it.
I am not so familiar with sql hence querying the data is difficult.
eg: Each record in a file is something like this:
/tmp/hive/ltonakanyan/9c01cc22-55ef-4410-9f55-614726869f6d/hive_2017-05-08_08-44-39_680_3710282255695385702-113/-mr-10000/.hive-staging_hive_2017-05-08_08-44-39_680_3710282255695385702-113/-ext-10001/000044_0.deflate|3|2017-05-0808:45|2017-05-0808:45|134217728|1|176|0|0|-rw-r-----|ltonakanyan|hdfs
/data/lz/cpi/ofz/zd/cbt_ca_verint/new_data/2017-09-27/253018001769667.xml | 3| 2017-09-2723:41| 2017-09-2817:09| 134217728| 1| 14549| 0| 0| -rw-r----- | bc55_ah_appid| hdfs
Table description is:
| hdfspath | string
| replication | int
| modificationtime | string
| accesstime | string
| preferredblocksize | int
| blockscount | int
| filesize | bigint
| nsquota | bigint
| dsquota | bigint
| permissionx | string
| userx | string
| groupx | string
I need to know how to query only /tmp , /data with filesize and then go to second level ( /tmp/hive ) ( /data/lz ) , subsequent levels with filesize
i created something like this:
select substr(hdfspath, 2, instr(substr(hdfspath,2), '/')-1) zone,
sum(filesize)
from example
group by substr(hdfspath, 2, instr(substr(hdfspath,2), '/')-1);
But its not giving the data..file sizes are all in bytes.
select joinedpath, sumsize
from
(
select joinedpath,round(sum(filesize)/1024/1024/1024,2) as sumsize
from
(select concat('/',split(hdfspath,'\/')[1]) as joinedpath,accesstime,filesize, userx
from default.hdfs_meta_d
)t
where joinedpath != 'null'
group by joinedpath
)h
please check the query above, it can help you!
This job is failing due to heap memory error. Try to increase heap size before executing hdfs oiv command.
export HADOOP_OPTS="-Xmx4096m"
If the command is still failing you might need to move fsimage to a different machine/server which has more memory and increase heap memory using the above environment variable.

Remove the Junk charcters from hive tables or from unix

We have the tables in hive like below and we are generating the flat files from hive data while we are generating we found that there was junk characteres with in the data like below we have many characters in many columns can any one help us to remove those junk characters from hive table or from unix file ?
ÿ,ä,í,ã
Here problem the same data need to send the downstream when they are loading in to there DB it shows as double dollar but we design code double dollar as column delimiter.
Basic concept
hive> select regexp_replace('Hÿelloä íworlãd','[^a-zA-Z ]','');
OK
Hello world
Demo
Removing undesired character from the whole table and exporting it to a file.
create table t (i int,s1 string,s2 string);
insert into t values (1,'Hÿelloä','íworlãd'),(2,'ãGããood','Byÿe');
select * from t;
+---+---------+---------+
| i | s1 | s2 |
+---+---------+---------+
| 1 | Hÿelloä | íworlãd |
| 2 | ãGããood | Byÿe |
+---+---------+---------+
create external table t_ext (rec string)
row format delimited
fields terminated by '0'
location '/user/hive/warehouse/t'
;
insert overwrite local directory '/tmp/t_ext'
select regexp_replace(regexp_replace(rec,'[^a-zA-Z0-9 \\01]',''),'\\x01','<--->')
from t_ext
;
! ls /tmp/t_ext
;
000000_0
! cat /tmp/t_ext/000000_0
;
1<--->Hello<--->world
2<--->Good<--->Bye
This works as long as your tables contain only "primitive" types (no structs, arrays, maps etc.).
I really pushed the envelope here.
Demo
create table t (i int, dt date, str string, ts timestamp, bl boolean);
insert into t select 1,current_date,'Hello world',current_timestamp,true;
select * from t;
+-----+------------+-------------+-------------------------+------+
| t.i | t.dt | t.str | t.ts | t.bl |
+-----+------------+-------------+-------------------------+------+
| 1 | 2017-03-14 | Hello world | 2017-03-14 14:37:28.889 | true |
+-----+------------+-------------+-------------------------+------+
select regexp_replace
(
printf(concat('%s',repeat('$$%s',field(unhex(1),*,unhex(1))-2)),*)
,'(\\$\\$)|[^a-zA-Z0-9 -]'
,'$1'
)
from t
;
1$$2017-03-14$$Hello world$$2017-03-14 143728.889$$true

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+