How to analyze the contents fsimage via hive queries - sql

Help needed, please
I have downloaded the fsimage converted into a delimited csv file via OIV tool.
I also created a hive table and inserted the csv file into it.
I am not so familiar with sql hence querying the data is difficult.
eg: Each record in a file is something like this:
/tmp/hive/ltonakanyan/9c01cc22-55ef-4410-9f55-614726869f6d/hive_2017-05-08_08-44-39_680_3710282255695385702-113/-mr-10000/.hive-staging_hive_2017-05-08_08-44-39_680_3710282255695385702-113/-ext-10001/000044_0.deflate|3|2017-05-0808:45|2017-05-0808:45|134217728|1|176|0|0|-rw-r-----|ltonakanyan|hdfs
/data/lz/cpi/ofz/zd/cbt_ca_verint/new_data/2017-09-27/253018001769667.xml | 3| 2017-09-2723:41| 2017-09-2817:09| 134217728| 1| 14549| 0| 0| -rw-r----- | bc55_ah_appid| hdfs
Table description is:
| hdfspath | string
| replication | int
| modificationtime | string
| accesstime | string
| preferredblocksize | int
| blockscount | int
| filesize | bigint
| nsquota | bigint
| dsquota | bigint
| permissionx | string
| userx | string
| groupx | string
I need to know how to query only /tmp , /data with filesize and then go to second level ( /tmp/hive ) ( /data/lz ) , subsequent levels with filesize
i created something like this:
select substr(hdfspath, 2, instr(substr(hdfspath,2), '/')-1) zone,
sum(filesize)
from example
group by substr(hdfspath, 2, instr(substr(hdfspath,2), '/')-1);
But its not giving the data..file sizes are all in bytes.

select joinedpath, sumsize
from
(
select joinedpath,round(sum(filesize)/1024/1024/1024,2) as sumsize
from
(select concat('/',split(hdfspath,'\/')[1]) as joinedpath,accesstime,filesize, userx
from default.hdfs_meta_d
)t
where joinedpath != 'null'
group by joinedpath
)h
please check the query above, it can help you!

This job is failing due to heap memory error. Try to increase heap size before executing hdfs oiv command.
export HADOOP_OPTS="-Xmx4096m"
If the command is still failing you might need to move fsimage to a different machine/server which has more memory and increase heap memory using the above environment variable.

Related

How to display all columns and its data type in a table via SQL query

I am trying to print the column names from a table called 'meta' and I need also its data types.
I tried this query
SELECT meta FROM INFORMATION_SCHEMA.TABLES;
but it throws an error saying no information schema available. Could you please help me, I am a beginner in SQL.
Edit:
select tables.name from tables join schemas on
tables.schema_id=schemas.id where schemas.name=’sprl_db’ ;
This query gives me all the tables in database 'sprl_db'
You can use the monetdb catalog:
select c.name, c.type, c.type_digits, c.type_scale
from sys.columns c
inner join sys.tables t on t.id = c.table_id and t.name = 'meta';
as you are using monetDB you can get that by using sys.columns
sys.columns
it will return all information related to table columns
you can also check Schema, table and columns documentation for monetDB
in sql server we get that like this exec sp_columns TableName
If I understand correctly you need to see the columns and the types of a table you (or some other user) defined called meta?
There are at least two ways to do this:
First (as #GMB mentioned in their answer) you can query the SQL catalog: https://www.monetdb.org/Documentation/SQLcatalog/TablesColumns
SELECT * FROM sys.tables WHERE NAME='meta';
+------+------+-----------+-------+------+--------+---------------+--------+-----------+
| id | name | schema_id | query | type | system | commit_action | access | temporary |
+======+======+===========+=======+======+========+===============+========+===========+
| 9098 | meta | 2000 | null | 0 | false | 0 | 0 | 0 |
+------+------+-----------+-------+------+--------+---------------+--------+-----------+
1 tuple
So this gets all the relevant information about the table meta. We are mostly interested in the value of the column id because this uniquely identifies the table.
(Please note that this id will probably be different in your system)
After we have this information we can query the columns table with this table id:
SELECT * FROM sys.columns WHERE table_id=9098;
+------+------+------+-------------+------------+----------+---------+-------+--------+---------+
| id | name | type | type_digits | type_scale | table_id | default | null | number | storage |
+======+======+======+=============+============+==========+=========+=======+========+=========+
| 9096 | i | int | 32 | 0 | 9098 | null | true | 0 | null |
| 9097 | j | clob | 0 | 0 | 9098 | null | true | 1 | null |
+------+------+------+-------------+------------+----------+---------+-------+--------+---------+
2 tuples
Since you are only interested in the names and types of the columns, you can modify this query as follows:
SELECT name, type FROM sys.columns WHERE table_id=9098;
+------+------+
| name | type |
+======+======+
| i | int |
| j | clob |
+------+------+
2 tuples
You can combine the two queries above with a join:
SELECT col.name, col.type FROM sys.tables as tab JOIN sys.columns as col ON tab.id=col.table_id WHERE tab.name='meta';
+------+------+
| name | type |
+======+======+
| i | int |
| j | clob |
+------+------+
2 tuples
The second, and preferred way to get this information if you are using the mclient utility of MonetDB, is by using the describe meta-command of mclient. When used without arguments it presents a list of tables that have been defined in the current database and when it is given the name of the table it prints its SQL definition:
sql>\d
TABLE sys.data
TABLE sys.meta
sql>\d sys.meta
CREATE TABLE "sys"."meta" (
"i" INTEGER,
"j" CHARACTER LARGE OBJECT
);
You can use the \? meta-command to see a list of all meta-commands in mclient:
sql>\?
\? - show this message
\<file - read input from file
\>file - save response in file, or stdout if no file is given
\|cmd - pipe result to process, or stop when no command is given
\history - show the readline history
\help - synopsis of the SQL syntax
\D table - dumps the table, or the complete database if none given.
\d[Stvsfn]+ [obj] - list database objects, or describe if obj given
\A - enable auto commit
\a - disable auto commit
\e - echo the query in sql formatting mode
\t - set the timer {none,clock,performance} (none is default)
\f - format using renderer {csv,tab,raw,sql,xml,trash,rowcount,expanded,sam}
\w# - set maximal page width (-1=unlimited, 0=terminal width, >0=limit to num)
\r# - set maximum rows per page (-1=raw)
\L file - save client-server interaction
\X - trace mclient code
\q - terminate session and quit mclient
For MySQL:
SELECT column_name,
data_type
FROM information_schema.columns
WHERE table_schema = ’ yourdatabasename ’
AND table_name = ’ yourtablename ’;
Output:
+-------------+-----------+
| COLUMN_NAME | DATA_TYPE |
+-------------+-----------+
| Id | int |
| Address | varchar |
| Money | decimal |
+-------------+-----------+

How to deduplicate in Presto SQL without with only varchar data

My question is related to this one. However, I have only varchar data so I can't use the solutions there. My data looks like this:
id | activity | type
------------------------
al12 | a1a4 | MOVE
la23 | 2a5e | WAIT
la23 | 2a5e | WAIT
ie42 | 35a8 | STAY
The third row is a duplicate. How can I remove it?
Use DISTINCT?
SELECT DISTINCT id, activity, type
FROM your_table
https://prestodb.github.io/docs/current/sql/select.html

Load data to hive array of struct

I have data in CSV looks like
David,"""SMARTPHONE,6""|""COMPUTER,3""|""LAPTOP,1"""
I try to load this to my hive table
create table user_device(name string, devices array<struct<devicename: string, number : int>>)
FIELDS TERMINATED BY ','
collection items terminated by '|'
STORED AS TEXTFILE
LOCATION 'maprfs:///user/david/';
I expected to see
[{"devicename":"SMARTPHONE","number":6},{"devicename":"COMPUTER","number":3},{"devicename":"LAPTOP","number":1}]
But when I try to query the table, I see the array of struct is
[{"devicename":"\"\"\"SMARTPHONE","number":null}]
Rest of the array and struct are gone.
Does anyone know how I can achieve this?
Thanks
David
He is a code I used. In the approach I used python for cleaning before proceeding to HQL queries. So after doing some wrangling steps, I have a file like this below (saved without indices and headers) in my local file system since its a small file:
import pandas as pd
import numpy as np
Name devicename number
0 David SMARTPHONE 6
1 COMPUTER 3
2 LAPTOP 1
Then a temp table tempt is created and populated with data from LFS or HDFS:
create table tempt
(
name string,
devicename string,
number int
)
row format delimited
FIELDS TERMINATED BY ',';
load data local inpath '/path_to_file' overwrite into table tempt;
select * from tempt;
+--------------------+--------------------------+----------------------+--+
| tempt.name | tempt.devicename | tempt.number |
+--------------------+--------------------------+----------------------+--+
| David | SMARTPHONE | 6 |
| | COMPUTER | 3 |
| | LAPTOP | 1 |
+--------------------+--------------------------+----------------------+--+
And now
Insert overwrite table user_device
select name,
array(named_struct("devicename",devicename,"number",number)) from tempt;
select * from user_device;
and the output now is as you expected.
+-----------------+-------------------------------------------+--+
|user_device.name | user_device.devices |
+-----------------+-------------------------------------------+--+
| David | [{"devicename":"SMARTPHONE","number":6}] |
| | [{"devicename":"COMPUTER","number":3}] |
| | [{"devicename":"LAPTOP","number":1}] |
+-----------------+-------------------------------------------+--+
Cheers!

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+

How to get numbers arranged right to left in sql server SELECT statements

When performing SELECT statements including number columns (prices, for example), the result always is left to right ordered, which reduces the readability. Therefore I'm searching a method to format the output of number columns right to left.
I already tried to use something like
SELECT ... SPACE(15-LEN(A.Nummer))+A.Nummer ...
FROM Artikel AS A ...
which gives close results, but depending on font not really. An alternative would be to replace 'SPACE()' with 'REPLICATE('_',...)', but I don't really like the underscores in output.
Beside that this formula will crash on numbers with more digits than 15, therefore I searched for a way finding the maximum length of entries to make it more save like
SELECT ... SPACE(MAX(A.Nummer)-LEN(A.Nummer))+A.Nummer ...
FROM Artikel AS A ...
but this does not work due to the aggregate character of the MAX-function.
So, what's the best way to achieve the right-justified order for the number-columns?
Thanks,
Rainer
To get you problem with the list box solved have a look at this link: http://www.lebans.com/List_Combo.htm
I strongly believe that this type of adjustment should be made in the UI layer and not mixed in with data retrieval.
But to answer your original question i have created a SQL Fiddle:
MS SQL Server 2008 Schema Setup:
CREATE TABLE dbo.some_numbers(n INT);
Create some example data:
INSERT INTO dbo.some_numbers
SELECT CHECKSUM(NEWID())
FROM (VALUES (1),(1),(1),(1),(1),(1),(1),(1),(1),(1))X(x);
The following query is using the OVER() clause to specify that the MAX() is to be applied over all rows. The > and < that the result is wrapped in is just for illustration purposes and not required for the solution.
Query 1:
SELECT '>'+
SPACE(MAX(LEN(CAST(n AS VARCHAR(MAX))))OVER()-LEN(CAST(n AS VARCHAR(MAX))))+
CAST(n AS VARCHAR(MAX))+
'<'
FROM dbo.some_numbers SN;
Results:
| COLUMN_0 |
|---------------|
| >-1486993739< |
| > 1620287540< |
| >-1451542215< |
| >-1257364471< |
| > -819471559< |
| >-1364318127< |
| >-1190313739< |
| > 1682890896< |
| >-1050938840< |
| > 484064148< |
This query does a straight case to show the difference:
Query 2:
SELECT '>'+CAST(n AS VARCHAR(MAX))+'<'
FROM dbo.some_numbers SN;
Results:
| COLUMN_0 |
|---------------|
| >-1486993739< |
| >1620287540< |
| >-1451542215< |
| >-1257364471< |
| >-819471559< |
| >-1364318127< |
| >-1190313739< |
| >1682890896< |
| >-1050938840< |
| >484064148< |
With this query you still need to change the display font to a monospaced font like COURIER NEW. Otherwise, as you have noticed, the result is still misaligned.