how to store a hive table in .xlsx format? - hive

I create a hive table with below query
create table t1 row format delimited fields terminated by '|' stored as textfile;
load data inpath 'l1.csv' overwrite into table t1;
But,I want to store my table t1 in .xlsx format? not textfile.

I am from python background , if you want to store any hive data into
excel format , you may use pandas in python. It provides a way
to write any output to excel.
Before that , you need to connect to hive using any modules like pyhs2 etc and capture data in any variables and then can write to excel.

Related

Trouble converting CSV date from non standard format SQL

I have dates in the format 01jan2020 (without a space or any separator) and need to convert this to a date type in SQL Server 2016 Management Studio.
The data was loaded from a .CSV file into a table (call it TestData, column is Fill_Date).
To join on a separate table to pull back data for another process, I need the TestData column Fill_Date to be in the correct format (MM-DD-YYYY) for my query to run correctly.
Fill_Date is currently in table TestData as datatype varchar(50).
I want to either see if it is possible to convert it with TestData table or directly insert the result into a 2nd table that is formatted.
Thanks (NEWB)
I ended up solving by converting the data while dropping into a temp table, deleting old value, and then inserting from that table back into the TestData table.
CONVERT(VARCHAR,CONVERT(date,[fill_date]),101) AS fill_date

select row from orc snappy table in hive

I have created a table employee_orc which is orc format with snappy compression.
create table employee_orc(emp_id string, name string)
row format delimited fields terminated by '\t' stored as orc tblproperties("orc.compress"="SNAPPY");
I have uploaded data into the table using the insert statement.
employee_orc table has 1000 records.
When I run the below query, it shows all the records
select * from employee_orc;
But when run the below query, it shows zero results even though the records exist.
select * from employee_orc where emp_id = "EMP456";
Why I am unable to retrieve a single record from the employee_orc table?
The record does not exist. You may think they are the same because they look the same, but there is some difference. One possibility are spaces at the beginning or end of the string. For this, you can use like:
where emp_id like '%EMP456%'
This might help you.
On my part, I don't understand why you want to specify a delimiter in ORC. Are you confusing CSV and ORC or external vs managed ?
I advice you to create your table differently
create table employee_orc(emp_id string, name string)
stored as ORC
TBLPROPERTIES (
"orc.compress"="ZLIB");

Is there a way to specify Date/Timestamp format for the incoming data within the Hive CREATE TABLE statement itself?

I've have a CSV files which contain date and timestamp values in the below formats. Eg:
Col1|col2
01JAN2019|01JAN2019:17:34:41
But when I define Col1 as Date and Col2 as Timestamp in my create statement, the Hive tables simply returns NULL when I query.
CREATE EXTERNAL TABLE IF NOT EXISTS my_schema.my_table
(Col1 date,
Col2 timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
STORED AS TEXTFILE
LOCATION 'my_path';
Instead, if I define the data types as simply string then it works. But that's not how I want my tables to be.
I want the table to be able to read the incoming data in correct type. How can I achieve this? Is it possible to define the expected data format of the incoming data with the CREATE statement itself?
Can someone please help?
As of Hive 1.2.0 it is possible to provide additional SerDe property "timestamp.formats". See this Jira for more details: HIVE-9298
ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="ddMMMyyyy:HH:mm:ss");

Create a table in Hive and populate it with data

While trying to load data in a Hive table I encountered a behavior that looks strange to me. My data is made up of JSON objects loaded as records in a table called twitter_test containing a single column named "json".
Now I want to extract three fields from each JSON and build a new table called "my_twitter". I thus issue the command
CREATE TABLE my_twitter AS SELECT regexp_replace(get_json_object(t.json, '$.body\[0]'), '\n', '') as text, get_json_object(t.json, '$.publishingdate\[0]') as created_at, get_json_object(t.json, '$.author_screen_name\[0]') as author from twitter_test AS t;
The result is a table with three columns that contains no data. However, if I run the SELECT command alone it returns data as expected.
By trial and error I found out that i need to add LIMIT x at the end of the query for data to be inserted in the new table. The question is: why?
Furthermore, seems strange that I need to know in advance the number x of rows returned by the SELECT statement for the CREATE to work correctly. Is there any workaround?
You could create a table on this json data using the JSON serde which would parse the json objects and then you could easily select each individual columns easily.
Find below a sample hive DDL for creating a json table using json serde
CREATE EXTERNAL TABLE `json_table`(
A string
,B string
)
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'PATH'

load data into SQL while datatype and column width in sas code?

Hi I'm trying to load this data into mySQL server. It's a 2 Gb txt file, and the delimiter is tab.
I can use import data wizard, and choose file type Flat File Source to import it. But in this way I additionally need to tell the database about the datatype and length of each column.
Information about the datatype and length of each column is available in sas-infile code (downloaded together with the data). (I replaced code of many columns with .... for conciseness)
DATA Medicare_PS_PUF;
LENGTH
npi $ 10
nppes_provider_last_org_name $ 70
nppes_provider_first_name $ 20
nppes_provider_mi $ 1
....
average_Medicare_standard_amt 8;
INFILE 'C:\My Documents\Medicare_Provider_Util_Payment_PUF_CY2014.TXT'
lrecl=32767
dlm='09'x
pad missover
firstobs = 3
dsd;
INPUT
npi
nppes_provider_last_org_name
nppes_provider_first_name
nppes_provider_mi
....
average_Medicare_standard_amt;
RUN;
I think maybe I should use the sas-infile code to load txt file into sas, and save it as sas-format data, and then import the sas-format data into mySQL.
My question is, can the info of datatype and length of each column be passed from SAS into mySQL ?
Any suggestion/other methods to handle this is appreciated. Thanks-
One way would be to first submit a describe table query in proc sql. For example:
proc sql;
describe table sashelp.class;
quit;
SAS Log shows:
create table SASHELP.CLASS( label='Student Data' bufsize=65536 )
(
Name char(8),
Sex char(1),
Age num,
Height num,
Weight num
);
This will need to be adapted to MySQL's syntax -- labels removed, char may become varchar, num becomes int or float, and so on.
Then you could try inserting the data into the newly created MySQL table. Either by point-and-click through your MySQL interface, or if this is not possible, by generating a text file from SAS containing a series of insert into queries (using a file statement in a data step with the put statement to generate the appropriate queries), and then submitting this text file to your MySQL database.
Quick example (not tested):
data _null_;
file 'insert_queries.sql';
set sashelp.class;
query = "insert into class values ("||
catx(",", quote(trim(Name)), quote(Sex), Height, Weight)||
");";
put query;
run;
If you have the right SAS/Access modules licensed you can copy the data directly from SAS to MySQL.
But it sounds like you are asking how to convert the SAS code into metadata about the table structure. I would recommend running the program in SAS to create the Medicare_PS_PUF dataset and then using PROC CONTENTS to get the metadata about the variables. You can add OBS=0 to the INFILE statement if you just want to create an empty dataset in SAS.