Hive - Loading into default location of external table - hive

I have an external table created in hive with default location
CREATE EXTERNAL TABLE `testtable`(
`id` int,
`name` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'<hdfs-uri>/hive/warehouse/testtable'
Would like to confirm if I can move a text file containing ID/name values from local to HDFS location /hive/warehouse/testtable/test.txt for the external table testtable? Thanks.

Yes. You can upload test.txt from local to the HDFS location for external table testtable (<hdfs-uri>/hive/warehouse/testtable). This will work even if <hdfs-uri>/hive/warehouse/ is the default Hive warehouse directory.
Just to keep in mind - For non-external (called Hive-managed) tables, dropping the table will drop its warehouse HDFS directory automatically. For external table, dropping the table will not drop the HDFS directory backing it up, and needs to be deleted as a separate operation.
Illustration:
Here the Hive warehouse directory is hdfs:///apps/hive/warehouse
Create table
hive> CREATE EXTERNAL TABLE `testtable`(
`id` int,
`name` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs:///apps/hive/warehouse/testtable';
Data in test.txt
1,name-1
2,name-2
3,name-3
Upload data to HDFS
hadoop fs -put test.txt hdfs:///apps/hive/warehouse/testtable
Query table
hive> select * from testtable;
1 name-1
2 name-2
3 name-3

Related

Created external table but it's empty

I want to create an external table from a .csv file I uploaded to the server earlier.
In Bline (shell for Hive), I tried running this script:
CREATE EXTERNAL TABLE c_fink_category_mapping (
trench_code string,
fink_code string
)
row format delimited fields terminated by '\073' stored as textfile
location '/appl/trench/dev/data/in/main/daily_wf/fink_category_mapping'
TABLEPROPERTIES ('serialization.null.format' = '')
;
which creates the table w/o any error byt the table itself is empty.
Help would be appreciated.
My textfile is populated with data.
First, check if the location path is correct.
Then try with this configuration:
CREATE EXTERNAL TABLE c_fink_category_mapping (
trench_code string,
fink_code string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/appl/trench/dev/data/in/main/daily_wf/fink_category_mapping';
response provided above seems to be correct:
CREATE EXTERNAL TABLE c_fink_category_mapping (
trench_code string,
fink_code string
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='"',
'separatorChar'=',')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/appl/trench/dev/data/in/main/daily_wf/fink_category_mapping';
This will create the table using a comma as the delimiter, which should correctly parse the data in your CSV file and populate the table with the data from the file. You can also specify a different delimiter character, such as '\t', if that is more appropriate for your data.

Cloudera - Hive/Impala Show Create Table - Error with the syntax

I'm making some automatic processes to create tables on Cloudera Hive.
For that I am using the show create table statement that me give (for example) the following ddl:
CREATE TABLE clsd_core.factual_player ( player_name STRING, number_goals INT ) PARTITIONED BY ( player_name STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://nameservice1/factual_player'
What I need is to run the ddl on a different place to create a table with the same name.
However, when I run that code I return the following error:
Error while compiling statement: FAILED: ParseException line 1:123 missing EOF at 'WITH' near ')'
And I remove manually this part "WITH SERDEPROPERTIES ('serialization.format'='1')" it was able to create the table with success.
Is there a better function to retrieves the tables ddls without the SERDE information?
First issue in your DDL is that partitioned column should not be listed in columns spec, only in the partitioned by. Partition is the folder with name partition_column=value and this column is not stored in the table files, only in the partition directory. If you want partition column to be in the data files, it should be named differently.
Second issue is that SERDEPROPERTIES is a part of SERDE specification, If you do not specify SERDE, it should be no SERDEPROPERTIES. See this manual: StorageFormat andSerDe
Fixed DDL:
CREATE TABLE factual_player (number_goals INT)
PARTITIONED BY (player_name STRING)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/factual_player';
STORED AS PARQUET already implies SERDE, INPUTFORMAT and OUPPUTFORMAT.
If you want to specify SERDE with it's properties, use this syntax:
CREATE TABLE factual_player(number_goals int)
PARTITIONED BY (player_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format'='1') --I believe you really do not need this
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/factual_player'

Hive select csv files only from directory

I have the following file structure
/base/{yyyy-mm-dd}/
folder1/
folderContainingCSV/
logs/
I want to load the data from my base directory for all dates. But the problem is that there are files in non csv.gz format in log/ directory. Is there a way to select only csv.gz files while querying from base directory level.
Sample query:-
CREATE EXTERNAL TABLE IF NOT EXISTS csvData (
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = '|'
) LOCATION 's3://base/'
TBLPROPERTIES ('has_encrypted_data'='true');
You may not do this at table creation level. You need to copy all the *.gz file separately into another folder.
This may be done within the hive script (containing the create table statement) itself. Just add below command at the beginning of the hive script (just before create table)
dfs -mkdir -p /new/path/folder
dfs -cp /regular/log/file/*.gz /new/path/folder
Now, you may create the external table pointing to new/path/folder.

Hive Json SerDE for ORC or RC Format

IS It possible to use a JSON serde with RC or ORC file formats? I am trying to insert into a Hive table with file format ORC and store on azure blob in serialized JSON.
Apparently not
insert overwrite local directory '/home/cloudera/local/mytable'
stored as orc
select '{"mycol":123,"mystring","Hello"}'
;
create external table verify_data (rec string)
stored as orc
location 'file:////home/cloudera/local/mytable'
;
select * from verify_data
;
rec
{"mycol":123,"mystring","Hello"}
create external table mytable (myint int,mystring string)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as orc
location 'file:///home/cloudera/local/mytable'
;
myint mystring
Failed with exception java.io.IOException:java.lang.ClassCastException:
org.apache.hadoop.hive.ql.io.orc.OrcStruct cannot be cast to org.apache.hadoop.io.Text
JsonSerDe.java:
...
import org.apache.hadoop.io.Text;
...
#Override
public Object deserialize(Writable blob) throws SerDeException {
Text t = (Text) blob;
...
You can do so using some sort of a conversion step, like a bucketing step which will produce ORC files in a target directory and mounting a hive table with same schema after bucketing. Like below.
CREATE EXTERNAL TABLE my_fact_orc
(
mycol STRING,
mystring INT
)
PARTITIONED BY (dt string)
CLUSTERED BY (some_id) INTO 64 BUCKETS
STORED AS ORC
LOCATION 's3://dev/my_fact_orc'
TBLPROPERTIES ('orc.compress'='SNAPPY');
ALTER TABLE my_fact_orc ADD IF NOT EXISTS PARTITION (dt='2017-09-07') LOCATION 's3://dev/my_fact_orc/dt=2017-09-07';
ALTER TABLE my_fact_orc PARTITION (dt='2017-09-07') SET FILEFORMAT ORC;
SELECT * FROM my_fact_orc WHERE dt='2017-09-07' LIMIT 5;

How to create table using external key word in hive

I want to process the data in hdfs , i am trying to create table using external keyword then i am getting following error, can you please provide solution for this.
hive> create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample.csv';
I am getting following error, can u please provide solution for this
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.ipc.RemoteException java.io.FileNotFoundException: Parent path is not a directory: /home/siva/jobportal/sample.csv
can you please confirm that this path is on HDFS?
More info on the creating external tables in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables
I use the following for XML parsing serde in hive ---
CREATE EXTERNAL TABLE XYZ(
X STRING,
Y STRING,
Z ARRAY<STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.X"="/XX/#X",
"column.xpath.Y"="/YY/#Y"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/XXX'
TBLPROPERTIES (
"xmlinput.start"="<xml start",
"xmlinput.end"="</xml end>"
);
At the moment, Hive only allows you to set a directory as a partition location while adding a partition. What you are trying to do here is to set a file as partition location. The workaround that I use is to first add a partition with a dummy/non-existent directory (Hive doesn't demand the directory exist while it's being set as a partition location) and then use alter table partition set location to change the partition location to your desired file. Surprisingly, Hive doesn't force the location to be a directory while setting the location of an existing partition the way it does while adding a new partition. So in your case, it will look like -
alter table samplecv add partition (id='11', name='somename') location '/home/siva/jobportal/somedirectory'
alter table samplecv partition (id='11', name='somename') set location '/home/siva/jobportal/sample.csv'
Hive always expect directory name in Location path instead of file name.
create your file inside a directory for example inside /home/siva/jobportal/sample/sample.csv and then try running below command to create your hive table.
create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample';
In case if you get any error just put your file in hdfs and try, it should work.