How to create table using external key word in hive - hive

I want to process the data in hdfs , i am trying to create table using external keyword then i am getting following error, can you please provide solution for this.
hive> create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample.csv';
I am getting following error, can u please provide solution for this
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.ipc.RemoteException java.io.FileNotFoundException: Parent path is not a directory: /home/siva/jobportal/sample.csv

can you please confirm that this path is on HDFS?
More info on the creating external tables in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables

I use the following for XML parsing serde in hive ---
CREATE EXTERNAL TABLE XYZ(
X STRING,
Y STRING,
Z ARRAY<STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.X"="/XX/#X",
"column.xpath.Y"="/YY/#Y"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/XXX'
TBLPROPERTIES (
"xmlinput.start"="<xml start",
"xmlinput.end"="</xml end>"
);

At the moment, Hive only allows you to set a directory as a partition location while adding a partition. What you are trying to do here is to set a file as partition location. The workaround that I use is to first add a partition with a dummy/non-existent directory (Hive doesn't demand the directory exist while it's being set as a partition location) and then use alter table partition set location to change the partition location to your desired file. Surprisingly, Hive doesn't force the location to be a directory while setting the location of an existing partition the way it does while adding a new partition. So in your case, it will look like -
alter table samplecv add partition (id='11', name='somename') location '/home/siva/jobportal/somedirectory'
alter table samplecv partition (id='11', name='somename') set location '/home/siva/jobportal/sample.csv'

Hive always expect directory name in Location path instead of file name.
create your file inside a directory for example inside /home/siva/jobportal/sample/sample.csv and then try running below command to create your hive table.
create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample';
In case if you get any error just put your file in hdfs and try, it should work.

Related

Amazon Athena set location to single csv file

I would like to set the location value in my Athena SQL create table statement to a single CSV file as I do not want to query every file in the path. I can set and successfully query an s3 directory (object) path and all files in that path, but not a single file. Is setting a single file as the location supported?
Successfully queries CSV files in path:
LOCATION 's3://my_bucket/path/'
Returns zero results:
LOCATION 's3://my_bucket/path/filename.csv.gz'
Create table statement:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_db` (
`name` string,
`occupation` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ','
) LOCATION 's3://bucket-name/path/filename.csv.gz'
TBLPROPERTIES ('has_encrypted_data'='false');
I have read this Q&A and this, but it doesn't seem to address the same issue.
Thank you.
You could try adding path of that particular object in WHERE condition while querying:
SELECT * FROM default.my_db
WHERE "$path" = 's3://bucket-name/path/filename.csv.gz'

Cloudera - Hive/Impala Show Create Table - Error with the syntax

I'm making some automatic processes to create tables on Cloudera Hive.
For that I am using the show create table statement that me give (for example) the following ddl:
CREATE TABLE clsd_core.factual_player ( player_name STRING, number_goals INT ) PARTITIONED BY ( player_name STRING ) WITH SERDEPROPERTIES ('serialization.format'='1') STORED AS PARQUET LOCATION 'hdfs://nameservice1/factual_player'
What I need is to run the ddl on a different place to create a table with the same name.
However, when I run that code I return the following error:
Error while compiling statement: FAILED: ParseException line 1:123 missing EOF at 'WITH' near ')'
And I remove manually this part "WITH SERDEPROPERTIES ('serialization.format'='1')" it was able to create the table with success.
Is there a better function to retrieves the tables ddls without the SERDE information?
First issue in your DDL is that partitioned column should not be listed in columns spec, only in the partitioned by. Partition is the folder with name partition_column=value and this column is not stored in the table files, only in the partition directory. If you want partition column to be in the data files, it should be named differently.
Second issue is that SERDEPROPERTIES is a part of SERDE specification, If you do not specify SERDE, it should be no SERDEPROPERTIES. See this manual: StorageFormat andSerDe
Fixed DDL:
CREATE TABLE factual_player (number_goals INT)
PARTITIONED BY (player_name STRING)
STORED AS PARQUET
LOCATION 'hdfs://nameservice1/factual_player';
STORED AS PARQUET already implies SERDE, INPUTFORMAT and OUPPUTFORMAT.
If you want to specify SERDE with it's properties, use this syntax:
CREATE TABLE factual_player(number_goals int)
PARTITIONED BY (player_name string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
WITH SERDEPROPERTIES ('serialization.format'='1') --I believe you really do not need this
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'hdfs://nameservice1/factual_player'

Zero results in Athena query of S3 object

I placed a text file that is comma delimited in an S3 bucket. I am attempting to query the folder the file resides in but it returns zero results.
Create table DDL:
CREATE EXTERNAL TABLE myDatabase.myTable (
`field_1` string,
`field_2` string,
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://bucket/files from boss/'
TBLPROPERTIES ('has_encrypted_data'='false');
The issue was the whitespace in the location:
LOCATION 's3://bucket/files from boss/'
I removed the whitespace from the folder name in S3 and I was able to query without issue:
LOCATION 's3://bucket/files_from_boss/'

Hive select csv files only from directory

I have the following file structure
/base/{yyyy-mm-dd}/
folder1/
folderContainingCSV/
logs/
I want to load the data from my base directory for all dates. But the problem is that there are files in non csv.gz format in log/ directory. Is there a way to select only csv.gz files while querying from base directory level.
Sample query:-
CREATE EXTERNAL TABLE IF NOT EXISTS csvData (
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = '|'
) LOCATION 's3://base/'
TBLPROPERTIES ('has_encrypted_data'='true');
You may not do this at table creation level. You need to copy all the *.gz file separately into another folder.
This may be done within the hive script (containing the create table statement) itself. Just add below command at the beginning of the hive script (just before create table)
dfs -mkdir -p /new/path/folder
dfs -cp /regular/log/file/*.gz /new/path/folder
Now, you may create the external table pointing to new/path/folder.

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;