Hive select csv files only from directory - hive

I have the following file structure
/base/{yyyy-mm-dd}/
folder1/
folderContainingCSV/
logs/
I want to load the data from my base directory for all dates. But the problem is that there are files in non csv.gz format in log/ directory. Is there a way to select only csv.gz files while querying from base directory level.
Sample query:-
CREATE EXTERNAL TABLE IF NOT EXISTS csvData (
`col1` string,
`col2` string,
`col3` string,
`col4` string,
`col5` string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = '|'
) LOCATION 's3://base/'
TBLPROPERTIES ('has_encrypted_data'='true');

You may not do this at table creation level. You need to copy all the *.gz file separately into another folder.
This may be done within the hive script (containing the create table statement) itself. Just add below command at the beginning of the hive script (just before create table)
dfs -mkdir -p /new/path/folder
dfs -cp /regular/log/file/*.gz /new/path/folder
Now, you may create the external table pointing to new/path/folder.

Related

Amazon Athena set location to single csv file

I would like to set the location value in my Athena SQL create table statement to a single CSV file as I do not want to query every file in the path. I can set and successfully query an s3 directory (object) path and all files in that path, but not a single file. Is setting a single file as the location supported?
Successfully queries CSV files in path:
LOCATION 's3://my_bucket/path/'
Returns zero results:
LOCATION 's3://my_bucket/path/filename.csv.gz'
Create table statement:
CREATE EXTERNAL TABLE IF NOT EXISTS `default`.`my_db` (
`name` string,
`occupation` string,
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'field.delim' = ','
) LOCATION 's3://bucket-name/path/filename.csv.gz'
TBLPROPERTIES ('has_encrypted_data'='false');
I have read this Q&A and this, but it doesn't seem to address the same issue.
Thank you.
You could try adding path of that particular object in WHERE condition while querying:
SELECT * FROM default.my_db
WHERE "$path" = 's3://bucket-name/path/filename.csv.gz'

impala CREATE EXTERNAL TABLE and remove double quotes

i got data on CSV for example :
"Female","44","0","0","Yes","Govt_job","Urban","103.59","32.7","formerly smoked"
i put it as hdfs with hdfs dfs put
and now i want to create external table from it on impala (not in hive)
there is an option without the double quotes ?
this is what i run by impala-shell:
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( `gender` STRING,`age` STRING,`hypertension` STRING,`heart_disease` STRING,`ever_married` STRING,`work_type` STRING,`Residence_type` STRING,`avg_glucose_level` STRING,`bmi` STRING,`smoking_status` STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION "/user/test/tmp/test1"
Update 28.11
i managed to do it by create the external and then create a VIEW as select with case when concat() each col.
Impala uses the Hive metastore so anything created in Hive is available from Impala after issuing an INVALIDATE METADATA dbname.tablename. HOWEVER, to remove the quotes you need to use the Hive Serde library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' and this is not accessible from Impala. My suggestion would be to do the following:
Create the external table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( gender STRING, age STRING, hypertension STRING, heart_disease STRING, ever_married STRING, work_type STRING, Residence_type STRING, avg_glucose_level STRING, bmi STRING, smoking_status STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ",",
"quoteChar" = """
)
STORED AS TEXTFILE
LOCATION "/user/test/tmp/test1"
Create a managed table in Hive using CTAS
CREATE TABLE mytable AS SELECT * FROM test_test.test1_ext;
Make it available in Impala
INVALIDATE METADATA db.mytable;

Zero results in Athena query of S3 object

I placed a text file that is comma delimited in an S3 bucket. I am attempting to query the folder the file resides in but it returns zero results.
Create table DDL:
CREATE EXTERNAL TABLE myDatabase.myTable (
`field_1` string,
`field_2` string,
...
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = ',',
'field.delim' = ','
) LOCATION 's3://bucket/files from boss/'
TBLPROPERTIES ('has_encrypted_data'='false');
The issue was the whitespace in the location:
LOCATION 's3://bucket/files from boss/'
I removed the whitespace from the folder name in S3 and I was able to query without issue:
LOCATION 's3://bucket/files_from_boss/'

Hive - Loading into default location of external table

I have an external table created in hive with default location
CREATE EXTERNAL TABLE `testtable`(
`id` int,
`name` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'<hdfs-uri>/hive/warehouse/testtable'
Would like to confirm if I can move a text file containing ID/name values from local to HDFS location /hive/warehouse/testtable/test.txt for the external table testtable? Thanks.
Yes. You can upload test.txt from local to the HDFS location for external table testtable (<hdfs-uri>/hive/warehouse/testtable). This will work even if <hdfs-uri>/hive/warehouse/ is the default Hive warehouse directory.
Just to keep in mind - For non-external (called Hive-managed) tables, dropping the table will drop its warehouse HDFS directory automatically. For external table, dropping the table will not drop the HDFS directory backing it up, and needs to be deleted as a separate operation.
Illustration:
Here the Hive warehouse directory is hdfs:///apps/hive/warehouse
Create table
hive> CREATE EXTERNAL TABLE `testtable`(
`id` int,
`name` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'hdfs:///apps/hive/warehouse/testtable';
Data in test.txt
1,name-1
2,name-2
3,name-3
Upload data to HDFS
hadoop fs -put test.txt hdfs:///apps/hive/warehouse/testtable
Query table
hive> select * from testtable;
1 name-1
2 name-2
3 name-3

How to create table using external key word in hive

I want to process the data in hdfs , i am trying to create table using external keyword then i am getting following error, can you please provide solution for this.
hive> create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample.csv';
I am getting following error, can u please provide solution for this
FAILED: Error in metadata: MetaException(message:Got exception: org.apache.hadoop.ipc.RemoteException java.io.FileNotFoundException: Parent path is not a directory: /home/siva/jobportal/sample.csv
can you please confirm that this path is on HDFS?
More info on the creating external tables in Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ExternalTables
I use the following for XML parsing serde in hive ---
CREATE EXTERNAL TABLE XYZ(
X STRING,
Y STRING,
Z ARRAY<STRING>
)
ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
WITH SERDEPROPERTIES (
"column.xpath.X"="/XX/#X",
"column.xpath.Y"="/YY/#Y"
)
STORED AS
INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION '/user/XXX'
TBLPROPERTIES (
"xmlinput.start"="<xml start",
"xmlinput.end"="</xml end>"
);
At the moment, Hive only allows you to set a directory as a partition location while adding a partition. What you are trying to do here is to set a file as partition location. The workaround that I use is to first add a partition with a dummy/non-existent directory (Hive doesn't demand the directory exist while it's being set as a partition location) and then use alter table partition set location to change the partition location to your desired file. Surprisingly, Hive doesn't force the location to be a directory while setting the location of an existing partition the way it does while adding a new partition. So in your case, it will look like -
alter table samplecv add partition (id='11', name='somename') location '/home/siva/jobportal/somedirectory'
alter table samplecv partition (id='11', name='somename') set location '/home/siva/jobportal/sample.csv'
Hive always expect directory name in Location path instead of file name.
create your file inside a directory for example inside /home/siva/jobportal/sample/sample.csv and then try running below command to create your hive table.
create EXTERNAL table samplecv(id INT, name STRING)
row format serde 'com.bizo.hive.serde.csv.CSVSerde'
with serdeproperties (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
LOCATION '/home/siva/jobportal/sample';
In case if you get any error just put your file in hdfs and try, it should work.