I am trying to create a Hive external table that points to an S3 output file.
The file name should reflect the current date (it is always a new file).
I tried this:
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION concat('s3://BlobStore/Exports/Daily_', from_unixtime(unix_STRING(),'yyyy-MM-dd'));
but I get an error:
FAILED: Parse Error: line 3:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
is there any way to dynamically specify table location?
OK, I found the hive variables feature.
So I pass the location in the cli as follows
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/
and then use the variable in the hive command
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${s3File}';
This function doesn't work at my side ,
how did you make this happen ?
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/
Related
I need to load an s3 data in hive table. This s3 location is dynamic and is stored in another static s3 location.
The dynamic s3 location which I want to load in hive table has path format
s3://s3BucketName/some-path/yyyy-MM-dd
and the static location has data format
{"datasetDate": "datePublished", "s3Location": "s3://s3BucketName/some-path/yyyy-MM-dd"}
Is there a way to read this data in hive? I searched about this a lot but could not find anything.
You can read JSON data from your static location file, parse s3Location field and pass it as a parameter to your add partition clause.
Possible way to read json is using Hive. You can use some other means for the same.
Example using Hive.
create table data_location(location_info string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION 's3://s3BucketName/some-static-location-path/';
Then get the location in the shell script and pass it as a parameter to ADD partition statement.
For example you have table named target_table partitioned by datePublished. You can add partitions like this:
#!/bin/bash
data_location=$(hive -e "set hive.cli.print.header=false; select get_json_object(location_info,'$.s3Location') from data_location")
#get partition name
partition=$(basename ${data_location})
#Create partition in your target table:
hive -e "ALTER TABLE TARGET_TABLE ADD IF NOT EXISTS PARTITION (datePublished='${partition}') LOCATION '${data_location}'"
If you do not want partitioned table, then you can use
ALTER TABLE SET LOCATION instead of adding partition:
hive -e "ALTER TABLE TARGET_TABLE SET LOCATION='${data_location}'"
If it is only the last subfolder name is dynamic (which is date) and base directory is always the same, like s3://s3BucketName/some-path/, only yyyy-MM-dd is changing, you can create table once with location s3://s3BucketName/some-path/ and issue RECOVER PARTITIONS statement. In this case you do not need to read the content of file with location specification. Just schedule RECOVER PARTITIONS to get new partition attached on daily basis.
The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!
I can export a hive query results using this:
INSERT OVERWRITE LOCAL DIRECTORY '/home/user/events/'
but if I want to export it to an HDFS dir at /user/events/
how do I do that? I tried this:
INSERT OVERWRITE DIRECTORY '/user/user/events/'
> row format delimited
> fields terminated by '\t'
> select * from table;
but get this error then:
FAILED: ParseException line 2:0 cannot recognize input near 'row' 'format' 'delimited' in statement
remove the LOCAL keyword - it specifies local filesystem. Without it the result will go to hdfs. You may actually need to use OVERWRITE though. So:
INSERT OVERWRITE DIRECTORY '/user/events/'
I have a table, which created using following hiveQl-script:
CREATE EXTERNAL TABLE Logs
(
ip STRING,
time STRING,
query STRING,
pageSize STRING,
statusCode STRING,
browser STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
-- some regexps
)
STORED AS TEXTFILE
LOCATION '/path';
I need to create partitioning by time field. But in all examples I saw, that partitioning creates only by first field or by the sequence of fields starting at first. Also I saw, that if I write the field in PARTITIONED BY section, I mustn't write it in CREATE TABLE section.
I tried to create partitioning by time in several ways but always cought different exceptions.
For example this:
ParseException line 11:20 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
or this:
ParseException line 16:0 missing EOF at 'PARTITIONED' near ')'
and so on.
So, how can I create partitioning by time field in my case?
The partition column in hive is not a real column.It just gives hive the hint where to find the files of specific partition.
So if you have a file that you want to store the file into different partitions based on one column in this file.There is no aotumatic way to do this,you have to split the input file on your own and load the splited file into different partition.(In case you dont know how to split a file based on column,use awk {print $0>>"filebase."$2;})
Or you can load your input to an unpartitioned table first.And then use a query to insert these data to another partitioned table.
I hope this can help.
Below is the hive table i have created:
CREATE EXTERNAL TABLE Activity (
column1 type, </br>
column2 type
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/exttable/';
In my HDFS location /exttable, i have lot of CSV files and each CSV file also contain the header row. When i am doing select queries, the result contains the header row as well.
Is there any way in HIVE where we can ignore the header row or first line ?
you can now skip the header count in hive 0.13.0.
tblproperties ("skip.header.line.count"="1");
If you are using Hive version 0.13.0 or higher you can specify "skip.header.line.count"="1" in your table properties to remove the header.
For detailed information on the patch see: https://issues.apache.org/jira/browse/HIVE-5795
Lets say you want to load csv file like below located at /home/test/que.csv
1,TAP (PORTUGAL),AIRLINE
2,ANSA INTERNATIONAL,AUTO RENTAL
3,CARLTON HOTELS,HOTEL-MOTEL
Now, we need to create a location in HDFS that holds this data.
hadoop fs -put /home/test/que.csv /user/mcc
Next step is to create a table. There are two types of them to choose from. Refer this for choosing one.
Example for External Table.
create external table industry_
(
MCC string ,
MCC_Name string,
MCC_Group string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/mcc/'
tblproperties ("skip.header.line.count"="1");
Note: When accessed via Spark SQL, the header row of the CSV will be shown as a data row.
Tested on: spark version 2.4.
There is not. However, you can pre-process your files to skip the first row before loading into HDFS -
tail -n +2 withfirstrow.csv > withoutfirstrow.csv
Alternatively, you can build it into where clause in HIVE to ignore the first row.
If your hive version doesn't support tblproperties ("skip.header.line.count"="1"), you can use below unix command to ignore the first line (column header) and then put it in HDFS.
sed -n '2,$p' File_with_header.csv > File_with_No_header.csv
To remove the header from the csv file in place use:
sed -i 1d filename.csv