Partitioned by non-first column - hive

I have a table, which created using following hiveQl-script:
CREATE EXTERNAL TABLE Logs
(
ip STRING,
time STRING,
query STRING,
pageSize STRING,
statusCode STRING,
browser STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
-- some regexps
)
STORED AS TEXTFILE
LOCATION '/path';
I need to create partitioning by time field. But in all examples I saw, that partitioning creates only by first field or by the sequence of fields starting at first. Also I saw, that if I write the field in PARTITIONED BY section, I mustn't write it in CREATE TABLE section.
I tried to create partitioning by time in several ways but always cought different exceptions.
For example this:
ParseException line 11:20 cannot recognize input near ')' 'ROW' 'FORMAT' in column type
or this:
ParseException line 16:0 missing EOF at 'PARTITIONED' near ')'
and so on.
So, how can I create partitioning by time field in my case?

The partition column in hive is not a real column.It just gives hive the hint where to find the files of specific partition.
So if you have a file that you want to store the file into different partitions based on one column in this file.There is no aotumatic way to do this,you have to split the input file on your own and load the splited file into different partition.(In case you dont know how to split a file based on column,use awk {print $0>>"filebase."$2;})
Or you can load your input to an unpartitioned table first.And then use a query to insert these data to another partitioned table.
I hope this can help.

Related

Getting null in some columns in hive due to next line "\n" within records

I have a table where I have next line character ("\n") within my records. So when I do select * on table, I get null values in the column which come after the records with "\n" or sometimes I get multiple records for a single record.
I get above problem in terminal,DB Visualizer and Tableau everywhere.
The data is stored correctly but this error is because hive is not able to provide in proper format. So we need to change the query output format of hive. We need to set below property :
set hive.query.result.fileformat=SequenceFile;
Its default value was TextFile which was giving an error.
Default Value:
Hive 0.x, 1.x, and 2.0: TextFile
Hive 2.1 onward: SequenceFile
https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties

Import csv to impala

So for my previous homework, we were asked to import a csv file with no columns names to impala, where we explicitly give the name and type of each column while creating the table. However, now we have a csv file but with column names given, in this case, do we still need to write down the name and type of it even it is provided in the data?
Yes, you still have to create an external table and define the column names and types. But you have to pass the following option right at the end of the create table statement
tblproperties ("skip.header.line.count"="1");
-- Once the table property is set, queries skip the specified number of lines
-- at the beginning of each text data file. Therefore, all the files in the table
-- should follow the same convention for header lines.

External Table in Hive - Location

The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!

How do I import data from a csv when the records are not separated by line breaks but with brackets

Looking at the AM data, just for a data analysis project and I'm having trouble importing the data into my dbms (postgresql).
My code is sql code is this:
DROP TABLE IF EXISTS member_details;
CREATE TABLE member_details(
pnum varchar(255),
.....
updatedon timestamp);
COPY member_details
FROM '/Users/etc/data/sample_dump.csv'
WITH DELIMITER ','
CSV;
Problem is that the csv file has no line breaks to separate the data, instead each record is within a bracket which my code above does not recognise and thus just imports all the data into the header in one line and so no records are created
how the data is structured
(dataA1, ....,dataAx),(dataB1,...,dataBx)
How can I alter my code so that postgresql imports the data record by record by recognising the brackets.
Based on the PostgreSQL COPY documentation, I don't believe it allows for row delimiters other than carriage returns and/or line feeds. I believe you'll need to process your file before importing. You can simply replace all ,( with \n(, then replace all the parenthesis to make it a standard csv format that COPY will happy consume.
Perhaps there's another method for PostgreSQL that would work too, but I haven't come across anything yet.

Hadoop Hive: create external table with dynamic location

I am trying to create a Hive external table that points to an S3 output file.
The file name should reflect the current date (it is always a new file).
I tried this:
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION concat('s3://BlobStore/Exports/Daily_', from_unixtime(unix_STRING(),'yyyy-MM-dd'));
but I get an error:
FAILED: Parse Error: line 3:9 mismatched input 'concat' expecting StringLiteral near 'LOCATION' in table location specification
is there any way to dynamically specify table location?
OK, I found the hive variables feature.
So I pass the location in the cli as follows
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/
and then use the variable in the hive command
CREATE EXTERNAL TABLE s3_export (...)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '${s3File}';
This function doesn't work at my side ,
how did you make this happen ?
hive -d s3file=s3://BlobStore/Exports/APKsCollection_test/`date +%F`/