I was looking at code which listed listed a Hive file path for training data hive://<namespace>/<table>/split=train
The last part of this file path, "split=train", seems to be a query parameter for the hive table. The table contains a column called "split" in it some values are "train" (for ML purposes). At first I thought it might be a partition, but after inspecting the table I see it is not partitioned on the column "split".
Is this a feature of hive file paths? Can you add query parameters to them? Or is there something else you should do to the table to make them queryable?
Related
I have a requirement of validating CSV file before loading into staged-folder, and later have to load into sql table.
I need to validate metadata (the structure of the file must be same as target sql table)
No. of columns should be equal to the target sql table
order of columns should be same as target sql table
Data types of columns (no text values should exist in numeric field of csv file)
looking for some easy and efficient way achieve this.
Thanks for help
A Python program and module that does most of what you're looking for is chkcsv.py: https://pypi.org/project/chkcsv/. It can be used to verify that a CSV file contains a specified set of columns and that the data type of each column conforms to the specification. It does not, however, verify that the order of columns in the CSV file is the same as the order in the database table. Instead of loading the CSV file directly into the target table, you can load it into a staging table and then move it from there into the target table--this two-step process eliminates column order dependence.
Disclaimer: I wrote chkcsv.py
Edit 2020-01-26: I just added an option that allows you to specify that the column order should be checked also.
So for my previous homework, we were asked to import a csv file with no columns names to impala, where we explicitly give the name and type of each column while creating the table. However, now we have a csv file but with column names given, in this case, do we still need to write down the name and type of it even it is provided in the data?
Yes, you still have to create an external table and define the column names and types. But you have to pass the following option right at the end of the create table statement
tblproperties ("skip.header.line.count"="1");
-- Once the table property is set, queries skip the specified number of lines
-- at the beginning of each text data file. Therefore, all the files in the table
-- should follow the same convention for header lines.
I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.
The below table returns no data while running a select statement
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
I need my hive to point to a dynamic folder so as a mapreduce job puts a part file in a folder and hive loads into the table.
Is there any way the location be made dynamic like
/user/data/CSV/*/*/*/*/part-*
or just /user/data/CSV/* would do fine ?
(The same code works fine when created as internal table and loaded with the file path - hence there is no issues due to formatting)
First of, your table definition is missing columns. Second, external table location always points to folder, not particular files. Hive will consider all files in the folder to be data for the table.
If you have data that is generated e.g. on a daily basis by some external process you should consider partitioning your table by date. Then you need to add a new partition to the table when the data is available.
Hive does not iterate through multiple folders -
Hence for the above scenario
I ran a command line argument that iterates through these multiple folders and cat (print to the console) all the part files and then put it to a desired location.(that Hive points to)
hadoop fs -cat /user/data/CSV/*/*/*/*/part-* | hadoop fs -put - <destination folder>
This line
LOCATION '/user/data/CSV/2016/1/27/*/part-*';
Does not look correct, I don't think that the table can created from multiple locations. Have you tried just importing by a single location to confirm this?
Could also be the delimiter you're using is not correct. If you are using a CSV file to import your data try delimitating by ','.
You can use an alter table statement to change the locations. In the example below partitions are based on dates where data is stored in time dependent file locations. If I want to search many days I have to add an alter table statement for each location. This idea may extend to your situation quite well. You create a script to generate the create table statement as below using some other technology such as python.
CREATE EXTERNAL TABLE foo (
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\073'
LINES TERMINATED BY '\n'
;
alter table foo add partition (date='20160201') location /user/data/CSV/20160201/data;
alter table foo add partition (date='20160202') location /user/data/CSV/20160202/data;
alter table foo add partition (date='20160203') location /user/data/CSV/20160203/data;
alter table foo add partition (date='20160204') location /user/data/CSV/20160204/data;
You can use as many add and drop statements you need to define your locations. Then your table can find data held in many locations in HDFS rather than having all your files in one location.
You may also be able to leverage a
create table like
statement. To create a schema like you have in another table. Then alter the table to point at the files you want.
I know this isn't exactly what you want and is more of a work around. Good luck!
When we create using
Create external table employee (name string,salary float) row format delimited fields terminated by ',' location /emp
In /emp directory there are 2 emp files.
so when we run select * from employee, it get the data from both the file ad display.
What will be happen when there will be others file also having different kind of record which column is not matching with the employee table , so it will try to load all the files when we run "select * from employee"?
1.Can we specify the specific file name which we want to load?
2.Can we create other table also with the same location?
Thanks
Prashant
It will load all the files in emp directory even it doesn’t match with table.
for your first question. you can use Regex serde.if your data matches to regex.then it loads to the table.
regex for access log in hive serde
https://github.com/apache/hive/blob/trunk/contrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
other options:I am pointing some links.these links has some ways.
when creating an external table in hive can I point the location to specific files in a direcotry?
https://issues.apache.org/jira/browse/HIVE-951
for your second question: yes we can create other tables also with the same location.
Here are your answers
1. If the data in the file dosent match with table format, hive doesnt throw an error. It tries to read the data as best as it could. If data for some columns are missing it will put NULL for them.
No we cannot specify the file name for any table to read data. Hive will consider all the files under the table directory.
Yes, we can create other tables with the same location.