How to create table using 3 .csv files from a location in Impala - impala

I have three files in a location '/user/hive/warehouse/dig.db/',let files be:
text1.csv
text2.csv
text3.csv
How do I create table using these 3 files(which are having same headers or schema) in impala
I have tried this but it only applies to only one file,not all three csv files. The rest 2 files data is stored under single field
create external table dig.Tunnel (
tbm string,
year smallint,
month tinyint,
day tinyint,
hour tinyint,
dist decimal(8,2),
lon decimal(9,6),
lat decimal(9,6))
row format delimited fields terminated by ","
location '/user/hive/warehouse/dig.db/'

Yes - to your last comment. If all your files do not have the same format, and that format doesn't match the format you have defined in the Impala table definition, then you won't see the data

Related

Create partitions using athena alter table statement

This "create table" statement is working correctly.
CREATE EXTERNAL TABLE default.no_details_2018_csv (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/2018/'
tblproperties ("parquet.compress"="SNAPPY");
The data for the year 2018 available in parquet format can be found in that bucket / folder.
1) How do I add partitions to this table? I need to add the year 2019 data to the same table by referring to the new location of s3://some_bucket/athena-parquet/no_details/2019/ The data for both years is available in parquet (snappy) format.
2) Is it possible to partition by month instead of years? In other words is it OK to have 24 partitions instead of 2? Will the new target table will also have parquet format just like source data? The code_2 column mentioned above looks like this "20181013133839". I need to use first 4 characters for yearly (or 6 for monthly) partitions.
First table needs be created as EXTERNAL TABLE Check this
Sample -
CREATE EXTERNAL TABLE default.no_details_table (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
PARTITIONED BY (year string)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/'
tblproperties ("parquet.compress"="SNAPPY");
You can add a partition as
ALTER TABLE default.no_details_table ADD PARTITION (year='2018') LOCATION 's3://some_bucket/athena-parquet/no_details/2018/';
If you want to have more partitions for each month or day, create table with
PARTITIONED BY (day string)
But you need to put data of a day to path -
s3://some_bucket/athena-parquet/no_details/20181013/

How to read parquet data with partitions from Aws S3 using presto?

I have data stored in S3 in form of parquet files with partitions. I am trying to read this data using presto. I am able to read data if I give the complete location of parquet file with partition. Below is the query to read data from "section a":
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (external_location = 's3://bucket/presto/section=a', format = 'PARQUET');
But my data is partitioned with different sections i.e. s3://bucket/presto folder contains multiple folders like "section=a", "section=b", etc.
I am trying to read the data with partitions as follows:
presto> create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255)) WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
The table is being created but when I try to select the data the table is empty.
I am new to Presto, please help.
Thanks
You create table correctly:
create table IF NOT EXISTS default.sample(name varchar(255), age varchar(255), section varchar(255))
WITH (partitioned_by = ARRAY['section'], external_location = 's3://bucket/presto', format = 'PARQUET');
However, in "Hive table format" the partitions are not auto-discovered. Instead, they need to be declared explicitly. There are some reasons for this:
explicit declaration of partitions allows you to publish a partition "atomically", once you're done writing
section=a, section=b is only the convention, the partition location may be different. In fact the partition can be located in some other S3 bucket, or different storage
To auto-discover partitions in the case like yours, you can use the system.sync_partition_metadata procedure that comes with Presto.

Filter Dynamic Partitioning in Apache Hive

Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;

Read multiple files in Hive table by date range

Let's imagine I store one file per day in a format:
/path/to/files/2016/07/31.csv
/path/to/files/2016/08/01.csv
/path/to/files/2016/08/02.csv
How can I read the files in a single Hive table for a given date range (for example from 2016-06-04 to 2016-08-03)?
Assuming every files follow the same schema, I would then suggest that you store the files with the following naming convention :
/path/to/files/dt=2016-07-31/data.csv
/path/to/files/dt=2016-08-01/data.csv
/path/to/files/dt=2016-08-02/data.csv
You could then create an external table partitioned by dt and pointing to the location /path/to/files/
CREATE EXTERNAL TABLE yourtable(id int, value int)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/path/to/files/'
If you have several partitions and don't want to write alter table yourtable add partition ... queries for each one, you can simply use the repair command that will automatically add partitions.
msck repair table yourtable
You can then simply select data within a date range by specifying the partition range
SELECT * FROM yourtable WHERE dt BETWEEN '2016-06-04' and '2016-08-03'
Without moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
Loading files into tables
Query with HiveQL ( select * from table where dt between '2016-06-04 ' and '2016-08-03')
Moving your file:
Design your table schema. In hive shell, create the table (partitioned by date)
move /path/to/files/2016/07/31.csv under /dbname.db/tableName/dt=2016-07-31, then you'll have
/dbname.db/tableName/dt=2016-07-31/file1.csv
/dbname.db/tableName/dt=2016-08-01/file1.csv
/dbname.db/tableName/dt=2016-08-02/file1.csv
load partition with
alter table tableName add partition (dt=2016-07-31);
See Add partitions
In Spark-shell, read hive table
/path/to/data/user_info/dt=2016-07-31/0000-0
1.create sql
val sql = "CREATE EXTERNAL TABLE `user_info`( `userid` string, `name` string) PARTITIONED BY ( `dt` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 'hdfs://.../data/user_info'"
2. run it
spark.sql(sql)
3.load data
val rlt= spark.sql("alter table user_info add partition (dt=2016-09-21)")
4.now you can select data from table
val df = spark.sql("select * from user_info")

Is there a way to partition an existing text file with Impala without pre-splitting the files into the partitioned directories?

Say I have a single file "fruitsbought.csv" that contains many records that contain a date field.
Is it possible to "partition" for better performance by creating the "fruits" table based on that text file, while creating a partition in which all the rows in fruitsbought.txt that would match that partition, say if I wanted to do it by year and month, to be created?
Or do I have to as part of a separate process, create a directory for each year and then put the appropriate ".csv" files that are filtered down for that year into the directory structure on HDFS prior to creating the table in impala-shell?
I heard that you can create an empty table, set up partitions, then use "Insert" statements that happen to contain the partition that that record goes into. Though in my current case, I already have a single "fruitsbought.csv" that contains every record I want in it that I like how I can just make that into a table right there (though it does not have parititionig).
Do I have to develop a separte process to presplit the one file into the multiple files sorted under the right partition? (The one file is very very big).
Create external table using fruitsbought.csv example (id is just example, ...- mean rest of columns in table):
CREATE EXTERNAL TABLE fruitsboughexternal
(
id INT,
.....
mydate STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'somelocationwithfruitsboughtfile/';
Create table with partition on date
CREATE TABLE fruitsbought(id INT, .....)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Import data to fruitsbought table, partition parameters have to be last in select (of course mydate have to be in format understand by impala like 2014-06-20 06:05:25)
INSERT INTO fruitsbought PARTITION(year, month, day) SELECT id, ..., year(mydate), month(mydate), day(mydate) FROM fruitsboughexternal;