Can we load text file separated by :: into hive table? - hive

Is there a way to load a simple text file where fields are separated by "::" into hive table other than replacing those "::" with "," and then load it?
Replacing the "::" with "," is quicker when the text file is small but what if contains millions of records?

Try creating Hive table using Regex serde
Example:
i had file with below text in it.
i::90
w::99
Create Hive table:
hive> create external table default.i
(Id STRING,
Name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*)')
STORED AS TEXTFILE;
Select from Hive table:
hive> select * from i;
+-------+---------+--+
| i.id | i.name |
+-------+---------+--+
| i | 90 |
| w | 99 |
+-------+---------+--+
In case if you want to skip the header then use below syntax:
hive> create external table default.i
(Id STRING,
Name STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*)')
STORED AS TEXTFILE
tblproperties ('skip.header.line.count'='1');
UPDATE:
Check is there any older files in your table location.if some files are there then delete them(if you don't want them).
1.Create Hive table as:
create external table <db_name>.<table_name>
(col1 STRING,
col2 STRING,
col3 string,
col4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES ('input.regex' = '(.*?)::(.*?)::(.*?)::(.*)')
STORED AS TEXTFILE;
2.Then run:
load data local inpath 'Source path' overwrite into table 'Destination table'

Related

impala CREATE EXTERNAL TABLE and remove double quotes

i got data on CSV for example :
"Female","44","0","0","Yes","Govt_job","Urban","103.59","32.7","formerly smoked"
i put it as hdfs with hdfs dfs put
and now i want to create external table from it on impala (not in hive)
there is an option without the double quotes ?
this is what i run by impala-shell:
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( `gender` STRING,`age` STRING,`hypertension` STRING,`heart_disease` STRING,`ever_married` STRING,`work_type` STRING,`Residence_type` STRING,`avg_glucose_level` STRING,`bmi` STRING,`smoking_status` STRING )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION "/user/test/tmp/test1"
Update 28.11
i managed to do it by create the external and then create a VIEW as select with case when concat() each col.
Impala uses the Hive metastore so anything created in Hive is available from Impala after issuing an INVALIDATE METADATA dbname.tablename. HOWEVER, to remove the quotes you need to use the Hive Serde library 'org.apache.hadoop.hive.serde2.OpenCSVSerde' and this is not accessible from Impala. My suggestion would be to do the following:
Create the external table in Hive
CREATE EXTERNAL TABLE IF NOT EXISTS test_test.test1_ext
( gender STRING, age STRING, hypertension STRING, heart_disease STRING, ever_married STRING, work_type STRING, Residence_type STRING, avg_glucose_level STRING, bmi STRING, smoking_status STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES
(
"separatorChar" = ",",
"quoteChar" = """
)
STORED AS TEXTFILE
LOCATION "/user/test/tmp/test1"
Create a managed table in Hive using CTAS
CREATE TABLE mytable AS SELECT * FROM test_test.test1_ext;
Make it available in Impala
INVALIDATE METADATA db.mytable;

How to output a table as a parquet file in spark-sql, not spark-shell?

It is easy to read a table from CSV file using spark-sql:
CREATE TABLE MyTable (
X STRING,
Y STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\,",
"quoteChar" = "\""
)
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'input.csv' INTO TABLE MyTable;
But how can I output this result as Parquet file?
PS: I know how to do that in spark-shell, but it is not what I'm looking for.
You have to create one table with the schema of your results in hive stored as parquet. After getting the results you can export them into the parquet file format table like this.
set hive.insert.into.external.tables = true
create external table mytable_parq ( use your source table DDL) stored as parquet location '/hadoop/mytable';
insert into mytable_parq select * from mytable ;
or
insert overwrite directory '/hadoop/mytable' STORED AS PARQUET select * from MyTable ;

Insert data into hive table without delimiters

I want 10 words in one column, another 10 words in another column .How to insert data into hive table with no specified delimiters using UDFs?
CREATE TABLE employees_stg (emplid STRING, name STRING, age STRING, salary STRING, dept STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", --Length of each column specified between braces "({})"
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s" --Output in string format
)
LOCATION '/path/to/input/employees_stg';
LOAD DATA INPATH '/path/to/sample_file.txt' INTO TABLE employees_stg;
SELECT * FROM employees_stg;

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;

Writing columns having NULL as some string using OpenCSVSerde - HIVE

I'm using 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to write hive table data.
CREATE TABLE testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
"quoteChar" = "'"
)
STORED AS TEXTFILE LOCATION '<location>' AS
select * from foo;
So, if 'foo' table has empty strings in it, for eg: '1','2','' . The empty strings are written as is to the textfile. The data in textfile reads '1','2',''
But if 'foo' contains null values, for eg: '1','2',null. The null value is not written in the text file.
The data in the textfile reads '1','2',
How do I make sure that the nulls are properly written to the textfile using csv serde. Either written as empty strings or any other string say "nullstring"?
I also tried this:
CREATE TABLE testtable ROW FORMAT SERDE
....
....
STORED AS TEXTFILE LOCATION '<location>'
TBLPROPERTIES ('serialization.null.format'='')
AS select * foo;
Though this should probably replace the empty strings with null. But this doesn't even do that.
Please guide me on how to write nulls to csv files.
Will I have to check for the null values for columns in the select query itself like (NVL or something) and replace it with something?
Open CSV Serde ignores 'serialization.null.format' property , you can handle null values using below steps
1. CREATE TABLE testtable
(
name string,
title string,
birth_year string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
,"quoteChar" = "'"
)
STORED AS TEXTFILE;
2. load data into testtable
3. CREATE TABLE testtable1
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('serialization.null.format'='');
4. INSERT OVERWRITE TABLE testtable1 SELECT * FROM testtable