How to insert raw data into hive table having different column sequence? - hive

Given : Hive Version 2.3.0 onwards, I have a Hive table and a fixed DDL from a long time. Now raw data is coming in different order of columns as text files and have to store data in parquet format with fixed partition criteria. My question is how to handle such situation when incoming data have different arrangement of columns.
Example :
CREATE TABLE users ( col1 string, col2 int, col3 string ... )
PARTITIONED BY (...)
STORED AS PARQUET;
and incoming data arrangement is like
col1 col3 col2
(row) x p 1
y q 2
in text files, notice column order changes.
I got a hard time finding correct information, can anyone explain best practices how to handle such situation? If it were small file, we can use scripts to correct text but if its in bulk and each time text files have different arrangement, what to do ? Appreciate any answer/ feedback.

With changing column order and/or addition/deletion of columns, one option is to convert the text files to Parquet format before loading the files to a Hive table.Set the property hive.parquet.use-column-names = true which is false by default, to read the Parquet files by column names rather than by column index.(This way you eliminate the dependency on column order in the source file) Partitions can have different schemas and you can create a table with the desired overall columns.
Note that an external table is easier to maintain compared to a managed table, without having to move data around when schema changes. When the schema changes, you can drop and re-create the table and execute an msck repair table .. to read the data.
To detect schema changes, you can have a process running that checks the first row of the text files (assuming they are column names) for any changes. Output of this process can be written to persistent storage like a MongoDB/DynamoDB data store with appropriate schema versioning.This helps retain history of all the schema changes.

Related

Transfer files from one table to another in impala

I have two tables in impala and I want to move the data from one to another.
Both tables have hdfs path like
/user/hive/db/table1 or table2/partitiona/partitionb/partitionc/file
I know the procedure with INSERT INTO to move the data from one table to another.
What I do not know is how to move also the files in the hdfs paths or if this happens automatically with the INSERT INTO statement
Also, if a table is sorted in the creation settings, if any data insert into it it will be sorted too?
It happens automatically and done by hive. When you do INSERT INTO table1 SELECT * FROM table2, hive copies data from /user/hive/db/table1 to table2/partitiona/partitionb/partitionc/file.
You do not have to move anything. You may need to analyze table1 for better performance.
Answer to your second question, if you use sort by while creating table1, then data will be automatically sorted by in table1 irrespective of data sorted or unsorted in table2.

Partitioning BigQuery table, loaded from AVRO

I have a bigquery table whose data is loaded from AVRO files on GCS. This is NOT an external table.
One of the fields in every AVRO object is created (date with a long type) and I'd like to use this field to partition the table.
What is the best way to do this?
Thanks
Two issues that prevent from using created as a partition column:
The avro file defines the schema during loading time. There is only one option to partition at this step: select Partition By Ingestion Time, however, most probably will include another field for this purpose.
The field created is long. This value seems to contain a Datetime. If it was Integer you will be able to use Integer Range partitioned tables somehow. But in this case, you would need to convert the long value into a Date/Timestamp to use date/timestamp partitioned tables.
So, from my opinion you can try:
Importing the data as it is into a first table.
Create a second empty table partitioned by created of type TIMESTAMP.
Execute a query reading from the first table and applying a timestamp function on created like TIMESTAMP_SECONDS (or TIMESTAMP_MILLIS) to transform the value to a TIMESTAMP, so each value you insert will be partioned.

Using HBase in place of Hive

Today we are using Hive as our data warehouse, mainly used for batch/bulk data processing - hive analytics queries/joins etc - ETL pipeline
Recently we are facing a problem where we are trying to expose our hive based ETL pipeline as a service. The problem is related to the fixed table schema nature of hive. We have a situation where the table schema is not fixed, it could change ex: new columns could be added (at any position in the schema not necessarily at the end), deleted, and renamed.
In Hive, once the partitions are created, I guess they can not be changed i.e. we can not add new column in the older partition and populate just that column with data. We have to re-create the partition with new schema and populate data in all columns. However new partitions can have new schema and would contain data for new column (not sure if new column can be inserted at any position in the schema?). Trying to read value of new column from older partition (un-modified) would return NULL.
I want to know if I can use HBase in this scenario and will it solve my above problems?
1. insert new columns at any position in the schema, delete column, rename column
2. backfill data in new column i.e. for older data (in older partitions) populate data only in new column without re-creating partition/re-populating data in other columns.
I understand that Hbase is schema-less (schema-free) i.e. each record/row can have different number of columns. Not sure if HBase has a concept of partitions?
You are right HBase is a semi schema-less database (column families still fixed)
You will be able to create new columns
You will be able to populate data only in new column without re-creating partition/re-populating data in other columns
but
Unfortunately, HBase does not support partitions (talking in Hive terms) you can see this discussion. That means if partition date will not be a part of row key, each query will do a full table scan
Rename column is not trivial operation at all
Frequently updating existing records between major compaction intervals will increase query response time
I hope it is helpful.

SQL Server Data Type Change

In SQL Server, We have table with 80 million of records and 3 column data type is float right now. Now we need to change the float data type column into Decimal column. How to proceed it with minimum downtime?
We executed the usual ALTER statement to change the data type, but log file got filled and going to system out of memory exception. So kindly let me the better way to solve this issue.
We cant use this technique Like: Creating 3 new temp columns and updating the existing data batch wise and dropping the existing column and renaming the temp column to live columns.
I have done exactly same in one of my project. Here are the steps which you can follow where minimal logging, minimum downtime with least complexity.
Create new table with new data types with same column names (Without index if table require any index create once data will be loaded in new table) but different name. For example existing table is EMPLOYEE then new table name should be EMPLOYEE_1. Keep in mind all constraints like foreign key and all you can create before loading or after loading it is not going to impact anything in terms of performance. But recommend don't create as existing table is having so names of constraints you have rename after renaming the table.
Keep in mind have max precision in new data type at what max precision value is available in your existing table.
Load data from your existing table to new table using SSIS with fast load option so that logging will not happen in temp database.
Rename the old table with EMPLOYEE_2 during downtime and rename the EMPLOYEE_1 to EMPLOYEE.
Alter table definition for foreign key, default or any other constrains.
Go live and create indexes in lower load time on table.
by using this approach we have changed the table data type where we had about more than billions records. Using this approach you can have minimum downtime and logging into tempDB as SSIS fast load option will not log anything in database.
This is too long for a comment.
One of these approaches might work.
The simplest approach is to give up. Well almost, but do the following:
Rename the existing column to something else (say, _col)
Add a computed column that does the conversion.
If the data is infrequently modified (so you only need read access), then create a new table. That is, copy the data in to a new table with the correct type, then drop the original table and rename the new table to the old name.
You can do something similar by swapping table spaces rather than renaming the table.
Another way could be:
Add a new column to the table with the correct datatype.
Insert the values from the float column to the decimal column, but do it in bulk of x records.
When everything is done, remove the float column and rename the decimal column to what the float column name was.

When creating an external table in hive can I point the location to specific files in a directory?

I have defined a table as such:
create external table PageViews (Userid string, Page_View string)
partitioned by (ds string)
row format as delimited fields terminated by ','
stored as textfile location '/user/data';
I do not want all the files in the /user/data directory to be used as part of the table. Is it possible for me to do the following?
location 'user/data/*.csv'
What kmosley said is true. As of now, you can't selectively choose certain files to be a part of your Hive table. However, there are 2 ways to get around it.
Option 1:
You can move all the csv files into another HDFS directory and create a Hive table on top of that. If it works better for you, you can create a subdirectory (say, csv) within your present directory that houses all CSV files. You can then create a Hive table on top of this subdirectory. Keep in mind that any Hive tables created on top of the parent directory will NOT contain the data from the subdirectory.
Option 2:
You can change your queries to make use of a virtual column called INPUT__FILE__NAME.
Your query would look something like:
SELECT
*
FROM
my_table
WHERE
INPUT__FILE__NAME LIKE '%csv';
The ill-effect of this approach is that the Hive query will have to churn through entire data present in the directory even though you only cared about specific files. The query wouldn't filter out files based on the predicate using INPUT__FILE__NAME. It will just filter out the records that don't come from match the predicate using INPUT__FILE__NAME during the map phase (consequently filtering out all records from particular files) but the mappers would run on unnecessary files as well. It will give you the correct result, might have some, probably minor, performance overhead.
The benefit of this approach is the you can use the same Hive table if you had multiple files in your table and you wanted the ability to query all files from that table (or its partition) in a few queries and a subset of the files in other queries. You could make use of the INPUT__FILE__NAME virtual column to achieve that. As an example:
if a partition in your HDFS directory /user/hive/warehouse/web_logs/ looked like:
/user/hive/warehouse/web_logs/dt=2012-06-30/
/user/hive/warehouse/web_logs/dt=2012-06-30/00.log
/user/hive/warehouse/web_logs/dt=2012-06-30/01.log
.
.
.
/user/hive/warehouse/web_logs/dt=2012-06-30/23.log
Let's say your table definition looked like:
CREATE EXTERNAL TABLE IF NOT EXISTS web_logs_table (col1 STRING)
PARTITIONED BY (dt STRING)
LOCATION '/user/hive/warehouse/web_logs';
After adding the appropriate partitions, you could query all logs in the partition using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt='2012-06-30';
However, if you only cared about the logs from the first hour of the day, you could query the logs for the first hour using a query like:
SELECT
*
FROM
web_logs_table w
WHERE
dt ='2012-06-30'
AND INPUT__FILE__NAME='00.log';
Another similar use case could be a directory that contains web logs from different domains and various queries need to analyze logs on different sets of domains. The queries can filter out domains using the INPUT__FILE__NAME virtual column.
In both the above use-cases, having a sub partition for hour or domain would solve the problem as well, without having to use the virtual column. However, there might exist some design trade-offs that require you to not create sub-partitions. In that case, arguably, using INPUT__FILE__NAME virtual column is your best bet.
Deciding between the 2 options:
It really depends on your use case. If you would never care about the files are you are trying to exclude from the Hive table, using Option 2 is probably an overkill and you should fix up the directory structure and create a Hive table on top of the directory containing files that you care about.
If the files you are presently excluding follow the same format as the other files (so they can all be part of the same Hive table) and you could see yourself writing a query that would analyze all the data in the directory, then go with Option 2.
I came across this thread when I had a similar problem to solve. I was able to resolve it by using a custom SerDe. I then added SerDe properties which guided what RegEx to apply to the file name patterns for any particular table.
A custom SerDe might seem overkill if you are only dealing with standard CSV files, I had a more complex file format to deal with. Still this is a very viable solution if you don't shy away from writing some Java. It is particularly useful when you are unable to restructure the data in your storage location and you are looking for a very specific file pattern among a disproportionately large file set.
> CREATE EXTERNAL TABLE PageViews (Userid string, Page_View string)
> ROW FORMAT SERDE 'com.something.MySimpleSerDe'
> WITH SERDEPROPERTIES ( "input.regex" = "*.csv")
> LOCATION '/user/data';
No you cannot currently do that. There is a JIRA ticket open to allow regex selection of included files for Hive tables (https://issues.apache.org/jira/browse/HIVE-951).
For now your best bet is to create a table over a different directory and just copy in the files you want to query.