how to check if load into hive statement executed successfully or not? - hive

We have LOAD DATA Statement in hive and impala where we load data from HDFS to hive or impala table. My question is what if there is an issue in the file (may be there are less number of columns in file than that of in table or there is data mismatch for one of the columns) in such scenario whether file will get loaded or it will not load and show any error.

Related

When creating a BigQuery table I'm getting an error message about hive partitioning

I'm creating a table from a CSV text file on my machine. I'm getting this message - Hive partitioned loads require that the HivePartitioningOptions be set, with both a sourceUriPrefix and mode provided. No sourceUriPrefix was provided.
How can I fix this?

How to avoid reloading of same data into hdfs at the time of any failure in pyspark

I have a pyspark program to perform business mapping and load the data into two hive external tables partitioned based on month end date.
Requested Scenario:
If there is any failure after the load of first target table and before the second table load. During my reprocess I should not touch the load again on first table and continue with the second table load. Is there any wrapper file can I touch in hdfs location or any other alternative avaiable ??
hdfs location:
/home/gudirame/user/data_base_db/table_name1/_SUCCESS
/home/gudirame/user/data_base_db/table_name1/2020-09-30/part-001-dsfas.parquet
/home/gudirame/user/data_base_db/table_name1/2020-10-31/part-002-dsfas.parquet
/home/gudirame/user/data_base_db/table_name2/_SUCCESS
/home/gudirame/user/data_base_db/table_name2/2020-09-30/part-003-dsfas.parquet
/home/gudirame/user/data_base_db/table_name2/2020-10-31/part-004-dsfas.parquet

How to create a log table in Hive to record job success/failure?

I want to create a table for logging in Hive. This table should contain basic details of a sqoop job that runs every day. It should list the name of the job/the table name to which the data was loaded using the sqoop job. The number of records ingested , whether the ingestion was successful or failed and also the time of ingestion.
A .log file is created after every sqoop job run but this log file is not structured to be loaded directly into the hive table using the LOAD DATA INPATH command.I would really appreciate it if someone could point me in the right direction about whether a shell script should be written to achieve this or something else?
Thank you in advance

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Does Hive duplicate data?

I have a large log file which I loaded in to HDFS. HDFS will replicate to different nodes based on rack awareness.
Now I load the same file into a hive table. The commands are as below:
create table log_analysis (logtext string) STORED AS TEXTFILE
LOCATION '/user/hive/warehouse/';
LOAD DATA INPATH '/user/log/apache.log' OVERWRITE INTO TABLE log_analysis;
Now when I go and see the '/user/hive/warehouse/' directory there is a table file and copying it to local, it has all the log file data.
My question is: the existing file in HDFS is replicated. Then loading that file in hive table, stored on HDFS also gets replicated.
Is that not the same file stored 6 times (assuming that replication factor is 3) ? That would be such a waste of resources.
Correct, In case you are loading the data from HDFS , the data moves from HDFS to the /user/hive/warehouse/yourdatabasename/tablename.
Your question indicates that you have created an INTERNAL table using hive and you are loading data into HIVE table from HDFS location.
When you load data into an internal table using LOAD DATA INPATAH command, It moves data from the primary location to another location. In your case, it should be /user/hive/warehouse/log_analysis. So basically it provides new address and new HDFS location of the data and you won't be seeing anything in the previous location.
When you move data from one location to another location on HDFS. NameNode receives a new location of the data and it deletes all old metadata for that data. Hence there won't be any duplicate information of the data and the data and there will be only 3 replication and it will be stored only 3 times.
I hope it clear to you.