Load data into hive table is atom manipulation? - hive

I am going to load data into a hive table,I want to make sure if errors occur at middle time the data will be Incomplete or roll back?

Related

is there is a way to automate truncate on staging table before loading new data through snowpipe from S3 bucket

we are moving data from the staging table to the fact table and want to truncate old data from staging table before new data is loaded from S3 bucket through snowpipe. Is there is a way to automate the truncate statement before snowpipe runs or load new data into the staging table?
Have you considered just continually adding the data to your stage tables, put an append-only STREAM over that table, and then use tasks to load downstream tables from the STREAM. The task could run every minute with a WHEN statement that checks whether data is in the STREAM or not. This would load the data and push it downstream whenever the data happens to land from your ERP.
Then, you can have a daily task that runs anytime during the day which checks the STREAM to make sure there is NO DATA in it, and if that's true, then DELETE everything in the underlying table. This step only needs to happen to save storage and because the STREAM is append-only, the DELETE statement does not create records in your STREAM.
Using this method will remove the need to truncate before Snowpipe loads the data.

When is data ready for querying in Google BigQuery after a Load Job?

Say I have a very large CSV file that I'm loading to a bigquery table. Will this data be available for querying only after the whole file has been uploaded and the job is finnished or will it be available for querying as as the file is being uploaded?
BigQuery will commit data from a load job in an all-or-none fashion. Partial results will not appear in the destination table as a job is progessing, results are committed at the end of the load job.
A load job that terminates with an error will commit no rows. However, for use cases where you have poorly sanitized data, you can optionally choose to configure your load job to allow bad/malformed data to be ignored through configuration values like MaxBadRecords. In such cases, a job may have warnings and still commit the successfully processed data, but the commit semantics remain the same (all at the end, or none if the defined threshold for bad data is exceeded).

how to check if load into hive statement executed successfully or not?

We have LOAD DATA Statement in hive and impala where we load data from HDFS to hive or impala table. My question is what if there is an issue in the file (may be there are less number of columns in file than that of in table or there is data mismatch for one of the columns) in such scenario whether file will get loaded or it will not load and show any error.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

Slow data load with conversion from nvarchar to varchar

I have a table which has approximately 140 columns. The data for this table comes from a transactional system and many of the columns are Unicode.
I dump the daily load to my Staging database whose data type matches exactly with what the source system has.From Staging, I do some cleaning and load it to a Reports database. When loading from Staging to Reports database, I convert all the Unicode character data to String and then load it to reports database. This process takes an awful lot of time and I am trying to optimize this process (make the load times faster).
I use the Derived column transformation to convert all the Unicode Data to String data. Any suggestions here for me?
how about if you cast your columns as varchar(size) on your source query?