Data update in big data system - hive

I'm using Spark Streaming and Hive. I want to insert or update data into exist table in hive using SparkSQL. But, I don't know check whether data existed. Please suggest me this question!

Related

How to update a record in Hive table using Informatica powercenter

I'm trying to update a record based on ID field in Hive table using informtica cloud data integration. But the problem is Instead of updating the existing record it is creating new record. Can anyone please suggest better approach

ACID table error in IMPALA?Hive upgraded to Hive3

I am very new to Hive and Impala.
I was trying to run an already existing table in IMPALA but I got the following error.
AnalysisException: Table dev_test.customer not supported. Transactional (ACID) tables are only supported when they are configured as insert_only.
The version is Hive 3. I am clueless as in what to do. I did see some documentation, articles online, but still could not solve the issue. I have attached a screenshot of the error screen. Let me know if you need more information.
Any help is greatly appreciated. Thanks!
Unfortunately you cant see the data through Impala and you have to use hive.
you can change table properties to insert_only to see this data.
alter TABLE tmp2 set
TBLPROPERTIES (
'transactional'='true', 'transactional_properties'='insert_only'
);
When you set a table to FULL ACID or hive upgrades to full acid, table file format changed to ORC and this is not supported by Impala so you can not access them. So you need to use hive to access these tables.
If you choose the workaround and change table properties, you will loose all ACID benefits like UPD/DEL etc.

How to add/reflect query changes to the existing data in spark

I have a table created in production with incorrect data.My rec_srt_dt and rec_end_dt columns have been loaded wrongly. rec_srt_dt is sys_dt and now I have modified the query to load the data properly. My question is how do I have to handle the existing data present in production table and how to add new changes to that data?
My source table is oracle, Using Spark for the transformations and the target table is in AWS.
Kindly help me in this.

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things:
HISTORICAL DATA:
To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(say TABLE B).
ALSO,
LATEST DAILY LOAD
To have a trigger like condition where whenever a new row is inserted in aurora table A then a trigger should update the columns in redshift table B with some processing.
what should be the best approach to handle such situation. Please understand I don't have a simple read and insert situation , I also have to perform some process as well between read and write.
Not sure if you have already solved the issue and if so please share the details.
We are looking at following approach
A cron will write the daily data batch into s3 (say 1 month or order)
Upon s3 arrival, load that file into Redshift via copy command (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html)
Looking for more ideas/thoughts for sure.

aws Glue: Is it possible to pull only specific data from a database?

I need to transform a fairly big database table with aws Glue to csv. However I only the newest table rows from the past 24 hours. There ist a column which specifies the creation date of the row. Is it possible, to just transform these rows, without copying the whole table into the csv file? I am using a python script with Spark.
Thank you very much in advance!
There are some Built-in Transforms in AWS Glue which are used to process your data. This transfers can be called from ETL scripts.
Please refer the below link for the same :
https://docs.aws.amazon.com/glue/latest/dg/built-in-transforms.html
You haven't mentioned the type of database that you are trying connect. Anyway for JDBC connections spark has the option of query, in which you can issue the usual SQL query to get the rows you need.