PutHiveQL NiFi Processor extremely slow - misconfiguration? - sql

I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance

Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.

Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably

Related

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things:
HISTORICAL DATA:
To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(say TABLE B).
ALSO,
LATEST DAILY LOAD
To have a trigger like condition where whenever a new row is inserted in aurora table A then a trigger should update the columns in redshift table B with some processing.
what should be the best approach to handle such situation. Please understand I don't have a simple read and insert situation , I also have to perform some process as well between read and write.
Not sure if you have already solved the issue and if so please share the details.
We are looking at following approach
A cron will write the daily data batch into s3 (say 1 month or order)
Upon s3 arrival, load that file into Redshift via copy command (https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-run-copy.html)
Looking for more ideas/thoughts for sure.

Hive Table deletion and query processing

as per my understanding on Hive concepts, if we load the dataset into hive table, the data file will move from source path to hive warehouse within HDFS, and HDFS was set to three replicas for the data.
these questions might look silly but as i am beginner, i want clear my doubts.
my questions are:
1) if i delete the hive table, will it delete data file from hive warehouse only or along other two replicas from HDFS also?
2)if we are processing query on hive table, will that query be done as distributed processing?
per say, one data file is of size 1GB (interns 8 blocks x 128MB), and as we have three replication factor, there would be total 24 blocks available for this file
will our hive query be distributed among all the data blocks or it would be processed on hive warehouse blocks only?
Thanks in advance..
If you do "load data inpath" from a HDFS path the data will be moved from source to destination HDFS path,
If you do "load data local inpath", it doesn't move data from local to HDFS path, instead it copies
For your question
If you delete file in HDFS all the replicas are deleted.
If you have a 1gb file (8 blocks) with 3 replication factor, when you trigger the query in hive CLI, it converts your query to MR. It process only 8 blocks, in case of the datanode failure of the triggered job, it accesses the 2nd replica on a different node and processes the data (speculative execution)

How to insert overwrite partitioned table in BigQuery UI?

We can insert data into specific partition of partitioned table, here we need to specify partition value.But my requirement is to overwrite all partitions in a table in one query using UI. Can we perform this operation?
Consulted bigquery team member. You can NOT write to all partitions in one query.
You can only write to a partition at a time.
As YY has pointed out, you would not be able to do this directly in BigQuery/SQL with one query (you could script something to run N queries). However, if you spun up a Cloud Dataflow pipeline and configured it to have multiple BigQueryIO sinks, with each one (sink) overwriting a partition, this could be one way I can think of doing it in one shot. It would be a straightforward pipeline to spin up and run.
At this time, BigQuery allows updating upto 2000 partitions in a single statement. If you need to just insert data into a partitioned table, you can use the INSERT DML statement to write to upto 2000 partitions in one statement. If you are updating or deleting existing partitions you can use the UPDATE or DELETE statements respectively. If you are both updating and inserting new data, you can use the MERGE DML statements to achieve this.

Either insert or update database records via Apache NiFi flow

I am trying to transfer data between two databases with similar structure of tables using NiFi. Example of data structure:
User: {varchar name, integer id}.
There are no "Maximum-value Columns" so it is impossible to determine if there is new data or not. So each time I create "snapshot" of the full table content. The problem is that it is unclear either particular record should be inserted or updated in the target database.
I created two branches of processors: with inserts and with updates. Only insert works for new records and only update for existing. But (!) PutSQL processor works with bunch of flow files.
For example batch size is 100 and processors work once a day. Assume there was 98 records yesterday. They will be inserted. Today there are 200 records (98 from yesterday and 102 new). In this flow if NiFi tries to update first 100 records and insert them then both actions will fail: first 98 records should be updated while last 2 should be inserted.
How to solve this issue? I know it is possible to use batch size 1 but it work too slow.
I recommend solving this in your SQL statements, since NiFi will not know the prior status of the records. A MERGE statement would be ideal, if your database supports it (Oracle, SQL Server, MySQL insert). Otherwise, you can craft both an INSERT and an UPDATE for each record in the source table, making them conditional on the user existing in the table.

Hive insert vs Hive Load: What are the trade offs?

I'm learning Hadoop/Big data technologies. I would like to ingest data in bulk into hive. I started working with a simple CSV file and when I tried to use INSERT command to load each record by record, one record insertion itself took around 1 minute. When I put the file into HDFS and then used the LOAD command, it was instantaneous since it just copies the file into hive's warehouse. I just want to know what are the trade offs that one have to face when they opt in towards LOAD instead of INSERT.
Load- Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move datafiles into locations corresponding to Hive tables.
Insert-Query Results can be inserted into tables by using the insert clause and which in turn runs the map reduce jobs.So it takes some time to execute.
In case if you want to optimize/tune the insert statements.Below are some techniques:
1.Setting the execution Engine in hive-site.xml to Tez(if its already installed)
set hive.execution.engine=tez;
2.USE ORCFILE
CREATE TABLE A_ORC (
customerID int, name string, age int, address string
) STORED AS ORC tblproperties (“orc.compress" = “SNAPPY”);
INSERT INTO TABLE A_ORC SELECT * FROM A;
3. Concurrent job runs in hive can save the overall job running time .To achieve that hive-default.xml,below config needs to be changed:
set hive.exec.parallel=true;
set hive.exec.parallel.thread.number=<your value>;
For more info,you can visit http://hortonworks.com/blog/5-ways-make-hive-queries-run-faster/
Hope this helps.