Validate a Sqoop with use of QUERY and WHERE clauses - sql

I am ope-rationalizing a data import process that takes data from an existing database and partitions it within a scheme of HDFS. By default, the job is split into four map processes, and right now I have the job configured to do this on a daily interval through Apache Oozie.
Since Oozie is DAG oriented, is there the capacity to create a validationStep within the Oozie workflow such that:
Run HIVE query on newly imported data to return count of rows
Run SQL query to return count of rows in original source of data
Compare the two values
If not match, return FAIL and KILL JOB, if match, return TRUE and OK
I understand there is a validate process within sqoop, but it is my understanding that since I am not running this against a single table that this is not applicable (each of my sqoop import is partitioned by a specific date).
Is this possible?

Related

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.
Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

PutHiveQL NiFi Processor extremely slow - misconfiguration?

I am currently setting up a simple NiFi flow that reads from a RDBMS source and writes to a Hive sink. The flow works as expected until the PuHiveSql processor, which is running extremely slow. It inserts one record every minute approximately.
Currently is setup as a standalone instance running on one node.
The logs showing the insert every 1 minute approx:
(INSERT INTO customer (id, name, address) VALUES (x, x, x))
Any ideas about why this may be? Improvements to try?
Thanks in advance
Inserting one record at a time into Hive will result extreme slowness.
As your doing regular insert into hive table:
Change your flow:
QueryDatabaseTable
PutHDFS
Then create Hive avro table on top of HDFS directory where you have stored the data.
(or)
QueryDatabaseTable
ConvertAvroToORC //incase if you need to store data in orc format
PutHDFS
Then create Hive orc table on top of HDFS directory where you have stored the data.
Are you poshing one record at time? if so may use the merge record process to create batches before pushing into HiveQL,
It is recommended to batch into 100 records :
See here: https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-hive-nar/1.5.0/org.apache.nifi.processors.hive.PutHiveQL/
Batch Size | 100 | The preferred number of FlowFiles to put to the database in a single transaction
Use the MergeRecord process and set the number of records or/and timeout, it should speed-up considerably

Result of Bigquery job running on a table in which data is loaded via streamingAPI

I have a BQ wildcard query that merges a couple of tables with the same schema (company_*) into a new, single table (all_companies). (all_companies will be exported later into Google Cloud Storage)
I'm running this query using the BQ CLI with all_companies as the destination table and this generates a BQ Job (runtime: 20mins+).
The company_* tables are populated constantly using the streamingAPI.
I've read about BigQuery jobs, but I can't find any information about streaming behavior.
If I start the BQ CLI query at T0, the streamingAPI adds data to company_* tables at T0+1min and the BQ CLI query finishes at T0+20min, will the data added at T0+1min be present in my destination table or not?
As described here the query engine will look at both the Columnar Storage and the streaming buffer, so potentially the query should see the streamed data.
It depends what you mean by a runtime of 20 minutes+. If the query is run 20 minutes after you create the job then all data in the streaming buffer by T0+20min will be included.
If on the other hand the job starts immediately and takes 20 minutes to complete, you will only see data that is in the streaming buffer at the moment the table is queried.

Will sqoop extract records present at initiation or will it sqoop records added to the table when the sqoop is running?

I have a table A that is constantly updated with new records. I am trying to sqoop the records from table A to HDFS at say 2:00 PM CT(source table A has 5M records) and the sqoop ends at 4:00 PM CT(table A has 5.5M records). My question is
Will there be 5M records in the target or 5.5M?
According to the documentation, Sqoop uses the read committed transaction isolation. So once the (one or more) SELECT queries that Sqoop performs underneath have been executed, the "selected" records will be the ones that are going to be inserted in Hive (I assume that you're importing data into Hive because of the tag you used in the question). So what determines the number of records that are going to be finally imported (5M or 5.5M records) is the execution of the SELECT queries, not the total amount of time that the whole import process takes.
Bear in mind that you can control the parallelism of the import process by specifying the number of mappers that are going to be used (parameter --num-mappers). Each mapper will perform an independent SELECT query.
Also, you can consider using incremental imports in order to retrieve the new data that have been added to the database after the import process has finished. Besides, you can also use free-form queries to have a finer grained control of the amount of data you want to import to your database.

Hive - How can I store the hive query results to be referred later?

I usually connect to gateway node through putty and run hive queries over there.
On several occasions the queries run for hours together. And at least a few times, putty gets disconnected, and the execution of the queries also abort.
Is there a way to store hive query results somehow, so that I can inspect them at later points of time?
I don't want to create another table just to store the results.
You can store your result
INSERT OVERWRITE DIRECTORY 'outputpath' SELECT * FROM table