How to efficiently use Select Hive Processor in NIFI? - hive

I have been using Select Hive processor to fetch data from Hive and create CSV files. I am observing for around 7 Million records, it takes around 5 minutes. When observed closely, It was found that data fetch from Hive is faster and hardly takes less 10% of the overall time but it is taking too long to write files in CSVs. I am using 8 Cores and 32GB RAM. I have configured heap memory of 16 GB. Can someone please help to improve this performance? Do I need to do any system level settings?

The CSV output option of SelectHiveQL could certainly be improved, currently it builds each row as a string in memory and then writes it to the flow file, but it probably could just write straight to the flow file, etc. Please feel free to file a Jira for this improvement.

Related

How to improve frequent BigQuery reads?

I'm using BigQuery for Java to do small reads on a table with about ~5GB of data. The queries I do follows the most standard SQL like SELECT foo FROM my-table WHERE bar=$1 where the result will be at most 1 row. I need to do this at a high frequency and therefore performance is a big concern. How do I optimize for this?
I thought about pulling the entire data set periodically since it's only 5GB, but then again 5GB sounds like a lot to be constantly keeping in memory.
Running this query in BigQuery console shows something like Query complete (0.6 sec elapsed, 4.2 GB processed). Fast for 4.2 GB but not fast enough. Again, I need to very frequently read from it but rarely (maybe once a day or week) write to it.
Maybe tell the server to cache the processed data somehow?
You don't have control over the Cache layer in BigQuery. That is something the service does automatically for you. Unfortunately typical cache lifetime is 24 hours, and the cached results are best-effort and may be invalidated sooner (Official docs).
Query completes in 0.6s seems to be goo for BQ. I'm afraide that If you are looking for something faster maybe BigQuery isn't the data warehouse for your use case.
BigQuery is built for analytical processing and not to interact with individual rows. The best practice would be as you mentioned to hold a copy of it in a place that allows quicker and more efficient reading of individual rows (like a MySQL database).
However, you can still vastly optimize the amount of data scanned in your query by clustering the table on the field that you're filtering on.
https://cloud.google.com/bigquery/docs/creating-clustered-tables

SQL Server: Bulk Insert Data Loading to Partitioned Table with Multiple File Groups

I am trying to load a series of CSV files, ranging from 100MB to 20GB in size (total of ~3TB). So, I need every performance enhancement that I can. I am aiming to use filegrouping, and partitioning as a mean. I performed a series of tests to see the optimum approach.
First, I tried various filegroup combination; the best I get is when I am loading into a table that is on 1 filegroup; with multiple files assigned to it, and they are all siting on one disc. This combination outperformed to the case that I have multiple filegroups.
Next step was naturally to have partitioning. ODDLY, all the partitioning combination that I examined have lower performance. I tried defining various partition function/schemes and various filegroup combinations. But all showed a lower loading speed.
I am wondering what I am missing here!?
So far, I managed to load (using bulk insert) a 1GB csv file in 3 minutes. Any idea is much appreciated.
For gaining optimal Data Loading speed you need to first understand SQL Server data load process, which means understanding how SQL Server achieves below mentioned optimizations.
Minimal Logging.
Parallel Loading.
Locking Optimization.
These two article will explain in detail how you can achieve all the above optimizations in detail. Fastest Data Loading using Bulk Load and Minimal Logging and Bulk Loading data into HEAP versus CLUSTERED Table
Hope this helps.

SQL HW to performance ration

I am seeking a way to find bottlenecks in SQL server and it seems that more than 32GB ram and more than 32 spindels on 8 cores are not enough. Are there any metrics, best practices or HW comparations (i.e. transactions per sec)? Our daily closure takes hours and I want it in minutes or realtime if possible. I was not able to merge more than 12k rows/sec. For now, I had to split the traffic to more than one server, but is it a proper solution for ~50GB database?
Merge is enclosed in SP and keeped as simple as it can be - deduplicate input, insert new rows, update existing rows. I found that the more rows we put into single merge the more rows per sec we get. Application server runs in more threads, and uses all the memory and processor on its dedicated server.
Follow a methodology like Waits and Queues to identify the bottlenecks. That's exactly what is designed for. Once you identified the bottleneck, you can also judge whether is a hardware provisioning and calibration issue (and if so, which hardware is the bottleneck), or if is something else.
The basic idea is to avoid having to do random access to a disk, both reading and writing. Without doing any analysis, a 50 GB database needs at least 50GB of ram. Then you have to make sure indexes are on a separate spindle from the data and the transaction logs, you write as late as possible, and critical tables are split over multiple spindles. Are you doing all that?

Take advantage of multiple cores executing SQL statements

I have a small application that reads XML files and inserts the information on a SQL DB.
There are ~ 300 000 files to import, each one with ~ 1000 records.
I started the application on 20% of the files and it has been running for 18 hours now, I hope I can improve this time for the rest of the files.
I'm not using a multi-thread approach, but since the computer I'm running the process on has 4 cores I was thinking on doing it to get some improvement on the performance (although I guess the main problem is the I/O and not only the processing).
I was thinking on using the BeginExecutingNonQuery() method on the SqlCommand object I create for each insertion, but I don't know if I should limit the max amount of simultaneous threads (nor I know how to do it).
What's your advice to get the best CPU utilization?
Thanks
If I understand you correctly, you are reading those files on the same machine that runs the database. Although I don't know much about your machine, I bet that your bottleneck is disk IO. This doesn't sound terribly computation intensive to me.
Have you tried using SqlBulkCopy? Basically, you load your data into a DataTable instance, then use the SqlBulkCopy class to load it to SQL Server. Should offer a HUGE performance increase without as much change to your current process as using bcp or another utility.
Look into bulk insert.
Imports a data file into a database table or view in a user-specified format.

Spark RDD.saveAsTextFile writing empty files to S3

I'm trying to execute a map-reduce job using Spark 1.6 (spark-1.6.0-bin-hadoop2.4.tgz) that reads input from and writes output to S3.
The reads are working just fine with: sc.textFile(s3n://bucket/path/to/file/file.gz)
However, I'm having a bunch of trouble getting the writes to work. I'm using the same bucket to output the files: outputRDD.saveAsTextFile(s3n://bucket/path/to/output/)
When my input is extremely small (< 100 records), this seems to work fine. I'm seeing a part-NNNNN file written per partition with some of those files having 0 bytes and the rest being under 1 KB. Spot checking the non-empty files shows the correctly formatted map-reduce output. When I move to a slightly bigger input (~500 records), I'm seeing the same number of part-NNNNN files (my number of partitions are constant for these experiments), but each one is empty.
When I was experimenting with much bigger data sets (millions of records), my thought was that I was exceeding some S3 limits which was causing this problem. However, 500 records (which amounts to ~65 KB zipped) is still a trivially small amount of data that I would think Spark and S3 should handle easily.
I've tried using the S3 Block FileSystem instead of the S3 Native FileSystem as outlined here. But get the same results. I've turned on logging for my S3 bucket, but I can't seem to find a smoking gun there.
Has anyone else experienced this? Or can otherwise give me a clue as to what might be going wrong?
Turns out I was working on this too late at night. This morning, I took a step back and found a bug in my map-reduce which was effectively filtering out all the results.
You should use coalesce before saveAsTextFile
from spark programming guide
Decrease the number of partitions in the RDD to numPartitions. Useful
for running operations more efficiently after filtering down a large
dataset.
eg:
outputRDD.coalesce(100).saveAsTextFile(s3n://bucket/path/to/output/)