Speed Up Hot Folder Data Upload - sap

I have CSV file of around 188MB. When I try to upload data using hot folder technique its taking too much time 10-12 hrs. How can I speedup the data upload?
Thanks

Default value of impex.import.workers is 1. Try to change this value. And I recommend making performance test with a bit smaller file first, than 188Mb (just to get swift results)
Adjust the number of impex threads on the backoffice server to speed up ImpEx file processing. It is recommended that you start with it equal to the number of cores available on a backoffice node. You should not adjust it any higher than 2 * number of cores, and this is only if the IMPEX processes will be the only item running on the node. The actual value could be somewhere in between and will only be determined by testing and analyzing the number of other processes, jobs, apps running on your server to ensure you are not maxing out CPU.
NOTE: this value could be higher for lower environments since Hybris will likely be the only process running.
Taken from Tuning Parameters - Hybris Wiki

Related

How much maximum thread capacity of jmeter command line mode?

I have to test load testing for 32000 users, duration 15 minutes. And I have run it on command line mode. Threads--300, ramp up--100, loop 1. But after showing some data, it is freeze. So I can't get the full report/html. Even i can't run for 50 users. How can I get rid of this. Please let me know.
From 0 to 2147483647 threads depending on various factors including but not limited to:
Hardware specifications of the machine where you run JMeter
Operating system limitations (if any) of the machine where you run JMeter
JMeter Configuration
The nature of your test (protocol(s) in use, the size of request/response, presence of pre/post processors, assertions and listeners)
Application response time
Phase of the moon
etc.
There is no answer like "on my macbook I can have about 3000 threads" as it varies from test to test, for GET requests returning small amount of data the number will be more, for POST requests uploading huge files and getting huge responses the number will be less.
The approach is the following:
Make sure to follow JMeter Best Practices
Set up monitoring of the machine where you run JMeter (CPU, RAM, Swap usage, etc.), if you don't have a better idea you can go for JMeter PerfMon Plugin
Start your test with 1 user and gradually increase the load at the same time looking into resources consumption
When any of monitored resources consumption starts exceeding reasonable threshold, i.e. 80% of maximum available capacity stop your test and see how many users were online at this stage. This is how many users you can simulate from particular this machine for particular this test.
Another machine or test - repeat from the beginning.
Most probably for 32000 users you will have to go for distributed testing
If your test "hangs" even for smaller amount of users (300 could be simulated even with default JMeter settings and maybe even in GUI mode):
take a look at jmeter.log file
take a thread dump and see what threads are doing

jmeter Load Test Serevr down issues

I was used a load of 100 using ultimate thread group for execution in NON GUI Mode .
The Execution takes place around 5 mins. only . After that my test environment got shut down. I am not able to drill down the issues. What could be the reason for server downs. my environment supports for 500 users.
How do you know your environment supports 500 users?
100 threads don't necessarily map to 100 real users, you need to consider a lot of stuff while designing your test, in particular:
Real users don't hammer the server non-stop, they need some time to "think" between operations. So make sure you add Timers between requests and configure them to represent reasonable think times.
Real users use real browsers, real browsers download embedded resources (images, scripts, styles, fonts, etc) but they do it only once, on subsequent requests the resources are being returned from cache and no actual request is being made. Make sure to add HTTP Cache Manager to your Test Plan
You need to add the load gradually, this way you will be able to state what was amount of threads (virtual users) where response time start exceeding acceptable values or errors start occurring. Generate a HTML Reporting Dashboard, look into metrics and correlate them with the increasing load.
Make sure that your application under test has enough headroom to operate in terms of CPU, RAM, Disk space, etc. You can monitor these counters using JMeter PerfMon Plugin.
Check your application logs, most probably they will have some clue to the root cause of the failure. If you're familiar with the programming language your application is written in - using a profiler tool during the load test can tell you the full story regarding what's going on, what are the most resources consuming functions and objects, etc.

flink streaming or batch processing

I am tasked with redesigning an existing catalog processor and the requirement goes as belowRequirement I have 5 to 10 vendors(each vendor can have multiple stores) who would provide me with 'XML' file per store. Basically, 1 products xml file per Store, and multiple Store files per Vendor. Max file size can be 500 MB and min can be 100 MB Avg products per file could be 100,000.
Sample xml format could be like this ... ... ...
It doesnt take more than 30 mins to download the file per store, and these files are updated once per day or every 3 to 6 hours.
Now priority requirement is that, the product details are highly unorganized and these files have to organized, processed(10+ processes) and converted to another common object(json) and then file stored in Cassandra.
My technology head advised me to design with Apache Flink and Kafka on top of HDFS, where flink directly stream the files from the vendor servers and start processing them while streaming.
My view was that, either case the files are of finite size and there is not much need to stream them. So thought of having a standalone scheduler come downloader to download and load the files to HDFS. As soon as the files are loaded to HDFS, I can trigger the Flink processing and store the same in Cassandra.
My question here is that, knowing the files are of finite size and finite counts irrespsective of the number of vendors, Is stream processing a overkill or a Batch processing would be a latency burden later?
The question is highly dependent on the tool you will use. If you go for Flink I believe that using the stream is fine and won't create problem in the long run. If you write your functions and jobs properly, moving from DataStream API to DataSet API would be easy, if needed. Batch here introduces an useless delay and without further informations doesn't seem the appropriate approach. I believe it would work fine anyway but it's not clear if latency is a strict requirement.
That said, I believe Flink in itself is an overkill. In this particular use case a more traditional like Spark would be a better option in terms of usability but if you want to invest on Flink, it's totally fine and given the use case, I don't think you will need any particular library that is present/integrated with spark but missing on Flink.

Processing data while it is loading

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.
The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!