While using Get Data from xml input step along with prune path to handle large file still causing memory issue in server - pentaho

I am loading 80Mb xml file to table in local job completed within 10 min but in server its performance slow and occupying entire JVM memory
I tried prune path to handle large files step mention the higher level path its working in local in server its causing slow down server.

Related

Memory issues with GraphDB 7.0

I am trying to load a dataset to GraphDB 7.0. I wrote a Python script to transform and load the data on Sublime Text 3. The program suddenly stopped working and closed, the computer threatened to restart but didn't, and I lost several hours worth of computing as GraphDB doesn't let me query the inserts. This is the error I get on GraphDB:
The currently selected repository cannot be used for queries due to an error:
org.openrdf.repository.RepositoryException: java.lang.RuntimeException: There is not enough memory for the entity pool to load: 65728645 bytes are required but there are 0 left. Maybe cache-memory/tuple-index-memory is too big.
I set the JVM as follows:
-Xms8g
-Xmx9g
I don't exactly remember what I set as the values for the cache and index memories. How do I resolve this issue?
For the record, the database I need to parse has about 300k records. The program shut shop at about 50k. What do I need to do to resolve this issue?
Open the workbench and check the amount of memory you have given to cache memory.
Xmx should be a value that is enough for
cache-memory + memory-for-queries + entity-pool-hash-memory
sadly the latter cannot be calculated easily because it depends on the number of entities in the repository. You will either have to:
Increase the java memory with a bigger value for Xmx
Decrease the value for cache memory

Optimize tempdb log files

Have a dedicated DB server with 16 cores and 200+ GB of memory.
Use tempdb a lot but it typically stays under 4 GB
Recently added a dedicated SSD stripe for tempdb
Based on this page should create multiple files
Optimizing tempdb Performance
Understand multiple row files.
Here is my question:
Should I also create multiple tempdb log files?
It does say "create one data file for each CPU".
So my thought is that data means row (not log) files.
No, as with all databases, SQL server can only use one log file at a time so there is no benefit at all in having multiple log files.
The best thing you can do with log files really is keep them on separate drives to the data files as they have different IO requirements, pre-size them so they don't have to auto grow and if they do have to autogrow, make sure they do so at a sensible level to manage the number of virtual log files that are created inside them.

Node-Local Map reduce job

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.
So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.
However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.
Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.
I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.
Any help would be greatly appreciated.
I see there are three ways you can do it.
1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing
2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.
file://192.168.2.3/usr/rags/data/DFile1.xyz
.......
You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing
3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.
Cheers
Rags

Will doing fork multiple times affect performance?

I need to read log files (.CSV) using fastercsv and save the contents of it in a db (each cell value is a record). The thing is there are around 20-25 log files which has to be read daily and those log files are really large (each CSV file is more then 7Mb). I had forked the reading process so that user need not have to wait a long time but still reading 20-25 files of that size is taking time (more then 2hrs). Now I want to fork reading of each file i.e there will be around 20-25 child process getting created, my question is can I do that? If yes will it affect the performance and is fastercsv able to handle this?
ex:
for report in #reports
pid = fork {
.
.
.
}
Process.dispatch(pid)
end
PS:I'm using rails 3.0.7 and Its going to happen in server which is running in amazon's large instance(7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of local instance storage, 64-bit platform)
If the storage is all local (and I'm not sure you can really say that if you're in the cloud), then forking isn't likely to provide a speedup because the slowest part of the operation is going to be disc I/O (unless you're doing serious computation on your data). Hitting the disc via several processes isn't going to speed that up at once, though I suppose if the disc had a big cache it might help a bit.
Also, 7MB of CSV data isn't really that much - you might get a better speedup if you found a quicker way to insert the data. Some databases provide a bulk load function, where you can load in formatted data directly, or you could turn each row into an INSERT and file that straight into the database. I don't know how you're doing it at the moment so these are just guesses.
Of course, having said all that, the only way to be sure is to try it!

Priority of SQL Server log files

I have SQL SERVER 2005 standard edition.
I have a "small" but fast hard drive and "big" and slow hard drive. Sometimes there is not enough space on the "small" drive for my SQL server database log file. So I restricted the size of the log file on the fast drive and created another one on slow drive. I hoped that SQL server would use first log file on the fast drive till it was full and then switched to second one. But I see that SQL server started with second log file.
Question: is it possible to specify which log file must be used first?
Thanks!
There is a way. For example: set file grow in 100 MB for the "fast" log and 1 MB for the "slow" log. Then they will be filling in proportion of file frow setting (100/1 in this case). When "fast" log file will be full, only "slow" log will be working. This is not an ideal solution, but at least something.
No Multiple log files don’t really help
Edit:
Better to detach, drop LDFs, single file attach though...