Pentaho text file input step crashing (out of memory) - pentaho

I am using Pentaho for reading a very large file. 11GB.
The process is sometime crashing with out of memory exception, and sometimes it will just say process killed.
I am running the job on a machine with 12GB, and giving the process 8 GB.
Is there a way to run the Text File Input step with some configuration to use less memory? maybe use the disk more?
Thanks!

Open up spoon.sh/bat or pan/kettle .sh or .bat and change the -Xmx figure. Search for JAVAMAXMEM Even though you have spare memory unless java is allowed to use it it wont work. although to be fair in your example above i can't really see why/how it would be consuming much memory anyway!

Related

Why does Nano with error.log open use 40% of RAM?

Edit: I should clarify 40% of 2GB of RAM
I just happened to catch this on my sever, I had used nano before to open an error log and it was still open I don't know for how long. After killing that task, my ram usage dropped from just over 1GB to 250MB.
I remember coming across this before somewhere, I want to know how to prevent/avoid this in the future. I like nano for its simplicity but yeah, I guess be sure to kill the process or something.
Will have to look into remote status updates or something on the server's "livelihood" haha.
Maybe because error.log is a big file (you don't say how big is it).
Did you try using a pager like less on it?
less error.log
You probably don't want to edit (i.e. have the opportunity to change) that error.log file, you just want to look inside it (with a terminal pager like less, or more, or most); a pager uses less memory than an editor, because it does not enable you to change the file.
BTW, consider tuning your logrotate(8)
Notice that nano, like all editors, need to keep in some complex data structures the content of the edited file, in such ways that modification is efficient. This explains why it takes a lot of memory. Since nano is free software (and so is less), you could study its source code for more details.

How to resume loading in PhpMyAdmin?

I am using XAMPP and PHPMyAdmin and I'm trying to load English Wikipedia. Since the file is so big (1.7GB), it take a lot of time. I'm wondering if there is any way to resume the loading process. I have no problem with TimeOut or something like that. The problem is that if my firefox crashes for any reason, the process must start from the scratch.
The part which says allow interrupt is already checked with a check mark. But the problem is that for such a big file that I am loading, it's really difficult to expect to be done without any interrupt. If the laptop is shut down or restarted or so, the process is repeated from the beginning. Is there any way to solve this problem?
In the meantime, I am using
$cfg['UploadDir'] = 'upload';
and load the file from the upload directory on my computer.
Thanks in advance
First, I would recommend against using phpMyAdmin for such a large file. You're going to be constrained by PHP/Apache resource limits for things such as execution time and memory used (or, apparently, some Firefox resource on the client side), to a degree that even if it works properly will have to be done in so many small chunks that it's just not ideal. Even using the UploadDir functionality, you're going to be limited in ways that make it non-ideal to import your file this way. I suggest using the command-line tool for importing a file of this size.
Secondly, if you're going to use phpMyAdmin anyway, it's better to uncompress the file and deal with the raw .sql. This is not intuitive, because of course you think the smaller filesize is better, but phpMyAdmin has to first uncompress the compressed file before it can begin working with it, which can cause problems such as the resource limits (or even running out of disk space). phpMyAdmin can pick up an aborted import, but if you're spending 95% of the execution time uncompressing the file each time, you're going to make very, very slow progress. Actually, I wonder if you're even getting the full file uncompressed on execution before PHP kills the process due to timeout.
phpMyAdmin can pick up execution part way through; you can select which line to begin the import from. If you restart your computer part way through the export, you can use this means to resume your partial import.

Can I use MRJob to process big files in local mode?

I have a relatively big file - around 10GB to process. I suspect it won't fit into my laptop's RAM, if MRJob decides to sort it in RAM or something similar.
At the same time, I don't want to setup hadoop or EMR - the job is not urgent and I can simple start worker before going to sleep and get the results the next morning. In other words, I'm quite happy with local mode. I know, the performance won't be perfect but it's ok for now.
So can it process such 'big' files at a single weak machine? If yes - what would you recommend to do (besides setting a custom tmp dir to point to the filesystem, not to the ramdisk which will be exhausted quickly). Let's assume we use version 0.4.1.
I think the RAM size won't be an issue with the python runner of mrjob. The output of each step should be written out to temporary file on disk, so it should not fill up the RAM I believe. Dumping output to disk is the way it should be with Hadoop (and the reason why it is slow due to IO). So I would just run the job and see how it goes.
If the RAM size is an issue, you can create enough swap space on your laptop to make it at least run, thought it will be slow if the partition isn't on SSD.

Max file size for File.ReadAllLines

I need to read and process a text file. My processing would be easier if I could use the File.ReadAllLines method but I'm not sure what is the maximum size of the file that could be read with this method without reading by chunks.
I understand that the file size depends on the computer memory. But are still there any recommendations for an average machine?
On a 32-bit operating system, you'll get at most a contiguous chunk of memory around 550 Megabytes, allowing loading a file of half that size. That goes down hill quickly after your program has been running for a while and the virtual memory address space gets fragmented. 100 Megabytes is about all you can hope for.
This is not an issue on a 64-bit operating system.
Since reading a text file one line at a time is just as fast as reading all lines, this should never be a real problem.
I've done stuff like this with 1-2GB before, albeit in Python. I do not think .NET would have a problem, though. But I would only do this for one-off processing.
If you are doing this on a regular basis, you might want to go through the file line by line.
Its bad design unless you know the files sizes vs the computer memory that would be avaiable in the running app.
A better solution would be consider memory mapped files. They use themselvses as page fil storage,

Reading a whole file on a network drive the fast way (Windows, C/C++, C#, ...)

Lately I've been having problems reading big files on a network drive and I just can't pinpoint what I may be doing wrong. I tried both in C++ (Unmanaged) and in C# and had about the same performances on both...which were somewhat abysmal.
Sometimes it will read at 4 KB/s a file on the network, but if this file is located on the local HD it will achieve easily the maximum data rate the HD can output. That is with reading 64 KB chunks at a time... I tried with bigger buffers up to insane numbers, or smaller and it doesn't make much differences.
I tried async IO in C# with BeginRead on the FileStream and OVERLAPPED IO in C++ as well as synchronous reads and they all had the same problems, which is being slow on the network.
The only solution we came up with is to copy the file using the OS CopyFile function on the local HD before actually reading the file but I'm not too satisfied with this approach. It just seems like CopyFile is doing something we are not that makes it incredibly faster than our approach.
Anyone has a clue as to why this is?
We would have to guess, since you aren't showing us your code. So my guess is that Windows file copy is opening the file with the FILE_FLAG_SEQUENTIAL_SCAN flag which in turn causes the file system/cache to choose optimal block sizes and submit read requests in anticipation of read calls that havn't been submitted yet.
We only can assume that you have been trying really all possible methods of reading/writing. Have you been reading synchronously or asynchronously? Did you try I/O completion ports? Or ReadFileEx() function? I would guess that the Windows CopyFile() function detects that you want to read a file from network and will use different method for reading then it would use for disk access.
If you have really exhausted all possible reading methods, and if you really need thing to be solved, then I would suggest to check out a bit on what is the CopyFile() function doing. There are numerous tools for doing that. E.g.: this one (or some other -- links on the same page).