Lambda out of memory with pandas dataframe - pandas

I am trying to convert a JSON file to CSV using Lambda.
I am using Pandas for this operation.
Initially I started with the following configuration :
File Size : 5 MB
Memory : 128
It took me around 5 seconds to complete the conversion.
Then I increased the file size to 10 MB, but there is a weird behavior.
It will be great if someone could help me to understand this.
Basically I am trying to benchmark this operation
Sometimes the file is getting processed successfully and sometimes it is getting timeout with message
REPORT RequestId: 28e55591-e6a7-4344-b5bc-321bd03422b6 Duration: 900089.03 ms Billed Duration: 900000 ms Memory Size: 128 MB Max Memory Used: 129 MB
It can be clearly seen that this a memory issue, but I am not able to understand the root cause.
It will be great if someone could help me to understand this behavior.
Sometimes it also happens that the lambda is re-triggered and then the file is getting processed.

It's due to your use of Panda's dataframe. It uses a lot more memory to store the CSV than what's just the size of the file itself. You can check how much memory the dataframe needs with df.info(memory_usage='deep').
If you just need to convert a csv to json, a better way would be to use the stdlib modules csv and json and code it yourself.

Related

Is there file size limit for the imported .dat file?

I meet a problem for the large stacked images of several GB. Actually, I can directly open a stacked image (dm4 format) of 9GB (1000x1000x1000), but if I want to rotate it using volumn operation such as "rotate about x", the GMS or DM exits automatically. I write a simple script code to complete the operation with the slice3 function and display the result correctly, but I cannot save it! If I try to save the resulted stacked image, software says "sorry" and forces me to close it.
OK, I think this file is too large to the software's capability. So I save the original data file to .dat formate and write a fortran code to rotate it, then save the result as a .dat file. When I use the import function of GMS or DM, it only imports first several hundred frames, not all frames.
How to deal with it?
There certainly are size restrictions in both total size and maximum length along one dimension, but I don't think 1000 x 1000 x 1000 should be a limiting factor.
I just ran the following two scripts in order and saved the data on my GMS 3.4.3 without problem.
image big := RealImage("Big First",4,1000,1000,1000)
big = icol*sin(irow/iheight*100*pi())*10000+iplane
big.showimage()
image bigIn := A
image bigOut := bigIn.Slice3(0,0,0, 1,1000,1,0,1000,1,2,1000,1)
bigOut.ShowImage()
Can you edit you question to include the script code you're failing to run and other useful information?

STM32F407VET FatFs f_close returns FR_DISK_ERR

I am interfacing SD card(16Gb Sandisk ultra micro SD) to STM32F407 micro-controller with SDIO protocol using chan FatFS library. When I try to write data into existing file, f_write returns FR_OK and returns numbers of written bytes(this value is equal to number of bytes to write), but f_close() returns FR_DISK_ERR, and in the end the file is empty.
With more experiment i found that if i format the micro SD card using 64Kb unit allocation size and the existing file with some text in it then i am able to write 64Kb data to file but f_close() returns FR_DISK_ERR, and in the end the file is not empty. I am able to see the data in Windows 10 OS.
If the existing file has no text in this then i am getting an empty file even though f_write is returning FR_OK but f_close is returning FR_DISK_ERR.
In short when I try to use f_write on a text file i created from my PC I can overwrite the content of that file till 64Kb. But i can't get it to work with an empty file i created with f_open
I came across a similar post with the same issue
TMS320F2812 FatFs f_write returns FR_DISK_ERR
I tried the solutions given in the above post but it didn't work. Since i have 192K RAM in my controller, i guess its sufficient enough for this FatFs module to work. My stack size is around 13Kb and Heap size 4Kb which is too much for this application. SD card is given 3.3V supply voltage.
I went little deep in code to see where the error is occurring and found that i am getting SD_ILLEGAL_CMD error while setting block size for card. Inside f_close(ff.c file)->f_sync(ff.c file)->move_window(ff.c file)->disk_read(diskio.c file)->SD_ReadBlock(sdcard.c file) is returning SD_ILLEGAL_CMD error while setting block size for card.
Any solutions are appreciated. If more information is required please feel free to ask, i will update with more information.
Chan FatFs Version - R0.07e

ora2pg out of memory error - after every table

when I try to export data, it runs out of memory, regardless of table size (even empty tables.)
Out of memory! ] 162926/498508267 rows (0.0%) on total estimated data (14 sec., avg: 11637 recs/sec)
Issuing rollback() due to DESTROY without explicit disconnect() of DBD::Oracle::db handle (DESCRIPTION=(ADDRESS=(PORT=1521)(HOST=192.168.0.42)
(PROTOCOL=tcp))(CONNECT_DATA=(SID=orcl))) at
/usr/local/lib/perl/5.18.2/DBD/Oracle.pm line 348.
Asking the author, I got this response:
You don't have enough memory. If you can't increase the memory size than reduce the value of DATA_LIMIT in ora2pg.conf. Try with 5000 and if it doesn't works use 2500.
Opened ./config/ora2pg.conf and modfied set DATA_LIMIT 5000 solved the issue.
I originally tried to add more RAM, but only doubled it from 2GB to 4GB, it did not help. Reducing the DATA_LIMIT was the solution.

how ext4 works with fallocate

Recently, I am testing the proper usage of ext4 filesystem. what is my expert is that:
when system crashed, the data had been write return ok can not loss, but metadate can.
Here is my usage:
1. call fallocate to alloc centain space
fallocate(fd, 0, 0, 4*1024*1024); //4MB
2. call fsync(fd) let data and metadata write to disks
3. then i call function to randomly write the file with 4k size(random data but not 0). with O_DRICT flagļ¼Œbut not call fsync. I log the offset with return write ok.
4. check the offset that logged. but i find in some offset, read 4k data, is 0. It seems mean that offset isn't used like hole files.
My question is that:
<1. why after calling fallocate and fsync the metadata of the file still seems
indicate some blocks is not used, so when read it return null. It is my understand .
<2. have other api to call, can make sure that in allocate space with file is not holes ,after that when write data return ok with O_DIRECT can make sure the data will not be loss even the system crashed.
Thanks.
Only writing to the file space can eliminate the hole. Without writing, there is no dirty page and fsync simply does nothing.
I am wondering how did you execute you step 4. It seems that you did it by a manual crash, did you? If you read it after write without a crash, it should not be zero, provided you wrote non-zeros. If you read it after a crash, zero can happen if disk cache existed. However, this kind of zero is not like holes, they are zeros read from the disk (very probably the disk contains zeros).

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.