Same Pandas Data Frames are different in size - pandas

I have two data frames A & B saved as pickles, on both Computer 1 (running Ubuntu) and 2 (running MacOSX)
I merge A & B into Data Frame C and save it as X.h5 (HDF format, on both Computers)
However, X.h5 in Computer 1 has smaller size (188MB) than X.h5 in Computer 2 (196MB). I copied X.h5 from Computer 1 to Computer 2, they are still different in size (and also in binary files). But when I use pandas to read both of them, X_h5_computer1.equals(X_h5_computer2) return True
Why it happened? The size difference is significant.
Thank you

Related

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

Why Rmarkdown shows different random numbers in pdf output than the ones in the Rmd file?

I set.seed in Rmd file to generate random numbers, but when I knit the document I get different random numbers. Here is a screen shot for the Rmd and pdf documents side by side.
In R 3.6.0 the internal algorithm used by sample() has changed. The default for a new session is
> set.seed(2345)
> sample(1:10, 5)
[1] 3 7 10 2 4
which is what you get in the PDF file. One can manually change to the old "Rounding" method, though:
> set.seed(2345, sample.kind="Rounding")
Warning message:
In set.seed(2345, sample.kind = "Rounding") :
non-uniform 'Rounding' sampler used
> sample(1:10, 5)
[1] 2 10 6 1 3
You have at some point made this change in your R session, as can be seen from the output of sessionInfo(). You can either change this back with RNGkind(sample.kind="Rejection") or by starting a new R session.
BTW, in general please include code samples as text, not as images.

unable to merge large files in r

I have run into a problem.
I have 10 large separate files, file type File without column headers, which are in total near 4GB which are require merging. I have been told they are text files and pipe delimited, so I added the file extension txt on each files, which I hope is not the problem. R Studio is crashing when I use the following code...
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=F, sep
= "|")})
Reduce(function(x,y) {merge(x,y, all=T)}, datalist)}
mymergeddata = multmerge("C://FolderName//FolderName")
or when I try to do something like this...
temp1 <- read.csv(file="filename.txt", sep="|")
:
temp10 <- read.csv(file="filename.txt", sep="|")
SomeData = Reduce(function(x, y) merge(x, y), list(temp1...,
temp10))
I seeing errors such as
"Error: C stack usage is too close to the limit r" and
"In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
Reached total allocation of 8183Mb: see help(memory.size)"
Then I saw a someone asked a question on SO as I am writing this question,
here, so I was wondering if SQL command can used in R Studio or SSMS to merge these large files? If they can how can it be merged to. If it can be done please can you advise me how to do this. I will looking around on the net.
If it can't then what is the best method to merge these rather large files. Can this be achieved in R Studio or is there open source?
I am working on a PC which has 64bit Windows with 8GB RAMS. I have included R and SQL Tags to see what options there are.
Thanks in advance if anyone can help me.
Your machine doesn't have enough memory for your selected operations.
You have 10 files ~ 4GB in total.
When you merge the 10 files you create another object which is also about 4GB, putting you very close to your machine's limit.
Your operating system and R and whatever else you're running also consume RAM so it's no surprise you run out of RAM.
I'd suggest taking a stepwise approach if you don't have access to a bigger maching:
- take the first two files and merge them.
- delete the file objects from R and keep only the merged one.
- load the third object and merge it with the earlier merger.
Repeat until done.

3D Graph in Octave/Matlab from a CSV file

I'm new to Octave/Matlab and I want to plot a 3D-Graph.
I was able to do so using a predefined formula, like this:
x=1:.1:5;
y=1:.1:5;
[xx,yy] = meshgrid(x,y);
z = sin(xx)+sin(yy);
mesh(x,y,z);
But now the question is how to do the same getting the data from a CSV (for example). I know I can use the function csvread, but the big question is how to format the CSV to contain such data.
An example of doing the same graph above but this time grabbing the data from Excel/CSV would be appreciated. Thanks!
Done! I was finally able to do it!
Here's how I did it:
1) I've created a file in Excel with the X values in the cells A2:A42, and the Y values in the cells B1:AP1 (so you form a rectangle).
2) Then in the cells in the middle I put the formula I want (ie =sin(A$2)+sin($B1))
3) Saved the file as CSV (but separated by spaces!) and manually edited it to look this way (the way QtOctave opens matrix files, in Matlab it might be different). For example (note the extra space before each column):
# Created by Octave 3.2.4, Thu Jan 12 19:32:05 2012 ART <diego#notebook2>
# name: z
# type: matrix
# rows: 3
# columns: 3
1 2 3
4 5 6
7 8 9
(if you're not sure how to do it, do what I did: create a simple matrix and export it to see how the exported file looks like!)
4) Octave has a function under Data -> Load matrix from file, which loads that kind of files. Or actually running this command (varname is the name of the resulting variable):
load("-text", "file-where-the-data-is", "varname")
5) Create the graph (ex is the name of the matrix I've just imported):
x=1:.1:5;
y=1:.1:5;
mesh(x,y,ex)

Hadoop S3 No Space Left On Device

I am running a map reduce job that takes a small input (~3MB, list of integers of size z),
with a sparse matrix cache of size n x m, and basically outputs z sparse vectors of dimension (n x 1). The output here is pretty big (~2TB). I am running 20 m1.small nodes on Amazon EC2 with S3 storage as inputs and output.
However, I am getting a IOException: No space left on device.
It seems like there are s3 bytes written on Hadoop logs, but no files are created.
When I used a smaller input (smaller z), the output is correctly there after the job is done.
Thus, I believe that it runs out on a temporary storage.
Is there way to check where this temporary storage is?
Also, funny thing is that the log is saying that all the bytes are written to s3, but I see no files and don't know where these bytes are being written.
Thank you for your help.
Example code (Have also tried to split into map and reduce job with same error)
public void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, VectorWritable>.Context context)
throws IOException, InterruptedException
{
// Assume the input is id \t number
String[] input = value.toString().split("\t");
int idx = Integer.parseInt(input[0]) - 1;
// Some operations to do, but basically outputting a vector
// Collect the output
context.write(new LongWritable(idx), new VectorWritable(matrix.getColumn(idx)));
};
Amazon EMR supports a couple of versions. These are the default values 0.20.205
hadoop.tmp.dir - /tmp/hadoop-${user.name} - A base for other temporary directories.
mapred.local.dir - ${hadoop.tmp.dir}/mapred/local - The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
mapred.temp.dir - ${hadoop.tmp.dir}/mapred/temp - A shared directory for temporary files.
Run the du --max-depth=7 /home/xyz | sort -n command on the hadoop.tmp.dir and check which directory is occupying the most space. Although hadoop.tmp.dir says temporary, it stores system and data files also.