unable to merge large files in r

unable to merge large files in r - sql

I have run into a problem.
I have 10 large separate files, file type File without column headers, which are in total near 4GB which are require merging. I have been told they are text files and pipe delimited, so I added the file extension txt on each files, which I hope is not the problem. R Studio is crashing when I use the following code...
multmerge = function(mypath){
filenames=list.files(path=mypath, full.names=TRUE)
datalist = lapply(filenames, function(x){read.csv(file=x,header=F, sep
= "|")})
Reduce(function(x,y) {merge(x,y, all=T)}, datalist)}
mymergeddata = multmerge("C://FolderName//FolderName")
or when I try to do something like this...
temp1 <- read.csv(file="filename.txt", sep="|")
:
temp10 <- read.csv(file="filename.txt", sep="|")
SomeData = Reduce(function(x, y) merge(x, y), list(temp1...,
temp10))
I seeing errors such as
"Error: C stack usage is too close to the limit r" and
"In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
Reached total allocation of 8183Mb: see help(memory.size)"
Then I saw a someone asked a question on SO as I am writing this question,
here, so I was wondering if SQL command can used in R Studio or SSMS to merge these large files? If they can how can it be merged to. If it can be done please can you advise me how to do this. I will looking around on the net.
If it can't then what is the best method to merge these rather large files. Can this be achieved in R Studio or is there open source?
I am working on a PC which has 64bit Windows with 8GB RAMS. I have included R and SQL Tags to see what options there are.
Thanks in advance if anyone can help me.

Your machine doesn't have enough memory for your selected operations.
You have 10 files ~ 4GB in total.
When you merge the 10 files you create another object which is also about 4GB, putting you very close to your machine's limit.
Your operating system and R and whatever else you're running also consume RAM so it's no surprise you run out of RAM.
I'd suggest taking a stepwise approach if you don't have access to a bigger maching:
- take the first two files and merge them.
- delete the file objects from R and keep only the merged one.
- load the third object and merge it with the earlier merger.
Repeat until done.

Related

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?

Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

In MS Access VBA get Number of Processor Cores

I need to get the number of processor cores available on a computer programmatically from within MS Access. As an example, the computer I work from most frequently has one processor with 6 cores. I want to grab the number '6' through VBA.
Thus far, I have found two ways to find this information through CMD. (1) I can execute the line echo %NUMBER OF PROCESSORS% and the result is 6 (simple and clean, I like it). (2) I have also tried wmic cpu get numberorcores, but the result of that prompt is as follows:
NumberOfCores
6
I intend to pipe the output to and read from the clipboard. The reason I use the clipboard is to avoid creating, reading, and deleting little text files of data. Prompt (2) works, I can successfully pipe the output to the clipboard and read it into a variable in VBA, but it's messy and I would have to parse the result to get the information I need. I would much prefer using prompt (1), but it's not working and the problem seems to be echo. I have tried using shell() and CreateObject(WScript.Shell).Run without success. The strings I have used to try to execute the echo prompt are as follows:
str = "echo %NUMBER OF PROCESSORS% | clip"
str = "cmd ""echo %NUMBER OF PROCESSORS% | clip"""
So, is there a way to successfully send an echo prompt to CMD through VBA and get a result?
Alternatively, is there a different way in VBA to get the number of cores?
TIA!

Why not keep it simple like this:
Dim result As Variant
result = Environ("NUMBER_OF_PROCESSORS")
Debug.Print "Number of processors is " & result

Qlikview does not upload what I ask for

I have this simple script to upload filenames:
Files:
LOAD
Distinct
FileName() as File
FROM [C:\Matias\Capacity Tracker\AllFiles\*];
And as a result while running the script, it happens the following:
Files << Analyst Time Sheet - Adam W - 0730-0805 0 lines fetched
Files << Analyst Time Sheet - Adam W - 0806-0812 0 lines fetched
Files << Analyst Time Sheet - Agnieszka J - 0702-0708 2 lines fetched
Files << Analyst Time Sheet - Agnieszka J - 0709-0715 3 lines fetched
Files << Analyst Time Sheet - Agnieszka J - 0716-0722 4 lines fetched
And so on...
So, the strange thing is that for the files from "Adam W", doesn't upload anything (no lines fetched). So then, I have the list of files except these ones. I find it very strange, because as I'm just asking for the filename, it can't be a thing of formatting (I think).
Any idea of what can be happening and how could I solve it?
Thank you in advance
Matias

Although QlikView offers that * option on the filename of the LOAD statment, the results are sometimes a little bit random. I would recomend that you try a different approach and see if it works.
For Each FILE in FileList('C:\Matias\Capacity Tracker\AllFiles\*')
Files:
LOAD
Distinct FileName() as File
FROM [$(FILE)];
next file
Hope this helps.

thanks for your idea. I tried that and unfortunately I had the same problem. Finally, I solved like this:
Files:
LOAD
Distinct
FileName() as File
FROM [C:\Matias\Capacity Tracker\AllFiles\*];
SET ErrorMode=0;
Files:
LOAD
Distinct
FileName() as File
FROM [C:\Matias\Capacity Tracker\AllFiles\*]
(ooxml, no labels, table is [Task Log])
Where Not Exists(File,FileName());
IF ScriptError <> 0 THEN
Files:
LOAD
FileName() as File
FROM [C:\Matias\Capacity Tracker\AllFiles\*]
(biff, no labels, table is [Task Log$])
Where Not Exists(File,FileName());
ENDIF
Despite they are all .xls files, it seems to be formatting differences between them. So the ones not uploaded at first, they were uploaded by the first statement after (ooxml), or if it failed, by the second one (biff files). Quite strange.
Maybe this is not the best and proper solution, but it was the only one that worked to upload all the filenames from the folder.

AMPL:How to print variable output using NEOS Server, when you can't include data and model command in the command file?

I'm doing some optimization using a model whose number of constraints and variables exceeds the cap for the student version of, say, AMPL, so I've found a webpage [http://www.neos-server.org/neos/solvers/milp:Gurobi/AMPL.html] which can solve my type of model.
I've found however that when using a solver where you can provide a commandfile (which I assume is the same as a .run file) the documentation of NEOS server tells that you should see the documentation of the input file. I'm using AMPL input which according to [http://www.neos-guide.org/content/FAQ#ampl_variables] should be able to print the decision variables using a command file with the appearance:
solve;
display _varname, _var;
The problem is that NEOS claim that you cannot add the:
data datafile;
model modelfile;
commands into the .run file, resulting in that the compiler cannot find the variables.
Does anyone know of a way to work around this?
Thanks in advance!
EDIT: If anyone else has this problem (which I believe many people have based on my Internet search). Try to remove any eventual reset; command from the .run file!

You don't need to specify model or data commands in the script file submitted to NEOS. It loads the model and data files automatically, solves the problem, and then executes the script (command file) you provide. For example submitting diet1.mod model diet1.dat data and this trivial command file
display _varname, _var;
produces the output which includes
: _varname _var :=
1 "Buy['Quarter Pounder w/ Cheese']" 0
2 "Buy['McLean Deluxe w/ Cheese']" 0
3 "Buy['Big Mac']" 0
4 "Buy['Filet-O-Fish']" 0
5 "Buy['McGrilled Chicken']" 0
6 "Buy['Fries, small']" 0
7 "Buy['Sausage McMuffin']" 0
8 "Buy['1% Lowfat Milk']" 0
9 "Buy['Orange Juice']" 0
;
As you can see this is the output from the display command.

[Q]uestion about reading and saving a large txt-file via {RSQLite} line by line into a DB

Since my hardware is very limited (a dual core with 32bit Win7 and 4GB of ram - I need to make the best of it.....) I try to save a large text-file (about 1.2GB) into a DB, which I can then trigger by SQL-like queries to do some analytics on particular subgroups.
To be honest I'm not familiar with this area and since I could not find help regarding my issues via "googling", I just quickly show what I came up with and how I thought things would look like:
First I check how many columns my txt-file has:
k <- length(scan("data.txt", nlines=1, sep="\t", what="character"))
Then I open a connection to the text file so that it does not need to be opened
again for every single line:
filecon<-file("data.txt", open="r")
Then I initialize a connection (dbcon) to an SQLite database
dbcon<- dbConnect(dbDriver("SQLite"), dbname="mydb.dbms")
I find out where the position of the first line is
pos<-seek(filecon, rw="r")
Since the first line contains the column-names I save them for later use
col_names <- unlist(strsplit(readLines(filecon, n=1), "\t"))
Next, I test to read the first 10 lines, line by line,
and save them into a DB, which themself (should) contain k - columns with columns-names = col_names.
for(i in 1:10) {
# prints the iteration number in hundreds
if(i %% 100 == 0) {
print(i)
}
# read one line into a variable tt
tt<-readLines(filecon, n=1)
# parse tt into a variable tt2, since tt is a string
tt2<-unlist(strsplit(tt, "\t"))
# Every line, read and parsed from the text file, is immediately saved
# in the SQLite database table "results" using the command dbWriteTable()
dbWriteTable(conn=dbcon, name="results", value=as.data.frame(t(tt2[1:k]),stringsAsFactors=T), col.names=col_names, append=T)
pos<-c(pos, seek(filecon, rw="r"))
}
If I run this I get the following error
Warning messages:
1: In value[[3L]](cond) :
RS-DBI driver: (error in statement: table results has 738 columns but 13 values were supplied)
Why should I supply 738 columns? If I change k (which is 12) to 738, the code works but then I need to trigger the columns by V1, V2, V3,.... and not by the column-names I intended to supply
res <- dbGetQuery(dbcon, "select V1, V2, V3, V4, V5, V6 from results")
Any help or even a small hint is very much appreciated!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

unable to merge large files in r - sql

Related

Reading and handling many small CSV-s to concatenate one large Dataframe

In MS Access VBA get Number of Processor Cores

Qlikview does not upload what I ask for

AMPL:How to print variable output using NEOS Server, when you can't include data and model command in the command file?

[Q]uestion about reading and saving a large txt-file via {RSQLite} line by line into a DB

Categories

Resources