Reading, parsing and storing .txt files contents in Torch tensors efficiently - optimization

I have a huge number of .txt files (maybe around 10 millions) each having the same number of rows/colums. They actually are some single channel images and the pixel values are separated with an space. Here's the code I've written to do the work but it's very slow. I wonder if someone can suggest a more optimized/efficient way of doing this:
require 'torch'
f = assert(io.open(txtFilePath, 'r'))
local tempTensor = torch.Tensor(1, 64, 64):fill(0)
local i = 1
for line in f:lines() do
local l = line:split(' ')
for key, val in ipairs(l) do
tempTensor[{1, i, key}] = tonumber(val)
end
i = i + 1
end
f:close()

In brief, change you source files if it is possible.
The only I can suggest is to use binary data instead of txt as a source.
You have got the long-term methods: f:lines(), line:split(' ') and tonumber(val). All of them are using strings as variables.
As I understood, you have got file like this:
0 10 20
11 18 22
....
so, change your source it into binary like this:
<0><18><20><11><18><22> ...
where <18> is a byte in hex form, that is 12 , <20> is 16 , etc.
to read
fid = io.open(sup_filename, "rb")
while true do
local bytes = fid:read(1)
if bytes == nil then break end -- EOF
local st = bytes[0]
print(st)
end
fid:close()
https://www.lua.org/pil/21.2.2.html
It would be dramatically faster.
May be using regular expressions (instead of :split() and lines()) can help to you but I do not think.

Related

How do you detect blank lines in Fortran?

Given an input that looks like the following:
123
456
789
42
23
1337
3117
I want to iterate over this file in whitespace-separated chunks in Fortran (any version is fine). For example, let's say I wanted to take the average of each chunk (e.g. mean(123, 456, 789) then mean(42, 23, 1337) then mean(31337)).
I've tried iterating through the file normally (e.g. READ), reading in each line as a string and then converting to an int and doing whatever math I want to do on each chunk. The trouble here is that Fortran "helpfully" ignores blank lines in my text file - so when I try and compare against the empty string to check for the blank line, I never actually get a .True. on that comparison.
I feel like I'm missing something basic here, since this is a typical functionality in every other modern language, I'd be surprised if Fortran didn't somehow have it.
If you're using so-called "list-directed" input (format = '*'), Fortran does special handling to spaces, commas, and blank lines.
To your point, there's a feature which is using the BLANK keyword with read
read(iunit,'(i10)',blank="ZERO",err=1,end=2) array
You can set:
blank="ZERO" will return a valid zero value if a blank is found;
blank="NULL" is the default behavior that skips blank/returns an error depending on the input format.
If all your input values are positive, you could use blank="ZERO" and then use the location of zero values to process your data.
EDIT as #vladimir-f has correctly pointed out, you not only have blanks in between lines, but also after the end of the numbers in most lines, so this strategy will not work.
You can instead load everything into an array, and process it afterwards:
program array_with_blanks
integer :: ierr,num,iunit
integer, allocatable :: array(:)
open(newunit=iunit,file='stackoverflow',form='formatted',iostat=ierr)
allocate(array(0))
do
read(iunit,'(i10)',iostat=ierr) num
if (is_iostat_end(ierr)) then
exit
else
array = [array,num]
endif
end do
close(iunit)
print *, array
end program
Just read each line as a character (but note Francescalus's comment on the format). Then read the character as an internal file.
program stuff
implicit none
integer io, n, value, sum
character (len=1000) line
n = 0
sum = 0
io = 0
open( 42, file="stuff.txt" )
do while( io == 0 )
read( 42, "( a )", iostat = io ) line
if ( io /= 0 .or. line == "" ) then
if ( n > 0 ) print *, ( sum + 0.0 ) / n
n = 0
sum = 0
else
read( line, * ) value
n = n + 1
sum = sum + value
end if
end do
close( 42 )
end program stuff
456.000000
467.333344
3117.00000

psycopg2 copy_from Problems in Python 3

I'm new to Python (and coding) and bit off more than I can chew trying to use copy_from.
I am reading rows from a CSV, manipulating them a bit, then writing them into SQL. Using the normal INSERT commands takes a very long time with hundreds of thousands of rows, so I want to use copy_from. It does work with INSERT though.
https://www.psycopg.org/docs/cursor.html#cursor.copy_from this example uses tabs as column separators and newline at the end of each row, so I made each IO line accordingly:
43620929 2018-04-11 11:38:14 30263506 30263503 30262500 0 0 0 0 0 1000 1000 0
That's what the below outputs with the first print statement:
def copyFromIO(thisOutput):
print(thisOutput.getvalue())
cursor.copy_from(thisOutput, 'hands_new')
thisCommand = 'SELECT * FROM hands_new'
cursor.execute(thisCommand)
print(cursor.fetchall())
hands_new is an existing, empty SQL table. The second print statement is just [], so it isn't writing to the db. What am I getting wrong?
Obviously if it worked, I could make thisOutput much longer, with lots of rows instead of just the one.
I think I figured it out, so if anyone comes across this in the future for some reason:
'thisOutput' format was wrong, I built it from smaller pieces including adding '\t' etc. It works if instead I do:
copyFromIO(io.StringIO('43620929\t2018-04-11 11:38:14\t30263506\t30263503\t30262500\t0\t0\t0\t0\t0\t1000\t1000\t0\n'))
& I needed the right columns in the copy_from command:
def copyFromIO(thisOutput):
print(thisOutput.getvalue())
thisCol = ('pkey', 'created', 'gameid', 'tableid', 'playerid', 'bet', 'pot',
'isout', 'outround', 'rake', 'endstack', 'startstack', 'stppaid')
cursor.copy_from(thisOutput, 'hands_new', columns=(thisCol))
thisCommand = 'SELECT * FROM hands_new'
cursor.execute(thisCommand)
print(cursor.fetchall())

Octave strread can't return parsed results to an array (?)

In Octave, I am reading very large text files from disk and parsing them. The function textread() does just what I want except for the way it is implemented. Looking at the source, textread.m pulls the entire text file into memory before attempting to parse lines. If the text file is large, it fills all my free RAM (16 GB) with text and then starts saving back to disk (virtual memory), before parsing. If I wait long enough, textread() will complete, but it takes almost forever.
Notice that after parsing into a matrix of floating point values, the same data fit into memory quite easily. So I'm using textread() in an intermediate zone, where there is enough memory for the floats, but not enough memory for the same data as text.
All of that is preparation for my question, which is about strread(). The data in my text files looks like this
0.0647148 -2.0072535 0.5644875 8.6954257
0.1294296 -8.4689583 0.6567095 144.3090450
0.1941444 -9.2658037 -1.0228742 173.8027785
0.2588593 -6.5483359 -1.5767574 90.7337329
0.3235741 -0.7646807 -0.5320896 1.7357120
... and so on. There are no header lines or comments in the file.
I wrote a function that reads the file line by line, and notice the two ways I'm attempting to use strread() to parse a line of data.
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% this line works, but is very brittle
[dest(line, 1), dest(line, 2), dest(line, 3), dest(line, 4)] = strread(lstr, "%f %f %f %f");
% This line doesn't work. Or anything similar I can think of.
% dest(line, 1:4) = strread(lstr, "%f %f %f %f");
endfor
fclose(fid);
endfunction
Is there an elegant way of having strread return parsed values to an array? Otherwise I'll have to write a new function any time I change the number of columns.
Thanks
Your described format is a matrix with floating point values. In this case you can just use load
d = load ("yourfile");
which is much faster than any other function. You can have a look at the used implementation in libinterp/corefcn/ls-mat-ascii.cc: read_mat_ascii_data
If you feed fprintf more values than are in its format specification, it will reapply the print statement until it's used them up:
>> fprintf("%d %d \n", 1:6)
1 2
3 4
5 6
It appears this also works with strread. If you specify only one value to read, but there are multiple on the current line, it will keep reading them and add them to a column vector. All we need to do is to assign those values to the correct row of dest:
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% read all values from current line into column vector
% and store values into row of dest
dest(line,:) = strread(lstr, "%f");
% this will also work since values are assumed to be numeric by default:
% dest(line,:) = strread(lstr);
endfor
fclose(fid);
endfunction
Output:
readPowerSpectrumFile(filename, zeros(5,4))
ans =
6.4715e-02 -2.0073e+00 5.6449e-01 8.6954e+00
1.2943e-01 -8.4690e+00 6.5671e-01 1.4431e+02
1.9414e-01 -9.2658e+00 -1.0229e+00 1.7380e+02
2.5886e-01 -6.5483e+00 -1.5768e+00 9.0734e+01
3.2357e-01 -7.6468e-01 -5.3209e-01 1.7357e+00

How to find every combination of a binary 16 digit number

I have 16 different options in my program and i have a 16 character variable which is filled with 1's or 0's depending on the options that are selected (0000000000000000 means nothing is selected, 0010101010000101 means options 3,5,7,9,14 and 16 are selected, 1111111111111111 means everything is selected.)
When i run my program, the code looks (using an if statement) for a 1 in the designated character of the 16 digit number and if there is one there then it runs the code for that option, otherwise it skips it..
e.g option 3 looks too see if the 3rd character (0010000000000000) is a 1 and if it is it runs the code.
Now what i am trying to do is generate a list of every different combination that is possible so I can create an option for it to just loop through and run every possible option:
0000000000000001
0000000000000010
0000000000000011
...
1111111111111100
1111111111111110
1111111111111111
I have tried this but i think it may take a couple of years to run jaja:
Dim binString As String
Dim binNUM As Decimal = "0.0000000000000001"
Do Until binNUM = 0.11111111111111111
binString = binNUM.ToString
If binString.Contains(1) Then
If binString.Contains(2) Or binString.Contains(3) Or binString.Contains(4) Or binString.Contains(5) Or binString.Contains(6) Or binString.Contains(7) Or binString.Contains(8) Or binString.Contains(9) Then
Else
Debug.Print(binNUM)
End If
End If
binNUM = binNUM + 0.0000000000000001
After the code above is complete i would then take the output list and remove any instances of "0." and then any lines which had fewer than 16 chararcters (because the final character would be a 0 and not show) I would add a 0 until there was 16 characters. I know this bit might be stupid but its as far a ive got
Is there a faster way I can I generate a list like this in VB.net?
You should be able to get the list by using Convert.ToString as follows:
Dim sb As New System.Text.StringBuilder
For i As Integer = 0 To 65535
sb.AppendLine(Convert.ToString(i, 2).PadLeft(16, "0"c))
Next
Debug.Print(sb.ToString())
BTW: This should finish in under one second, depending on your system ;-)
Create an enum with FlagAttributes, which allows you to do the key functions you list. Here is an example of setting it up in a small project I am working on:
<FlagsAttribute>
Public Enum MyFlags As Integer
None = 0
One = 1
Two = 2
Three = 4
Four = 8
Five = 16
Recon = 32
Saboteur = 64
Mine = 128
Headquarters = 256
End Enum
e.g.
Dim temp as MyFlags
Dim doesIt as Boolean
temp = MyFlags.One
doesIt = temp.HasFlag(MyFlags.Two)
temp = temp OR MyFlags.Three
'etc.
The real advantage is how it prints out, if you want something other than 0, 1 and is much more human friendly.

Using a table made from input file Lua

I have a text file with contents like this
Jack 17
Will 16
Jordan 15
Elsie 16
You get the idea, it's a list of people's names with their ages.
I have a program that reads the file in. Like so:
file = io.open("ages.txt")
for line in file:lines()
do
local name, age = line:match("(%a+) (%d+)")
print(age) --Not exactly what I want
end
file:close()
print(age) gives me the ages of all people, without names. It runs for everyone, as expected as it's within the loop (as an aside, why does it not work outside the loop? It gives me nil there)
What I want to do is load it into a table. This way, if I want to know Jack's age, I can go print(Jack.age) and it will give me 17. How can this be program be constructed to support this functionality?
Perhaps you are looking for something like this to build a table in the loop:
file = io.open("ages.txt")
names = {}
for line in file:lines()
do
local n, a = line:match("(%a+) (%d+)")
names[n] = {age = a}
end
file:close()
Here is a sample interaction:
> print(names.Will.age)
16
> print(names.Jordan.age)
15
> print(names.Elsie.age)
16