Octave strread can't return parsed results to an array (?) - indexing

In Octave, I am reading very large text files from disk and parsing them. The function textread() does just what I want except for the way it is implemented. Looking at the source, textread.m pulls the entire text file into memory before attempting to parse lines. If the text file is large, it fills all my free RAM (16 GB) with text and then starts saving back to disk (virtual memory), before parsing. If I wait long enough, textread() will complete, but it takes almost forever.
Notice that after parsing into a matrix of floating point values, the same data fit into memory quite easily. So I'm using textread() in an intermediate zone, where there is enough memory for the floats, but not enough memory for the same data as text.
All of that is preparation for my question, which is about strread(). The data in my text files looks like this
0.0647148 -2.0072535 0.5644875 8.6954257
0.1294296 -8.4689583 0.6567095 144.3090450
0.1941444 -9.2658037 -1.0228742 173.8027785
0.2588593 -6.5483359 -1.5767574 90.7337329
0.3235741 -0.7646807 -0.5320896 1.7357120
... and so on. There are no header lines or comments in the file.
I wrote a function that reads the file line by line, and notice the two ways I'm attempting to use strread() to parse a line of data.
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% this line works, but is very brittle
[dest(line, 1), dest(line, 2), dest(line, 3), dest(line, 4)] = strread(lstr, "%f %f %f %f");
% This line doesn't work. Or anything similar I can think of.
% dest(line, 1:4) = strread(lstr, "%f %f %f %f");
endfor
fclose(fid);
endfunction
Is there an elegant way of having strread return parsed values to an array? Otherwise I'll have to write a new function any time I change the number of columns.
Thanks

Your described format is a matrix with floating point values. In this case you can just use load
d = load ("yourfile");
which is much faster than any other function. You can have a look at the used implementation in libinterp/corefcn/ls-mat-ascii.cc: read_mat_ascii_data

If you feed fprintf more values than are in its format specification, it will reapply the print statement until it's used them up:
>> fprintf("%d %d \n", 1:6)
1 2
3 4
5 6
It appears this also works with strread. If you specify only one value to read, but there are multiple on the current line, it will keep reading them and add them to a column vector. All we need to do is to assign those values to the correct row of dest:
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% read all values from current line into column vector
% and store values into row of dest
dest(line,:) = strread(lstr, "%f");
% this will also work since values are assumed to be numeric by default:
% dest(line,:) = strread(lstr);
endfor
fclose(fid);
endfunction
Output:
readPowerSpectrumFile(filename, zeros(5,4))
ans =
6.4715e-02 -2.0073e+00 5.6449e-01 8.6954e+00
1.2943e-01 -8.4690e+00 6.5671e-01 1.4431e+02
1.9414e-01 -9.2658e+00 -1.0229e+00 1.7380e+02
2.5886e-01 -6.5483e+00 -1.5768e+00 9.0734e+01
3.2357e-01 -7.6468e-01 -5.3209e-01 1.7357e+00

Related

How do you detect blank lines in Fortran?

Given an input that looks like the following:
123
456
789
42
23
1337
3117
I want to iterate over this file in whitespace-separated chunks in Fortran (any version is fine). For example, let's say I wanted to take the average of each chunk (e.g. mean(123, 456, 789) then mean(42, 23, 1337) then mean(31337)).
I've tried iterating through the file normally (e.g. READ), reading in each line as a string and then converting to an int and doing whatever math I want to do on each chunk. The trouble here is that Fortran "helpfully" ignores blank lines in my text file - so when I try and compare against the empty string to check for the blank line, I never actually get a .True. on that comparison.
I feel like I'm missing something basic here, since this is a typical functionality in every other modern language, I'd be surprised if Fortran didn't somehow have it.
If you're using so-called "list-directed" input (format = '*'), Fortran does special handling to spaces, commas, and blank lines.
To your point, there's a feature which is using the BLANK keyword with read
read(iunit,'(i10)',blank="ZERO",err=1,end=2) array
You can set:
blank="ZERO" will return a valid zero value if a blank is found;
blank="NULL" is the default behavior that skips blank/returns an error depending on the input format.
If all your input values are positive, you could use blank="ZERO" and then use the location of zero values to process your data.
EDIT as #vladimir-f has correctly pointed out, you not only have blanks in between lines, but also after the end of the numbers in most lines, so this strategy will not work.
You can instead load everything into an array, and process it afterwards:
program array_with_blanks
integer :: ierr,num,iunit
integer, allocatable :: array(:)
open(newunit=iunit,file='stackoverflow',form='formatted',iostat=ierr)
allocate(array(0))
do
read(iunit,'(i10)',iostat=ierr) num
if (is_iostat_end(ierr)) then
exit
else
array = [array,num]
endif
end do
close(iunit)
print *, array
end program
Just read each line as a character (but note Francescalus's comment on the format). Then read the character as an internal file.
program stuff
implicit none
integer io, n, value, sum
character (len=1000) line
n = 0
sum = 0
io = 0
open( 42, file="stuff.txt" )
do while( io == 0 )
read( 42, "( a )", iostat = io ) line
if ( io /= 0 .or. line == "" ) then
if ( n > 0 ) print *, ( sum + 0.0 ) / n
n = 0
sum = 0
else
read( line, * ) value
n = n + 1
sum = sum + value
end if
end do
close( 42 )
end program stuff
456.000000
467.333344
3117.00000

Lua - It is possible to stop inputs while "ex.sleep" is running?

Basic stuff that I can't figure out or find in internet:
The little code I'm using for tests is simple:
require("ex")
a = true
b = nil
while (a == true) do
b = io.read()
ex.sleep(5)
print(b)
end
Very simple. If I input "1" (I am using notepad++ and windows command prompt), it will wait 5 seconds and print it, then repeat. But my problem is... If I input more numbers during the 5 seconds of sleeping, it all will be executed automatically, in order, when the sleep ends.
Is it possible to stop that? I don't want any input being read during that time. Where these "ghost" inputs are stored?
You can control reading by means of "buffer size" argument in bytes:
b = io.read(1)
In this case reading completes after the first byte was taken from input. Rest input bytes will be available for the next "read" statement.
Important note: if you input "1" and press "Enter" then there will be 3 bytes for reading (including "\r\n").
See https://www.lua.org/pil/21.1.html for details.
In addition, you want to know a way to clean input buffer before next reading. This is easy: use io.read("*line") statement as follows:
b = io.read("*line") -- suppose, input is: "1234"
b = string.sub(b, 0, 1)
print(b) -- prints 1
b = io.read("*line") -- suppose, input is: "567"
b = string.sub(b, 0, 1)
print(b) -- prints 5
b = io.read("*line") -- suppose, input is: ""
b = string.sub(b, 0, 1)
print(b) -- prints empty string
io.read("*line") gets whole line from input, but you can take only the first character from it.

Reading, parsing and storing .txt files contents in Torch tensors efficiently

I have a huge number of .txt files (maybe around 10 millions) each having the same number of rows/colums. They actually are some single channel images and the pixel values are separated with an space. Here's the code I've written to do the work but it's very slow. I wonder if someone can suggest a more optimized/efficient way of doing this:
require 'torch'
f = assert(io.open(txtFilePath, 'r'))
local tempTensor = torch.Tensor(1, 64, 64):fill(0)
local i = 1
for line in f:lines() do
local l = line:split(' ')
for key, val in ipairs(l) do
tempTensor[{1, i, key}] = tonumber(val)
end
i = i + 1
end
f:close()
In brief, change you source files if it is possible.
The only I can suggest is to use binary data instead of txt as a source.
You have got the long-term methods: f:lines(), line:split(' ') and tonumber(val). All of them are using strings as variables.
As I understood, you have got file like this:
0 10 20
11 18 22
....
so, change your source it into binary like this:
<0><18><20><11><18><22> ...
where <18> is a byte in hex form, that is 12 , <20> is 16 , etc.
to read
fid = io.open(sup_filename, "rb")
while true do
local bytes = fid:read(1)
if bytes == nil then break end -- EOF
local st = bytes[0]
print(st)
end
fid:close()
https://www.lua.org/pil/21.2.2.html
It would be dramatically faster.
May be using regular expressions (instead of :split() and lines()) can help to you but I do not think.

Dynamically creating variables, while doing map/apply on a dataframe in pandas to get key names for the values in Series object returned

I am writing code for a Naive Bayes model(I know there's a standard implementation in Sklearn, but I want to code it anyway) - For this I have say upwards of 30 features, against all of which I have the corresponding click & impression counts (Treat them as True/False flags)
What I need then, is to calculate
P(Click/F1, F2.. F30) = (P(Click)*P(F1/Click)*P(F2|click) ..*P(F30|Click))/(P(F1, F2...F30), and
P(NoClick/F1, F2.. F30) = (P(NoClick)*P(F1/NoClick)*P(F2|Noclick) ..*P(F30|NOClick))/(P(F1, F2...F30)
Where I will disregard the denominator as it will affect both Click & Non click behaviour similarly.
Example, for two features, day_custom & is_tablet_phone, I have
is_tablet_phone click impression
FALSE 375417 28291280
TRUE 17743 4220980
day_custom click impression
Fri 77592 7029703
Mon 43576 3773571
Sat 65950 5447976
Sun 66460 5031271
Thu 74329 6971541
Tue 55282 4575114
Wed 51555 4737712
My approach to the Problem : Assuming I read the individual files in data frame, one after another, I want the abilty to calculate & store the corresponding Probablities back in a file, that I will then use for real time prediction of Probabilty to click vs no click.
One possible structure of "processed file" thus would be -:
Here's my entire code -:
In the full blown example, I am traversing the entire directory structure(of 30 txt files, one at a time, from the base path) - which is why I need the ability to create "names" at runtime.
for base_path in base_paths:
for root, dirs, files in os.walk(base_path):
for file in files:
file_paths.append(os.path.join(root, file))
For reasons of tractability, follow from here, by taking the 2 txt files as sample input
file_paths=['/home/ekta/Desktop/NB/day_custom.txt','/home/ekta/Desktop/NB/is_tablet_phone.txt']
flag=0
for filehandle in file_paths:
feature_name=filehandle.split("/")[-1].split(".")[0]
df= pd.read_csv(filehandle,skiprows=0, encoding='utf-8',sep='\t',index_col=False,dtype={feature_name: object,'click': int,'impression': int})
df2=df[(df.impression-df.click>0) & (df.click >0)]
if flag ==0:
MySumC,MySumNC,Mydict=0,0,collections.defaultdict(dict)
MySumC=sum(df2['click'])
MySumNC=sum(df2['impression'])
P_C=float(MySumC)/float(MySumC+MySumNC)
P_NC=1-P_C
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
flag=1 %Set the flag as "1" because we don't need to compute the MySumC,MySumNC, P_C & P_NC again
Question :
It looks like THIS loop is the killer here.Also, intutively, looping on a dataframe is a BAD practice. How can I rewrite this, perhaps using Map/Apply ?
for feature_value in df2[feature_name]:
Mydict[feature_name+'_'+feature_value]={'P_'+feature_name+'_'+feature_value+'_C':(df2[df2[feature_name]==feature_value]['click']*float(P_C))/MySumC, \
'P_'+feature_name+'_'+feature_value+'_NC':(df2[df2[feature_name]==feature_value]['impression']*float(P_NC))/MySumNC}
What I need in Mydict , which is a hash to store each feature name and each feature value in it
{'day_custom_Mon':{'P_day_custom_Mon_C':.787,'P_day_custom_Mon_NC': 0.556},
'day_custom_Tue':{'P_day_custom_Tue_C':0.887,'P_day_custom_Tue_NC': 0.156},
'day_custom_Wed':{'P_day_custom_Tue_C':0.087,'P_day_custom_Tue_NC': 0.167}
'day_custom_Thu':{'P_day_custom_Tue_C':0.947,'P_day_custom_Tue_NC': 0.196},
'is_tablet_phone_True':{'P_is_tablet_phone_True_C':.787,'P_is_tablet_phone_True_NC': 0.066},
'is_tablet_phone_False':{'P_is_tablet_phone_False_C':.787,'P_is_tablet_phone_False_NC': 0.077},
.. and so on..
%PPS: I just made up those float numbers, but you get the point
Also because I will later serialize this file & pass to Redis directly, for other systems to feed on it, in an cron-job manner, so I need to preserve some sort of Dynamic naming .
What I tried -:
Since I am reading feature_name as
feature_name=filehandle.split("/")[-1].split(".")[0]` # thereby abstracting & creating variables dynamically
def funct1(row):
return row[feature_name]
def funct2(row):
return row['click']
def funct3(row):
return row['impression']
then..
df2.apply(funct2,axis=1)df2.apply(funct,axis=1)*float(P_C))/MySumC, df2.apply(funct3,axis=1)*float(P_NC))/MySumNC Gives me both the values I need for a feature_value(say Mon, Tue, Wed, and so on..) for a feature_name (say,day_custom)
I also know that df2.apply(funct1, axis=1) contains part of mycustom "names"(ie feature values), how would I then build these names using map/apply ?
Ie. I will have the values, but how would I create the "key" 'P_'+feature_name+'_'+feature_value+'_C' , since feature value post apply is returned as a series object.
check out the following recipe which does exactly what you want, only using data frame manipulations. I also simplified the actual frequency calculation a bit ;)
#set the feature name values as the index of
df2.set_index(feature_name, inplace=True)
#This is what df2.set_index() looks like:
# click impression
#day_custom
#Fri 9917 3163
#Mon 2566 3818
#Sat 8725 7753
#Sun 6938 8642
#Thu 6136 2556
#Tue 5234 2356
#Wed 9463 9433
#rename the index of your data frame
df2.rename(index=lambda x:"%s_%s"%('day_custom', x), inplace=True)
#compute the total sum of your data frame entries
totsum = float(df2.values.sum())
#use apply to multiply every data frame element by the total sum
df2 = df2.applymap(lambda x:x/totsum)
#transpose the data frame to have the following shape
#day_custom day_custom_Fri day_custom_Mon ...
#click 0.102019 0.037468 ...
#impression 0.087661 0.045886 ...
#
#
dftranspose = df2.T
# template kw for formatting
templatekw = {'click':"P_%s_C", 'impression':"P_%s_NC"}
# build a list of small data frames with correct index names P_%s_NC etc
dflist = [dftranspose[[col]].rename(lambda x:templatekw[x]%col) for col in dftranspose]
#use the concatenate function to produce a sparse dictionary
MyDict= pd.concat(dflist).to_dict()
Instead of assigning to MyDict at the end, you can use the update-method during the loop.
For understanding the comments below, see here my
Original answer:
Try to use a pivot_table:
def clickfunc(x):
return np.sum(x) * P_C / MySumC
def impressionfunc(x):
return np.sum(x) * P_NC / MySumNC
newtable = df2.pivot_table(['click', 'impression'], 'feature_name', \
aggfunc=[clickfunc, impressionfunc])
#transpose the table for the dictionary to have the right form
newtable = newtable.T
#to_dict functionality already gives the correct result
MyDict = newtable.to_dict()
#rename by copying
for feature_value, subdict in MyDict.items():
word = feature_name +"_"+ feature_value
copydict[word] = {'P_' + word + '_C':subdict['click'],\
'P_' + word + '_NC':subdict['impression'] }
This gives you the result you want in copydict
itertuples() is what worked for me(worked at lightspeed) - though It is still not using the map/apply approach that I so much wanted to see. Itertuples on a pandas dataframe returns the whole row, so I no longer have to do df2[df2[feature_name]==feature_value]['click'] - be aware that this matching by value is not only expensive, but also undesired, since it may return a series, if there were duplicate rows. itertuples solves that problem were elegantly, though I need to then access the individual objects/columns by integer indexes , which means less re-usable code. I could abstract this, but It wont be like accessing by column names, the status-quo.
for row in df2.itertuples():
Mydict[feature_name+'_'+str(row[1])]={'P_'+feature_name+'_'+str(row[1])+'_C':(row[2]*float(P_C))/MySumC, \
'P_'+feature_name+'_'+str(row[1])+'_NC':(row[3]*float(P_NC))/MySumNC}
Note that I am accesing each column in the row by row[1] , row[2] and like. For example, row has (0, u'Fri', 77592, 7029703)
Post this I get
dict(Mydict)
{'day_custom_Thu': {'P_day_custom_Thu_NC': 0.18345372640838162, 'P_day_custom_Thu_C': 0.0019559423132143377}, 'day_custom_Mon': {'P_day_custom_Mon_C': 0.0011466875948906617, 'P_day_custom_Mon_NC': 0.099300235316209587}, 'day_custom_Sat': {'P_day_custom_Sat_NC': 0.14336163246883712, 'P_day_custom_Sat_C': 0.0017354517827023852}, 'day_custom_Tue': {'P_day_custom_Tue_C': 0.001454726996987919, 'P_day_custom_Tue_NC': 0.1203925662982053}, 'day_custom_Sun': {'P_day_custom_Sun_NC': 0.13239618235343156, 'P_day_custom_Sun_C': 0.0017488722589598259}, 'is_tablet_phone_TRUE': {'P_is_tablet_phone_TRUE_NC': 0.11107365073163174, 'P_is_tablet_phone_TRUE_C': 0.00046690100046229593}, 'day_custom_Wed': {'P_day_custom_Wed_NC': 0.12467127727567069, 'P_day_custom_Wed_C': 0.0013566522616712882}, 'day_custom_Fri': {'P_day_custom_Fri_NC': 0.1849842396242351, 'P_day_custom_Fri_C': 0.0020418070466026303}, 'is_tablet_phone_FALSE': {'P_is_tablet_phone_FALSE_NC': 0.74447539516197614, 'P_is_tablet_phone_FALSE_C': 0.0098789704610580936}}

How to load 2D array from a text(csv) file into Octave?

Consider the following text(csv) file:
1, Some text
2, More text
3, Text with comma, more text
How to load the data into a 2D array in Octave? The number can go into the first column, and all text to the right of the first comma (including other commas) goes into the second text column.
If necessary, I can replace the first comma with a different delimiter character.
AFAIK you cannot put stings of different size into an array. You need to create a so called cell array.
A possible way to read the data from your question stored in a file Test.txt into a cell array is
t1 = textread("Test.txt", "%s", "delimiter", "\n");
for i = 1:length(t1)
j = findstr(t1{i}, ",")(1);
T{i,1} = t1{i}(1:j - 1);
T{i,2} = strtrim(t1{i}(j + 1:end));
end
Now
T{3,1} gives you 3 and
T{3,2} gives you Text with comma, more text.
After many long hours of searching and debugging, here's how I got it to work on Octave 3.2.4. Using | as the delimiter (instead of comma).
The data file now looks like:
1|Some text
2|More text
3|Text with comma, more text
Here's how to call it: data = load_data('data/data_file.csv', NUMBER_OF_LINES);
Limitation: You need to know how many lines you want to get. If you want to get all, then you will need to write a function to count the number of lines in the file in order to initialize the cell_array. It's all very clunky and primitive. So much for "high level languages like Octave".
Note: After the unpleasant exercise of getting this to work, it seems that Octave is not very useful unless you enjoy wasting your time writing code to do the simplest things. Better choices seems to be R, Python, or C#/Java with a Machine Learning or Matrix library.
function all_messages = load_data(filename, NUMBER_OF_LINES)
fid = fopen(filename, "r");
all_messages = cell (NUMBER_OF_LINES, 2 );
counter = 1;
line = fgetl(fid);
while line != -1
separator_index = index(line, '|');
all_messages {counter, 1} = substr(line, 1, separator_index - 1); % Up to the separator
all_messages {counter, 2} = substr(line, separator_index + 1, length(line) - separator_index); % After the separator
counter++;
line = fgetl(fid);
endwhile
fprintf("Processed %i lines.\n", counter -1);
fclose(fid);
end