How to load 2D array from a text(csv) file into Octave? - file-io

Consider the following text(csv) file:
1, Some text
2, More text
3, Text with comma, more text
How to load the data into a 2D array in Octave? The number can go into the first column, and all text to the right of the first comma (including other commas) goes into the second text column.
If necessary, I can replace the first comma with a different delimiter character.

AFAIK you cannot put stings of different size into an array. You need to create a so called cell array.
A possible way to read the data from your question stored in a file Test.txt into a cell array is
t1 = textread("Test.txt", "%s", "delimiter", "\n");
for i = 1:length(t1)
j = findstr(t1{i}, ",")(1);
T{i,1} = t1{i}(1:j - 1);
T{i,2} = strtrim(t1{i}(j + 1:end));
end
Now
T{3,1} gives you 3 and
T{3,2} gives you Text with comma, more text.

After many long hours of searching and debugging, here's how I got it to work on Octave 3.2.4. Using | as the delimiter (instead of comma).
The data file now looks like:
1|Some text
2|More text
3|Text with comma, more text
Here's how to call it: data = load_data('data/data_file.csv', NUMBER_OF_LINES);
Limitation: You need to know how many lines you want to get. If you want to get all, then you will need to write a function to count the number of lines in the file in order to initialize the cell_array. It's all very clunky and primitive. So much for "high level languages like Octave".
Note: After the unpleasant exercise of getting this to work, it seems that Octave is not very useful unless you enjoy wasting your time writing code to do the simplest things. Better choices seems to be R, Python, or C#/Java with a Machine Learning or Matrix library.
function all_messages = load_data(filename, NUMBER_OF_LINES)
fid = fopen(filename, "r");
all_messages = cell (NUMBER_OF_LINES, 2 );
counter = 1;
line = fgetl(fid);
while line != -1
separator_index = index(line, '|');
all_messages {counter, 1} = substr(line, 1, separator_index - 1); % Up to the separator
all_messages {counter, 2} = substr(line, separator_index + 1, length(line) - separator_index); % After the separator
counter++;
line = fgetl(fid);
endwhile
fprintf("Processed %i lines.\n", counter -1);
fclose(fid);
end

Related

How do you detect blank lines in Fortran?

Given an input that looks like the following:
123
456
789
42
23
1337
3117
I want to iterate over this file in whitespace-separated chunks in Fortran (any version is fine). For example, let's say I wanted to take the average of each chunk (e.g. mean(123, 456, 789) then mean(42, 23, 1337) then mean(31337)).
I've tried iterating through the file normally (e.g. READ), reading in each line as a string and then converting to an int and doing whatever math I want to do on each chunk. The trouble here is that Fortran "helpfully" ignores blank lines in my text file - so when I try and compare against the empty string to check for the blank line, I never actually get a .True. on that comparison.
I feel like I'm missing something basic here, since this is a typical functionality in every other modern language, I'd be surprised if Fortran didn't somehow have it.
If you're using so-called "list-directed" input (format = '*'), Fortran does special handling to spaces, commas, and blank lines.
To your point, there's a feature which is using the BLANK keyword with read
read(iunit,'(i10)',blank="ZERO",err=1,end=2) array
You can set:
blank="ZERO" will return a valid zero value if a blank is found;
blank="NULL" is the default behavior that skips blank/returns an error depending on the input format.
If all your input values are positive, you could use blank="ZERO" and then use the location of zero values to process your data.
EDIT as #vladimir-f has correctly pointed out, you not only have blanks in between lines, but also after the end of the numbers in most lines, so this strategy will not work.
You can instead load everything into an array, and process it afterwards:
program array_with_blanks
integer :: ierr,num,iunit
integer, allocatable :: array(:)
open(newunit=iunit,file='stackoverflow',form='formatted',iostat=ierr)
allocate(array(0))
do
read(iunit,'(i10)',iostat=ierr) num
if (is_iostat_end(ierr)) then
exit
else
array = [array,num]
endif
end do
close(iunit)
print *, array
end program
Just read each line as a character (but note Francescalus's comment on the format). Then read the character as an internal file.
program stuff
implicit none
integer io, n, value, sum
character (len=1000) line
n = 0
sum = 0
io = 0
open( 42, file="stuff.txt" )
do while( io == 0 )
read( 42, "( a )", iostat = io ) line
if ( io /= 0 .or. line == "" ) then
if ( n > 0 ) print *, ( sum + 0.0 ) / n
n = 0
sum = 0
else
read( line, * ) value
n = n + 1
sum = sum + value
end if
end do
close( 42 )
end program stuff
456.000000
467.333344
3117.00000

Octave strread can't return parsed results to an array (?)

In Octave, I am reading very large text files from disk and parsing them. The function textread() does just what I want except for the way it is implemented. Looking at the source, textread.m pulls the entire text file into memory before attempting to parse lines. If the text file is large, it fills all my free RAM (16 GB) with text and then starts saving back to disk (virtual memory), before parsing. If I wait long enough, textread() will complete, but it takes almost forever.
Notice that after parsing into a matrix of floating point values, the same data fit into memory quite easily. So I'm using textread() in an intermediate zone, where there is enough memory for the floats, but not enough memory for the same data as text.
All of that is preparation for my question, which is about strread(). The data in my text files looks like this
0.0647148 -2.0072535 0.5644875 8.6954257
0.1294296 -8.4689583 0.6567095 144.3090450
0.1941444 -9.2658037 -1.0228742 173.8027785
0.2588593 -6.5483359 -1.5767574 90.7337329
0.3235741 -0.7646807 -0.5320896 1.7357120
... and so on. There are no header lines or comments in the file.
I wrote a function that reads the file line by line, and notice the two ways I'm attempting to use strread() to parse a line of data.
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% this line works, but is very brittle
[dest(line, 1), dest(line, 2), dest(line, 3), dest(line, 4)] = strread(lstr, "%f %f %f %f");
% This line doesn't work. Or anything similar I can think of.
% dest(line, 1:4) = strread(lstr, "%f %f %f %f");
endfor
fclose(fid);
endfunction
Is there an elegant way of having strread return parsed values to an array? Otherwise I'll have to write a new function any time I change the number of columns.
Thanks
Your described format is a matrix with floating point values. In this case you can just use load
d = load ("yourfile");
which is much faster than any other function. You can have a look at the used implementation in libinterp/corefcn/ls-mat-ascii.cc: read_mat_ascii_data
If you feed fprintf more values than are in its format specification, it will reapply the print statement until it's used them up:
>> fprintf("%d %d \n", 1:6)
1 2
3 4
5 6
It appears this also works with strread. If you specify only one value to read, but there are multiple on the current line, it will keep reading them and add them to a column vector. All we need to do is to assign those values to the correct row of dest:
function dest = readPowerSpectrumFile(filename, dest)
% read enough lines to fill destination array
[rows, cols] = size(dest);
fid = fopen(filename, 'r');
for line = 1 : rows
lstr = fgetl(fid);
% read all values from current line into column vector
% and store values into row of dest
dest(line,:) = strread(lstr, "%f");
% this will also work since values are assumed to be numeric by default:
% dest(line,:) = strread(lstr);
endfor
fclose(fid);
endfunction
Output:
readPowerSpectrumFile(filename, zeros(5,4))
ans =
6.4715e-02 -2.0073e+00 5.6449e-01 8.6954e+00
1.2943e-01 -8.4690e+00 6.5671e-01 1.4431e+02
1.9414e-01 -9.2658e+00 -1.0229e+00 1.7380e+02
2.5886e-01 -6.5483e+00 -1.5768e+00 9.0734e+01
3.2357e-01 -7.6468e-01 -5.3209e-01 1.7357e+00

Checking errors in my program

I'm trying to make some changes to my dictionary counter in python. I want make some changes to my current counter, but not making any progress so far. I want my code to show the number of different words.
This is what I have so far:
# import sys module in order to access command line arguments later
import sys
# create an empty dictionary
dicWordCount = {}
# read all words from the file and put them into
#'dicWordCount' one by one,
# then count the occurance of each word
you can use the Count function from collections lib:
from collections import Counter
q = Counter(fileSource.read().split())
total = sum(q.values())
First, your first problem, add a variable for the word count and one for the different words. So wordCount = 0 and differentWords = 0. In the loop for your file reading put wordCount += 1 at the top, and in your first if statement put differentWords += 1. You can print these variables out at the end of the program as well.
The second problem, in your printing, add the if statement, if len(strKey)>4:.
If you want a full example code here it is.
import sys
fileSource = open(sys.argv[1], "rt")
dicWordCount = {}
wordCount = 0
differentWords = 0
for strWord in fileSource.read().split():
wordCount += 1
if strWord not in dicWordCount:
dicWordCount[strWord] = 1
differentWords += 1
else:
dicWordCount[strWord] += 1
for strKey in sorted(dicWordCount, key=dicWordCount.get, reverse=True):
if len(strKey) > 4: # if the words length is greater than four.
print(strKey, dicWordCount[strKey])
print("Total words: %s\nDifferent Words: %s" % (wordCount, differentWords))
For your first qs, you can use set to help you count the number of different words. (Assume there is a space between every two words)
str = 'apple boy cat dog elephant fox'
different_word_count = len(set(str.split(' ')))
For your second qs, using a dictionary to help you record the word_count is ok.
How about this?
#gives unique words count
unique_words = len(dicWordCount)
total_words = 0
for k, v in dicWordCount.items():
total_words += v
#gives total word count
print(total_words)
You don't need a separate variable for counting word counts since you're using dictionary, and to count the total words, you just need to add the values of the keys(which are just counts)

Fortran: read variables that are not present in a file

I need help understanding this 50 line program
implicit none
integer maxk, maxb, maxs
parameter (maxk=6000, maxb=1000, maxs=5)
integer nk, nspin, nband, ik, is, ib
double precision e(maxb, maxs, maxk), k(maxk)
double precision ef, kmin, kmax, emin, emax
logical overflow
read(5,*) ef
read(5,*) kmin, kmax
read(5,*) emin, emax
read(5,*) nband, nspin, nk
overflow = (nband.gt.maxb) .or. (nk.gt.maxk) .or. (nspin.gt.maxs)
if (overflow) stop 'Dimensions in gnubands too small'
write(6,"(2a)") '# GNUBANDS: Utility for SIESTA to transform ',
. 'bands output into Gnuplot format'
write(6,"(a)") '#'
write(6,"(2a)") '# ',
. ' Emilio Artacho, Feb. 1999'
write(6,"(2a)") '# ------------------------------------------',
. '--------------------------------'
write(6,"(a,f10.4)") '# E_F = ', ef
write(6,"(a,2f10.4)") '# k_min, k_max = ', kmin, kmax
write(6,"(a,2f10.4)") '# E_min, E_max = ', emin, emax
write(6,"(a,3i6)") '# Nbands, Nspin, Nk = ', nband, nspin, nk
write(6,"(a)") '#'
write(6,"(a)") '# k E'
write(6,"(2a)") '# ------------------------------------------',
. '--------------------------------'
read(5,*) (k(ik),((e(ib,is,ik),ib=1,nband), is=1,nspin), ik=1,nk)
do is = 1, nspin
do ib = 1, nband
write(6,"(2f14.6)") ( k(ik), e(ib,is,ik), ik = 1, nk)
write(6,"(/)")
enddo
enddo
This is a free format Fortran file. The name of the program is gnubands and rearranges numbers in an input (which the user specifies). I would like to know how this program operates. Here is what I do not understand. The program takes input from a file, it reads
ef, kmin,kmax,emin,emax,nband,nspin,nk
However, all of these variables are not found inside the input file. I opened the input file in vi and conducted a search using /. I do not obtain any results. Nevertheless, the program appears to correctly pick all values. What is happening?
Also, I do not understand the read format
read(5,*) (k(ik),((e(ib,is,ik),ib=1,nband), is=1,nspin), ik=1,nk)
I am not familiar with the syntax and would like to know what it is saying or any references.
Some tutorial PDF of SIESTA shows that the input for gnubands.f is something like this:
whose header part is to be read by the first four read statements of gnubands.f. With this input, the variables are set as
ef = -5.018...
kmin = 0.000...
kmax = 3.338...
emin = -25.187...
emax = 143.069...
nband = 18
nspin = 1
nk = 150
by giving the input file from the standard input (assumed unit number 5) as
gfortran -o gnubands.x gnubands.f
gnubands.x < your_data_file.bands
Note that there are (and should be) no keywords like "ef" or "EF" or "Ef" (capitalization does not matter), because the numbers are directly read into the variables in gnubands.f. This is in contrast to other cases like using XML files, where (human-readable) tags or keywords are embedded in the file itself (e.g., pseudopotential files used by Quantum ESPRESSO). I guess your confusion might be coming from the use of namelist for obtaining input values, which looks like
namelist /your_inp/ a, b, c
read( funit, nml = your_inp )
with an input file
&your_inp
a = 1.0
b = "method1"
c = 77
/
In this case, the variable names (here, a, b, and c) appear literally in the input file.
Historically, 5 (in your read(5,*)) is stdin, so either
(1)you are supplying the value, when you are running the code,
or,(2) I guess when you run the SIESTA, (gnuband is a postprocessor of that) it creates a file, possibly named fort.5. Check that.

Convert cell text in progressive number

I have written this SQL in PostgreSQL environment:
SELECT
ST_X("Position4326") AS lon,
ST_Y("Position4326") AS lat,
"Values"[4] AS ppe,
"Values"[5] AS speed,
"Date" AS "timestamp",
"SourceId" AS smartphone,
"Track" as session
FROM
"SingleData"
WHERE
"OsmLineId" = 44792088
AND
array_length("Values", 1) > 4
AND
"Values"[5] > 0
ORDER BY smartphone, session;
Now I have imported the result in Matlab and I have six vectors and one cell (because the text from the UUIDs was converted in cell) all of 5710x1 size.
Now I would like convert the text in the cell, in a progressive number, like 1, 2, 3... for each different session code.
In Excel it is easy with FIND.VERT(obj, matrix, col), but I do not know how do it in Matlab.
Now I have a big cell with a lot of codes like:
ff95465f-0593-43cb-b400-7d32942023e1
I would like convert this cell in an array of numbers where at the first occurrence of
ff95465f-0593-43cb-b400-7d32942023e1 -> 1
and so on. And you put 2 when a different code appear, and so on.
OK, I have solve.
I put the single session code in a second cell C.
At this point, with a for loop, I obtain:
%% Converting of the UUIDs into integer
C = unique(session);
N = length(session);
session2 = zeros(N, 1);
for i = 1:N
session2(i) = find(strcmp(C, session(i)));
end
Thanks to all!