Clojure way of reading large files and transforming data therein - file-io

I am processing a Subrip subtitles file which is quite large and need to process it one subtitle at a time. In Java, to extract the subtitles from file, I would write a method with following signature:
Iterator<Subtitle> fromSubrip(final Iterator<String> lines);
The use of Iterator gives me two benefits:
The file is never in the memory in its entirety, nor is any of its transformed stage.
An abstraction wherein I can loop over a collection of Subtitle objects without the memory overhead.
Since iterators are by nature imperative and mutable, they're probably not idiomatic in Clojure. So what is the Clojure way to deal with this sort of situation?

As Vladimir said, you need to handle the laziness and file closing correctly. Here's how I did it, as shown in "Read a very large text file into a list in clojure":
(defn lazy-file-lines
"open a (probably large) file and make it a available as a lazy seq of lines"
[filename]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader filename))))

read all files from a directory, a lazy way.
using go black and channel.
code:
(ns user
(:require [clojure.core.async :as async :refer :all
:exclude [map into reduce merge partition partition-by take]]))
(defn read-dir [dir]
(let [directory (clojure.java.io/file dir)
files (filter #(.isFile %) (file-seq directory))
ch (chan)]
(go
(doseq [file files]
(with-open [rdr (clojure.java.io/reader file)]
(doseq [line (line-seq rdr)]
(>! ch line))))
(close! ch))
ch))
invoke:
(def aa "D:\\Users\\input")
(let [ch (read-dir aa)]
(loop []
(when-let [line (<!! ch )]
(println line)
(recur))))
================
reify the Iterable interace, can be used in java.
MyFiles.clj:
(ns user
(:gen-class :methods [#^{:static true} [readDir [String] Iterable]])
(:require [clojure.core.async :as async :refer :all
:exclude [map into reduce merge partition partition-by take]]))
(defn -readDir [dir]
(def i nil)
(let [ch (read-dir dir)
it (reify java.util.Iterator
(hasNext [this] (alter-var-root #'i (fn [_] (<!! ch))) (not (nil? i)))
(next [this] i))
itab (reify Iterable
(iterator [this] it))]
itab))
java code:
for (Object line : MyFiles.readDir("/dir")) {
println(line)
}

You can use lazy sequences for this, for example, line-seq.
You must be careful, however, that the sequence returned by line-seq (and other functions which return lazy sequences based on some external resource) never would leak out of e.g. with-open scope because after the source is closed, further reading from lazy sequence will cause exceptions.

Related

How does read-line work in Lisp when reaching eof?

Context:
I have a text file called fr.txt with 3 columns of text in it:
65 A #\A
97 a #\a
192 À #\latin_capital_letter_a_with_grave
224 à #\latin_small_letter_a_with_grave
etc...
I want to create a function to read the first (and eventually the third one too) column and write it into another text file called alphabet_code.txt.
So far I have this function:
(defun alphabets()
(setq source (open "fr.txt" :direction :input :if-does-not-exist :error))
(setq code (open "alphabet_code.txt" :direction :output :if-does-not-exist :create :if-exists :supersede))
(loop
(setq ligne (read-line source nil nil))
(cond
((equal ligne nil) (return))
(t (print (read-from-string ligne) code))
)
)
(close code)
(close source)
)
My problems:
I don't really understand how the parameters of read-line function. I have read this doc, but it's still very obscure to me. If someone would have very simple examples, that would help.
With the current code, I get this error: *** - read: input stream #<input string-input-stream> has reached its end even if I change the nil nil in (read-line source nil nil) to other values.
Thanks for your time!
Your questions
read-line optional arguments
read-line accepts 3 optional arguments:
eof-error-p: what to do on EOF (default: error)
eof-value: what to return instead of the error when you see EOF
recursive-p: are you calling it from your print-object method (forget about this for now)
E.g., when the stream is at EOF,
(read-line stream) will signal the end-of-file error
(read-line stream nil) will return nil
(read-line stream nil 42) will return 42.
Note that (read-line stream nil) is the same as (read-line stream nil nil) but people usually still pass the second optional argument explicitly.
eof-value of nil is fine for read-line because nil is not a string and read-line only returns strings.
Note also that in case of read the second optional argument is, traditionally, the stream itself: (read stream nil stream). It's quite convenient.
Error
You are getting the error from read-from-string, not read-line, because, apparently, you have an empty line in your file.
I know that because the error mentions string-input-stream, not file-stream.
Your code
Your code is correct functionally, but very wrong stylistically.
You should use with-open-file whenever possible.
You should not use print in code, it's a weird legacy function mostly for interactive use.
You can't create local variables with setq - use let or other equivalent forms (in this case, you never need let! :-)
Here is how I would re-write your function:
(defun alphabets (input-file output-file)
(with-open-stream (source input-file)
(with-open-stream (code output-file :direction :output :if-exists :supersede)
(loop for line = (read-line source nil nil)
as num = (parse-integer line :junk-allowed t)
while line do
(when num
(write num :stream code)
(write-char #\Newline code))))))
(alphabets "fr.txt" "alphabet_code.txt")
See the docs:
loop: for/as, while, do
write, write-char
parse-integer
Alternatively, instead of (when num ...) I could have use the corresponding loop conditional.
Also, instead of write+write-char I could have written (format code "~D~%" num).
Note that I do not pass those of your with-open-stream arguments that are identical to the defaults.
The defaults are set in stone, and the less code you have to write and your user has to read, the less is the chance of an error.

Is clojure's read file structure, i.e. with-open and clojure.java.io/reader, efficient enough for frequent access?

Suppose I write a function to parse data from a txt file with with-open and clojure.java.io/reader, and then I wrote another function to call the reader function multiple of times in order to process data, e.g.
(defn grabDataFromFile [file patternString]
(let [data (atom [])]
(with-open [rdr (clojure.java.io/reader file)]
(doseq [line (line-seq rdr)]
(if (re-matches (re-pattern patternString) line) (swap! data conj line))))
#data))
(defn myCalculation [file ]
(let [data1 (grabDataFromFile file "pattern1")
data2 (grabDataFromFile file "pattern2")
data3 (grabDataFromFile file "pattern3")]
;calculations or processes of data1, data2, data3....))
My question is, inside this myCalculation function, is the underlying code smart enough to open the file just once with clojure reader, and get all data needed in one shot? Or does it open and close the file as many times as number of calls for function grabDataFromFile ? ( In this example, 3)
A follow up question would be, what can I do to speed up if the reader is not smart enough, and if I have to intentionally separate "parser" code with "processing" code?
grabDataFromFile will open and close reader (on exit) every time it is called. The underlying code cannot be that smart such that a function can detect the context of its caller without some explicitly provided information.
Make grabDataFromFile to accept another function which is your parser logic that operates on each line (or it could be any function that you want to perform on each line)
(defn grabDataFromFile [file patternString process-fn]
(with-open [rdr (clojure.java.io/reader file)]
(doseq [line (line-seq rdr)]
(process-fn line))))
(defn myCalculation [file]
(let [patterns [["pattern1" (atom [])]
["pattern2" (atom [])]
["pattern3" (atom [])]]
pattern-fns (map (fn [[p data]]
(fn [line]
(if (re-matches (re-pattern p) line)
(swap! data conj line)))) patterns)
pattern-fn (apply juxt pattern-fns)]
(grabDataFromFile file pattern-fn)
;perform calc on patterns atoms
))

Why is line-seq returning clojure.lang.Cons instead of clojure.lang.LazySeq?

According to the ClojureDocs entry for line-seq (http://clojuredocs.org/clojure_core/clojure.core/line-seq) and the accepted answer for the Stack question (In Clojure 1.3, How to read and write a file), line-seq should return a lazy seq when passed a java.io.BufferedReader.
However when I test this in the REPL, the type is listed as clojure.lang.Cons. See the code below:
=> (ns stack-question
(:require [clojure.java.io :as io]))
nil
=> (type (line-seq (io/reader "test-file.txt")))
clojure.lang.Cons
=> (type (lazy-seq (line-seq (io/reader "test-file.txt"))))
clojure.lang.LazySeq
Wrapping up the line-seq call in a lazy-seq call gives a lazy seq, but according to the docs, this shouldn't be necessary: line-seq should return a lazy seq anyway.
Note:
Inside the REPL (I'm using nrepl) it seems that lazy seqs get fully realized, so I thought perhaps it was just a quirk of the REPL; however the same problem exists when I test it with Speclj. Plus, I don't think realizing a lazy seq has to do with what is going on anyway.
EDIT:
So I went to check the source code after mobyte's answer said there is a lazy seq in the tail of the cons...
1 (defn line-seq
2 "Returns the lines of text from rdr as a lazy sequence of strings.
3 rdr must implement java.io.BufferedReader."
4 {:added "1.0"}
5 [^java.io.BufferedReader rdr]
6 (when-let [line (.readLine rdr)]
7 (cons line (lazy-seq (line-seq rdr)))))
That call to cons would explain why the type of the return value of line-seq is clojure.lang.Cons.
You don't need "wrap" output Cons because it already has lazy seq as "tail":
(type (line-seq (io/reader "test-file.txt")))
=> clojure.lang.Cons
(type (rest (line-seq (io/reader "test-file.txt"))))
=> clojure.lang.LazySeq
(type (cons 'a (rest (line-seq (io/reader "test-file.txt")))))
=> clojure.lang.Cons
Edit.
Note: Inside the REPL (I'm using nrepl) it seems that lazy seqs get
fully realized
Not correct. You can test it:
(with-open [r (io/reader "test-file.txt")] (line-seq r))
=> IOException Stream closed java.io.BufferedReader.ensureOpen (BufferedReader.java:97)
It's because line-seq returns lazy-seq which is not fully realized and reader is already closed when repl tries to realize result later to print it. But if you realize it explicitly it would give normal result without any exceptions:
(with-open [r (io/reader "/home/mobyte/2")] (doall (line-seq r)))
=> ... output ...

Input stream ends within an object

I want to count the number of rows in a flat file, and so I wrote the code:
(defun ff-rows (dir file)
(with-open-file (str (make-pathname :name file
:directory dir)
:direction :input)
(let ((rownum 0))
(do ((line (read-line str file nil 'eof)
(read-line str file nil 'eof)))
((eql line 'eof) rownum)
(incf rownum )))))
However I get the error:
*** - READ: input stream
#<INPUT BUFFERED FILE-STREAM CHARACTER #P"/home/lambda/Documents/flatfile"
#4>
ends within an object
May I ask what the problem is here? I tried counting the rows; this operation is fine.
Note: Here is contents of the flat file that I used to test the function:
2 3 4 6 2
1 2 3 1 2
2 3 4 1 6
A bit shorter.
(defun ff-rows (dir file)
(with-open-file (stream (make-pathname :name file
:directory dir)
:direction :input)
(loop for line = (read-line stream nil nil)
while line count line)))
Note that you need to get the arguments for READ-LINE right. First is the stream. A file is not part of the parameter list.
Also generally is not a good idea to mix pathname handling into general Lisp functions.
(defun ff-rows (pathname)
(with-open-file (stream pathname :direction :input)
(loop for line = (read-line stream nil nil)
while line count line)))
Do the pathname handling in another function or some other code. Passing pathname components to functions is usually a wrong design. Pass complete pathnames.
Using a LispWorks file selector:
CL-USER 2 > (ff-rows (capi:prompt-for-file "some file"))
27955
Even better is when all the basic I/O functions work on streams, and not pathnames. Thus you you could count lines in a network stream, a serial line or some other stream.
The problem, as far as I can tell, is the "file" in your (read-line ... ) call.
Based on the hyperspec, the signature of read-line is:
read-line &optional input-stream eof-error-p eof-value recursive-p
=> line, missing-newline-p
...which means that "file" is interpreted as eof-error-p, nil as eof-value and 'eof as recursive-p. Needless to say, problems ensue. If you remove "file" from the read-line call (e.g. (read-line str nil :eof)), the code runs fine without further modifications on my machine (AllegroCL & LispWorks.)
(defun ff-rows (dir file)
(with-open-file
(str (make-pathname :name file :directory dir)
:direction :input)
(let ((result 0))
(handler-case
(loop (progn (incf result) (read-line str)))
(end-of-file () (1- result))
(error () result)))))
Now, of course if you were more pedantic then I am, you could've specified what kind of error you want to handle exactly, but for the simple example this will do.
EDIT: I think #Moritz answered the question better, still this may be an example of how to use the error thrown by read-line to your advantage instead of trying to avoid it.

How do you use a type outside of its own namespace in clojure?

I have a project set up with leiningen called techne. I created a module called scrub with a type in it called Scrub and a function called foo.
techne/scrub.clj:
(ns techne.scrub)
(deftype Scrub [state]
Object
(toString [this]
(str "SCRUB: " state)))
(defn foo
[item]
(Scrub. "foo")
"bar")
techne/scrub_test.clj:
(ns techne.scrub-test
(:use [techne.scrub] :reload-all)
(:use [clojure.test]))
(deftest test-foo
(is (= "bar" (foo "foo"))))
(deftest test-scrub
(is (= (Scrub. :a) (Scrub. :a))))
When I run the test, I get the error:
Exception in thread "main" java.lang.IllegalArgumentException: Unable to resolve classname: Scrub (scrub_test.clj:11)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:5376)
at clojure.lang.Compiler.analyze(Compiler.java:5190)
at clojure.lang.Compiler.analyzeSeq(Compiler.java:5357)
If I remove test-scrub everything works fine. Why does :use techne.scrub 'import' the function definitions but not the type definitions? How do I reference the type definitions?
Because deftype generates a class, you will probably need to import that Java class in techne.scrub-test with (:import [techne.scrub Scrub]) in your ns definition.
I actually wrote up this same thing with respect to defrecord here:
http://tech.puredanger.com/2010/06/30/using-records-from-a-different-namespace-in-clojure/
Another thing you could do would be to define a constructor function in scrub:
(defn new-scrub [state]
(Scrub. state))
and then you would not need to import Scrub in test-scrub.
I add the import, but get the same problem. I'm testing with the Expectations package 2.0.9, trying to import deftype Node and interface INode.
In core.clj:
(ns linked-list.core)
(definterface INode
(getCar [])
(getCdr [])
(setCar [x])
(setCdr [x]))
(deftype Node [^:volatile-mutable car ^:volatile-mutable cdr]
INode
(getCar[_] car)
(getCdr[_] cdr)
(setCar[_ x] (set! car x) _)
(setCdr[_ x] (set! cdr x) _))
In core_test.clj:
(ns linked-list.core-test
(:require [expectations :refer :all]
[linked-list.core :refer :all])
(:import [linked-list.core INode]
[linked-list.core Node]))
and the output from lein autoexpect:
*************** Running tests ***************
Error refreshing environment: java.lang.ClassNotFoundException: linked-list.core.INode, compiling:(linked_list/core_test.clj:1:1)
Tests completed at 07:29:36.252
The suggestion to use a factory method, however, is a viable work-around.