How to list directories faster? - optimization

I have a few situations where I need to list files recursively, but my implementations have been slow. I have a directory structure with 92784 files. find lists the files in less than 0.5 seconds, but my Haskell implementation is a lot slower.
My first implementation took a bit over 9 seconds to complete, next version a bit over 5 seconds and I'm currently down to a bit less than two seconds.
listFilesR :: FilePath -> IO [FilePath]
listFilesR path = let
isDODD "." = False
isDODD ".." = False
isDODD _ = True
in do
allfiles <- getDirectoryContents path
dirs <- forM allfiles $ \d ->
if isDODD d then
do let p = path </> d
isDir <- doesDirectoryExist p
if isDir then listFilesR p else return [d]
else return []
return $ concat dirs
The test takes about 100 megabytes of memory (+RTS -s), and the program spends around 40% in GC.
I was thinking of doing the listing in a WriterT monad with Sequence as the monoid to prevent the concats and list creation. Is it likely this helps? What else should I do?
Edit: I have edited the function to use readDirStream, and it helps keeping the memory down. There's still some allocation happening, but productivity rate is >95% now and it runs in less than a second.
This is the current version:
list path = do
de <- openDirStream path
readDirStream de >>= go de
closeDirStream de
where
go d [] = return ()
go d "." = readDirStream d >>= go d
go d ".." = readDirStream d >>= go d
go d x = let newpath = path </> x
in do
e <- doesDirectoryExist newpath
if e
then
list newpath >> readDirStream d >>= go d
else putStrLn newpath >> readDirStream d >>= go d

I think that System.Directory.getDirectoryContents constructs a whole list and therefore uses much memory. How about using System.Posix.Directory? System.Posix.Directory.readDirStream returns an entry one by one.
Also, FileManip library might be useful although I have never used it.

Profiling your code shows that most of the CPU time goes in getDirectoryContents, doesDirectoryExist and </>. This means that only changing the data structure won't help very much. If you want to match the performance of find you should use lower level functions for accessing the filesystem, probably the ones which Tsuyoshi pointed out.

One problem is that it has to construct the entire list of directory contents, before the program can do anything with them. Lazy IO is usually frowned upon, but using unsafeInterleaveIO here cut memory use significantly.
listFilesR :: FilePath -> IO [FilePath]
listFilesR path =
let
isDODD "." = False
isDODD ".." = False
isDODD _ = True
in unsafeInterleaveIO $ do
allfiles <- getDirectoryContents path
dirs <- forM allfiles $ \d ->
if isDODD d then
do let p = path </> d
isDir <- doesDirectoryExist p
if isDir then listFilesR p else return [d]
else return []
return $ concat dirs

Would it be an option to use some sort of cache system combined with the read? I was thinking of an async indexing service/thread that kept this cache up-to-date in the background, perhaps you could do the cache as a simple SQL-DB which would then give you some nice performance when doing queries against it?
Can you elaborate anything on your "project/idea" so we can come up with something alternative?
I wouldn't go for a "full index" myself as I mostly build webbased services and "resposnetime" is criticial to me, on the other hand - if its an initial way of starting up a new server I am sure the customers wouldnt mind waiting that first time. I would just store the result in the DB for later lookups.

Related

moving individual 230 pdf files to already created folders

#nmnhI'm trying to move over 200 pdf files, each to separate folders that are already created and named 2018. The destination path for each is like- GFG-0777>>2018. Each pdf has an unique GFG-0### name that matches the folders I already created that lead to the 2018 destination folders. Not sure how to iterate and get each pdf into the right folder.... :/
I've tried shutil.move which i think is best but have issues with paths I think.
import os
import shutil
srcDir = r'C:\Complete'
#print (srcDir)
dstDir = r'C:\Python27\end_dir'
dirList = os.listdir(srcDir)
for f in dirList:
fp = [f for f in dirList if ".pdf" in f] #list comprehension to iterate task (flat for loop)
for file in fp:
dst = (srcDir+"/"+file[:-4]+"/"+dstDir+"/"+"2018")
shutil.move(os.path.join(srcDir, dst, dstDir))
error: shutil.move(os.path.join(srcDir, dst, dstDir))
TypeError: move() missing 1 required positional argument: 'dst'
AFAICT you are calling
shutil.move(os.path.join(srcDir, dst, dstDir))
without a to.
According to the documentation you need to have a from and to folder.
https://docs.python.org/3/library/shutil.html#shutil.move
I guess your idea was to somehow create a string containing the dst and src :
dst = (srcDir+"/"+file[:-4]+"/"+dstDir+"/"+"2018")
What you actually want is something along this line:
dst_dir = dstDir+"/"+"2018"
src_dir = srcDir+"/"+file[:-4]
shutil.move(src_dir,dst_dir)
Above code is just for demonstration.
If this does not work you could tree or ls -la example a small part of your srcdir and dstdir and we could work something out.
#nmanh
I managed to work it out. Thanks for calling out the issue to create string with src and dst. After removing the string, I tweaked a bit more but found I had too many "file" in code. I had to make two of them "file1" and add a comma in the shutil.move between src and dst.
Thanks again
import os
import shutil
srcDir = r'C:\Complete'
#print (srcDir)
dstDir = r'C:\Python27\end_dir'
dirList = os.listdir(srcDir)
for file in dirList:
fp = [f for f in dirList if ".pdf" in f] #list comprehension to iterate task
(flat for loop)
for file in fp:
if ' ' in file: #removing space in some of pdf names noticed during fp print
file1 = file.split(' ')[0]# removing space continued
else:
file1 = file[:-4]# removing .pdf
final = dstDir+"\\"+file1+"\\2018"
print (srcDir+"\\"+file1+" "+final)
shutil.move(srcDir+"\\"+file,final)

IDL batch processing: fully automatic input selection

I need to process MODIS ocean level 2 data and I obtained an external plugin for ENVI https://github.com/dawhite/EPOC/releases. Now, I want to batch process hundreds of images for which I modified the code like the following code. The code is running fine, but I have to select the input file every time. Can anyone please help me to make the program fully automatic? I really appreciate and thanks a lot for your help!
Pro OCL2convert
dir = 'C:\MODIS\'
CD, dir
; batch processing of level 2 ocean chlorophyll data
files=file_search('*.L2_LAC_OC.x.hdf', count=numfiles)
; this command will search for all files in the directory which end with
; the specified one
counter=0
; this is a counter that tells IDL which file is being read-starts at 0
While (counter LT numfiles) Do begin
; this command tells IDL to start a loop and to only finish when the counter
; is equal to the number of files with the name specified
name=files(counter)
openr, 1, name
proj = envi_proj_create(/utm, zone=40, datum='WGS-84')
ps = [1000.0d,1000.0d]
no_bowtie = 0 ;same as not setting the keyword
no_msg = 1 ;same as setting the keyword
;OUTPUT CHOICES
;0 -> standard product only
;1 -> georeferenced product only
;2 -> standard and georeferenced products
output_choice = 2
;RETURNED VALUES
;r_fid -> ENVI FID for the standard product, if requested
;georef_fid -> ENVI FID for the georeferenced product, if requested
convert_oc_l2_data, fname=fname, output_path=output_path, $
proj=proj, ps=ps, output_choice=output_choice, r_fid=r_fid, $
georef_fid=georef_fid, no_bowtie=no_bowtie, no_msg=no_msg
print,'done!'
close, 1
counter=counter+1
Endwhile
End
Not knowing what convert_oc_l2_data does (it appears to be a program you created, there is no public documentation for it), I would say that the problem might be that the out_path variable is not defined in the rest of your program.

How to enable hints and warnings in the online REPL

I figured that I can do it on the command line REPL like so:
java -jar frege-repl-1.0.3-SNAPSHOT.jar -hints -warnings
But how can I do the same in http://try.frege-lang.org
Hints and warnings are already enabled by default. For example,
frege> f x = f x
function f :: α -> β
3: application of f will diverge.
Perhaps we can make it better by explicitly saying it as warning or hint (instead of colors distinguishing them) something like:
[Warning] 3: application of f will diverge.
and providing an option to turn them on/off.
Update:
There was indeed an issue (Thanks Ingo for pointing that out!) with showing warnings that are generated in a later phase during the compilation. This issue has been fixed and the following examples now correctly display warnings in the REPL:
frege> h x = 0; h false = 42
function h :: Bool -> Int
4: equation or case alternative cannot be reached.
frege> f false = 6
function f :: Bool -> Int
5: function pattern is refutable, consider
adding a case for true

Jython not finding file when using variable to pass file name

So heres the issue guys,
I have a very simple little program that reads in some setup details from a file (to make it reuseable for other sets of data) and stores them into variables.
It then uses one of those variables to open another file that I need to write some results to, as well as various search parameters.
When passing the variable to the .open() function, it fails saying it cant find the file, but when passing the exact same information, but as a written string instead of a variable, it works.
Is this a known problem, or am I just doing something wrong?
The code(problem bit bolded)
def urlTrawl(filename):
import urllib
read = open(getMediaPath(filename), "rt")
baseurl = read.readline()
orgurl = read.readline()
lasturlfile = read.readline()
linksfile = read.readline()
read.close()
webpage = ""
links = ""
counter = 0
lasturl = ""
nexturl = ""
url = ""
connection = ""
try:
read = open(lasturlfile, "rt")
lasturl = read.readline()
except IOError:
print "IOError"
webpage = connection.read()
connection.close()
**file = open(linksfile, "wt")**
file.close()
file = open(lasturlfile, "wt")
file.write(nexturl)
return 1
The information being passed in
http://www.questionablecontent.net/
http://www.questionablecontent.net/view.php?comic=2480
C:\\Users\\James\\Desktop\\comics\\qclast.txt
C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt
strip\"
src=\"
\"
Pevious
Next
f=\"
\"
EDIT: removed working code, to narrow down the problem area and updated code to use a direct reference rather then a relative one.
I found the problem in the end.
The problem was that it was reading in the \n at the end of each line in my details file, and of course the \n isn't anywhere in the website data I'm reading. Removing the last character of each read did the trick:
baseurl = baseurl[:-1]
orgurl = orgurl[:-1]
lasturlfile = lasturlfile[:-1]
linksfile = linksfile[:-1]
search1 = search1[:-1]
search2 = search2[:-1]
search3 = search3[:-1]
search4 = search4[:-1]
search5 = search5[:-1]
search6 = search6[:-1]
I might not be right, but I think this is what's happening.
You're saying this works fine:
file = open('C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt', "wt")
But this doesn't:
# After reading three lines
linksfile = read.readline()
file = open(linksfile, "wt")
There is a difference between these two. In the first piece of code, the double slashes are escapes. They resolve to single slashes when Python is done parsing. Like so:
>>> print 'C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt'
C:\Users\James\Desktop\comics\comiclinksqc.txt
But when you read that same text from the file, there's no parsing of the text. That means that the string stored in your variable still has double slashes.
Try this command out. I bet it fails the same way as when you read the file path in:
file = open(r'C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt', "wt")
The r stands for "raw"; it prevents Python from interpreting escape characters. If it does fail the same way, then the double slashes are your problem. To fix it, in your file, you need to remove the double slashes:
C:\Users\James\Desktop\comics\comiclinksqc.txt
This isn't a problem in CPython 2.7; I'm betting it's not in 3.x, either. CPython interprets double slashes in some manner that they are effectively a single slash (in most cases, at least). So this may be an issue specific to Jython.
If unclean paths cause errors, you might want to consider doing something to clean them up. os.path.abspath might be helpful, although I can't say if Jython's implementation works as well as CPython's:
>>> print os.path.abspath(r'C:\\Users\\James\\Desktop\\comics\\comiclinksqc.txt')
C:\Users\James\Desktop\comics\comiclinksqc.txt
>>> print os.path.abspath(r'C:/Users/James/Desktop/comics/comiclinksqc.txt')
C:\Users\James\Desktop\comics\comiclinksqc.txt
I am trying to create a script which will list the datasource name and will show the connection pool utilization(pooled connection, Free Pool Size ext.)
But facing the issue when list the connection pool, if the data source name having space in between the name like "Default Datasource"
then it is listing list "Default Datasource and it is not parsing the datasource name correctly to the next function.
datasource = AdminConfig.list('DataSource', AdminConfig.getid( '/Cell:'
+ cell + '/')).splitlines()
for datasourceID in datasource:
datasourceName = datasourceID.split('(')[0]
print datasourceName
Request you to help if possible drop me mail at bubuldey#gmail.com
Regards,
Bubul

How to prevent common sub-expression elimination (CSE) with GHC

Given the program:
import Debug.Trace
main = print $ trace "hit" 1 + trace "hit" 1
If I compile with ghc -O (7.0.1 or higher) I get the output:
hit
2
i.e. GHC has used common sub-expression elimination (CSE) to rewrite my program as:
main = print $ let x = trace "hit" 1 in x + x
If I compile with -fno-cse then I see hit appearing twice.
Is it possible to avoid CSE by modifying the program? Is there any sub-expression e for which I can guarantee e + e will not be CSE'd? I know about lazy, but can't find anything designed to inhibit CSE.
The background of this question is the cmdargs library, where CSE breaks the library (due to impurity in the library). One solution is to ask users of the library to specify -fno-cse, but I'd prefer to modify the library.
How about removing the source of the trouble -- the implicit effect -- by using a sequencing monad that introduces that effect? E.g. the strict identity monad with tracing:
data Eval a = Done a
| Trace String a
instance Monad Eval where
return x = Done x
Done x >>= k = k x
Trace s a >>= k = trace s (k a)
runEval :: Eval a -> a
runEval (Done x) = x
track = Trace
now we can write stuff with a guaranteed ordering of the trace calls:
main = print $ runEval $ do
t1 <- track "hit" 1
t2 <- track "hit" 1
return (t1 + t2)
while still being pure code, and GHC won't try to get to clever, even with -O2:
$ ./A
hit
hit
2
So we introduce just the computation effect (tracing) sufficient to teach GHC the semantics we want.
This is extremely robust to compile optimizations. So much so that GHC optimizes the math to 2 at compile time, yet still retains the ordering of the trace statements.
As evidence of how robust this approach is, here's the core with -O2 and aggressive inlining:
main2 =
case Debug.Trace.trace string trace2 of
Done x -> case x of
I# i# -> $wshowSignedInt 0 i# []
Trace _ _ -> err
trace2 = Debug.Trace.trace string d
d :: Eval Int
d = Done n
n :: Int
n = I# 2
string :: [Char]
string = unpackCString# "hit"
So GHC has done everything it could to optimize the code -- including computing the math statically -- while still retaining the correct tracing.
References: the useful Eval monad for sequencing was introduced by Simon Marlow.
Reading the source code to GHC, the only expressions that aren't eligible for CSE are those which fail the exprIsBig test. Currently that means the Expr values Note, Let and Case, and expressions which contain those.
Therefore, an answer to the above question would be:
unit = reverse "" `seq` ()
main = print $ trace "hit" (case unit of () -> 1) +
trace "hit" (case unit of () -> 1)
Here we create a value unit which resolves to (), but which GHC can't determine the value for (by using a recursive function GHC can't optimise away - reverse is just a simple one to hand). This means GHC can't CSE the trace function and it's 2 arguments, and we get hit printed twice. This works with both GHC 6.12.4 and 7.0.3 at -O2.
I think you can specify the -fno-cse option in the source file, i.e. by putting a pragma
{-# OPTIONS_GHC -fno-cse #-}
on top.
Another method to avoid common subexpression elimination or let floating in general is to introduce dummy arguments. For example, you can try
let x () = trace "hi" 1 in x () + x ()
This particular example won't necessarily work; ideally, you should specify a data dependency via dummy arguments. For instance, the following is likely to work:
let
x dummy = trace "hi" $ dummy `seq` 1
x1 = x ()
x2 = x x1
in x1 + x2
The result of x now "depends" on the argument dummy and there is no longer a common subexpression.
I'm a bit unsure about Don's sequencing monad (posting this as answer because the site doesn't let me add comments). Modifying the example a bit:
main :: IO ()
main = print $ runEval $ do
t1 <- track "hit 1" (trace "really hit 1" 1)
t2 <- track "hit 2" 2
return (t1 + t2)
This gives us the following output:
hit 1
hit 2
really hit 1
That is, the first trace fires when the t1 <- ... statement is executed, not when t1 is actually evaluated in return (t1 + t2). If we define the monadic bind operator as
Done x >>= k = k x
Trace s a >>= k = k (trace s a)
instead, the output will reflect the actual evaluation order:
hit 1
really hit 1
hit 2
That is, the traces will fire when the (t1 + t2) statement is executed, which is (IMO) what we really want. For example, if we change (t1 + t2) to (t2 + t1), this solution produces the following output:
hit 2
really hit 2
hit 1
The output of the original version remains unchanged, and we don't see when our terms are really evaluated:
hit 1
hit 2
really hit 2
Like the original solution, this also works with -O3 (tested on GHC 7.0.3).