I'm keeping some basic info written in a file, but 99% of the time, I just need to count the number of lines there are as efficiently as it is reasonably possible.
is there a way to get the row? or do I need to loop through the file?
Read the file and treat it as a list delimited by CR/LF characters. The listLen() will be the number of lines in the file. Depending on whether you want to count empty lines, you might need to use the includeEmptyValues option.
Related
Here is the situation: My workplace files engineering drawing pdfs by drawing number and sorts them into folders. There are 200+ of these folders. The folder is named with the range of drawing numbers that will go in the folder. For example:
"0.001.000 - 0.001.999"
Most of them are from x.xxx.000 - x.xxx.999, however, there are a few that are x.000.000 - x.500.999. Additionally, any with a first digit 9 are done in the format 9.xxx.x00 - 9.xxx.x99. So basically, the range they will fall into is not always consistent.
I currently have two ways I have figured out to tackle this:
(1) I take the input and use substrings to create the x.xxx., and then add the 000 and 999 on the end. I use if statements to handle the first digit 9 case and the others that don't follow the usual format, since I already know which ones these are
(2) The other method I came up with seems more elegant, but it also seems to be slower:
I get all the folder names into a list. Then I using a for loop, I loop through every folder in the list. In this loop I first take out the .'s from the input and the folder name, and then do foldername.split("-"c) to get it into a min and max. Then I use a select case to see if input is between those two numbers. If it is, I set that as the folder to look in, and exit the loop. If not, go to the next folder and repeat. The problem is because it loops through 200+ folders, this process seems to be slower if you're looking up 10 or 20 drawing numbers
Is there a better way to do this? Perhaps a way to go directly to which folder it is without having to loop through a bunch for every input? I have some ideas for reducing the number of folders the loop will have to go through (for example, only looping through folders with the same first digit would in some cases speed it up quite a bit), but I am not sure if it's possible to bypass the loop entirely without using the first method, which while simple also seems a little more brute force.
I have about 30,000 different files all with different file formats names. I want to put together a list of "unique" files given that the dates/etc. are replaced by generic characters/symbols.
For example:
20160105asdf_123456_CODE.txt
Would be notated into:
YYYYMMDD*_######_XXXX.txt
Any ideas on how to do this efficiently on a large scale? I thought about parsing it out per delimiter ("_"), but I'm sure there's something a lot easier out there.
Problem: Reading data file with multiple entries on a single line
The easiest way I have found to do this is to read the whole line as string and then use internal reads to extract the non-blank values.
~~~~~~~~~~~~
Problems with that solution:
Requires you to know the maximum length of any given line in the data file which is often not possible.
or
Requires you to create an arbitrary and excessively long string variable which wastes memory.
~~~~~~~~~~~~
Is there any other way of doing this?
You can directly read multiple items from a one or multiple lines. For example:
read (5, *) a, b, c, d
will read four values from one to many lines.
Using deferred length character and non-advancing reads avoids the problems you mention in your question.
Continuing to parse of the resulting line using internal IO with explicit formats then avoids the potential for user "surprise" associated with the more obscure features of list directed formatting and allows far more scope and control over input error detection and reporting.
I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..
I wish to get some expert advice on this problem.
I have two text files, one very large ( ~ GB ) and other small ( ~ MB). These files essentially have information per line. I can say that bigger file has a subset of information about the smaller file. Each line in files is organized as tuples sperated by spaces and diff is found by looking at one or more of columns in those two files. Both of these files are sorted based on one of such column (document id).
I implemented it by keeping index on document id and line number and doing a random access to that line in larger file to start the diff. But this method is slow. I want to know any good mechanism for this scenario.
Thanks in advance.
If the files are known to be sorted in the same order by the same key, and the lines that share a common key are expected to match exactly, then comm is probably what you want - it has flags to allow you to show only the lines that are common between two files, or the lines that are in one file but not the other.