Reading lines from a data file - file-io

Problem: Reading data file with multiple entries on a single line
The easiest way I have found to do this is to read the whole line as string and then use internal reads to extract the non-blank values.
~~~~~~~~~~~~
Problems with that solution:
Requires you to know the maximum length of any given line in the data file which is often not possible.
or
Requires you to create an arbitrary and excessively long string variable which wastes memory.
~~~~~~~~~~~~
Is there any other way of doing this?

You can directly read multiple items from a one or multiple lines. For example:
read (5, *) a, b, c, d
will read four values from one to many lines.

Using deferred length character and non-advancing reads avoids the problems you mention in your question.
Continuing to parse of the resulting line using internal IO with explicit formats then avoids the potential for user "surprise" associated with the more obscure features of list directed formatting and allows far more scope and control over input error detection and reporting.

Related

How to overcome the 2GB limit for a single column value in Spark

I am ingesting json files where the entire data payload is on a single row, single column.
This column is an array of complex objects that I want to explode so that each object represents a row.
I'm using a Databricks notebook and spark.read.json() to load the file contents to a dataframe.
This results in a dataframe with a single row, and the data payload in a single column.(let's call it obj_array)
The problem I'm having is that the obj_array column is greater than 2GB so Spark cannot handle the explode() function.
Are there any alternatives to splitting the json file into more manageable chunks?
Thanks.
Code example...
#set path to file
jsonFilePath='/mnt/datalake/jsonfiles/filename.json
#read file to dataframe
#entitySchema is a schema struct previously extracted from a sample file
rawdf=spark.read.option("multiline","true").schema(entitySchema).format("json").load(jsonFilePath)
#rawdf contains a single row of file_name,timestamp_created, and obj_array #obj_array is an array field containing the entire data payload (>2GB)
explodeddf=rawdf.selectExpr("file_name","timestamp_created","explode(obj_array) as data")
#this column explosion fails due to obj_array exceeding 2GB
When you hit limits like this you need to re-frame the problem. Spark is choking on 2Gigs in a column and that a pretty reasonable choke point. Why not write your own custom data reader.(Presenstation) That emits records in the way that you deem reasonable? (Likely the best solution to leave the files as is.)
You could probably read all the records in with a simple text read and then "paint" in columns after. You could use SQL tricks to try to expand and fill rows with windows/lag.
You could do file level cleaning/formatting to make the data more manageable for the out of the box tools to work with.

Text editor with multipe undo/redo in c

I'm starting a school project: we have to code an efficient text editor in c. To command this i can use:
(row1,row2)c
(row1,row2)d
(row1,row2)p
(number)u
(number)r
These commands are used to change text between row1 and row2 (the c), delete text between row1 and row2 (the d, the text will be replaced with a single dot), print on stdout rows between row1 and row2 (the p), undo (number) times or redo (number times) (this last two commands doesn't affect print, just c and d).
To start I was thinking what data structure I can use.
I thought, for the rows, to use a single link list with number of row and a second list (for the text itself).
This because the code has to be efficient in time and space
But I don't find a good way to implement undo/redo in my case: I was thinking two create two stacks, one for undo and one for redo. Every command I give it's inserted in undo stack and, if I undo something i delete the first action in undo stack and put it in redo stack
But I don't know ho to write how to write these commands: i was thinking to save a complementary command, so I can run this command and return in a previous statement. Then, when I undo, i create complementary command in redo stack, and I delete this stack every new command to free space
I hope it's understandable, I just want your opinion about this possible structure
NB I can code only in c11 with stdlib and stdio theoretically, but I can copy and modify other libraries' functions if it's needed
---UPDATE---
I was thinking if it's better to use a R/B Tree for keeping the rows structure. This because it would take O(log(n)) to search the X-th row and edit it, instead of O(n)
The only problem it's, when I have to change many rows in just a command (e.g 1,521c) it takes longer to search every row
Maybe a sort of hybrid could be a good choice: i use RBT structure to find the start row address, then I use the list structure to find the others. So every node of this tree has 2 address for RBT and 1 address for list
Your design ideas are spot on.
The part needed is how to represent the undo and redo entries.
A redo entry could be a struct that indicates what span of text to replace and the text to replace it with. A "span" here gives the offsets into the text, and since it could be an empty span (just a position), that suggests using a half-open interval [start .. end) or a start and length. That can express any single text-change operation. In theory the replacement text could be a 0-length string; if not yet, anticipate that future assignments may add feature requests.
An undo entry can be the same struct, describing as you noted the complementary text replacement operation.
The other design decision is how to represent the document text. The simplest thing is a sequential buffer of characters, in which case every insertion requires moving all following text downwards after ensuring the memory buffer is large enough.
An alternative is a list of lines of text, each line being a separate memory node. That way, inserting, deleting, and replacing lines doesn't have to move the bulk of the text around in memory, just some of the line node pointers. Furthermore, for line replacement commands the redo/undo entries can just list which range of line pointers to replace with other line pointers.
Suppose you create just one stack (array?) and call it 'history'. As commands are made, add their 'antidote' (or pointers to them) to the stack, and adjust a pointer/counter to the last command. As your user steps back ('undo'), replace each command with it's 'antidote' (The code to put it there in the first place could be reused), so it's there for subsequent 'redo', and reposition the counter as needed . You'll have to allocate storage for deleted text, and link it to the stack position (2 dimensional (pointer?) array, or perhaps a struct?). If your stack gets full, delete the oldest- it's now 'out of range', and move everything accordingly... Or... Just allocate more memory... ;-)
Just an idea...
Remember, if it works properly, it isn't wrong. Perhaps just not the most efficient or effective way of doing it...
Don't forget to clear the stack on 'save', and most importantly, release any allocated memory on 'terminate'.
Mike.

Elm: Search Number in Bytes

I'm trying to find some exif data in an image.
So first I need to find the number 0x45786966 ('Exif' as unsignedInt32) and store the offset.
The next two bytes should be zeros and after that the endianness as unsignedInt16 (either 0x4d4d or 0x4949) which should be stored too.
I can get the image as Bytes with the elm/file module.
But how do I search the 'Exif' start and parse the endianness in those Bytes?
I looked at the loop-example from elm/bytes but do not fully understand it.
First it reads the length of a list (unsignedInt32) and then it reads byte by byte?
How would this work if I want to read unsignedInt32s instead of bytes?
How do I set an offset to indicate where functions like unsignedInt32 should read next?
The example is talking about structured data with a known size field at the start. In your case, what you want to do is a search, so it is a rather different problem.
The problem is elm/bytes isn't really designed to handle searching. If you can guarantee the part you are looking for will be byte aligned, it may well be possible to do this, but given just what you have said, there isn't an easy way, as you can't iterate bit-by-bit.
You would have to read in values without alignment and then manually search for the part of the number you want within that. Given the difficulty and inefficiency of that approach, I would recommend using ports instead for that use case.
If you can guarantee that what you are searching for will be byte-aligned (or better yet, aligned to the length of your number), you can decode a byte at a time until you find what you are looking for. There is no way to read from a given offset, if you want to read to a certain point, you'd need to read and throw away values.
To do this, you would want to set up a loop where your state contains how much of the value you are looking for you have found. Each step, you check if you have the whole thing (success), you have the next part (continue), or you have something different (reset the state to search from the start again). If you reach the end without finding it, you have failed.

NetLogo: how to read values from a data set, assigning values at each tick?

I'm modelling salmon population dynamics and I have a real data set about temperature and flow. I would like to assign a daily value of these two parameters during each tick, setting the first tick as the first day in the dataset and making it keep reading the file.
How can I do that?
Jacopo
NetLogo has fairly extensive IO capabilities for text files (and thus for CSV). You apparently have your data in a simple CSV file, so you will need to use these capabilities. For simple IO examples, see https://subversion.american.edu/aisaac/notes/netlogo-intro.xhtml#file-based-io There are also lots of examples of reading CSV files on the web (e.g., http://netlogoabm.blogspot.com/2014/01/reading-from-csv-file.html). Unfortunately, NetLogo does not provide a CSV reader.
You suggest you would like to repeatedly read from the file. You will then have to leave the file open for the entire simulation. Each tick you can read in one line from each open file.
Unless it is a very large dataset, I would rather read in all the data into two global lists (e.g., temparatures and flows) at the very beginning. Since you say you want to update the values each tick, use the current tick value to index into these lists. E.g., set temp item ticks temperatures. (Here I assume you only use tick to advance the tick counters, so that you get successive integers. Also if you tick before you start reading data, you'll need to use ticks - 1.)
hth

how to look for the content of text file in pentaho?

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..