How does associative arrays work in awk? - awk

I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:
awk '!x[$1]++' filename
It works, but I am not sure how it works. I know it uses associate arrays in awk but I am not able to infer anything beyond it.
Update:
Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.

That awk script !x[$1]++ fills an array named x. Suppose the first word ($1 refers to the first word in a line of text) in a line of text is line1. It effectively results in this operation on the array:
x["line1"]++
The "index" (the key) of the array is the text encountered in the file (line1 in this example), and the value associated with that key is an integer that is incremented by 1.
When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not operator ! evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not operation results in zero (false), so the line is not printed.
A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:
{
if (x[$1] == 0 )
print
x[$1]++
}

Related

How to split a CSV file into groups using Pentaho?

I am new to Pentaho and am trying to read a CSV file (which I already did) and create blocks of data based on an identifier.
Eg
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
1|N|O|P
4|Q|R|S|T
5|U|V|W
I need to split and group this as such:
(each block starts when the first column is equal to '1')
Block a)
1|A|B|C
2|D|E|F
8|G|H|I|J|K
4|L|M
Block b)
1|N|O|P
4|Q|R|S|T
5|U|V|W
Eg
a |1|A|B|C
a |2|D|E|F
a |8|G|H|I|J|K
a |4|L|M
b |1|N|O|P
b |4|Q|R|S|T
b |5|U|V|W
How can this be achieved using Penatho? Thanks.
I found a similar question but answers don't really help my case
Pentaho Kettle split CSV into multiple records
I think I got the answer.
I created the transformation in this zip that can transform your "csv" file in rows almost like you described but I don't know what you intend to do next, so maybe you can give us more details. =)
I'll explain what I did:
1) First, we grab the row full text with a Text input step
When you look at configurations of Text Input step, you'll see I used a ';' has separator, when your input file uses '|' so I'm not spliting columns with the '|' but loading the whole line in one column. Grabbing the row's full text, nothing else.
2) Next we apply a regex eval to separate the ID from the rest of our string.
^(\d+)\|(.*)
Which means: in the beginning of the text I expect one or more digits followed by a pipe and anything after that. Capture the digits in the beginning of the string in one column and everything after the pipe to another column.
That gives you this output: (blue is the first capture group, red is the second)
3) Now what you need is to add a 'sequence' that only goes up if there is a row_id = 1. Which I did in the Mod JS Value with the following code:
var sequence
//if it's the first row, set sequence to 1
if(sequence == null){
sequence = 1;
}else{
//if it's not the first row, check if the row_id is equal to 1 (string)
if(row_id == '1'){
// increment the sequence
sequence++;
}else{
//nothing
}
}
And that will give you this output that seem to be what you expected: (green, the group/sequence done)
Hope it helps =)

Cannot print subsequent rows of array to file

I'm trying to write a rudimentary bit of code to print a 50*50 array called 'arr'. Unfortunately it so far only prints the first row of the array, although the formatting for that row is correct. I've attached the code below and was wondering if anyone could point out where I was going wrong? Thank you!
program testing
implicit none
integer :: i, j
integer, dimension (1:50, 1:50) :: arr
arr = 1
do i=1,50
open(unit=6, file= "array.txt", action="write")
write(6, '(2500I3)') (arr(i,j), j=1,50)
close(6)
end do
end program testing
Your open statement is inside loop (along with a matching close statement). That means for every row of the array, you open the file. That's probably not what you meant to do.
The default position specifier for an OPEN statement if there is no POSITION specifier is 'ASIS'. For a file that already exists (your case after the first iteration, and perhaps even for the first iteration) that means that the position is unspecified. Your processor probably takes that to be the start of the file. That means that each iteration of the loop you simply overwrite the first record, over and over again.
If you must open the file each iteration, then use the POSITION='APPEND' specifier to position the file at the end when the open statement is executed. Otherwise, move the open and close statements out of the loop.
(The way that the default of 'ASIS' behaves means that you should always specify the initial position of a file via a POSITION specifier when executing an OPEN statement for an existing "on disk" file.)
IanH's answer is correct. Your program can be fixed as follows. Note that output units should be parameterized and not set to 6 and that arrays and array sections can be written as shown.
program testing
implicit none
integer :: i
integer, dimension (1:50, 1:50) :: arr
integer, parameter :: outu = 20 ! better to parameterize unit and
! not to use the number 6, which most compilers
! use for standard output
arr = 1
open(unit=outu, file= "array.txt", action="write")
do i=1,50
write(outu, '(2500I3)') arr(i,:) ! can write array section without implied do loop
end do
close(outu)
end program testing

awk: delete short line from long line

I have a long file of text strings, sorted by length. What I need to do is find any short lines contained within long lines, breaking up the long line into two shorter lines, and leaving the original short line intact, like this:
input:
here is an example of a long line
an example of
output:
here is
a long line
an example of
You haven't posted sufficient dataset to allow us to post a complete solution, but here is something to get you started:
$ awk '
NR==FNR{
a[$0]++;
next
}
{
for(x in a)
if(x!=$0 && index($0,x)>0) {
sub(x FS,"\n")
}
}1' file file
here is
a long line
an example of
We are doing two passes to the file. In the first pass, read the lines and store them as key in an array (duplicate lines will get stored as one key).
In the second pass, we iterate through the array, if the key is not equal to the current line but is a subset of the current line, substitute that smaller word with a newline.

Fortran, How do I get Fortran to ignore lines from a data file that has random spacing

I am writing a FORTRAN code that uses data in a file made by a MD program. the data is a list of values but has breaks in the data for list updates in the form (# Neighbor list update .. 6527 indexes in list), These breaks are at random intervals so I can't just skip every x
I when I do my code it doesn't ignore these lines and randomly adds the value from the previous step.
1, 0.98510699999999995, 0.98510699999999995
2, 1.9654170000000000, 0.98031000000000001
3, 2.9427820000000002, 0.97736500000000004
4, 3.9186540000000001, 0.97587199999999996
4, 4.8945259999999999, 0.97587199999999996
5, 5.8697910000000002, 0.97526500000000005
note the double step 4 with an identical value from the true step 4
How would I go about skipping this line. Please find the sample code below
Open(Unit=10,File='prod._100.tup')
do i=1,50
Read(10,*,IOSTAT=ios)step,temp,kinetic,potential,total,pressure
If(IS_IOSTAT_END(ios)) Exit
test=test+temp
print*, step, test, temp
End Do
It is not clear to me what the "breaks" in the file are. Are they blank lines? If so, the following code should work:
use, intrinsic :: iso_fortran_env
character (len=200) :: line
Open(Unit=10,File='prod._100.tup')
read_loop: do
Read (10,'(A)',IOSTAT=ios) line
If(ios == iostat_end) exit read_loop
if (len_trim (line) == 0) then
write (*, *) "blank line"
cycle read_loop
end if
read (line, *) step,temp,kinetic,potential,total,pressure
test=test+temp
print*, step, test, temp
end do: read_loop
write (*, *) "total is", test
The above is not tested. The "len_trim" test is based on bad records being blank lines. If breaks are otherwise defined you will have to create a different test.
Try:
i=1
do while (i<=50)
Read(10,*,IOSTAT=ios)step,temp,kinetic,potential,total,pressure
If(IS_IOSTAT_END(ios)) Exit
IF(ios.ne.0) cycle
test=test+temp
i=i+1
enddo
When a bad record is read, ios is assigned a system dependent non-zero number (it is zero on success). Apparently you've written a function (IS_IOSTAT_END) to tell if you've reached the end of the file, but other error conditions can exist (for example, the read statement doesn't match the data). That will return a different non-zero ios than an end-file record, so you should just restart the loop at that point (e.g. cycle)
I assume you want to read exactly 50 lines from the file, so I changed your do loop to a do while, but if the number of records you read doesn't actually matter, then feel free to change it back.

How to skip records that turn on/off the range pattern?

gawk '/<Lexer>/,/<\/Lexer>/' file
this works but it prints the first and last records, which I'd like to omit. How to do so?
It says: "The record that turns on the range pattern and the one that turns it off both match the range pattern. If you don't want to operate on these records, you can write if statements in the rule's action to distinguish them from the records you are interested in." but no example.
I tried something like
gawk '/<Lexer>/,/<\/Lexer>/' {1,FNR-1} file
but it doesn't work.
If you have a better way to do this, without using awk, say so.
You can do it with 2 separate match statements and a variable
gawk '/<Lexer>/{p=1; next} /<\/Lexer>/ {p=0} p==1 {print}' file
This matches <Lexer> and sets p to 1 and then skips to the next line. While p is 1 it prints the current line. When it matches </Lexer> it sets p to 0 and skips. As p is 0 printing is suppressed.