Awk print single element of array - awk

This ought to be ridiculously easy. I simply want to print a single element of an array. However, all I get from a command like print arr[1] is an empty line.
Here is my entire bash script:
#!/bin/bash
find -X $1 -type f |
xargs md5 |
awk '
NF == 4 {
md5[$4]++;
files[$2]++;
}
END {
for (i = 1; i <= NF; i++)
for (j = i + 1; j <= NF; j++)
if (md5[i] == md5[j]) {
print "These are duplicates: "
print files[j+1]
print files[i]
}
'
exit 0
It is a very simply duplicate file finder. The problematic part is in the END{} statement within awk.
This just gives me a bunch of "These are duplicates: " with empty lines after them. I know that the information is available, because I add this to END{}: for (x in arr); print x and it flawlessly prints every element in arr, as expected.
I must be doing something very silly.

What you're currently doing is assigning the values you want to save as the indices of the two arrays, which seems to be common from code examples in awk. However, that's usually used in conjunction with the for (x in y) syntax. To fix your code, the way that comes to mind to fix what you're doing is to modify your awk bit like so:
BEGIN {
md5idx = 0;
filesidx = 0;
}
And then change:
NF == 4 {
md5[md5idx++] = $4;
files[filesidx++] = $2;
}
That should about do it, I think, but I haven't tested it.

Instead of using variables, you can also use NR which contains line number as an index to store field values in to your arrays.
NF == 4 {
md5[NR]=$4;
files[NR]=$2;
}
and then in your END portion, you can use, something like for (i=1;i<=NR;i++}. Since, in END statement you will always have the value of NR as the last line number, you don't need to use an arbitrary number or even the length function of awk to find the length of an array.

It took me a while to find a standard md5 as opposed to my own home-brew version, but the example output from a version on MacOS X 10.7.2 is:
$ /sbin/md5 $(which -a md5)
MD5 (./md5) = 57f49e1c53ca7875fe63a33958ab0b0b
MD5 (/Users/jleffler/bin/md5) = 57f49e1c53ca7875fe63a33958ab0b0b
MD5 (/sbin/md5) = dd00b1dc4dd11c8443a70b5d33e0cade
$
Assuming that the output of md5 is a hash in column 4 and a file name in column 2 with the parentheses around the name not mattering, and also assuming that the names do not contain any spaces (because spaces in the file name will mess up the column numbering), then you probably want something like:
#!/bin/bash
find -X "${#:-'.'}" -type f |
xargs /sbin/md5 |
awk '
NF == 4 {
if (file[$4] != "") printf "Duplicate: MD5 %s - %s & %s\n", $4, file[$4], $2;
else file[$4] = $2;
}'
exit 0
Example output:
Duplicate: MD5 57f49e1c53ca7875fe63a33958ab0b0b - (./md5) & (/Users/jleffler/bin/md5)
This identifies duplicate MD5 values as it goes. If there is no entry in the (associative) array file for the given MD5 hash, then an entry is created with the file's name. If there is an entry, then the MD5 value and the two file names are printed; you can debate the format, which might be better spread over three lines than cramped onto one.
The "${#:-'.'}" notation means 'use the command line arguments if there are any; otherwise, use . (the current directory)'. This seems likely to be more usable than using the first argument (only) and failing if no argument is supplied.

Related

Loop through files in a directory and select rows based on column value using awk for large files

I have 15 text files (each about 1.5 - 2 GB) in a folder, each with about 300,000 to 500,000 rows and about 250 columns, each with a header row with column names. I also have a list of five values ("a123", "b234", "c345", "d456", and "e567"). (These are arbitrary values and the values are not in order and they do not have any relation with each other)
For each of the five values, I would like to query in each of 15 text files and select the rows if "COL_ABC" or "COL_DEF" equals the value. ("COL_ABC" and "COL_DEF" are arbitrary names and the column names do not have any relation with each other.) I do not know which column number is "COL_ABC" or "COL_DEF". They differ between each file because each file has a different number of columns, but "COL_ABC"/"COL_DEF" would be named "COL_ABC"/"COL_DEF" in each of the files. Additionally, some of the files have both "COL_ABC" and "COL_DEF" but others have only "COL_ABC". If only "COL_ABC" exists, I would like to do the query on "COL_ABC" but if both exists, I would like to do the query on both columns (i.e. check if "a123" is present in other "COL_ABC" or "COL_DEF" and select the row if true).
I'm very new to awk, so forgive me if this is a simple question. I am able to only do simple filtering such as:
awk -F "\t" '{ if(($1 == "1") && ($2 == "2")) { print } }' file1.txt
For each of the fifteen files, I would like to print the results to a new file.
Typically I could do this in R but my files are too big to be read into R. Thank you!
Assuming:
The input filenames have the form as "*.txt".
The columns are separated by a tab character.
Each of five values are compared with the target column (COL_ABC or COL_DEF) one by one and individual
result files are created according to the value. Then 15 x 5 = 75 files will be created. (If this is not what you want, please let me know.)
Then would you please try:
awk -F"\t" '
BEGIN {
values["a123"] # assign values
values["b234"]
values["c345"]
values["d456"]
values["e567"]
}
FNR==1 { # header line
for (i in values) { # loop over values
if (outfile[i] != "") close(outfile[i]) # close previous file
outfile[i] = "result_" i "_" FILENAME # filename to create
print > outfile[i] # print the header
}
abc = def = 0 # reset the indexes
for (i = 1; i <= NF; i++) { # loop over the column names
if ($i == "COL_ABC") abc = i # "COL_ABC" is found: assign abc to the index
else if ($i == "COL_DEF") def = i # "COL_DEF" is found: assign def to the index
}
next
}
{
for (i in values) {
if (abc > 0 && $abc == i || def > 0 && $def == i)
print > outfile[i] # abc_th column or def_th column matches i
}
}
' *.txt
If your 15 text files are located in the directory, e.g. /path/to/the/dir/ and you want to specify the directory as an argument, change the *.txt in the last line to /path/to/the/dir/*.txt.
for f in file*.txt; do
awk -F'\t' '
BEGIN {
n1="COL_DEF"
n2="COL_ABC"
val["a123"]
val["b234"]
val["c345"]
val["d456"]
val["e567"]
}
NR==1 {
for(i=1; i<=NR; i++)
col[$i]=i
c=col[n1]
if(!c) c=col[n2]
next
}
$c in val { print }
' "$f" > "$f.new"
done
we don't really need to set n1, n2 (we could use the string values directly) but it keeps all definitions in one place
awk doesn't have a very nice way to declare all elements of an entire array at once, so we set val elements individually (alternatively, for simple values we could use split)
on the first line of the file (NR==1), we store the header names, then immediately look up the ones we care about and store the index in c : we choose the first of col[n2] or col[n1] that is defined (non-zero) to be the column index to be searched
next skips the remaining awk actions for this line
then for every remaining line we check if the value in the relevant column is one of the values in val and, if so, print that line.
The awk script is enclosed in a bash for loop and we write output to a new file based on the loop variable. (This could all be done in awk itself, but this way is easy enough.)

Adding daily data to existing file, based on list in file using awk

I'm trying to add the daily totals from separate files to a single file. This is to count the number of daily transactions for different accounts. The accounts might not have a count on a specific day, so the account is not unique between the files. The data could be added to a single file and done in excel with a pivot table, but the number of accounts exceed the excel number of rows allowed.
File1_2022-03-01.dat
26159933386,12
26359222592,34
26459979727,56
26359925994,1
26461265992,22
26591926740,33
26465926740,44
File2_2022-03-02.dat
26159933386,3
26359222592,324
26459979727,43
26527939259,543
26461265992,32
26591926740,2
26465926740,443
26332060759,5
26465993472,33
Below the required output, Header would be optional, but nice to have. Every day's data should be added as a new column.
(At the end of the month, you will have daily data for each account, on which a total can be calculated, which can then be used to get the top users, and can be exported to excel.
Output_2022-03.dat
Account,2022-03-01, 2022-03-02
26159933386,12,3
26332060759,,5
26359222592,34,324
26359925994,1,
26459979727,56,43
26461265992,22,32
26465926740,44,443
26465993472,,33
26527939259,,543
26591926740,33,2
I have tried something like awk -F, 'FNR==NR{var[$1]=$2;next;}{print $1","var[$1]FS$2}' File1_2022-03-01.dat File2_2022-03-02.dat but not sure how to ensure that uniq account numbers from both files should appear in the output file. The same script should be used to add additional days.
The problem simply cries out for SQLite. You need one table with 4 columns: filename, date, account, and count. Parse the date out of the filename, and produce a CSV or similar that can be loaded into a temporary table. Use SQL to insert the rows from the temporary table to the main table, and delete the temporary rows. Now you have your data.
SQL has "pivot tables" like Excel does, but (depending on the DBMS) can handle virtually limitless numbers of rows. But it sounds like you don't need to pivot the values onto rows; it sounds like what you need is a sum by time-span and account. SQLite and every other SQL implementation has date-time functions to operate on parts of dates, and GROUP BY to compute aggregates.
If for whatever reason your ultimate target has to be Excel, you can write the SQL to produce the data in whatever arrangement the Excel spreadsheet expects. SQLite can export query results as a CSV file.
Could it be done just in awk? Sure. It could be done in C or APL, too. I'm just telling you the easiest way, one that is far less error-prone than using a language that can't guard against data errors.
You can do so fairly easily in awk by simple using an array indexed by the account number and then concatenate the second field as a string as the value. To provide sorted output, you can pipe to sort when done.
One solution in awk would be:
awk -F, '{acct[$1]=acct[$1]","$2} END {for (i in acct) {print i acct[i]}}' File1 File2
Example Use/Output
Using your data files, and piping to sort when done you would have:
$ awk -F, '{acct[$1]=acct[$1]","$2} END {for (i in acct) {print i acct[i]}}' File1_2022-03-01.dat File2_2022-03-02.dat | sort
26159933386,12,3
26332060759,5
26359222592,34,324
26359925994,1
26459979727,56,43
26461265992,22,32
26465926740,44,443
26465993472,33
26527939259,543
26591926740,33,2
(note: above the empty fields are excluded.)
Handling Empty Fields
To handle empty fields, you need to keep track of the number of files seen nfiles and then for each new file any new account not seen needs to have it's string prefixed with ",". At the end, you also need to check the number of fields being output (using split() to indirectly check the number of "," present in the string) and append "," to any accounts that come up short, e.g.
awk -F, '
FNR == 1 { nfiles++ }
{
if (FNR < NR) {
$1 in acct || acct[$1] = acct[$1] ","
}
acct[$1] = acct[$1] "," $2
}
END {
for (i in acct) {
n = split(acct[i], tmp, ",")
for (j = n - 1; j < nfiles; j++)
acct[i] = acct[i] ","
print i acct[i]
}
}
' File1_2022-03-01.dat File2_2022-03-02.dat | sort
Example Use/Output
$ awk -F, '
> FNR == 1 { nfiles++ }
> {
> if (FNR < NR) {
> $1 in acct || acct[$1] = acct[$1] ","
> }
> acct[$1] = acct[$1] "," $2
> }
> END {
> for (i in acct) {
> n = split(acct[i], tmp, ",")
> for (j = n - 1; j < nfiles; j++)
> acct[i] = acct[i] ","
> print i acct[i]
> }
> }
> ' File1_2022-03-01.dat File2_2022-03-02.dat | sort
26159933386,12,3
26332060759,,5
26359222592,34,324
26359925994,1,
26459979727,56,43
26461265992,22,32
26465926740,44,443
26465993472,,33
26527939259,,543
26591926740,33,2
Let me know if you have further questions.

How to print the user specified fields

I am writing a AWK script that is going to have the user input the fields and have the script count the amount of times each word appears in that field. I have the code set up so that it already so that it prints out all of the fields and the amount of times each word occurs but I am trying to have only the user specified fields get counted. The user will be inputting CSV files so I am setting the FS to a comma.
Knowing that AWK assumes that all arguments are that are inputted are going to be a file, I set the arguments to an array and then delete them from ARGV array so it will not throw an error.
#!/usr/bin/awk -f
BEGIN{ FS = ",";
for(i = 1; i < ARGC-1; i++){
arg[i] = ARGV[i];
delete ARGV[i];
}
}
{
for(i=1; i <=NF; i++)
words[($i)]++
}
END{
for( i in words)
print i, words[i];
}
So if the user inputs a CSV file such as...
A,B,D,D
Z,C,F,G
Z,A,C,D
Z,Z,C,Q
and the user wants to have only field 3 counted for the output should be...
C 3
F 1
Or if the user 1 and 3 for the fields...
A 2
B 1
C 1
Z 4
Could you please try following(I have written this on mobile so couldn't test it).
awk -v fields="1,3" '
BEGIN{
FS=OFS=","
num=split(fields,array,",")
for(j=1;j<=num;j++){
a[array[j]]
}
}
{
for(i=1;i<=NF;i++){
if(i in a){
count[$i]++
}
}
}
END{
for(h in count){
print h,count[h]
}
}
' Input_file
I believe this should work for parsing multiple Input_files too. If needed you could try passing multiple files to it.
Explanation: Following is only for explanation purposes.
-v fields="1,3" creating a variable named fields whose value is user defined, it should be comma separated, for an example I have taken 1 and 3, you could keep it as per Your need too.
BEGIN{......} starting BEGIN section here where mentioning field separator and output field separator as Comma for all lines of Input_file(s). Then using split I am splitting variable fields to an array named array whose delimiter is comma. Variable num is having length of fields variable in it. Starring a for loop from 1 to till value of num. In it creating an array named a whose index is value of array whose index is variable j value.
MAIN Section: now starting a for loop which traverse through all of the fields of lines. Then it checks if any field number is coming into array named a which we created in BEGIN section, if yes then it is creating an array named count with index of current column + taking its count too. Which we need as per OP's requirement.
Finally in this program's END section traversing through array count and printing it's indexes with their counts.
Another:
$ awk -F, -v p="1,2" '{ # parameters in comma-separated var
split(p,f) # split parameters to fields var
for(i in f) # for the given fields
c[$f[i]]++ # count chars in them
}
END { # in the end
for(i in c)
print i,c[i] # output chars and counts
}' file
Output for fields 1 and 2:
A 2
B 1
C 1
Z 4

Print every line from a large file where the previous N lines meet specific criteria

I'd like to print every line from a large file where the previous 10 lines have a specific value in in a specific column (in the example below, column 9 has a value < 1). I don't want to store the whole file in memory. I am trying to use awk for this purpose as follows:
awk 'BEGIN{FS=","}
{
for (i=FNR,i<FNR+10, i++) saved[++s] = $0 ; next
for (i=1,i<s, i++)
if ($9<1)
print saved[s]; delete saved; s=0
}' file.csv
The goal of this command is to save the 10 previous lines, then that check that column 9 in each of those lines meets my criteria, then print the current line. Any help with this, or suggestion on a more efficient way to do this, is much appreciated!
No need to store anything in memory or do any explicit looping on values. To print the current line if the last 10 lines (inclusive) had a $9 value < 1 is just:
awk -F, '(c=($9<1?c+1:0))>9' file
Untested of course since you didn't provide any sample input or expected output so check the math but that is the right approach and if the math is wrong then the tweak to fix it is just to change >9 to >10 or whatever you need.
Here is a solution for GNU Awk:
chk_prev_lines.awk
BEGIN { FS=","
CMP_LINE_NR=10
CMP_VAL = 1 }
FNR > CMP_LINE_NR {
ok = 1
# check the stored values
for( i = 0; i< CMP_LINE_NR; i++ ) {
if ( !(prev_Field9[ i ] < CMP_VAL) ) {
ok = 0
break # early return
}
}
if( ok ) print
}
{ # store $9 for the comparison
prev_Field9[ FNR % CMP_LINE_NR] = $9
}
Use it like this: awk -f chk_prev_lines.awk your_file.
Explanation
CMP_LINE_NR determines how many values from previous lines are stored
CMP_VAL determines the values used for the comparison
The condition FNR > CMP_LINE_NR takes care, that the first line, whose previous lines are checked, is the one with CMP_LINE_NR +1. It is the first with that much previous lines.
The last Action stores the value $9. This Action is executed for all lines.

Is it possible to append an item to an array in awk without specifying an index?

I realize that awk has associative arrays, but I wonder if there is an awk equivalent to this:
http://php.net/manual/en/function.array-push.php
The obvious workaround is to just say:
array[$new_element] = $new_element
However, this seems less readable and more hackish than it needs to be.
I don't think an array length is immediately available in awk (at least not in the versions I fiddle around with). But you could simply maintain the length and then do something like this:
array[arraylen++] = $0;
And then access the elements it via the same integer values:
for ( i = 0; i < arraylen; i++ )
print array[i];
In gawk you can find the length of an array with length(var) so it's not very hard to cook up your own function.
function push(A,B) { A[length(A)+1] = B }
Notice this discussion, though -- all the places I can access right now have gawk 3.1.5 so I cannot properly test my function, duh. But here is an approximation.
vnix$ gawk '# BEGIN: make sure arr is an array
> BEGIN { delete arr[0] }
> { print "=" length(arr); arr[length(arr)+1] = $1;
> print length(arr), arr[length(arr)] }
> END { print "---";
> for (i=1; i<=length(arr); ++i) print i, arr[i] }' <<HERE
> fnord foo
> ick bar
> baz quux
> HERE
=0
1 fnord
=1
2 ick
=2
3 baz
---
1 fnord
2 ick
3 baz
As others have said, awk provides no functionality like this out of the box. Your "hackish" workaround may work for some datasets, but not others. Consider that you might add the same array value twice, and want it represented twice within the array.
$ echo 3 | awk 'BEGIN{ a[1]=5; a[2]=12; a[3]=2 }
> { a[$1] = $1 }
> END {print length(a) " - " a[3]}'
3 - 3
The best solution may be informed by the data are in the array, but here are some thoughts.
First off, if you are certain that your index will always be numeric, will always start at 1, and that you will never delete array elements, then triplee's suggestion of A[length(A)+1]="value" may work for you. But if you do delete an element, then your next write may overwrite your last element.
If your index does not matter, and you're not worried about wasting space with long keys, you could use a random number that's long enough to reduce the likelihood of collisions. A quick & dirty option might be:
srand()
a[rand() rand() rand()]="value"
Remember to use srand() for better randomization, and don't trust rand() to produce actual random numbers. This is a less than perfect solution in a number of ways, but it has the advantage of being a single line of code.
If your keys are numeric but possibly sparse, as in the example that would break tripleee's solution, you can add a small search to your push function:
function push (a, v, n) {
n=length(a)+1
while (n in a) n++
a[n]=v
}
The while loop insures that you'll assign an unused index. This function is also compatible with arrays that use non-numeric indices -- it assigns keys that are numeric, but it doesn't care what's already there.
Note that awk does not guarantee the order of elements within an array, so the idea that you will "push an item onto the end of the array" is wrong. You'll add this element to the array, but there's no guarantee it's appear last when you step through with a for loop.
$ cat a
#!/usr/bin/awk -f
function push (a, v, n) {
n=length(a)+1
while (n in a) n++
a[n]=v
}
{
push(a, $0)
}
END {
print "length=" length(a)
for(i in a) print i " - " a[i]
}
$ printf '3\nfour\ncinq\n' | ./a
length=3
2 - four
3 - cinq
1 - 3