Finding common value across multiple files containing single column values

Finding common value across multiple files containing single column values - awk

I have 100 text files containing single columns each. The files are like:
file1.txt
10032
19873
18326
file2.txt
10032
19873
11254
file3.txt
15478
10032
11254
and so on.
The size of each file is different.
Kindly tell me how to find the numbers which are common in all these 100 files.
The same number appear only once in 1 file.

This will work whether or not the same number can appear multiple times in 1 file:
$ awk '{a[$0][ARGIND]} END{for (i in a) if (length(a[i])==ARGIND) print i}' file[123]
10032
The above uses GNU awk for true multi-dimensional arrays and ARGIND. There's easy tweaks for other awks if necessary, e.g.:
$ awk '!seen[$0,FILENAME]++{a[$0]++} END{for (i in a) if (a[i]==ARGC-1) print i}' file[123]
10032
If the numbers are unique in each file then all you need is:
$ awk '(++c[$0])==(ARGC-1)' file*
10032

awk to the rescue!
to find the common element in all files (assuming uniqueness within the same file)
awk '{a[$1]++} END{for(k in a) if(a[k]==ARGC-1) print k}' files
count all occurrences and print the values where count equals number of files.

Files with a single column?
You can sort and compare this files, using shell:
for f in file*.txt; do sort $f|uniq; done|sort|uniq -c -d
Last -c is not necessary, it's need only if you want to count number of occurences.

One using Bash and comm because I needed to know if it would work. My test files were 1, 2 and 3, hence the for f in ?:
f=$(shuf -n1 -e ?) # pick one file randomly for initial comms file
sort "$f" > comms
for f in ? # this time for all files
do
comm -1 -2 <(sort "$f") comms > tmp # comms should be in sorted order always
# grep -Fxf "$f" comms > tmp # another solution, thanks #Sundeep
mv tmp comms
done

Related

I need to sum all the values in a column across multiple files

I have a directory with multiple csv text files, each with a single line in the format:
field1,field2,field3,560
I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:
file1.txt
field1,field2,field3,560
file2.txt
field1,field2,field3,415
file3.txt
field1,field2,field3,672
The output would simply be:
1647
I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt
This correctly outputs 975.
However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232
Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?
Thank you for your time and assistance.

A couple issues with the current code:
NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array
Since you're only looking for one global sum, a bit more simpler code should suffice:
$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647
NOTES:
in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc

Here we go (no need to test NR==FNR in a concatenation):
$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647
Or same-same 🇹🇭 (without wasting some pipe(s)):
$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647

$ perl -MList::Util=sum0 -F, -lane'push #a,$F[3];END{print sum0 #a}' file{1..3}.txt
1647
$ perl -F, -lane'push #a,$F[3];END{foreach(#a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647

$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

how to randomly split file in three equal files, by row in bash?

I have a line for randomly splitting it in half:
'BEGIN {srand()} {f = FILENAME (rand() <= 0.5 ? ".base" : ".target"); print > f}' file.txt
I need a method like this one, but to split file in three ~equal parts.
One messy solution would be to split in 0.3/0.7 with existing script and further the "0.7" part in .5/.5. But i would appreciate shorter solution.

for approxmiately expected equal sizes (no guarantees, since based on underlying randomness)
$ awk 'BEGIN{srand()} {s=int(rand()*3)+1; print > (FILENAME"."s)}' file
for exact equality (within rounding), you can do
$ awk -v n=3 '{print > FILENAME"."(NR%n + 1)}' file
however, the file will be split without any random selection of rows.
If you want random selection and keep the relative order of the rows, the best solution I guess is using shuf and the above script combination
$ cat -n file | shuf > file.shuf
$ awk -v n=3 '{c=NR%n+1; print | "sort -n | cut -f2 > "FILENAME".c}' file.shuf && rm file.shuf
we add the line numbers to the original file so that the split files will have the same record order.

How to parse a column from one file in mutiple other columns and concatenate the output?

I have one file like this:
head allGenes.txt
ENSG00000128274
ENSG00000094914
ENSG00000081760
ENSG00000158122
ENSG00000103591
...
and I have a multiple files named like this *.v7.egenes.txt in the current directory. For example one file looks like this:
head Stomach.v7.egenes.txt
ENSG00000238009 RP11-34P13.7 1 89295 129223 - 2073 1.03557 343.245
ENSG00000237683 AL627309.1 1 134901 139379 - 2123 1.02105 359.907
ENSG00000235146 RP5-857K21.2 1 523009 530148 + 4098 1.03503 592.973
ENSG00000231709 RP5-857K21.1 1 521369 523833 - 4101 1.07053 559.642
ENSG00000223659 RP5-857K21.5 1 562757 564390 - 4236 1.05527 595.015
ENSG00000237973 hsa-mir-6723 1 566454 567996 + 4247 1.05299 592.876
I would like to get lines from all *.v7.egenes.txt files that match any entry in allGenes.txt
I tried using:
grep -w -f allGenes.txt *.v7.egenes.txt > output.txt
but this takes forever to complete. Is there is any way to do this in awk or?

Without knowing the size of the files, but assuming the host has enough memory to hold allGenes.txt in memory, one awk solution comes to mind:
awk 'NR==FNR { gene[$1] ; next } ( $1 in gene )' allGenes.txt *.v7.egenes.txt > output.txt
Where:
NR==FNR - this test only matches the first file to be processed (allGenes.txt)
gene[$1] - store each gene as an index in an associative array
next stop processing and go to next line in the file
$1 in gene - applies to all lines in all other files; if the first field is found to be an index in our associative array then we print the current line
I wouldn't expect this to run any/much faster than the grep solution the OP is currently using (especially with shelter's suggestion to use -F instead of -w), but it should be relatively quick to test and see ....

GNU Parallel has a whole section dedicated to grepping n lines for m regular expressions:
https://www.gnu.org/software/parallel/man.html#EXAMPLE:-Grepping-n-lines-for-m-regular-expressions

You could try with a while read loop :
#!/bin/bash
while read -r line; do
grep -rnw Stomach.v7.egenes.txt -e "$line" >> output.txt
done < allGenes.txt
So here tell the while loop to read all the lines from the allGenes.txt, and for each line, check whether there are matching lines in the egenes file. Would that do the trick?
EDIT :
New version :
#!/bin/bash
for name in $(cat allGenes.txt); do
grep -rnw *v7.egenes.txt* -e $name >> output.txt
done

AWK: How to extract rows from one file based on rows of another file? [duplicate]

I've got a pretty big comma-delimited CSV log file (>50000 rows, let's call it file1.csv) that looks something like this:
field1,field2,MM-DD-YY HH:MM:SS,field4,field5...
...
field1,field2,07-29-10 08:04:22.7,field4,field5...
field1,field2,07-29-10 08:04:24.7,field4,field5...
field1,field2,07-29-10 08:04:26.7,field4,field5...
field1,field2,07-29-10 08:04:28.7,field4,field5...
field1,field2,07-29-10 08:04:30.7,field4,field5...
...
As you can see, there is a field in the middle that is a time stamp.
I also have a file (let's call it file2.csv) that has a short list of times:
timestamp,YYYY,MM,DD,HH,MM,SS
20100729180031,2010,07,29,18,00,31
20100729180039,2010,07,29,18,00,39
20100729180048,2010,07,29,18,00,48
20100729180056,2010,07,29,18,00,56
20100729180106,2010,07,29,18,01,06
20100729180115,2010,07,29,18,01,15
What I would like to do is to extract only the lines in file1.csv that have times specified in file2.csv.
How do I do this with a bash script? Since file1.csv is quite large, efficiency would also be a concern. I've done very simple bash scripts before, but really don't know how to deal with this. Perhaps some implementation of awk? Or is there another way?
P.S. Complication 1: I manually spot checked some of the entries in both files to make sure they would match, and they do. There just needs to be a way to remove (or ignore) the extra ".7" at the end of the seconds ("SS") field in file1.csv.
P.P.S. Complication 2: Turns out the entries in list1.csv are all separated by about two seconds. Sometimes the time stamps in list2.csv fall right in between two of the entries in list1.csv! Is there a way to find the closest match in this case?

Taking advantage of John's answer, you could sort and join the files, printing just the columns you want (or all columns if the case). Please take a look below (note that I'm considering that you're using UNIX, like Solaris, so nawk could be faster than awk, also we don't have gawk that could facilitate even more):
# John's nice code
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
# Sorting times.list file to prepare for the join
sort times.list -o times.list
# Sorting file1.csv
sort -t, -k3,3 file1.csv -o file1.csv
# Finally joining files and printing the rows that match the times
join -t, -1 3 -2 1 -o 1.1 1.2 1.3 1.4 1.5......1.50 file1.csv times.list
One special particularity from this method is that you could change it in order to work in several different cases, like with different columns order, and also in cases when the key columns are not concatenated. It would be very hard to do this with grep (using regexp or not)

If you have GNU awk (gawk), you can use this technique.
In order to match the nearest times, one approach would be to have awk print two lines for each line in file2.csv, then use that with grep -f as in John Kugelman's answer. The second line will have one second added to it.
awk -F, 'NR>1 {$1=""; print strftime("%m-%d-%y %H:%M:%S", mktime($0));
print strftime("%m-%d-%y %H:%M:%S", mktime($0) + 1)}' file2.csv > times.list
grep -f times.list file1.csv
This illustrates a couple of different techniques.
skip record number one to skip the header (using a match is actually better)
instead of dealing with each field individually, $1 is emptied and strftime creates the output in the desired format
mktime converts the string in the format "yyyy mm dd hh mm ss" (the -F, and the assignment to $1 removes the commas) to a number of seconds since the epoch, and we add 1 to it for the second line

One approach is to use awk to convert the timestamps in file2.csv to file1.csv's format, then use grep -f to search through file1.csv. This should be quite fast as it will only make one pass through file1.csv.
awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv > times.list
grep -f times.list file1.csv
You could combine this all into one line if you wish:
grep -f <(awk -F, '! /timestamp/ {print $3 "-" $4 "-" ($2-2000) " " $5 ":" $6 ":" $7}' file2.csv) file1.csv

Shell script: How to split line?

here's my scanerio:
my input file like:
/tmp/abc.txt
/tmp/cde.txt
/tmp/xyz/123.txt
and i'd like to obtain the following output in 2 files:
first file
/tmp/
/tmp/
/tmp/xyz/
second file
abc.txt
cde.txt
123.txt
thanks a lot

Here is all in one single awk
awk -F\/ -vOFS=\/ '{print $NF > "file2";$NF="";print > "file1"}' input
cat file1
/tmp/
/tmp/
/tmp/xyz/
cat file2
abc.txt
cde.txt
123.txt
Here we set input and output separator to /
Then print last field $NF to file2
Set the last field to nothing, then print the rest to file1

I realize you already have an answer, but you might be interested in the following two commands:
basename
dirname
If they're available on your system, you'll be able to get what you want just piping through these:
cat input | xargs -l dirname > file1
cat input | xargs -l basename > file2
Enjoy!
Edit: Fixed per quantdev's comment. Good catch!

Through grep,
grep -o '.*/' file > file1.txt
grep -o '[^/]*$' file > file2.txt
.*/ Matches all the characters from the start upto the last / symbol.
[^/]*$ Matches any character but not of / zero or more times. $ asserts that we are at the end of a line.

The awk solution is probably the best, but here is a pure sed solution :
#n sed script to get base and file paths
h
s/.*\/\(.*.txt\)/\1/
w file1
g
s/\(.*\)\/.*.txt/\1/
w file2
Note how we hold the buffer with h, and how we use the write (w) command to produce the output files. There are many other ways to do it with sed, but I like this one for using multiple different commands.
To use it :
> sed -f sed_script testfile

Here is another oneliner that uses tee:cat f1.txt | tee >(xargs -n 1 dirname >> f2.txt) >(xargs -n 1 basename >> f3.txt) &>/dev/random

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas