Awk for handling three files - awk

Input:
File 1
col1 col2 col3 col4 col5 col6 col7
A 91 - E Abu 7 -
B 82 - f Anu 9 -
C 93 - G Aru 8 -
File 2
col1 col2 col3 col4 col5 col6 col7
A 91 - x Bob 7 -
B 82 - y Bag 9 -
C 93 - z Bui 8 -
File 3
col1 col2 col3 col4 col5 col6 col7
A 91 - T Can 7 -
B 82 - U Con 9 -
C 93 - V Cuu 8 -
Output Expected:
col1 col2 col3 col4 col5
A 91 Abu Bob Can
B 82 Anu Bag Cun
C 93 Aru Bui Cuu
I have three files having same data at col1 and 2. I need to print fifth column of all files along with first two column.
I am able to do using two files. So Can anyone help me to do with three and more files?

Here is one way using awk:
$ awk '
BEGIN {
SUBSEP = FS;
print "col1 col2 col3 col4 col5"
}
FNR>1 {
a[$1,$2] = (a[$1,$2]?a[$1,$2]FS$5:$5)
}
END {
for(x in a) print x, a[x]
}' file1 file2 file3
col1 col2 col3 col4 col5
C 93 Aru Bui Cuu
A 91 Abu Bob Can
B 82 Anu Bag Con
You can pipe the output to sort if you required sorted output. This does not limit to three files. It is scalable to n number of files. Just add the file names at the end or use * to glob to all files under a given directory.

Assuming all three files have same number of rows because of this sentence
I have three files having same data at col1 and 2.
awk 'BEGIN{OFS="\t";
getline<"file1";getline<"file2";getline<"file3";
print "col1","col2","col3","col4","col5";
while(1) {getline < "file1";a=$1;b=$2;c=$5;getline<"file2";d=$5;f=getline<"file3";e=$5;
if(!f)exit;print a,b,c,d,e}}'
Output:
col1 col2 col3 col4 col5
A 91 Abu Bob Can
B 82 Anu Bag Con
C 93 Aru Bui Cuu
This will discard first line of each file, then reads files line by line, printing desired fields.

Related

What is the name of the function to create dataframe from text in R, with each element being separated by space or tabulation?

I know that it's possible but I can't remember the name of this function
# Does this function exist ? For a character element formatted more or less as follows
function_readText(
"col1 col2
1 2
3 4")
col1 col2
1 1 2
2 3 4
It's read.table
read.table(textConnection(
col1 col2
1 2
3 4"), header = T)
col1 col2
1 1 2
2 3 4

Remove multiple strings at a time in sql

I want to remove 'hi' ,'by', 'dy' from col2 at one shot in sql. I'm very new to sql server, if anyone could give an outline how such problems are solved that would be really helpful.
Col1 col2 col3
A hi!abcd 123
B bypython 678
C norm 888
D dupty dy 999
output:
Col1 col2 col3
A abcd 123
B python 678
C norm 888
D dupty 999

Creating multiple pandas column from combination of a list in python

I have a pandas data frame with one column as string seperated with commas
eg:
col1 col2
B1,B2,B3 20
B4,B5,B6 15
and I want to create another data frame with combination like:
Col1 Col2 col3 col4 col5
B1,B2,B3 20 B1,B2 B2,B3 B1,B3
B4,B5,B6 15 B4,B5, B5,B6 B4,B6
How can I do this in Pandas.
If there are always 3 values in col1, you could use itertools.combinations:
from itertools import combinations
splits = df['col1'].str.split(',', expand=True).values.tolist()
df[['col3', 'col4', 'col5']] = [[','.join(c) for c in combinations(split, 2)] for split in splits]
print(df)
Output
col1 col2 col3 col4 col5
0 B1,B2,B3 20 B1,B2 B1,B3 B2,B3
1 B4,B5,B6 15 B4,B5 B4,B6 B5,B6
A more general solution is to do:
from itertools import combinations
splits = df['col1'].str.split(',').values.tolist()
rows = [[','.join(c) for c in combinations(split, 2)] for split in splits]
length = max(len(row) for row in rows)
new_cols = pd.DataFrame(data=rows, columns=[f'col{i}' for i in range(3, length + 3)])
res = pd.concat((df, new_cols), axis=1)
print(res)
Output
col1 col2 col3 col4 col5
0 B1,B2,B3 20 B1,B2 B1,B3 B2,B3
1 B4,B5 15 B4,B5 None None
Note that the input for the second example was:
col1 col2
B1,B2,B3 20
B4,B5 15

Select column index in pandas dataframe based on values

I have a csv file like so:
Id col1 col2 col3
a 7.04 0.3 1.2
b 0.3 1.7 .1
c 0.34 0.05 1.3
d 0.4 1.60 3.1
I want to convert it to a data frame thresholding on 0.5 . If the values is greater than or equal to 0.5 then the column is counted , otherwise it is not counted.
Id classes
a col1,col3
b col2
c col3
d col2,col3
The closest solution I found is this one. However it deals with single rows, not multiple rows. For multiple rows, the best I have is iterate through all the rows. I need a succinct expression without for loop.
Use set_index first and then numpy.where for extract columns by condition. Last remove empty strings by list comprehension:
df = df.set_index('Id')
s = np.where(df > .5, ['{}, '.format(x) for x in df.columns], '')
df['new'] = pd.Series([''.join(x).strip(', ') for x in s], index=df.index)
print (df)
col1 col2 col3 new
Id
a 7.04 0.30 1.2 col1, col3
b 0.30 1.70 0.1 col2
c 0.34 0.05 1.3 col3
d 0.40 1.60 3.1 col2, col3
Similar for new DataFrame:
df1 = pd.DataFrame({'classes': [''.join(x).strip(', ') for x in s],
'Id': df.index})
print (df1)
Id classes
0 a col1, col3
1 b col2
2 c col3
3 d col2, col3
And if necessary remove empty with ,:
df1 = pd.DataFrame({'classes': [''.join(x).strip(', ').replace(', ',',') for x in s],
'Id': df.index})
print (df1)
Id classes
0 a col1,col3
1 b col2
2 c col3
3 d col2,col3
Detail:
print (s)
[['col1, ' '' 'col3, ']
['' 'col2, ' '']
['' '' 'col3, ']
['' 'col2, ' 'col3, ']]
Alternative with apply (slowier):
df1 = (df.set_index('Id')
.apply(lambda x: ','.join(x.index[x > .5]), 1)
.reset_index(name='classes'))
print (df1)
Id classes
0 a col1,col3
1 b col2
2 c col3
3 d col2,col3
Comprehension after clever multiplication... This assumes that Id is the index.
df.assign(classes=[
','.join(s for s in row if s)
for row in df.ge(.5).mul(df.columns).values
])
col1 col2 col3 classes
Id
a 7.04 0.30 1.2 col1,col3
b 0.30 1.70 0.1 col2
c 0.34 0.05 1.3 col3
d 0.40 1.60 3.1 col2,col3
Setup Fun Trick
Custom subclass of str that redefines string addition to include a ','
class s(str):
def __add__(self, other):
if self and other:
return s(super().__add__(',' + other))
else:
return s(super().__add__(other))
Fun Trick
df.ge(.5).mul(df.columns).applymap(s).sum(1)
Id
a col1,col3
b col2
c col3
d col2,col3
dtype: object

Linux remove space/column if matching string pattern

I would like awk or sed or any other display filter mechanisms in native shell to be able to remove space from string lines that a match & do not remove any space between the 2 strings (columns) when not matched. And then display the output.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
My output I would like to be:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc defghi jkl mno
ln2 abc defghi jkl mno pqr
I have tried multiple combinations of grep & awk & cut. But not able to do it. I am not good with sed, but I can try. I even tried to use an interim file i.e. echo output to some file & then grep. But I failed to do that too.
Edited with more of my requirement:
My biggest problem is that I can't predict where the space will be & what the contents of the row entries would be. So I would like sed to get the output not based on a specific string but instead based on column numbers.
My file:
# cat test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbb ccc ddd eee
ln4 aaa bbbccc ddd eee
Output File:
# <something to view/grep/filter with> test2
col0 col1 col2 col3 col4 col5
ln1 abc def ghi jkl mno
ln2 abc defghi jkl mno pqr
ln3 aaa bbbccc ddd eee
ln4 aaa bbbccc ddd eee
sed 's/def ghi/defghi/' file
If that's not what you wanted then edit your question to clarify your requirements and provide input/output that better demonstrates your problem.