Counting Pattern Per Column From Line X to Y Using AWK - awk

I have a file that looks like this:
Sample_ID Population CP026243.1_309 CP026243.1_318 CP026243.1_427 CP026243.1_449 CP026243.1_515
Turbot-BalticSea_01 BalticSea 0 0 0
Turbot-BalticSea_02 BalticSea -1 -1 -1
Turbot-BalticSea_03 BalticSea -1 -1 0
Turbot-BalticSea_04 BalticSea 0 0 -1
Turbot-BalticSea_05 BalticSea -1 -1 0
Turbot-BalticSea_06 BalticSea -1 0 0
Turbot-BalticSea_07 BalticSea -1 -1 -1
Turbot-BalticSea_08 BalticSea 0 0 0
Turbot-BalticSea_09 BalticSea 0 0 1
Turbot-BalticSea_10 BalticSea 0 0 0
Turbot-BalticSea_11 BalticSea -1 -1 1
Turbot-BalticSea_12 BalticSea 0 0 0
Turbot-BalticSea_13 BalticSea 0 0 0
Turbot-BalticSea_14 BalticSea 0 1 -1
Turbot-BalticSea_15 BalticSea 0 0 0
Turbot-BalticSea_16 BalticSea 0 0 0
Turbot-BalticSea_17 BalticSea 0 0 -1
Turbot-Kattegat_01 Kattegat 0 -1 0
Turbot-Kattegat_02 Kattegat -1 -1 0
Turbot-Kattegat_03 Kattegat -1 -1 -1
Turbot-Kattegat_04 Kattegat 0 0 0
Turbot-Kattegat_05 Kattegat -1 -1 -1
Turbot-Kattegat_06 Kattegat -1 -1 -1
I would like to know how I could count the number of "-1" in each column (starting from the 3rd column and from the second row) per population (BalticSea / Kattegat).
I guess AWK would be good for doing this?
cat TEMP.tsv | awk 'NR>1{...}'
Thanks a lot in advance, George.

With awk you could do something like this:
tail -n +2 TEMP.tsv | awk '$3=="-1" { count[$2]++ } END {for (word in count) print word, count[word]}'
which counts the occurrences of -1 in the 3rd column grouped by 2nd column value.
Note that tail command removes the header.
$3 in the awk command is checking for third column. Use $4, $5 for the fourth, fifth column and so on..
To do it for all the columns using a single awk script create a file for example grpbycount.awk:
BEGIN {}
{
for ( counter = 3; counter <=NF; counter++ ) {
if ($counter=="-1") count[counter","$2]++
}
}
END {
for ( word in count) print word, count[word]
}
Then use this command:
tail -n +2 TEMP.tsv | awk -f grpbycount.awk | sort
For your input, this results in the count as below
3,BalticSea 6
3,Kattegat 4
4,BalticSea 5
4,Kattegat 5
5,BalticSea 5
5,Kattegat 3
First numeric value is the column number followed by the value from column 2, and then it prints the count.

Related

Looping through combinations of selected strings in specific columns and counting their occurrence

I have
A 34 missense fixed
A 33 synonymous fixed
B 12 synonymous var
B 34 missense fixed
B 34 UTR fixed
B 45 missense var
TRI 4 synonymous var
TRI 4 intronic var
3 3 synonymous fixed
I wanna output the counts of the combinations missense && fixed, missense && var, synonymous && fixed, synonymous && var , for each element in $1
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 0
TRI 0 0 0 1
3 0 0 1 0
I can do this way with 4 individual commands selecting for each combination and concatenating the outputs
awk -F'\t' '($3~/missense/ && $4~/fixed/)' file | awk -F'\t' '{count[$1"\t"$3"\t"$4]++} END {for (word in count) print word"\t"count[word]}' > out
But I'm would like to do this for all combinations at once. I've tried some variations of this but not able to make it work
awk print a[i] -v delim=":" -v string='missense:synonymous:fixed:var' 'BEGIN {n = split(string, a, delim); for (i = 1; i <= n-2; ++i) {count[xxxx}++}} END ;for (word in count) print word"\t"count[word]}
You may use this awk with multiple arrays to hold different counts:
awk -v OFS='\t' '
{keys[$1]}
/missense fixed/ {++mf[$1]}
/missense var/ {++mv[$1]}
/synonymous fixed/ {++sf[$1]}
/synonymous var/ {++sv[$1]}
END {
print "-\tmissensefixed\tmissensevar\tsynonymousfixed\tsynonymousvar"
for (i in keys)
print i, mf[i]+0, mv[i]+0, sf[i]+0, sv[i]+0
}
' file | column -t
- missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I have used column -t for tabular output only.
GNU awk supports arrays of arrays, so if it is your awk you can count your records with something as simple as num[$1][$3$4]++. The most complex part is the final human-friendly printing:
$ cat foo.awk
{ num[$1][$3$4]++ }
END {
printf(" missensefixed missensevar synonymousfixed synonymousvar\n");
for(r in num) printf("%3s%14d%12d%16d%14d\n", r, num[r]["missensefixed"],
num[r]["missensevar"], num[r]["synonymousfixed"], num[r]["synonymousvar"])}
$ awk -f foo.awk data.txt
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
Using any awk in any shell on every Unix box with an assist from column to convert the tab-separated awk output to a visually tabular display if you want it:
$ cat tst.awk
BEGIN {
OFS = "\t"
numTags = split("missensefixed missensevar synonymousfixed synonymousvar",tags)
}
{
keys[$1]
cnt[$1,$3 $4]++
}
END {
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", OFS, tag
}
print ""
for (key in keys) {
printf "%s", key
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
val = cnt[key,tag]
printf "%s%d", OFS, val
}
print ""
}
}
$ awk -f tst.awk file
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
$ awk -f tst.awk file | column -s$'\t' -t
missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0
I'd highly recommend you always give every column a header string though so it doesn't make further processing of the data harder (e.g. reading it into Excel and sorting on headers), so if I were you I'd add printf "key" or something else that more accurately identifies that columns contents as the first line of the END section (i.e. on a line immediately before the first for loop) so the first column gets a header too:
$ awk -f tst.awk file | column -s$'\t' -t
key missensefixed missensevar synonymousfixed synonymousvar
A 1 0 1 0
B 1 1 0 1
TRI 0 0 0 1
3 0 0 1 0

How to speed up Pandas' "iterrows"

I have a Pandas dataframe which I want to transform in the following way: I have some sensor data from an intelligent floor which is in column "CAPACITANCE" (split by ",") and that data comes from the device indicated in column "DEVICE". Now I want to have one row with a column per sensor - each device has 8 sensors, so I want to have devices x 8 columns and in that column I want the sensor data from exactly that sensor.
But my code seems to be super slow since I have about 90.000 rows in that dataframe! Does anyone have a suggestion how to speed it up?
BEFORE:
CAPACITANCE DEVICE TIMESTAMP \
0 0.00,-1.00,0.00,1.00,1.00,-2.00,13.00,1.00 01,07 2017/11/15 12:24:42
1 0.00,0.00,-1.00,-1.00,-1.00,0.00,-1.00,0.00 01,07 2017/11/15 12:24:42
2 0.00,-1.00,-2.00,0.00,0.00,1.00,0.00,-2.00 01,07 2017/11/15 12:24:43
3 2.00,0.00,-2.00,-1.00,0.00,0.00,1.00,-2.00 01,07 2017/11/15 12:24:43
4 1.00,0.00,-2.00,1.00,1.00,-3.00,5.00,1.00 01,07 2017/11/15 12:24:44
AFTER:
01,01-0 01,01-1 01,01-2 01,01-3 01,01-4 01,01-5 01,01-6 01,01-7 \
0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
01,02-0 01,02-1 ... 05,07-1 05,07-2 05,07-3 05,07-4 05,07-5 \
0 0 0 ... 0 0 0 0 0
1 0 0 ... 0 0 0 0 0
2 0 0 ... 0 0 0 0 0
3 0 0 ... 0 0 0 0 0
4 0 0 ... 0 0 0 0 0
05,07-6 05,07-7 TIMESTAMP 01,07-8
0 0 0 2017-11-15 12:24:42 1.00
1 0 0 2017-11-15 12:24:42 0.00
2 0 0 2017-11-15 12:24:43 -2.00
3 0 0 2017-11-15 12:24:43 -2.00
4 0 0 2017-11-15 12:24:44 1.00
# creating new dataframe based on the old one
floor_df_resampled = floor_df.copy()
floor_device = ["01,01", "01,02", "01,03", "01,04", "01,05", "01,06", "01,07", "01,08", "01,09", "01,10",
"02,01", "02,02", "02,03", "02,04", "02,05", "02,06", "02,07", "02,08", "02,09", "02,10",
"03,01", "03,02", "03,03", "03,04", "03,05", "03,06", "03,07", "03,08", "03,09",
"04,01", "04,02", "04,03", "04,04", "04,05", "04,06", "04,07", "04,08", "04,09",
"05,06", "05,07"]
# creating new columns
floor_objects = []
for device in floor_device:
for sensor in range(8):
floor_objects.append(device + "-" + str(sensor))
# merging new columns
floor_df_resampled = pd.concat([floor_df_resampled, pd.DataFrame(columns=floor_objects)], ignore_index=True, sort=True)
# part that takes loads of time
for index, row in floor_df_resampled.iterrows():
obj = row["DEVICE"]
sensor_data = row["CAPACITANCE"].split(',')
for idx, val in enumerate(sensor_data):
col = obj + "-" + str(idx + 1)
floor_df_resampled.loc[index, col] = val
floor_df_resampled.drop(["DEVICE"], axis=1, inplace=True)
floor_df_resampled.drop(["CAPACITANCE"], axis=1, inplace=True)
Like commented, I'm not sure why you want that many columns, but the new columns can be created as follows:
def explode(x):
dev_name = x.DEVICE.iloc[0]
ret_df = x.CAPACITANCE.str.split(',', expand=True).astype(float)
ret_df.columns = [f'{dev_name}-{col}' for col in ret_df.columns]
return ret_df
new_df = df.groupby('DEVICE').apply(explode).fillna(0)
and then you can merge this with the old data frame:
df = df.join(new_df)

replace values in two columns using awk

Hi I have a text file that look like this
P383 0 II-5 2 0 1/2 0 0 42.7 0 54.67 0
T528 0 P383 2 0 1/2 0 0 0 0 34.06 0
T529 III-8 0 2 0 0 0 0 0 0 37.74 0
T530 III-8 0 2 0 0 0 0 0 0 36.73 0
3B888 III-4 III-5 2 0 1/2 38.4 0 0 0 44.38 0
I want to replace the 0 in columns 2, 3 and 5 with " " (a blank), I know how to do it for one column
The desired output would be
P383 II-5 2 1/2 0 0 42.7 0 54.67 0
T528 P383 2 1/2 0 0 0 0 34.06 0
T529 III-8 2 0 0 0 0 0 37.74 0
T530 III-8 2 0 0 0 0 0 36.73 0
3B888 III-4 III-5 2 1/2 38.4 0 0 0 44.38 0
I know how to do it for a single column
awk '$3=="0"{$3=" "}; {print}' file
But how I do it for the three columns at the same time?
Thanks
Following awk may help you on same.
awk '{$2=$2==0?"":$2;$3=$3==0?"":$3;$5=$5==0?"":$5;} 1' OFS="\t" Input_file
Solution 2nd: Adding a more generic solution by which you could pass number of fields in function as follows.
awk '
function check_fields(a){
num=split(a,array,",");
for(i=1;i<=num;i++){
$array[i]=$array[i]==0?"":$array[i]}
}
check_fields("2,3,5")
1
' OFS="\t" Input_file
awk '
BEGIN { split("2 3 5", tmp); for (i in tmp) flds[tmp[i]] }
{ for (i in flds) if ($i == 0) $i = ""; print }
' file

Check how many 1 are in the column of a matrix #minizinc

Given a matrix Z[n,m]:
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
1 0 0 0 0
0 0 1 0 0
I'd like to check how many "1" there are in the different columns of the matrix. So given k=1 in this case, the problem should be unsatisfiable since in the column there are 2 "1", so "number of 1">k. I tried in this way but it doesn't work:
constraint forall(i in n, j in m) forall(k in n) k<=( Z[i,j]\/Z[k,j])
Where am I wrong?
In the case I have this variables how I can do?
int b;
int: k;
set of int: PEOPLE = 1..p;
set of int: STOPS = 1..s;
array [1..b, PEOPLE, STOPS] of var bool: Z;
Z[1]
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
1 0 0 0 0
0 0 1 0 0
Z[2]
0 1 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
p = 5;
s =5;
k=1;
b=2;
So in this case the result should be:
Z[1]: 1 0 1 0 0 , the number of "1" is 2, "2 > K"
Z[2]: 0 1 0 0 0, the number of "1" is 1, "1<=K"
UNSATISFIABLE
I just solved in this way:
array [1..b, STOPS] of var bool: M;
constraint forall (m in 1..b) ( forall (j in STOPS) ( M[m,j]= exists([Z[m,i,j] | i in PEOPLE ])));
constraint forall (m in 1..b) ( let {
var int: s = sum (j in STOPS)(M[m,j]>0);
} in
s <= t );
thank you all for the answers :)

How can I change my index vector into sparse feature vector that can be used in sklearn?

I am doing a News recommendation system and I need to build a table for users and news they read. my raw data just like this :
001436800277225 [12,456,157]
009092130698762 [248]
010003000431538 [361,521,83]
010156461231357 [173,67,244]
010216216021063 [203,97]
010720006581483 [86]
011199797794333 [142,12,86,411,201]
011337201765123 [123,41]
011414545455156 [62,45,621,435]
011425002581540 [341,214,286]
the first column is userID, the second column is the newsID.newsID is a index column, for example, after transformation, [12,456,157] in the first row means that this user has read the 12th, 456th and 157th news (in sparse vector, the 12th column, 456th column and 157th column are 1, while other columns have value 0). And I want to change these data into a sparse vector format that can be used as input vector in Kmeans or DBscan algorithm of sklearn.
How can I do that?
One option is to construct the sparse matrix explicitly. I often find it easier to build the matrix in COO matrix format and then cast to CSR format.
from scipy.sparse import coo_matrix
input_data = [
("001436800277225", [12,456,157]),
("009092130698762", [248]),
("010003000431538", [361,521,83]),
("010156461231357", [173,67,244])
]
NUMBER_MOVIES = 1000 # maximum index of the movies in the data
NUMBER_USERS = len(input_data) # number of users in the model
# you'll probably want to have a way to lookup the index for a given user id.
user_row_map = {}
user_row_index = 0
# structures for coo format
I,J,data = [],[],[]
for user, movies in input_data:
if user not in user_row_map:
user_row_map[user] = user_row_index
user_row_index+=1
for movie in movies:
I.append(user_row_map[user])
J.append(movie)
data.append(1) # number of times users watched the movie
# create the matrix in COO format; then cast it to CSR which is much easier to use
feature_matrix = coo_matrix((data, (I,J)), shape=(NUMBER_USERS, NUMBER_MOVIES)).tocsr()
Use MultiLabelBinarizer from sklearn.preprocessing
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.newsID), columns=mlb.classes_)
12 41 45 62 67 83 86 97 123 142 ... 244 248 286 341 361 411 435 456 521 621
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0
2 0 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 1 0
3 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 1 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
7 0 1 0 0 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
8 0 0 1 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 1 0 0 1
9 0 0 0 0 0 0 0 0 0 0 ... 0 0 1 1 0 0 0 0 0 0