I have a table like this:
Var score
1 1.00
1 1.06
1 1.03
1 0.65
1 0.68
2 1.06
2 1.07
2 0.64
2 1.05
3 0.71
3 0.72
3 1.03
4 0.68
4 1.08
5 0.11
Want to convert this into matrix like:
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
I tried awk but its keep running:
awk '{if(NF>max) max=NF} END{while(getline<"file"){for(i=NF+1;i<=max;i++)$i="0";print}}'
It keeps running because you forgot to pass it the file name. So awk takes its input from the standard input and waits for you to enter something on the keyboard. Use awk '...' file, not just awk '...'. But even with this error fixed it will not work as you expect.
You don't need to read the file twice. You can build your matrix in one single pass and populate the missing cells in the END block (tested with GNU and BSD awk):
awk 'NR > 1 {
num[$1] += 1
mat[$1, $1 + num[$1]] = mat[$1 + num[$1], $1] = $2
n = num[$1] > n ? num[$1] : n
}
END {
n += 1
mat[0, 0] = ""
for(i = 1; i <= n; i += 1) {
mat[0, i] = mat[i, 0] = i
mat[i, i] = "0.00"
}
for(i = 0; i <= n; i += 1)
for(j = 0; j <= n; j += 1)
printf("%s%s", mat[i, j], j == n ? "\n" : "\t")
}' file
1 2 3 4 5 6
1 0.00 1.00 1.06 1.03 0.65 0.68
2 1.00 0.00 1.06 1.07 0.64 1.05
3 1.06 1.06 0.00 0.71 0.72 1.03
4 1.03 1.07 0.71 0.00 0.68 1.08
5 0.65 0.64 0.72 0.68 0.00 0.11
6 0.68 1.05 1.03 1.08 0.11 0.00
Pandas DataFrame = df (example) is as follows:
----------------------------------
col_1 col_2 col_3 col_4 ... etc.
----------------------------------
0 34.91 12.45 0.00 256.95
1 0.00 0.00 0.00 0.00
2 2.34 346.78 1.23 0.02
3 0.00 78.95 36.78 2.95
4 0.03 46.21 128.05 30.00
5 0.05 0.10 0.07 0.05
----------------------------------
df = df.assign(col_new_bool = lambda x: True if ((x['col_1'] < 0.0001) and
(x['col_2'] < 0.0002) and
(x['col_3'] < 0.0003) and
(x['col_3'] < 0.0004))
else False)
I want to create a new column (named new_col_bool) in dataframe df.
new_col_bool will contain boolean True if all 4 columns have zeroes.
new_col_bool will contain boolean False if any of the 4 columns is non-zero.
Pls help with the correct lambda function ?
NOTE:
df has 100+ columns but my new_col_bool is calculated based on only 4 columns.
** How do I check a different threshold value for each of those 4 columns ?**
You don't need a lambda function for something trivial, use DataFrame.all over axis=1:
df['new_col_bool'] = df.eq(0).all(axis=1)
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
5 0.05 0.10 0.07 0.05 False
To only check certain columns, select these first:
cols = ['col_1', 'col_2', 'col_3', 'col_4']
df['new_col_bool'] = df[cols].eq(0).all(axis=1)
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
5 0.05 0.10 0.07 0.05 False
To check any condition:
cols = ['col_1', 'col_2', 'col_3', 'col_4']
cond = df[cols] > 0.5
# or cond = df[cols] <= -1.3
df['new_col_bool'] = cond.all(axis=1)
I think it would be efficient to transpose the dataframe and take the sum:
df['new_col_bool'] = df.T.sum() == 0
df
Out[1]:
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
Or for specific columns:
df['new_col_bool'] = df.T.iloc[0:4].sum() == 0
df
Out[1]:
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 False
4 0.03 46.21 128.05 30.00 False
To do by a threshold, use max:
df['new_col_bool'] = df.T.iloc[0:4].max() < 100
df
Out[1]:
col_1 col_2 col_3 col_4 new_col_bool
0 34.91 12.45 0.00 256.95 False
1 0.00 0.00 0.00 0.00 True
2 2.34 346.78 1.23 0.02 False
3 0.00 78.95 36.78 2.95 True
4 0.03 46.21 128.05 30.00 False
5 0.05 0.10 0.07 0.05 True
I am looking for a way to perform a matrix multiplication on two sets of columns in a dataframe. One set of columns will need to be transposed and then multiplied with the other set. Then I need to take the resulting matrix and do an element wise product with a scalar matrix and add up. Below is an example:
Data for testing:
import pandas as pd
import numpy as np
dftest = pd.DataFrame(data=[['A',0.18,0.25,0.36,0.21,0,0.16,0.16,0.64,0.04,0,0],['B',0,0,0.5,0.5,0,0,0,0.25,0.75,0,0]],columns = ['Ticker','f1','f2','f3','f4','f5','p1','p2','p3','p4','p5','multiplier'])
Starting dataframe with data for Tickers. f1 through f5 represent one set of categories and p1 through p5 represent another.
dftest
Out[276]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 0
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 0
For each row, I need to do transpose columns p1 through p5 and then multiply them to columns f1 through f5. I think I have found the solution using below.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]))
Out[408]:
f1 f2 f3 f4 f5
Ticker
A p1 0.0288 0.04 0.0576 0.0336 0.0
p2 0.0288 0.04 0.0576 0.0336 0.0
p3 0.1152 0.16 0.2304 0.1344 0.0
p4 0.0072 0.01 0.0144 0.0084 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
B p1 0.0000 0.00 0.0000 0.0000 0.0
p2 0.0000 0.00 0.0000 0.0000 0.0
p3 0.0000 0.00 0.1250 0.1250 0.0
p4 0.0000 0.00 0.3750 0.3750 0.0
p5 0.0000 0.00 0.0000 0.0000 0.0
Next I need to do a element wise product of the above matrix against another 5x5 matrix that is in another DataFrame and then add up the columns or rows (you get the same result either way). If I extend the above statement as below, I get the result I want.
dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
Out[409]:
Ticker
A 2.7476
B 1.6250
dtype: float64
So far so good, I think. Happy to know a better and faster way to do this. The next question and this is where I am stuck.
How do I take this and update the "multiplier" column on my original dataFrame?
if I try to do the following:
dftest['multiplier']=dftest.groupby('Ticker')['f1','f2','f3','f4','f5','p1','p2','p3','p4','p5'].apply(lambda x: pd.DataFrame(m.values * x[['p1','p2','p3','p4','p5']].T.dot(x[['f1','f2','f3','f4','f5']]).values, columns = m.columns, index = m.index).sum().sum())
I get NaNs in the multiplier column.
dftest
Out[407]:
Ticker f1 f2 f3 f4 f5 p1 p2 p3 p4 p5 multiplier
0 A 0.18 0.25 0.36 0.21 0 0.16 0.16 0.64 0.04 0 NaN
1 B 0.00 0.00 0.50 0.50 0 0.00 0.00 0.25 0.75 0 NaN
I suspect it has to do with indexing and whether all the indices after grouping are translating back to the original dataframe. Second, do I need a group by statement for this one? Since it is a row by row solution, can't I just do it without grouping or group by the index? any suggestions on that?
I need to do this without iterating row by row because the whole code will iterate due to some optimization I have to do. So I need to run this whole process, look at the results and if they are outside some constraints, calculate new f1 through f5 and p1 through p5 and run the whole thing again.
I posted a question on this earlier but it was confusing so this a second attempt. Hope it makes sense.
Thanks in advance for all your help.
Trying to run an awk command according to some documentation (linky).
However, whenever I add {1} or {2} to the awk command as documentation describes (see link above or example below) my search stops working. Zero results even on gigantic multi-gigabyte files. Any advice?
These work
awk '($3=="+" && $4~/^CG/)' example
awk '($3=="+" && $4~/..CG/)' example
awk '($3=="+" && $4~/.CG/)' example
awk '($3=="+" && $4~/^..CG/)' example
These don't return anything (even on a 3 gigabyte file)
awk '($3=="+" && $4~/.{2}CG/)' example
awk '($3=="+" && $4~/.{1}CG/)' example
awk '($3=="+" && $4~/^.{2}CG/)' example
Full command according to documentation:
awk '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)' BSMAPexample.txt > CpG.txt
Example dataset
EDIT (A COLUMN DISAPPEARED WHEN I PASTED INTO STACK EXCHANGE, TYPO FIXED)
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 + CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 + TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 + GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 + CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 + TACGC 0.000 56 0 0.000 0.064
chr1 3121631 + CTCGT 0.000 56 0 0.000 0.064
You have removed some columns from the original sample data.
This is the original data in the link you sent:
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 + CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 + ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 + GTCGT 0.000 56 0 0.000 0.064
And this is the sample data you posted:
chr pos strand context ratio total_C methy_C CI_lower CI_upper
chr1 3121589 CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 GTCGT 0.000 56 0 0.000 0.064
This is a problem for an expression like this:
awk '($3=="+" && $4~/.{2}CG/)' example
Which expects a + symbol in the third column ($3, inexisting in your data) and some repetitions of CG in the fourth column ($4 which seems to be located in position number 3). It won't match any line in your file.
If you modify the expression to refer to the proper column ($3) and forget the + sign since it does not appear in your data, you will get to match lines in your file.
$ awk '($3~/.{2}CG/)' example
chr1 3121589 CGCGT 0.000 56 0 0.000 0.064
chr1 3121597 ATCGG 0.000 56 0 0.000 0.064
chr1 3121599 GTCGT 0.000 56 0 0.000 0.064
chr1 3121605 CTCGG 0.000 56 0 0.000 0.064
chr1 3121606 TGCGC 0.000 56 0 0.000 0.064
chr1 3121607 GGCGC 0.000 56 0 0.000 0.064
chr1 3121611 CTCGA 0.000 56 0 0.000 0.064
chr1 3121614 TACGC 0.000 56 0 0.000 0.064
chr1 3121631 CTCGT 0.000 56 0 0.000 0.064
$
Actually all lines in the example file have 2 characters before the CG (**CG*). Only the header will be skipped.
Problem solved. I used gawk and --posix
gawk --posix '($3=="-" && $4~/^.{1}CG/ ) || ($3=="+" && $4~/^.{2}CG/)'
Works just fine now.
I have a text file that looks like this.
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
I need to extract all the lines that start with B,H and two lines after H . How can I do this using awk?
The expected output would be
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
Any suggestions please.
Ignoring the blank line after B in your output (your problem specifications give no indication as to why that blank line is in the output, so I'm assuming it should not be there):
awk '/^H/{t=3} /^B/ || t-- >0' input.file
will print all lines that start with B and each line that starts with H along with the next two lines.
awk '/^[BH]/ || /^[[:blank:]]*[[:digit:]]/' inputfile
bash-3.00$ cat t
A 102
B 456
C 678
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
bash-3.00$ awk '{if(( $1 == "B") || ($1 == "H") || ($0 ~ /^ / )) print;}' t
B 456
H A B C D E F G H I J
1.18 0.20 0.23 0.05 1.89 0.72 0.11 0.49 0.31 1.45
3.23 0.06 2.67 1.96 0.76 0.97 0.84 0.77 0.39 1.08
OR in short
awk '{if($0 ~ /^[BH ]/ ) print;}' t
OR even shorter
awk '/^[BH ]/' t
If H and B aren't the only headers that are sent before tabular data and you intend to omit those blocks of data (you don't specify the requirements fully) you have to use a flip-flop to remember if you're currently in a block you want to keep or not:
awk '/^[^ 0-9]/ {inblock=0}; /^[BH]/ {inblock=1}; { if (inblock) print }' d.txt
cat filename.txt | awk '/^[B(H(^ .*$){2})].*$/' > output.txt
EDIT: Updated for OP's edit