Perl: Combine duplicated keys in Hash of Array - sql

I having issues with this and wondering if someone could provide some help. I'm parsing a .txt file and want to combine duplicated keys and it's values. Essentially, for each identifier I want to store it's height value. Each "sample" has 2 entries (A & B). I have the file stored like this:
while(...){
#data= split ("\t", $line);
$curr_identifier= $data[0];
$markername= $data[1];
$position1= $data[2];
$height= $data[4];
if ($line >0){
$result[0] = $markername;
$result[1] = $position1;
$result[2] = $height;
$result[3] = $curr_identifier;
$data{$curr_identifier}= [#result];
}
}
This seems to work fine, but my issue is that when I send this data to below function. It prints the $curr_identifier twice. I only want to populate unique identifiers and check for the presence of it's $height variable.
if (!defined $data{$curr_identifier}[2]){
$output1= "no height for both markers- failed";
} else {
if ($data{$curr_identifier}[2] eq " ") {
$output1 = $markername;
}
}
print $curr_identifier, $output1 . "\t" . $output1 . "\n";
Basically, if sample height is present for both markers (A&B), then output is both markers.
'1', 'A', 'B'
If height is not present, then output is empty for reported marker.
'2', 'A', ' '
'3', ' ', 'B'
My current output is printing out like this:
1, A
1, B
2, A
2, ' '
3, ' '
3, B'
_DATA_
Name Marker Position1 Height Time
1 A A 6246 0.9706
1 B B 3237 0.9706
2 A 0
2 B B 5495 0.9775
3 A A 11254 0.9694
3 B 0

Your desired output can essentially be boiled down to these few lines of perl code:
while (<DATA>) {
($name,$mark,$pos,$heig,$time) = split /\t/;
print "'$name','$mark','$pos'\n";
}
__DATA__
... your tab-separated data here ...

Related

Replacing sequences of space-delimited numbers in huge input with awk

How can I replace sequences of space-delimited numbers when those sequences span multiple lines and the input is too big to fit in RAM.
A sample input would be:
edit: I re-worked the sample input and input parameters for introducing border cases (excluding ones that have to do with the length of the matched sequence or replacement priorities)
3 12 3 4
0 6 7 10
8 9 12 3
4 6 7 8
10 6 6 7
9 199 10 11
11
note: the number of fields per line is homogeneous but not known in advance; the last line might contain less fields
From that input I would like to:
replace 3 4 with &
replace 6 7 8 with 9 9
replace 6 7 9 with 8 8
replace 7 10 with 11 12
replace 0 with nothing
replace 10 with 13 10
replace 8 9 12 3 5 with #
The expected output would have one number or replacement per line:
3
12
&
6
11 12
8
9
12
&
9 9
13 10
6
8 8
199
13 10
11
11
I'm trying to do the task with awk but I'm having a hard time implementing a dynamic state machine with a pseudo B-Tree:
tr -s '[:space:]' '\n' < input.txt |
awk '
BEGIN {
for (i = 2; i < ARGC; i += 2) {
n = split(ARGV[i], arr)
k = ""
for (j = 1; j <= n; j++) {
k = j SUBSEP k SUBSEP arr[j]
Tree[k]
}
Tree[k] = "$" ARGV[i+1] #=> now can test "if (Tree[k])"
delete ARGV[i]
delete ARGV[i+1]
}
}
{
Key = (int(Key) + 1) SUBSEP Key SUBSEP $1
if ( Key in Tree ) {
if (Tree[Key]) {
print substr(Tree[Key],2)
Buffer = ""
Key = ""
}
else
Buffer = Buffer $1 "\n"
} else {
print Buffer $1
Buffer = ""
Key = ""
}
}
END { if (Buffer != "") printf ("%s", Buffer) }
' - \
'3 4' '&' \
'6 7 8' '9 9' \
'6 7 9' '8 8' \
'7 10' '11 12' \
'0' '' \
'10' '13 10' \
'8 9 12 3 5' '#'
edit: I realised that the code doesn't backtrack after failing to find a complete match in the B-tree, so it's wrong...
How I'm planning to tackle the problem
I'm emulating a B-tree with an array and keys in the following format:
from the middle to the left of the key are the consecutive depths
from the middle to the right of the key are the consecutive values
When a key exists in Tree:
if it doesn't have an associated value then it's a node
if there's a value then it's a leaf
So, for the current input parameters, the content of the Tree array will be:
# from param: "3 4" => "&"
Tree[ 1,"",3 ]
Tree[2,1,"",3,4] = "$&"
# from param: "6 7 8" => "9 9"
Tree[ 1,"",6 ]
Tree[ 2,1,"",6,7 ]
Tree[3,2,1,"",6,7,8] = "$9 9"
# from param: "6 7 9" => "8 8"
Tree[ 1,"",6 ]
Tree[ 2,1,"",6,7 ]
Tree[3,2,1,"",6,7,9] = "$8 8"
# from param: "7 10" => "11 12"
Tree[ 1,"",7 ]
Tree[2,1,"",7,10] = "$11 12"
# from param: "0" => ""
Tree[1,"",0] = "$"
# from param: "10" => "13 10"
Tree[1,"",10] = "$13 10"
# from param: "8 9 12 3 5" => "#"
Tree[ 1,"",8 ]
Tree[ 2,1,"",8,9 ]
Tree[ 3,2,1,"",8,9,12 ]
Tree[ 4,3,2,1,"",8,9,12,3 ]
Tree[5,4,3,2,1,"",8,9,12,3,5] = "$#"
FWIW I'd approach this by figuring out the max number of records that you might need to search in based on the mappings you want, keep a rolling buffer of that number of records, and then do the comparison part on each buffer, e.g.:
$ cat tst.awk
BEGIN {
RS = "[[:space:]]+"
map("3,4" , "&")
map("6,7,8" , "9")
map("9" , "")
map("0" , "\\000")
map("13,10" , "10")
}
{ buf[((NR-1) % maxRecs) + 1] = $0 }
NR >= maxRecs { prt() }
END { prt() }
function prt( nr,sep,str) {
for ( nr=NR-maxRecs+1; nr<=NR; nr++ ) {
str = str sep buf[((nr-1) % maxRecs) + 1]
sep = ORS
}
print ">>>>" ORS str ORS "<<<<"
# Replace the above with something that loops through the
# strings you want replaced, e.g.
#
# for ( mapNr=1; mapNr<=numMaps; mapNr++ ) {
# old = olds[mapNr]
# if ( str ~ old ) { # add something to avoid partial matches
# new = news[mapNr]
# replace old with new in the output
# }
# }
}
function map(old,new, numRecs) {
++numMaps
numRecs = gsub(/,/,ORS,old) + 1
maxRecs = ( numRecs > maxRecs ? numRecs : maxRecs )
olds[numMaps] = old
news[numMaps] = new
}
$ awk -f tst.awk file
>>>>
112
3
4
<<<<
>>>>
3
4
6
<<<<
>>>>
4
6
7
<<<<
>>>>
6
7
8
<<<<
>>>>
7
8
9
<<<<
>>>>
8
9
12
<<<<
>>>>
9
12
0
<<<<
>>>>
12
0
3
<<<<
>>>>
0
3
4
<<<<
>>>>
3
4
15
<<<<
>>>>
4
15
255
<<<<
>>>>
15
255
13
<<<<
>>>>
255
13
10
<<<<
>>>>
13
10
6
<<<<
>>>>
10
6
7
<<<<
>>>>
6
7
8
<<<<
>>>>
7
8
199
<<<<
>>>>
8
199
9
<<<<
>>>>
199
9
0
<<<<
>>>>
9
0
13
<<<<
>>>>
9
0
13
<<<<
The above is just printing the buff-sized strings, the part to be added is replacing the target strings with the new ones in a way that the next target doesn't match the replaced part which is a common problem with, I expect, lots of solutions online so it's left as an exercise.
You'll also need to tweak it to make sure it doesn't revisit lines at the end of the input.
The above uses GNU awk for multi-char RS, if you don't have GNU awk then just pipe the input from tr -s '[:space:]' '\n' as shown in the question.
UPDATE:
previous answer (see edit revisions) was woefully slow (several minutes) when run against a ramped up input (7K mappings in map.txt; 25M tokens
in input.txt1)
new answer (below) is a complete rewrite and processes the 7K-mappings/25M-tokens in ~45 seconds
The main component of this design centers around a tree-like node structure used to manage the series of tokens (lines of input from map.txt):
tree [ParentNodeNbr] [token] [NodeType] = value
Where:
ParentNodeNbr == 0 for the root
token from map.txt
NodeType has one of two values 'node' or 'leaf'
for NodeType = 'node' the value stored in the array is a numeric node number (implemented as an counter that's incremented each time a new node is added to the tree); this node number becomes the ParentNodeNbr for the next token in the series
for NodeType = 'leaf' this designates the 'end' of a series of tokens (line of input from map.txt) and the value stored in the array is the line number (aka FNR) from map.txt; this line number (FNR) is used as an index into a couple other arrays and to determine precendence when an input sequence (from input.txt) has multiple matches from map.txt
when processing a series of tokens from a map.txt line of input we start at ParentNodeNbr == 0 looking for a series of matching nodes, adding new nodes as needed
Setup: storing replacements in a comma-delimited file (map.txt), and adding one additional line to input.txt:
$ head map.txt input.txt
==> map.txt <==
2 3 4,X # "2 3 4" has precendence over ...
2 3,Y # "2 3"
3 4,&
6 7 8,9
9,
0,\000
13 10,10
==> input.txt <==
2 3 4 # keep eye on "2 3" vs "2 3 4" precendence
112 3
4 6 7
8 9 12 0 3
4 15 255 13
10 6
7 8 199 9
0 13
NOTE: here's what tree[][][] looks like when populated from map.txt:
tree [Parent] [Token] [NodeType] = NodeVal
Parent Token NodeType NodeVal MapTo ** MapTo only applies to NodeType = leaf
====== ===== ======== ======= =====
0 0 leaf 6 "\000"
0 2 node 1
0 3 node 3
0 6 node 4
0 9 leaf 5 ""
0 13 node 6
1 3 node 2
1 3 leaf 1 "Y"
2 4 leaf 2 "X"
3 4 leaf 3 "&"
4 7 node 5
5 8 leaf 4 "9"
6 10 leaf 7 "10"
One GNU awk (for multidimensional arrrays):
awk '
function replace(op) {
while ( ((maxToken - minToken + 1) >= maxlen) || op == "flush" ) {
NodeNbr=root
minOrd=maxOrd
for (j=0 ; j<maxlen; j++) { # loop through tokens in buffer[]
token=buffer[ ((minToken + j - 1) % maxlen) + 1 ]
# if we find a matching "leaf" node then keep track of the ordering (ie, FNR from map.txt; lower order == higher precedence)
if ( token in tree [NodeNbr] && "leaf" in tree[NodeNbr][token] )
minOrd= ( tree[NodeNbr][token]["leaf"] < minOrd ) ? tree[NodeNbr][token]["leaf"] : minOrd
# if we find a matching "node" node then grab the next node to compare against the next token from buffer[]
if ( token in tree[NodeNbr] && "node" in tree[NodeNbr][token] ) {
NodeNbr=tree[NodeNbr][token]["node"]
continue
}
break # if we get here we have a token from buffer[] that does not match any of our replacement mappings so abort checking rest of buffer[]
}
if (minOrd < maxOrd) { # if we found at least one complete match (ie, hit a "leaf" node) then ...
print map[minOrd] # use the associated "ord"er to print the associated replacement string and ...
minToken=minToken + len[minOrd] # update the pointer into the buffer[] array
}
else { # otherwise we did not find a match so ...
print buffer[ ((minToken - 1) % maxlen) + 1 ] # print the first token from buffer[] and ...
minToken++ # update the pointer into the buffer[] array
}
if (minToken > maxToken)
break
}
}
BEGIN { root=maxNodeNbr=maxToken=0
minToken=1
maxOrd=9999999999
}
FNR==NR { split($0,a,",")
map[FNR]=a[2] # save replacement string for this input line from map.txt
n=split(a[1],b) # break our matching pattern into tokens
len[FNR]=n # make note of number of tokens in this line of input
maxlen=(n > maxlen) ? n : maxlen # keep track of longest series of tokens
NodeNbr=root # initiate our tree search
for (i=1 ; i<=n ; i++) { # loop through our list of tokens
token=b[i]
if (i==n) # if the last token for this line then create a "leaf" node and store the line number (aka "order")
tree[NodeNbr][token]["leaf"]=FNR
else
if ( tree[NodeNbr][token]["node"] ) # else if we already have a node at this point in the tree then grab its associated node number for the next level in the tree
NodeNbr=tree[NodeNbr][token]["node"]
else { # else create a new "node" node and populate with the next available node number
tree[NodeNbr][token]["node"]=++maxNodeNbr
NodeNbr=maxNodeNbr # use this as the next level in our tree traversal
}
}
maxrec=FNR # keep track of total number of replacement sets from map.txt (only used if we decide to print the contents of map[] to stdout
next
}
FNR==1 {
# Uncomment following to display the contents of the map[] array:
# for (i=1;i<=maxrec;i++)
# print "map:" i ":" map[i] ":"
#
# Uncomment following to display the contents of the tree[][][] array:
# fmt="%6s%8s%10s%10s%10s\n"
# fmt="%6s%8s%10s%10s%10s\n"
# printf "tree [Parent] [Token] [NodeType]\n\n"
# printf fmt, "Parent", "Token", "NodeType", "NodeVal", "MapTo"
# printf fmt, "======", "=====", "========", "=======", "====="
#
# for (NodeNbr=root ; NodeNbr<=maxNodeNbr ; NodeNbr++)
# for (token in tree[NodeNbr])
# for (NodeType in tree[NodeNbr][token]) { # ??
# NodeVal=tree[NodeNbr][token][NodeType]
# printf fmt, NodeNbr, token, NodeType, NodeVal, (NodeType=="leaf") ? "\"" map[NodeVal] "\"" : ""
# }
}
{ for (i=1 ; i<=NF ; i++) { # loop through tokens in current line from input.txt
maxToken++
buffer[ ((maxToken - 1) % maxlen) + 1 ] = $i
if ( (maxToken - minToken + 1) >= maxlen ) # if we have a "full" buffer then ...
replace() # look for replacement match
}
}
END { replace("flush") } # flush the rest of buffer[]
' map.txt input.txt
This generates:
X # "2 3 4" has precendence over "2 3"
112
&
9
12
\000
&
15
255
10
9
199
\000
13
If we switch the first 2 lines of map.txt like such:
==> map.txt <==
2 3,Y # "2 3" has precendence over ...
2 3 4,X # "2 3 4"
We now generate:
Y # "2 3" has precendence over "2 3 4" thus ...
4 # leaving "4" by itself
112
&
9
12
\000
&
15
255
10
9
199
\000
13

Sum Up all Numbers before and after an operator Sign in a String user input

i am new to programming and starting up with kotlin . i've been stuck on this problem for a few days and i really need some assistance . i am trying to read a user input of Strings in a format like this 3 + 2 + 1, go through the String and wherever there is an operator , add up the numbers before and after the operator sign . so the above 3 + 2 + 1 should output 6.
Here's a snippet of my code
fun main() {
val userInput = readLine()!!.split(" ")
var sum = 0
for (i in 0 until userInput.size) {
if (userInput.get(i) == "+"){
sum += userInput.get(i-1).toInt() + userInput.get(i+1).toInt()
}
}
println(sum )
}
my code works until the point of adding up the numbers . it repeats the next number after the operator , so using the above example of 3 + 2 + 1 it outputs 8 thus 3 + 2 + 2 + 1. I'm so confused and don't know how to go about this .
Try not to increment the sum value each time, but rewrite the last number which was participated in sum. Just like that:
You have the case: 1 + 2 + 3 + 4
Split them
Now you have the array [1, +, 2, +, 3, +, 4]
Then you iterate this array, stuck with the first plus and sum the values.
Rewrite the second summed value with your sum.
Now you have new array [1, +, 3, +, 3, +, 4]
At the end of the loop, you will have this array [1, +, 3, +, 6, +, 10]
And your sum is the last element of the array
The logic of your code is that for each "+" encountered, it adds the sum of the numbers left and right of the "+" to the sum. For the example "1 + 2 + 3", here's what is happening:
Starting sum is 0.
At first "+", add 1 + 2 to sum, so now sum is 3.
At second "+", add 2 + 3 to sum, so now the sum is 3 + 5 = 8.
So you are adding all the middle numbers to the total twice, because they each appear next to two operators.
One way to do this is start with the first number as your sum. Then add only the number to the right of each "+", so numbers are only counted once.
fun main() {
val userInput = readLine()!!.split(" ")
var sum = userInput[0].toInt()
for (i in userInput.indices) {
if (userInput[i] == "+") {
sum += userInput[i + 1].toInt()
}
}
println(sum)
}
sum += userInput.get(i-1).toInt() + userInput.get(i+1).toInt()
This is only valid for the first iteration, so if the user puts 1 + 2 + 3.
So
userInput[0] is 1
userInput[1] is +...
the first time that line will be triggered sum will be 1 + 2, and that's quite fine, but the second time, in the second +, you will sum i-1 (which is 2 and was in the total sum already) and i+1, that will be 3, so you are doing 1+2+2+3.
You need to understand why this is happening and think of another way to implement it.
Check this is working for me
var sum = 0
readLine()?.split(" ")?.filter { it.toIntOrNull() != null }?.map { sum += it.toInt() }
println(sum)

Successive averaging of repeating data but different number of lines

I have the following format of data:
1 3
1.723608 0.8490000
1.743011 0.8390000
1.835833 0.7830000
2 5
1.751377 0.8350000
1.907603 0.7330000
1.780053 0.8190000
1.601427 0.9020000
1.950540 0.6970000
3 2
1.993951 0.6610000
1.796519 0.8090000
4 4
1.734961 0.8430000
1.840741 0.7800000
1.818444 0.7950000
1.810717 0.7980000
5 1
2.037940 0.6150000
6 7
1.738221 0.8330000
1.767678 0.8260000
1.788517 0.8140000
2.223586 0.4070000
1.667492 0.8760000
2.039232 0.6130000
1.758823 0.8300000
...
Data consists of data blocks. Each data block has the same format as follows:
The very first line is the header line. The header line contains the ID number and the total number of lines of each data block. For example, the first data block's ID is 1, and the total number of lines is 3. For the third data block, ID is 3, and the total number of lines is 2. All data blocks have this header line.
Then, the "real data" follows. As I explained, the number of lines of "real data" is designated in the second integer of the header line.
Accordingly, the total number of lines for each data block will be number_of_lines+1. In this example, the total number of lines for data block 1 is 4, and data block 2 costs 6 lines...
This format repeats all the way up to 10000 number of data blocks in my current data, but I can provide this 10000 as a variable in the bash or awk script as an input value. I know the total number of data blocks.
Now, what I wish to do is, I want to get the average of data of each two columns and print it out with data block ID number and a total number of lines. The output text will have:
ID_number number_of_lines average_of_column_1 average_of_column_2
using 5 spaces between columns with 6 decimal places format. The result will have 10000 lines, and each line will have ID, number of lines, avg of column 1 of data, and avg of column 2 of data for each data block. The result of this example will look like
1 3 1.767484 0.823666
2 5 1.798200 0.797200
3 2 1.895235 0.735000
...
I know how to get the average of a simple data column in awk and bash. These are already answered in StackOverflow a lot of times. For example, I really favor using
awk '{ total += $2; count++ } END { print total/count }' data.txt
So, I wish to this using awk or bash. But I really have no clue how can I approach and even start to get this kind of average of multiple repeating data blocks, but with a different number of lines for each data block.
I was trying based on awk, following
Awk average of n data in each column
and
https://www.unix.com/shell-programming-and-scripting/135829-partial-average-column-awk.html
But I'm not sure how can I use NR or FNR for the average of data with a varying number of total lines of data, for each data block.
You may try this awk:
awk -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.76748 0.823667
2 5 1.7982 0.7972
3 2 1.89524 0.735
4 4 1.80122 0.804
5 1 2.03794 0.615
6 7 1.85479 0.742714
If you have gnu awk then use OFMT for getting fixed size decimal numbers like this:
awk -v OFMT="%.6f" -v OFS='\t' '$2 ~ /\./ {s1 += $1; s2 += $2; next} {if (id) {print id, num, s1/num, s2/num; s1=s2=0} id=$1; num=$2} END {print id, num, s1/num, s2/num}' file
1 3 1.767484 0.823667
2 5 1.798200 0.797200
3 2 1.895235 0.735000
4 4 1.801216 0.804000
5 1 2.037940 0.615000
6 7 1.854793 0.742714
An expanded form:
awk OFMT='%.6f' -v OFS='\t' '
$2 ~ /\./ {
s1 += $1
s2 += $2
next
}
{
if (id) {
print id, num, s1/num, s2/num
s1 = s2 = 0
}
id = $1
num = $2
}
END {
print id, num, s1/num, s2/num
}' file
And yet another one:
awk -v num_blocks=10000 '
BEGIN {
OFS = "\t"
OFMT = "%.6f"
}
num_lines == 0 {
id = $1
num_lines = $2
sum1 = sum2 = 0
next
}
lines_read < num_lines {
sum1 += $1
sum2 += $2
lines_read++
}
lines_read >= num_lines {
print id, num_lines,
sum1 / num_lines,
sum2 / num_lines
num_lines = lines_read = 0
num_blocks--;
}
num_blocks <= 0 {
exit
}' file
You could try
awk -v qnt=none 'qnt == "none" {id = $1; qnt = $2; s1 = s2 = line = 0;next}{s1 += $1; s2 += $2; ++line} line == qnt{printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt; qnt="none"}'
The above is expanded as follows:
qnt == "none"
{
id = $1;
qnt = $2;
s1 = s2 = line = 0;
next
}
{
s1 += $1;
s2 += $2;
++line
}
line == qnt
{
printf "%d %d %.6f %.6f\n", id, qnt, s1/qnt, s2/qnt;
qnt="none"
}
After a data block is processed (or at the beginning), record header info.
Otherwise, add to sum and print the result when we've done with all lines in this block.

Iterate awk function for every unique field in column

I have write an awk script to analyse my table data - I am calculating p-value and log2 odds ratio.
This is an example of data table I have.
Label Value1 Value2
Label1 9 6
Label1 7 6
Label1 1 6
Label2 5 7
Label2 3 7
Label2 8 7
For every label (Label1/2) I count how many times value1 > value2 and divide this number by total times Label was observed - I am getting p-value.
Additionally to this, I compare their log2 ratio.
This is my awk script.
awk '{a[$1]=$1}; ($2>=$3) {c++}; {sum+=$2} END
{print c/NR,log($3/(sum/NR))/log(2),a[$1]}'
And this is result I get
0.666667 0.0824622 Label1
Column1 is p-value; Column 2 is odds ratio; Column 3 is Label.
Problem is that I don't know how to apply this calculation for both Labels - I am getting result only for the first one.
My question is - how to iterate such awk function for every unique field in column 1 (Label1/2)
I assume two lines before first line of data, so I compare NR with 3. The program saves previous label name ($1) and only when it changes ($1 != label) it does the calculations and print. Other condition (NR >= 3) only saves data while processing same label.
awk '
NR == 3 { label = $1 }
NR >=3 && $1 != label {
printf "%.6f %.6f %s\n", c/l, log( v / (sum/l) ) / log(2), label
c = l = sum = 0
label = $1
}
NR >= 3 {
if ( $2 >= $3 ) { c++ }
l++
sum += $2
v = $3
}
END {
printf "%.6f %.6f %s\n", c/l, log( v / (sum/l) ) / log(2), label
}
' infile
It yields:
0.666667 0.082462 Label1
0.333333 0.392317 Label2
Another way with awk (using arrays):
awk '
NR>1 && $2>$3 {
times[$1]++
}
{
total[$1]+=$2;
col3[$1]=$3;
seen[$1]++
}
END {
for(label in times) {
print times[label]/seen[label],log(col3[label]/(total[label]/seen[label]))/log(2),label
}
}' inputFile
Output:
0.666667 0.0824622 Label1
0.333333 0.392317 Label2

Efficient way to calculate averages, standard deviations from a txt file

Here is a copy of what one of many txt files looks like.
Class 1:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
0 2 180 63 38
-1 -2 0 79 84
-1 -2 180 85 95
. . . . .
Subject B:
posX posY posZ x(%) y(%)
0 2 0 71 73
-1 -2 0 69 88
. . . . .
Subject C:
posX posY posZ x(%) y(%)
0 2 0 86 71
-1 -2 0 81 55
. . . . .
Class 2:
Subject A:
posX posY posZ x(%) y(%)
0 2 0 81 72
-1 -2 0 79 84
. . . . .
The number of classes, subjects, row entries all vary.
Class1-Subject A always has posZ entries that have 0 alternating with 180
Calculate average of x(%), y(%) by class and by subject
Calculate standard deviation of x(%), y(%) by class and by subject
Also ignore the posZ of 180 row when calculating averages and std_deviations
I have developed an unwieldly solution in excel (using macro's and VBA) but I would rather go for a more optimal solution in python.
numpy is very helpful but the .mean(), .std() functions only work with arrays- I am still researching some more into it as well as the panda's groupby function.
I would like the final output to look as follows (1. By Class, 2. By Subject)
1. By Class
X Y
Average
std_dev
2. By Subject
X Y
Average
std_dev
I think working with dictionaries (and a list of dictionaries) is a good way to get familiar with working with data in python. To format your data like this, you'll want to read in your text files and define variables line by line.
To start:
for line in infile:
if line.startswith("Class"):
temp,class_var = line.split(' ')
class_var = class_var.replace(':','')
elif line.startswith("Subject"):
temp,subject = line.split(' ')
subject = subject.replace(':','')
This will create variables that correspond to the current class and current subject. Then, you want to read in your numeric variables. A good way to just read in those values is through a try statement, which will try to make them into integers.
else:
line = line.split(" ")
try:
keys = ['posX','posY','posZ','x_perc','y_perc']
values = [int(item) for item in line]
entry = dict(zip(keys,values))
entry['class'] = class_var
entry['subject'] = subject
outputList.append(entry)
except ValueError:
pass
This will put them into dictionary form, including the earlier defined class and subject variables, and append them to an outputList. You'll end up with this:
[{'posX': 0, 'x_perc': 81, 'posZ': 0, 'y_perc': 72, 'posY': 2, 'class': '1', 'subject': 'A'},
{'posX': 0, 'x_perc': 63, 'posZ': 180, 'y_perc': 38, 'posY': 2, 'class': '1', 'subject': 'A'}, ...]
etc.
You can then average/take SD by subsetting the list of dictionaries (applying rules like excluding posZ=180 etc.). Here's for averaging by Class:
classes = ['1','2']
print "By Class:"
print "Class","Avg X","Avg Y","X SD","Y SD"
for class_var in classes:
x_m = np.mean([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_m = np.mean([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
x_sd = np.std([item['x_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
y_sd = np.std([item['y_perc'] for item in output if item['class'] == class_var and item['posZ'] != 180])
print class_var,x_m,y_m,x_sd,y_sd
You'll have to play around printed output to get exactly what you want, but this should get you started.