How to replace data in pandas by using values in dict? - pandas

I have a series which contains several numbers. I want to replace them to other string type data by using dictionary values. But I don't know how to do that...
GDP_group['GdpForYearPer$1M'].head(5)
0 46.919625
1 47.515189
2 47.737955
3 54.832578
4 56.338028
5 63.101272 \
This is the dict that I made to replace data.
range_GDP = {'$0 ~ $100M': np.arange(0,100), '$100M ~ $1B': np.arange(100.0000001,1000), '$1B ~ $10B': np.arange(1000.000001, 10000), '$10B ~ $100B': np.arange(10000.000001, 100000),
'$100B ~ $1T': np.arange(100000.000001, 1000000), '$1T ~': np.arange(1000000.000001, 20000000)}

You can use pd.cut to segment your data in ranges and apply labels.
(re)generate dummy data sampled uniformly in log space:
import numpy as np
import pandas as pd
GdpForYearPer1M = pd.Series(10**np.random.randint(0, 8, 100))
"""
0 1
1 1000
2 100
3 10
4 100
...
95 1000000
96 100
97 100000
98 10000
99 10
"""
solution:
# generate "cuts" (bins) and associated labels from `range_GDP`.
cut_data = [(np.min(v), k) for k, v in range_GDP.items()]
bins, labels = zip(*cut_data)
# bins required to have one more value than labels
bins = list(bins) + [np.inf]
pd.cut(GdpForYearPer1M, bins=bins, labels=labels)
output:
0 $0 ~ $100M
1 $100M ~ $1B
2 $0 ~ $100M
3 $0 ~ $100M
4 $0 ~ $100M
...
95 $100B ~ $1T
96 $0 ~ $100M
97 $10B ~ $100B
98 $1B ~ $10B
99 $0 ~ $100M
Length: 100, dtype: category
Categories (6, object): [$0 ~ $100M < $100M ~ $1B < $1B ~ $10B < $10B ~ $100B < $100B ~ $1T < $1T ~]

Related

Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2

This new question is a followup from a recent question : Compare two numerical ranges in two distincts files with awk. The proposed solution that perfectly worked was not practical for downstream analysis (misconception of my question, not on the solution that worked).
I have a file1 with 3 columns. Columns 2 and 3 define a numerical range. Data are sorted from the smaller to the bigger value in column 2. Numerical ranges never overlap.
file1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
I have a second file2 (test) structured similarly.
file2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
Desired output: Print ALL lines from file1 and next to them, lines of file2 for which numerical ranges overlap (including partially) the ones of file1. If no ranges from file2 overlap to a range of file1, print zero next to the range of file 1.
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
Based on the answer of the previous question from #EdMorton I modified the print command of the tst.awk script to add these new features. In addition I also changed the order file1/file2 to file2/file1 to have all the lines from file1 printed (whether or not there is a match in the second file)
'NR == FNR {
begs2ends[$2] = $3
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,"\t",$1,"\t",beg,"\t",end
else
print $0,"\t","0"
next
}
}
}
Note: $1 is identical in file1 and file2. This is why I used print ... $1 to make it appear. No idea how to print it from file2 and not file1 (if I understand correctly this $1 refers to file1.
And I launch the analysis with awk -f tst.awk file2 file1
The script is not accepting the else argument and I dont understand why? I assuming that it is linked to the looping but I tried several changes without any success.
Thanks if you can help me with this.
Assumptions:
a range from file1 can only overlap with one range from file2
The current code is almost correct, just need some work with the placement of the braces (using some consistent indentation helps):
awk '
BEGIN { OFS="\t" } # output field delimiter is "\t"
NR == FNR { begs2ends[$2] = $3; next }
{
# $1=$1 # uncomment to have current line ($0) reformatted with "\t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,$1,beg,end # spacing within $0 unchanged, 3 new fields prefaced with "\t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print $0,"0" # spacing within $0 unchanged, 1 new field prefaced with "\t"
}
' file2 file1
This generates:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
With the $1=$1 line uncommented the output becomes:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905
A slight variation on #markp-fuso's answer
Works with GNU awk: saved as overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "#ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = $0
lo[FNR] = $2
hi[FNR] = $3
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], $2, $3) || in_range(hi[i], $2, $3)) {
overlap = line[i]
delete line[i]
break
}
}
print $0, overlap
}
Then
gawk -f overlaps.awk file2 file1 | column -t
outputs
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
ranges[++numRanges] = $0
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print $0, range, sprintf("* %d-%d overlaps with %d-%d", beg, end, $2, $3)
}
else {
print $0, 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736

rearrange from specific string into respective column

I'm trying to rearrange from specific string into respective column.
etc:
126N (will be sorted into "Normal" column)
Value 1 (the integer will be concatenated with 126)
Resulting :
N=Normal
126 # 1
Here is the input
(N=Normal, W=Weak)
Value 1
126N,
Value 3
18N,
Value 4
559N, 562N, 564N,
Value 6
553W, 565A, 553N,
Value 5
490W,
Value 9
564N,
And the output should be
W=Weak
490 # 5
553 # 6
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9
Let me know your thought on this.
I've tried this script, I'm still figuring out to concatenating the value
cat input.txt | sed '/^\s*$/d' | awk 'BEGIN{RS=","};match($0,/N/){print $3"c"$2}' | sed ':a;N;$!ba;s/\n/;/g' | sed 's/W//g;s/N//g;s/S//g'
And some of it, are missing
This should give you what you want using gnu awk
IT will work with any number of letters, not just A N W
awk -F, '
!/Value/ {
for (i=1;i<NF;i++) {
hd=substr($i,length($i),1);
arr[hd][++cnt[hd]]=($i+0" # "f)}
}
{split($0,b," ");f=b[2];}
END {
for (i in arr) { print "\n"i"\n---";
for (j in arr[i]) {
print arr[i][j]}}
}' file
A
---
565 # 6
N
---
562 # 4
564 # 4
553 # 6
564 # 9
126 # 1
18 # 3
559 # 4
W
---
553 # 6
490 # 5
Another alternative in awk would be:
awk -F',| ' '
$1 == "Value" {value = $2; next}
{ for (i=1; i<=NF; i++) {
if ($i~"N$")
N[substr($i, 1, length($i) - 1)] = value
if ($i~"W$")
W[substr($i, 1, length($i) - 1)] = value
}
}
END {
print "W=Weak"
for (i in W)
print i, "#", W[i]
print "\nN=Normal"
for (i in N)
print i, "#", N[i]
}
' file
(note: this relies on knowing the wanted headers are W=Weak and N=Normal. If would take a few additional expression if the headers are subject to change.)
Output
$ awk -F',| ' '
> $1 == "Value" {value = $2; next}
> { for (i=1; i<=NF; i++) {
> if ($i~"N$")
> N[substr($i, 1, length($i) - 1)] = value
> if ($i~"W$")
> W[substr($i, 1, length($i) - 1)] = value
> }
> }
> END {
> print "W=Weak"
> for (i in W)
> print i, "#", W[i]
> print "\nN=Normal"
> for (i in N)
> print i, "#", N[i]
> }
> ' file
W=Weak
490 # 5
N=Normal
18 # 3
126 # 1
559 # 4
562 # 4
564 # 9
$ cat tst.awk
NR%2 { val = $NF; next }
{
for (i=1; i<=NF; i++) {
num = $i+0
abbr = $i
gsub(/[^[:alpha:]]/,"",abbr)
list[abbr] = list[abbr] num " # " val ORS
}
}
END {
n = split("Weak Absolute Normal",types)
for (i=1; i<=n; i++) {
name = types[i]
abbr = substr(name,1,1)
print abbr "=" name ORS list[abbr]
}
}
.
$ awk -f tst.awk file
W=Weak
553 # 6
490 # 5
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9

AWK: Average of each row from different measurement series

My objective is to calculate the average of the second column from multiple measurement series (the average of the first row of K blocks, the average of the second row of K blocks etc.). All data is contained in one file and is seperated in blocks with a blank line. The file has the following structure:
#
#
33 -0.23
34.5 -0.32
36 -0.4
.
.
.
#
#
33 -0.25
34.5 -0.31
36 -0.38
.
.
.
$ cat avg.awk
BEGIN { FS=" " }
/^#/ { next }
/^\s*$/ { print col1/nr " " col2/nr; col1=col2=nr=0; next }
{ col1 += $1; col2 += $2; nr++ }
END {print col1/nr " " col2/nr }
with input:
$ cat test.txt
#
#
33 -0.23
34.5 -0.32
36 -0.4
#
#
33 -0.25
34.5 -0.31
36 -0.38
gives as result:
$ awk -f avg.awk test.txt
34.5 -0.316667
34.5 -0.313333

Selecting one output when regex matches more than once

Given an input file containing multiple rows of my target
...
100 100 100 100
Expression: out1
200 200 200 200
300 300 300 300
Expression: out2
400 400 400 400
500 500 500 500
Expression: out3
...
If I do
awk '/Expression:/ {printf " %f ",$2 } ' $file
I get multiple outputs
out1 out2 out3
How could I choose only one of the output based on the position they are found on the file, for instance out3?
For - If n number of rows found for the regex matched, then how to select last one ?
awk '/Expression:/ { last_found = $2 }END{print last_found }' file
Display nth match for the given regex
awk '/Expression:/ { if(++i==3){ print $2; exit } }' file
Input
akshay#db-3325:~$ cat f
...
100 100 100 100
Expression: out1
200 200 200 200
300 300 300 300
Expression: out2
400 400 400 400
500 500 500 500
Expression: out3
...
Output
# for 3rd one
akshay#db-3325:~$ awk '/Expression:/ { if(++i==3){ print $2; exit } }' f
out3
# For 2nd one
akshay#db-3325:~$ awk '/Expression:/ { if(++i==2){ print $2; exit } }' f
out2
# For last one
akshay#db-3325:~$ awk '/Expression:/ { last_found = $2 }END{print last_found }' f
out3
Simply count the occurrences and print the last value once the counter has reached the threshold:
awk '/Expression/{c+=1;s=$2};c==3{print s}'

awk setting variables to make a range

I have the following two files:
File 1:
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
File 2:
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
I wish to isolate only the rows in file 2 in which the second field is within 100 units of the second field in file 1 (if field 1 matches):
Desired output: (note the third field is from the matching line in file1).
1 201 LDLR rs1345
2 714 APOA5 rs4325
I tried using the following code:
for i in {1..4} #there are 4 lines in file2
do
chr=$(awk 'NR=="'${i}'" { print $1 }' file2)
pos=$(awk 'NR=="'${i}'" { print $2 }' file2)
gene=$(awk 'NR=="'${i}'" { print $3 }' file2)
start=$(echo $pos | awk '{print $1-100}') #start and end variables for 100 unit range
end=$(echo $pos | awk '{print $1+100}')
awk '{if ($1=="'$chr'" && $2 > "'$start'" && $2 < "'$end'") print "'$chr'","'$pos'","'$gene'"$3}' file1
done
The code is not working, I believe something is wrong with my start and end variables, because when I echo $start, I get 414, which doesn't make sense to me and I get 614 when i echo $end.
I understand this question might be difficult to understand so please ask me if any clarification is necessary.
Thank you.
The difficulty is that $1 is not a unique key, so some care needs to be taken with the data structure to store the data in file 1.
With GNU awk, you can use arrays of arrays:
gawk '
NR==FNR {f1[$1][$2] = $3; next}
$1 in f1 {
for (val in f1[$1])
if (val-100 <= $2 && $2 <= val+100)
print $0, f1[$1][val]
}
' file1 file2
Otherwise, you have to use a one-dimensional array and stuff 2 pieces of information into the key:
awk '
NR==FNR {f1[$1,$2] = $3; next}
{
for (key in f1) {
split(key, a, SUBSEP)
if (a[1] == $1 && a[2]-100 <= $2 && $2 <= a[2]+100)
print $0, f1[key]
}
}
' file1 file2
That works with mawk and nawk (and gawk)
#!/usr/bin/python
import pandas as pd
from StringIO import StringIO
file1 = """
1 290 rs1345
2 450 rs5313
1 1120 rs4523
2 790 rs4325
"""
file2 = """
1 201 LDLR
2 714 APOA5
1 818 NOTCH5
1 514 TTN
"""
sio = StringIO(file1)
df1 = pd.read_table(sio, sep=" ", header=None)
df1.columns = ["a", "b", "c"]
sio = StringIO(file2)
df2 = pd.read_table(sio, sep=" ", header=None)
df2.columns = ["a", "b", "c"]
df = pd.merge(df2, df1, left_on="a", right_on="a", how="outer")
#query is intuitive
r = df.query("b_y-100 < b_x <b_y + 100")
print r[["a", "b_x", "c_x", "c_y"]]
output:
a b_x c_x c_y
0 1 201 LDLR rs1345
7 2 714 APOA5 rs4325
pandas is the right tool to do such tabular data manipulation.