obtain averages of field 2 after grouping by field 1 with awk - awk

I have a file with two fields containing numbers that I have sorted numerically based on field 1. The numbers in field 1 range from 1 to 200000 and the numbers in field 2 between 0 and 1. I want to obtain averages for both field 1 and field 2 in batches (based on rows).
Here is example input output when specifying batches of 4 rows:
1 0.12
1 0.34
2 0.45
2 0.40
50 0.60
301 0.12
899 0.13
1003 0.14
1300 0.56
1699 0.43
2100 0.25
2500 0.56
The output would be:
1.5 0.327
563.25 0.247
1899.75 0.45

Here you go:
awk -v n=4 '{s1 += $1; s2 += $2; if (++i % n == 0) { print s1/n, s2/n; s1=s2=0; } }'
Explanation:
Initialize n=4, the size of the batches
Collect the sums: sum of 1st column in s1, the 2nd in s2
Increment counter i by 1 (default initial value is 0, no need to set it)
If i is divisible by n with no remainder, then we print the averages, and reset the sum variables

Related

How can I group a continuous column (0-1) into equal sizes? Scala spark

I have a dataframe column that I want to split into equal size buckets. The values in this column are floats between 0-1. Most of the data is skewed, so most values fall in the 0.90's and 1.
Bucket 10: All 1's (the size of this bucket will be different from 2-9 and 1)
Bucket 2-9: Any values > 0 and < 1 (equal sized)
Bucket 1: All 0's (the size of this bucket will be different from 2-9 and 10)
Example:
continous_number_col
Bucket
0.001
2
0.95
9
1
10
0
1
This should be how it looks when I groupBy("Bucket")
Counts of bucket 1 and 10 aren't significant here, they will just be in their own bucket.
And the 75 count will be different, just using as an example.
Bucket
Count
Values
1
1000
0
2
75
0.01 - 0.50
3
75
0.51 - 0.63
4
75
0.64 - 0.71
5
75
0.72 - 0.83
6
75
0.84 - 0.89
7
75
0.90 - 0.92
8
75
0.93 - 0.95
9
75
0.95 - 0.99
10
2000
1
I've tried using the QuantileDiscretizer() Function as this:
val df = {
rawDf
//Taking 1's and 0's out for the moment
.filter(col("continuous_number_col") =!= 1 && col("continuous_number_col") =!= 0)
}
val discretizer = new QuantileDiscretizer()
.setInputCol("continuous_number_col")
.setOutputCol("bucket_result")
.setNumBuckets(8)
val result = discretizer.fit(df).transform(df)
However, this gives me the following, not equal buckets:
bucket_result
count
7.0
20845806
6.0
21096698
5.0
21538813
4.0
21222511
3.0
21193393
2.0
21413413
1.0
21032666
0.0
21681424
Hopefully this gives enough context to what I'm trying to do. Thanks in advance.

how to format a large txt file to bed

I am trying to format CpG methylation calls from R package "methyKit" to simple bed format. Since it is a large file, i can not do it in Excel. I also tried Seqmonk, but it does not allow me to export the data in the format I want. Linux Awk/sed might be a good option, but I am new to them as well. Basically, I need to trim "chr" column, add "stop" column, convert "F" to "+" /"R" to "-", and freqC with 2 decimal places. Can you please help?
From:
chrBase chr base strand coverage freqC freqT
chr1.339 chr1 339 F 7 0.00 100.00
chr1.183 chr1 183 R 4 0.00 100.00
chr1.192 chr1 192 R 6 0.00 100.00
chr1.340 chr1 340 R 5 40.00 60.00
chr1.10007 chr1 10007 F 13 53.85 46.15
chr1.10317 chr1 10317 F 8 0.00 100.00
chr1.10346 chr1 10346 F 9 88.89 11.11
chr1.10349 chr1 10349 F 9 88.89 11.11
To:
chr start stop freqc Coverage strand
1 67678 67679 0 3 -
1 67701 67702 0 3 -
1 67724 67725 0 3 -
1 67746 67747 0 3 -
1 67768 67769 0.333333 3 -
1 159446 159447 0 3 +
1 162652 162653 0 3 +
1 167767 167768 0.666667 3 +
1 167789 167790 0.666667 3 +
1 167797 167798 0 3 +
This should do what you actually want, producing a BED6 file with the methylation percentage in the score column:
$ cat foo.awk
BEGIN{OFS="\t"}
{if(NR>1) {
if($4=="F") {
strand="+"
} else {
strand="-"
}
chromUse=gsub("chr", "", $2);
print chromUse,$3-1,$3,$1,$6,strand,$5
}}
That would then be run with:
awk -f foo.awk input.txt > output.bed
The additional column 7 is the coverage, since genome viewers will only display a single score column:
1 338 339 chr1.339 0.00 + 7
1 182 183 chr1.183 0.00 - 4
1 191 192 chr1.192 0.00 - 6
1 339 340 chr1.340 40.00 - 5
1 10006 10007 chr1.10007 53.85 + 13
1 10316 10317 chr1.10317 0.00 + 8
1 10345 10346 chr1.10346 88.89 + 9
1 10348 10349 chr1.10349 88.89 + 9
You can tweak that further as needed.
It is not entirely clear the exact sequence you want since your "From" data does not correspond to what you show as your "To" results, but if what you are showing is the general format change and in the "From" data, you want to:
discard field 1,
retrieve the "chr" value from the end of field 2,
if the 4th field is "F" make it "+" else if it is "R" make it "-" otherwise leave it unchanged,
use the 3rd field as "start" and 3rd + 1 as "stop" (adjust whether to add or subtract 1 as needed to get the desired "start" and "stop" values),
print 6th field as "freqc",
output 5th field as "Coverage", and finally
output modified 4th field as "strand"
If that is your goal, then with your from data in the file named from, you can do something like the following:
awk '
BEGIN { OFS="\t"; print "chr","start","stop","freqc","Coverage","strand" }
FNR > 1 {
match($2, /[[:digit:]]+$/, arr)
if ($4 == "F")
$4 = "+"
else if ($4 == "R")
$4 = "-"
print arr[0], $3, $3 + 1, $6, $5, $4
}
' from
Explanation, the BEGIN rule is run before awk starts processing records (lines) from the file. Above it simply sets the Output Field Separator to tab and prints the heading.
The condition (pattern) of FNR > 1 on the second rule processes the from file from the 2nd record (line) on (skipping the heading row). FNR is awk's way of saying File Record Number (even though it looks like the N and R are backwards).
match($2, /[[:digit:]]+$/, arr) splits the trailing digits from the second field into the first element of arr (e.g. arr[0]) and not relevant here sets the RSTART and RLENGTH internal variables telling you which character the first digit starts on and how many digits there are.
The if and else if statement does the "F" to "+" and "R" to "-" change. And, finally, the print statement just prints the modified values and unchanged fields in the order specified above.
Example Output
Running the above on your original "From" data will produce:
chr start stop freqc Coverage strand
1 339 340 0.00 7 +
1 183 184 0.00 4 -
1 192 193 0.00 6 -
1 340 341 40.00 5 -
1 10007 10008 53.85 13 +
1 10317 10318 0.00 8 +
1 10346 10347 88.89 9 +
1 10349 10350 88.89 9 +
Let me know if this is close to what you explained in your question, and if not, drop a comment below.
The GNU Awk User's Guide is a great gawk/awk reference.

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

Select every nth row as a Pandas DataFrame without reading the entire file

I am reading a large file that contains ~9.5 million rows x 16 cols.
I am interested in retrieving a representative sample, and since the data is organized by time, I want to do this by selecting every 500th element.
I am able to load the data, and then select every 500th row.
My question: Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Question 2: How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
Here is a snippet of what the data looks like (first five rows) The first 4 rows are out of order, bu the remaining dataset looks ordered (by time):
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance RatecodeID store_and_fwd_flag PULocationID DOLocationID payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 1 2017-01-09 11:13:28 2017-01-09 11:25:45 1 3.30 1 N 263 161 1 12.5 0.0 0.5 2.00 0.00 0.3 15.30
1 1 2017-01-09 11:32:27 2017-01-09 11:36:01 1 0.90 1 N 186 234 1 5.0 0.0 0.5 1.45 0.00 0.3 7.25
2 1 2017-01-09 11:38:20 2017-01-09 11:42:05 1 1.10 1 N 164 161 1 5.5 0.0 0.5 1.00 0.00 0.3 7.30
3 1 2017-01-09 11:52:13 2017-01-09 11:57:36 1 1.10 1 N 236 75 1 6.0 0.0 0.5 1.70 0.00 0.3 8.50
4 2 2017-01-01 00:00:00 2017-01-01 00:00:00 1 0.02 2 N 249 234 2 52.0 0.0 0.5 0.00 0.00 0.3 52.80
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
Something you could do is to use the skiprows parameter in read_csv, which accepts a list-like argument to discard the rows of interest (and thus, also select). So you could create a np.arange with a length equal to the amount of rows to read, and remove every 500th element from it using np.delete, so this way we'll only be reading every 500th row:
n_rows = 9.5e6
skip = np.arange(n_rows)
skip = np.delete(skip, np.arange(0, n_rows, 500))
df = pd.read_csv('my_file.csv', skiprows = skip)
Can I immediately read every 500th element (using.pd.read_csv() or some other method), without having to read first and then filter my data?
First get the length of the file by a custom function, remove each 500 row by numpy.setdiff1d and pass to the skiprows parameter in read_csv:
#https://stackoverflow.com/q/845058
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
print (len_of_file)
skipped = np.setdiff1d(np.arange(len_of_file), np.arange(0,len_of_file,500))
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped)
How would you approach this problem if the date column was not ordered? At the moment, I am assuming it's ordered by date, but all data is prone to errors.
The idea is read only the datetime column by parameter usecols, and then sort and select each 500 index value, get the difference and pass again to parameter skiprows:
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
len_of_file = file_len('test.csv')
df1 = pd.read_csv('test.csv',
usecols=['tpep_pickup_datetime'],
parse_dates=['tpep_pickup_datetime'])
sorted_idx = (df1['tpep_pickup_datetime'].sort_values()
.iloc[np.arange(0,len_of_file,500)].index)
skipped = np.setdiff1d(np.arange(len_of_file), sorted_idx)
print (skipped)
df = pd.read_csv('test.csv', skiprows=skipped).sort_values(by=['tpep_pickup_datetime'])
use a lambda with skiprows:
pd.read_csv(path, skiprows=lambda i: i % N)
to skip every N rows.
source: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
You can use csv module return a iterator and use itertools.cycle to select every nth row.
import csv
from itertools import cycle
source_file='D:/a.txt'
cycle_size=500
chooser = (x == 0 for x in cycle(range(cycle_size)))
with open(source_file) as f1:
rdr = csv.reader(f1)
data = [row for pick, row in zip(chooser, rdr) if pick]

Sum a column in awk and print each line of said column divided by the sum

I have a column of counts that I want to transform into frequencies. This sums the first column and prints the sum:
awk '{ sum+=$1 } END {print sum}'
However, I can't quite figure out how to divide each line of said file with the sum (in one-liner format). Bonus points for being able to specify the number of decimals.
Example input:
0
0
0
0
1
0
5
0
Output should be:
0
0
0
0
0.166666667
0
0.833333333
0
And preferably I could also adjust the number of decimals, e.g.:
0.00
0.00
0.00
0.00
0.16
0.00
0.83
0.00
Scan the input twice:
awk 'NR==FNR{sum+= $1; next}{printf("%4.2f\n", $1/sum)}' input input