R adding dynamic reference line to ggplot - ggplot2

I tried to use the solution mentioned in this Add vline to existing plot and have it appear in ggplot2 legend?
However, part of the code doesn't work for my case especially the vline part didn't work. I'm wondering if there are others trick?
I have two data set first one is the main dataset:
str(sf)
$ date : chr "2020-02-02" "2020-02-03" "2020-02-04" "2020-02-05" ...
$ School_Closure_Date: chr "2020-03-16" "2020-03-16" "2020-03-16" "2020-03-16" ...
$ variable : Factor w/ 5 levels "cases","deaths",..: 1 1 1 1 1 1 1 1 1 1 ...
$ value : num 2 2 2 2 2 2 2 2 2 2 ...
$ mindate : Date, format: "2020-01-24" "2020-01-24" …
$ maxdate : Date, format: "2020-03-30" "2020-03-30" ...
$ closure : logi FALSE FALSE FALSE FALSE FALSE FALSE …
str(sfs)
'data.frame': 1 obs. of 4 variables:
$ fips: int 6075
$ City: chr "San_Francisco"
$ date: chr "2020-03-16"
$ type: chr "school closure"
My pupose is to put a vline to the ggplot. However, I got error messeges:
y <- ggplot(sf,aes(date,value)) + geom_line(aes(group=variable,color=variable))
x <- geom_vline(data=sfs,xintercept=date,show.legend=TRUE)
y+x
Error in UseMethod("rescale") :
no applicable method for 'rescale' applied to an object of class "factor"
In addition: Warning message:
Removed 33 rows containing missing values (geom_path).
The y part works but after adding part x, the figure disappeared. Any thought? Thanks for your help!

Related

How to replace a value to another value in a specific column on a gzipped file using awk?

I have a compressed file (.gz) The file has approx 7000000 rows and the first few lines look like this:
CHROM POS ID REF ALT A1 TEST OBS_CT BETA SE T_STAT P
1 54712 1:54712 TTTTC T ADD 1460 0.00428077 0.0561095 0.0762931 0.939196
1 825069 rs4475692 G C G ADD 1460 -0.000411661 0.0413083 -0.00996558 0.99205
1 825410 rs13303179 G A G ADD 1460 0.00489633 0.041967 0.116671 0.907137
The end of the file has X in the first column
X 154929637 rs35185538:154929637:CT:C CT C C ADD 1460 0.0787708 0.0396199 1.98816 0.0469823
X 154929952 rs4012982:154929952:CAA:C CAA C C ADD 1460 0.0265508 0.0522027 0.50861 0.611104
X 154930230 rs781880:154930230:A:G A G G ADD 1460 0.0827871 0.0356246 2.32387 0.0202707
I want to replace the X (only the X) to 23 and preserve the header. I have tried to no avail.
gunzip -c file.gz | awk 'NR==1{gsub(/\X/,"23",$1)} 1' > out.txt
Any help will be appreciated.
Avni.
You could check only for X in the first column and check if the row number is greater than 1.
Then you can replace X at the start of the string using ^X with 23.
awk 'NR > 1 && $1=="X" {sub(/^X/,"23")}1' > out.txt

Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent error using "sqlSave"

I'm trying to use sqlSave command to import R dataframe into SQL database. Below is my code
> head(final_series)
Price Time FactorID CountryID id
1 5.363334e+01 1980-01-01 1 1 1
2 5.143333e+01 1980-04-01 1 1 16384
3 5.060000e+01 1980-07-01 1 1 32767
4 5.250000e+01 1980-10-01 1 1 49150
5 5.266667e+01 1981-01-01 1 1 65533
6 5.280000e+01 1981-04-01 1 1 81916
> sqlSave(dbhandle, final_series, tablename = "db_time_price", varTypes = c(id="uniqueidentifier", FactorID= "float", CountryID="float", Time="date", Price="float"), append=TRUE, verbose = T, fast = F)
But I got the following error:
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Anyone knows why? Thanks!
Did you check if the table already exists? If the table already exists but with a different dimension you would see this error.

How to get CPU(idle) value in the vmstat result

I am trying to get the value of 'id' in the vmstat result.
However, I found out that the position of 'id' column is different between platforms such as linux/AIX/HP...
## Linux
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 35268 117568 158244 1849104 0 0 3 11321 5 2 9 15 73 3 0
So, I think I should find the string 'id' and get the position(the) then, get the value of the position in the next row.
How can I do that with awk script?
this oneliner does what you want:
awk '{for(i=NF;i>0;i--)if($i=="id"){x=i;break}}END{print $x}'
first find out the id index, then print the corresponding column in the last line.

Gnuplot - Iteration with two commands

I'm trying to build a sort of bar-chart using a simple data file (.example) containing only 0s or 1s. Here is the data contained in .example:
dest P1 P2 P3 P4 P5 NA
D1 0 1 1 0 0 0
D2 0 0 1 0 0 0
D3 0 1 0 1 0 0
""
GPV 1 1 1 1 1 1
and here is the code I'm using:
set style histogram rowstacked title textcolor lt -1
set datafile missing 'nan'
set style data histograms
plot '.example' using ( $2==0 ? 1 : 0 ) ls 17 title 'NA', \
'' using ( $2==1 ? 1 : 0 ) ls 1, \
for [i=3:5] '.example' using ( column(i)==0 ? 1 : 0) ls 17 notitle, \
for [i=3:5] '' using ( column(i)==1 ? 1 : 0) ls i-1
where the last two commands iterate over a potentially large number of
columns stacking white or colored boxes depending on the value of column(i). To keep the same color order among different columns in the histogram I would need to merge the two iterations into a single one with two commands.
Is it possible? Any suggestion on how to do that?
You can use nested loops, which I think is what you want to achieve. You can use an outer loop iterating over your large number of columns and an inner loop iterating over the two options (white vs. colored), for [i=3:5] for [j=0:1], and tell gnuplot to ignore the column if its content doesn't match the value of j using 1/0 (or use the trick, valid for histograms, of setting it to 0 as you're already doing):
set style histogram rowstacked title textcolor lt -1
set datafile missing 'nan'
set style data histograms
plot '.example' using ( $2==0 ? 1 : 0 ) ls 17 title 'NA', \
'' using ( $2==1 ? 1 : 0 ) ls 1, \
for [i=3:5] for [j=0:1] '.example' using ( column(i) == j ? 1 : 0 ) \
ls ( j == 0 ? 17 : i-1 ) notitle
The code above is equivalent to what you have already, only the value of j allows to switch the style depending on whether you have a 0 or a 1 as the column's value.

How to awk every nth line starting from different lines each iteration

I would like awk to print every nth line out of a file starting from line 0. Then, after awk has gone through the whole file, I would like it to print every nth line starting from line 1...then print every nth line starting from line 2...etc, up to printing every nth line starting from line n-1. My sad attempt thus far:
#!/bin/bash
rm *.sad *.sadd *.out
#Create loop index
for i in $(seq 20 1 36);
do
listm+=($i)
done
#Create input file
for j in "${listm[#]}"
do
if [ $j -eq 20 ];
then
awk 'NR % 20 == 0' vel_VMDout > atomvel.dat
awk '{print $2,$3,$4}' atomvel.dat > velocity.dat
else
awk 'NR % 20 == 1' vel_VMDout > $j.sad
egrep -v "^[[:space:]]*$|^#" $j.sad > $j.sadd
awk '{print $2, $3, $4}' $j.sadd > $j.out
paste velocity.dat $j.out > taste
fi
done
Let me try to clarify this by providing the input and what the output should look like. Th input is an xyz file of an MD simulation consisting of frames of the atoms' xyz coordinates.
INPUT:
This image shows the 1st snapshot and part of the second snapshot. Because these are snapshot, the ordering of the atoms do not change. Thus, I am trying to print the xyz coordinates from each snapshot for each specific atom in their own columns as shown below. This would eventually make a file consisting of 3N columns, where N is the number of atoms.
OUTPUT:
As you can see, the each atoms' coordinates are in their own columns and the total file is a Nx3N array. My bash script was me trying to do this, but could only do the first two atoms. I wanted to print every nth line (coordinates of the nth atom) so they look like the output. I really appreciate your patience all.
Generating sample data
This is a step that should not be necessary; the question should have included usable sample data and the required output from that sample data.
At one level, it won't help much because you don't have my random number generator program, but the script below shows how I generated the data that follows, and it illustrates the lengths to which it might be necessary to go when the question doesn't supply readable data. I generated some data that looks similar to the data in the question (at least superficially):
18
Generated by VMD in absentia
C 0.979485 -6.665347 0.575383
C 1.191999 -3.002386 2.859484
C 3.151517 -5.610077 0.429413
C 3.439828 -6.454984 1.319724
C 3.726201 -0.123038 2.096854
C 1.363325 -3.031238 0.016019
C 6.090283 -3.915340 2.396358
C 0.407755 -7.957784 -0.846842
C 0.203074 -0.796428 2.659573
O 2.600610 -2.259674 -0.260378
O 4.773839 -6.765097 0.588508
H 2.743424 -2.890016 2.906452
H 2.810233 -6.641054 -0.797672
H 6.854169 -3.191721 -0.925670
O 2.914233 -1.060001 0.776983
H 3.803923 -1.497032 2.908799
H 5.669443 -7.227666 -0.647552
H 0.092455 -5.850637 2.959987
18
Generated by VMD in absentia
C 6.042840 -7.254720 2.093573
C 2.551942 -6.044322 2.061072
C 3.523150 -6.167163 2.451689
C 5.197316 -3.429866 -0.412062
C 2.548777 -6.422851 1.282846
C 3.775197 -2.012031 1.377440
C 3.405112 -3.206415 -0.879886
C 1.448359 -5.419629 0.467291
C 3.661964 -2.789234 2.644294
O 4.214854 -2.439574 -0.951704
O 5.297609 -2.320418 2.709898
H 2.653940 -4.431080 -0.511743
H 5.040635 -0.676199 -0.590970
H 1.546725 -1.294582 2.562937
O 4.231461 -7.180908 1.629901
H 3.297836 -1.557133 -0.133280
H 3.442481 -4.489962 2.111930
H 1.423611 -7.982655 0.715618
18
Generated by VMD in absentia
C 1.432495 -7.686243 2.525734
C 5.038409 -4.976270 2.826846
C 6.184137 -7.303094 2.711561
C 3.208125 -0.606556 1.978725
C 2.171859 -6.792060 0.678988
C 6.521124 -5.622797 -0.773797
C 1.725619 -5.768633 -0.223397
C 3.602427 -2.325680 1.762008
C 1.937521 -1.686895 1.743159
O 0.745526 -0.114246 -0.949490
O 4.754360 -6.531145 1.998913
H 1.114732 -1.158810 1.486939
H 6.410490 -5.411647 0.062737
H 4.164330 -6.743763 1.802804
O 2.587841 -3.979700 2.609748
H 2.192073 -2.815376 -0.809569
H 5.501795 -2.326438 1.325829
H 3.285032 -1.212541 1.284453
18
Generated by VMD in absentia
C 3.564424 -3.117406 -0.032879
C 2.894745 -0.632591 0.532311
C 3.384916 -5.383135 1.179585
C 0.793488 -0.894539 -0.886891
C 1.348785 -6.501867 1.648604
C 2.189941 -2.438067 0.616090
C 2.043378 -4.966472 0.691603
C 3.124161 -5.792896 0.545362
C 5.741472 -0.640590 2.825374
O 0.300550 -7.149663 0.942726
O 1.344387 -0.121382 2.169401
H 4.963296 -0.964665 -0.230523
H 6.651423 -4.905053 2.509626
H 5.059694 -6.166516 0.102255
O 5.046864 -3.288883 0.853948
H 2.389007 -3.057664 1.806301
H 2.365876 -0.956860 1.458959
H 2.892502 -0.097422 -0.531714
The script I used to do it was:
random -n $((4 * 18)) -T '%8:6[0:7]F %8:6[-8:0]F %8:6[-1:3]F' |
awk 'BEGIN { n = split("CCCCCCCCCOOHHHOHHH", atoms, ""); atoms[0] = atoms[n] }
NR % n == 1 { print n; print " Generated by VMD in absentia" }
{ print "", atoms[NR%18], " ", $0 }'
The -n option to random says how many rows to generate; I chose 72. The -T option is a template, and the notation %8:6[0:7]F means use %8.6F format to print uniformly distributed random numbers between 0 and 7. The awk script takes the data that is so generated and interpolates the noise (the number of atoms and a variant on the 'generated by VMD' line), as well as tagging the lines with the appropriate atomic symbol.
Processing the sample data
Given some data, you then need to munge it to get the required output. This script more or less does the job. There are endless ways it should be improved, of course, such as taking file names as command line arguments, using temporary file names instead of fixed names, cleaning up the intermediate files, different compounds, different atoms (nitrogen, phosphorous, etc), and so on. However, it should adapt reasonably easily.
input="data"
output="output"
n=$(sed 1q "$input")
n2=$(($n+2))
for ((i = 3; i <= n2; i++))
do
colno=$(printf "%.2d" $(($i-2)))
awk -v N=$n2 -v R=$i \
' BEGIN { name["C"] = "Carbon"; name["H"] = "Hydrogen"; name["O"] = "Oxygen";
R0 = R % N }
NR > 2 && NR <= R { count[$1]++; }
NR == R { printf "%-32.32s\n", name[$1] " " count[$1]; }
NR % N == R0 { xyz = sprintf("%s %s %s", $2, $3, $4); printf "%-32.32s\n", xyz }
' "$input" > "column.$colno"
done
paste -d ' ' column.* > "$output"
The first four lines set up the control parameters, collecting the number of lines per unit of data from the input file, and adjusting things accordingly. The for loop iterates over offsets 3 to $n2 inclusive (skipping the two header lines), and runs the awk script. That encodes atom types (BEGIN), determines which atom it is processing this time (NR > 2 && NR <= R and NR == R), and then arranges to print the triplets of data for the relevant atom. The formatting is carefully organized so that the column headings and the actual xyz-triplets are uniformly spaced. These are written to a file column.$colno. When all's done, the column.* files are pasted to generate a single output file, which looks like this:
Carbon 1 Carbon 2 Carbon 3 Carbon 4 Carbon 5 Carbon 6 Carbon 7 Carbon 8 Carbon 9 Oxygen 1 Oxygen 2 Hydrogen 1 Hydrogen 2 Hydrogen 3 Oxygen 3 Hydrogen 4 Hydrogen 5 Hydrogen 6
0.979485 -6.665347 0.575383 1.191999 -3.002386 2.859484 3.151517 -5.610077 0.429413 3.439828 -6.454984 1.319724 3.726201 -0.123038 2.096854 1.363325 -3.031238 0.016019 6.090283 -3.915340 2.396358 0.407755 -7.957784 -0.846842 0.203074 -0.796428 2.659573 2.600610 -2.259674 -0.260378 4.773839 -6.765097 0.588508 2.743424 -2.890016 2.906452 2.810233 -6.641054 -0.797672 6.854169 -3.191721 -0.925670 2.914233 -1.060001 0.776983 3.803923 -1.497032 2.908799 5.669443 -7.227666 -0.647552 0.092455 -5.850637 2.959987
6.042840 -7.254720 2.093573 2.551942 -6.044322 2.061072 3.523150 -6.167163 2.451689 5.197316 -3.429866 -0.412062 2.548777 -6.422851 1.282846 3.775197 -2.012031 1.377440 3.405112 -3.206415 -0.879886 1.448359 -5.419629 0.467291 3.661964 -2.789234 2.644294 4.214854 -2.439574 -0.951704 5.297609 -2.320418 2.709898 2.653940 -4.431080 -0.511743 5.040635 -0.676199 -0.590970 1.546725 -1.294582 2.562937 4.231461 -7.180908 1.629901 3.297836 -1.557133 -0.133280 3.442481 -4.489962 2.111930 1.423611 -7.982655 0.715618
1.432495 -7.686243 2.525734 5.038409 -4.976270 2.826846 6.184137 -7.303094 2.711561 3.208125 -0.606556 1.978725 2.171859 -6.792060 0.678988 6.521124 -5.622797 -0.773797 1.725619 -5.768633 -0.223397 3.602427 -2.325680 1.762008 1.937521 -1.686895 1.743159 0.745526 -0.114246 -0.949490 4.754360 -6.531145 1.998913 1.114732 -1.158810 1.486939 6.410490 -5.411647 0.062737 4.164330 -6.743763 1.802804 2.587841 -3.979700 2.609748 2.192073 -2.815376 -0.809569 5.501795 -2.326438 1.325829 3.285032 -1.212541 1.284453
3.564424 -3.117406 -0.032879 2.894745 -0.632591 0.532311 3.384916 -5.383135 1.179585 0.793488 -0.894539 -0.886891 1.348785 -6.501867 1.648604 2.189941 -2.438067 0.616090 2.043378 -4.966472 0.691603 3.124161 -5.792896 0.545362 5.741472 -0.640590 2.825374 0.300550 -7.149663 0.942726 1.344387 -0.121382 2.169401 4.963296 -0.964665 -0.230523 6.651423 -4.905053 2.509626 5.059694 -6.166516 0.102255 5.046864 -3.288883 0.853948 2.389007 -3.057664 1.806301 2.365876 -0.956860 1.458959 2.892502 -0.097422 -0.531714
Your task is to understand why all the bits of the awk script are present. For example, why is R0 needed (hint, experiment without the R0 calculation, and use R in its place).