I was looking for an answer everywhere, but I just couldn't find one to this problem (maybe I was just too stupid to use other answers, because I'm new to R).
I have two data frames with different numbers of rows. I want to create a plot containing a single bar per data frame. Both should have the same length and the count of different variables should be stacked over each other. For example: I want to compare the proportions of gender in those to data sets.
t1<-data.frame(cbind(c(1:6), factor(c(1,2,2,1,2,2))))
t2<-data.frame(cbind(c(1:4), factor(c(1,2,2,1))))
1 represents male, 2 represents female
I want to create two barplots next to each other that represent, that the proportions of gender in the first data frame is 2:4 and in the second one 2:2.
My attempt looked like this:
ggplot() + geom_bar(aes(1, t1$X2, position = "fill")) + geom_bar(aes(1, t2$X2, position = "fill"))
That leads to the error: "Error: stat_count() must not be used with a y aesthetic."
First I should merge the two dataframes. You need to add a variable that will identify the origin of the data, add in both dataframes a column with an ID (like t1 and t2). Keep in mind that your columnames are the same in both frames so you will be able to use the function rbind.
t1$data <- "t1"
t2$data <- "t2"
t <- (rbind(t1,t2))
Now you can make the plot:
ggplot(t[order(t$X2),], aes(data, X2, fill=factor(X2))) +
geom_bar(stat="identity", position="stack")
Related
Been trying to figure out how to calculate across a DF using L/T/Apply by dynamically selecting the column to use. I have a DF of wind speed at different heights (Date, Year, Wind.100, Wind.120, Wind.80, etc) and I want to do calculations based on the heights that varies depending on what turbine I am simulating.
Height <- 100
Level <- paste0("Wind.", Height)
I've tried:
tapply(df[[Level]], list(DF$Hour, DF$Year), mean)
tapply(df[Level], list(DF$Hour, DF$Year), mean)
tapply(df[,..Level], list(DF$Hour, DF$Year), mean)
but it fails and says :object 'Level' not found.
There has got to be a way to script this. I know doing a
paste0("df$Wind.",Height)
doesn't work but I can't figure it out.
I'm a beginner to R from a SAS background trying to do a basic "case when" match on two tables to get a flag where I have and have not found a match. Please see the SAS code I have in mind below. I just need something analogous to this in R. Thanks in advance.
proc sql;
create table
x as
select
a.*,
b.*,
case when a.first_column=b.column_first and
a.second_column=b.column_second
then 1 else 0 end as matched_flag
from table1 as a
left join
table2 as b
on a.first_column=b.column_first and a.second_column=b.column_second;
quit;
I'm not familiar with SAS, but I think I understand what you are trying to do. To see how many rows/columns are similar between two tables, you can use %in% and the length function.
For example, initialize two matrices of different dimensions and given them similar row names and column names:
mat.a <- matrix(1, nrow=3, ncol = 2)
mat.b <- matrix(1, nrow=2, ncol = 3)
rownames(mat.a) <- c('a','b','c')
rownames(mat.b) <- c('a','d')
colnames(mat.a) <- c('g','h')
colnames(mat.b) <- c('h','i')
mat.a and mat.b now exist with different row and column names. To match the rows by names, you can use:
row.match <- rownames(mat.a)[rownames(mat.a) %in% rownames(mat.b)]
num.row.match <- length(row.match)
Note that row.match can now be used to index into both of the matrices. The %in% operator returns a logical of the same length of the first argument (in this case, rownames(mat.a)) that indicates if the ith element of the first argument was found anywhere in the elements of the second argument. This nature of %in% means that you have to be sensitive to how you order the arguments for your indexing.
If you simply want to quantify how many rows or columns are the same between the two matrices, then you can use the sum function with the %in% operator:
sum(rownames(mat.a) %in% rownames(mat.b))
With the sum function used like this, you do not need to be sensitive to how you order the arguments, because the number of row names of mat.a in row names of mat.b is equivalent to the number of row names of mat.b in row names of mat.a. That is to say that this usage of %in% is commutative.
I hope this helps!
You will want to use dataframe objects. These are like datasets in SAS. You can use bind to put two dataframe objects together side by side. Then you can select rows based on conditions and set the flag based on this. In the code below you will see that I did this twice: once to set the 1 flag and once to set the 0 flag.
To select the rows where all fields match you can do something similar, but instead of assigning a new column you can assign all the results back to the name of the table you are working on.
Here's the code:
# make up example a and b data frames
table1 <- data.frame(list(a.first_column=c(1,2,3),a.second_column=c(4,5,6)))
table2 <- data.frame(list(b.first_column=c(1,3,6),b.second_column=c(4,5,9)))
# Combine columns (horizontally)
x <- cbind(table1, table2)
print("Combined Data Frames")
print(x)
# create matched flag (1 when the first columns match)
x$matched_flag[x$a.first_column==x$b.first_column] <- 1
x$matched_flag[!x$a.first_column==x$b.first_column] <- 0
# only select records that match both data frames
x <- x[x$a.first_column==x$b.first_column & x$a.second_column==x$b.second_column,]
print("Matched Data Frames")
print(x)
BTW: since you are used to using SQL, you might want to try the sqldf package in R. It will let you use the same techniques that you are used to but in R and on data frames.
32I have several zoo ordered observations that contain data indexed by date. For example (HTML used for spacing):
A B
2014-10-5 60.272 84.019
2014-10-6 61.183 84.024
2014-10-7 61.611 84.010
A B
2014-10-5 61.376 84.028
2014-10-6 61.761 84.032
2014-10-7 62.210 84.025
A B
2014-10-5 61.159 84.006
2014-10-6 61.550 84.029
2014-10-7 61.996 84.024
I have 3 of these objects, say, x, y, and z, with different data in each.
I want to extract data (from last row) from these and put them into a non-zoo data frame because I no longer want the date indexing. Here is what I tried to do:
main.result <- data.frame(x[nrow(x),])
main.result <- data.frame(rbind(main.result, y[nrow(y),]))
main.result <- data.frame(rbind(main.result, z[nrow(z),]))
What is get is not what I expected in main.result:
A B
main.result 61.61163, 62.21080 84.02096, 84.02505
61.99627 84.02423
I should have gotten:
A B
61.611 84.010
62.210 84.025
61.996 84.024
I want to continue to add rows to the main.result matrix. Where am I jumping the track here?
SOLVED: Sorry, the formatting of the output for ‘main.result’ threw me off. The answer I am getting is correct. Please disregard this post!!
I have an ecology data table with about 12,000 rows. There are three columns: site, species, and value. I need to add up the values for each set of matching site and species - for example, all "red maple" values at "site A". I have the data sorted by site and species, so I can do it by hand, but it's slow going. The number of site/species matches varies, so I can't just add up the values in sets of three or anything.
Similar types of questions have talked about pivot tables, but none have needed to match two columns and add a third column, and I haven't been able to figure out how to extrapolate to my situation.
I'm reasonably comfortable coding and would like to do something that looks like this pseudocode, but I'm not clear on the syntax in VBA:
For each row
if a(x) = a(x+1) and b(x) = b(x+1) then
sum = sum + c(x)
else
d(x) = sum
sum = 0
next
Any ideas?
In a PivotTable, put site in Row Labels and species in Column Labels (or vice versa) and Sum of value in Σ Values:
SELECT id, ST_Box2D(areas) AS bbox FROM mytable;
In this example, the table "mytable" contains two columns: "id" is the unique id number of the row and "areas" is a geometry field containing one MULTIPOLYGON per row.
This works fine for multipolygons containing only one polygon, but some rows have polygons very spread apart, hence the bounding box is not relevant when the multipolygon contains one polygon in Europe and one in Canada for example.
So I would need a way to get one box2d per polygon per multipolygon, but I haven't found how just yet.
More exactly, my goal is to return one multipolygon per row, containing one box2d per polygon.
First example
id: 123
area: a multipolygon containing only one oval polygon in Australia
therefore bbox should return a multipolygon containing only one rectangle (the bounding box) in Australia
Second example
id: 321
area: a multipolygon containing one circle in Paris, one circle in Toronto
therefore bbox should return a multipolygon containing one rectangle in Paris, one rectangle in Toronto
You should use ST_Dump https://postgis.net/docs/ST_Dump.html
Then you will get one row per polygon. The other fields will be duplicated when the geometry is split. It is like an aggregate function but the other way.
The syntax gets a little special since it outputs a compound data type so you have to extract the geometry part like this:
SELECT (ST_Dump(the_geom)).geom from mytable;
since this gives you more rows in the table you should just make a new table from the query.
then you can just create an index on that new geometry column in the new table and it will be built on bounding boxes for each single polygon.
HTH
/Nicklas
Do you want your polygons too at one row each? That is what I thought, but if you want only a table with bboxes, one per row with an id references the original multipolygon (you will of cource get the same id repeated for every part of the multipolygon) then you can do the same byt just extracting the bboxes something like:
CREATE TABLE newTable AS
SELECT id, BOX2D((ST_Dump(the_geom)).geom) AS myBox FROM originamTable
I am afraid I don't really get what you want, but you have a lot of possibilities with ST_Dump in cases like this.
You would have to box the relevant bits (say the Canadian and French components) separately. The best tool for this in PostGIS is the geometry accessor ST_GeometryN(geometry,int) (reference: http://postgis.refractions.net/docs/ST_GeometryN.html ). That link has a good example of combining the accessor with ST_NumGeometries.
UPDATE PER COMMENT:
Here is a simple example from San Francisco -- this table contains a geometry field called the_geom, gid record 1 is a field with two multipolygons as reported by st_numgeometries (note the ordinal is indexed at 1 not 0):
=> select st_box2d(st_geometryn(the_geom, 1)) from tl_2009_06075_cousub00 \
where gid = 1;
st_box2d
-------------------------------------------------------------------------
BOX(-123.173828125 37.6398277282715,-122.935707092285 37.8230590820312)
(1 row)
=> select st_box2d(st_geometryn(the_geom, 2)) from tl_2009_06075_cousub00 \
where gid = 1;
st_box2d
----------------------------------------------------------------------------
BOX(-122.612289428711 37.7067184448242,-122.281776428223 37.9298248291016)
(1 row)