Restructuring complicated and large dataframe - dataframe

I have a large dataframe with 41040 obs. and 20 variables.
Here I will simplify the mock data set so it's easier to understand the question.
What I have:
rm(list = ls())
variable <- rep(c('var1', 'var1_2', 'var1_3', 'var1_4'), 5)
group <- as.factor(rep(c('county1', 'county2', 'county3', 'county4'), 5))
year <- rep(c(2000:2004), 4)
month <- c(rep(1:12, 1), 1:8)
value1 <- sample(1:10000, 20)
value2 <- sample(1:10000, 20)
value3 <- sample(1:10000, 20)
mydata <- data.frame(variable, group, year, month, value1, value2, value3)
head(mydata)
variable group year month value1 value2 value3
1 var1 county1 2000 1 4848 4759 6029
2 var1_2 county2 2001 2 7624 3486 6745
3 var1_3 county3 2002 3 4612 9155 4266
4 var1_4 county4 2003 4 1496 2420 9451
5 var1 county1 2004 5 6739 4312 5577
6 var1_2 county2 2000 6 5127 5030 5479
What i want from this, is get another data.frame where values won't be messed up across counties, years or months, but each column will represent one variable from the variable column. To clarify, on the same example I am looking for the quickest way to get this:
var1 <- c(t(mydata[1, 5:7]))
var1_2 <- c(t(mydata[2, 5:7]))
var1_3 <- c(t(mydata[3, 5:7]))
var1_4 <- c(t(mydata[4, 5:7]))
group2 <- rep('county1', 3)
year2 <- rep(2000, 3)
month2 <- rep(1, 3)
mydata2 <- data.frame(group2, year2, month2, var1, var1_2, var1_3, var1_4)
head(mydata2)
group2 year2 month2 var1 var1_2 var1_3 var1_4
county1 2000 1 4848 7624 4612 1496
county1 2000 1 4759 3486 9155 2420
county1 2000 1 6029 6745 4266 9451
After all values for county1, year 2000 and month 1 are written, I want it to go to month 2, year 2000 and county1, than month 3 etc. After all months are done I want year 2001 for county 1 etc, and in the end moving to county2.
I tried various ways with melt(), dcast(), stack(), unstack(), gather() and spread() with no success.

I did it, not super-elegantly though. I just divided the original data.frame into new data.frames with selecting the first 4 variables and than alternating the following variables which needed to be cast. Like this:
res <- select(mydata, c(1:4, 5)) # i changed this 5 to 6, than to 7 etc.
base <- dcast(res, group + year + month ~ variable, value.var = 'value1')
after I did this for each column, I used cbind to create a new, casted dataframe:
cbind(base, var1_2[ , 5:14], var1_3[ , 6:14])
It works, although I would still like to see a nicer way to do this automatically in one or two lines.

Related

How to append two dataframes when column number differ in PostgreSQL in R

What I try to do is that bind rows in my PostgreSQL databes in matched columns like rbindlist's (from data.table) fill argument.
In short, the table I'd like to see in my database is like this;
a <- data.frame(no=c(234,235,346),year=2012:2014,col1=c(1,1,1))
b <- data.frame(no=c(333,353,324),year=2014:2016,col2=c(2,2,2))
union_data_frame <- data.table::rbindlist(list(a,b),fill=T)
union_data_frame
no year col1 col2
1 234 2012 1 NA
2 235 2013 1 NA
3 346 2014 1 NA
4 333 2014 NA 2
5 353 2015 NA 2
6 324 2016 NA 2
I tried it in RPostgres in this way;
library(RPostgres)
a <- data.frame(no=c(234,235,346),year=2012:2014,col1=c(1,1,1))
b <- data.frame(no=c(333,353,324),year=2014:2016,col2=c(2,2,2))
drv <- dbDriver('Postgres')
con <- dbConnect(drv,user='postgres',dbname='dummy_db')
dbWriteTable(con,'dummy_table',a,append = T,row.names = F)
dbWriteTable(con,'dummy_table',b,append = T,row.names = F)
But it doesn't work and fields an error because the second table (b) doesn't have a column called col2.
How to append tables by only common columns ?
Thanks in advance.
I think you need to:
identify which columns are missing,
alter table those new columns into existence, and then
upload the data, assuming all data in the first that are missing in the second are null-able.
### pg <- dbConnect(...)
dbWriteTable(pg, "some_table", a)
newcolumns <- setdiff(colnames(b), dbListFields(pg, "a"))
newcolumns
# [1] "col2"
addqry <- paste("alter table some_table",
paste("add", newcolumns, dbDataType(pg, b[,newcolumns]),
collapse = ", "))
addqry
# [1] "alter table some_table add col2 DOUBLE PRECISION"
dbExecute(pg, addqry)
dbWriteTable(pg, "some_table", b, append = TRUE)
dbGetQuery(pg, "select * from some_table")
# no year col1 col2
# 1 234 2012 1 NA
# 2 235 2013 1 NA
# 3 346 2014 1 NA
# 4 333 2014 NA 2
# 5 353 2015 NA 2
# 6 324 2016 NA 2

Name-Specific Variability Calculations Pandas

I'm trying to calculate variability statistics from two df's - one with current data and one df with average data for the month. Suppose I have a df "DF1" that looks like this:
Name year month output
0 A 1991 1 10864.8
1 A 1997 2 11168.5
2 B 1994 1 6769.2
3 B 1998 2 3137.91
4 B 2002 3 4965.21
and a df called "DF2" that contains monthly averages from multiple years such as:
Name month output_average
0 A 1 11785.199
1 A 2 8973.991
2 B 1 8874.113
3 B 2 6132.176667
4 B 3 3018.768
and, i need a new DF calling it "DF3" that needs to look like this with the calculations specific to the change in the "name" column and for each "month" change:
Name year month Variability
0 A 1991 1 -0.078097875
1 A 1997 2 0.24454103
2 B 1994 1 -0.237197002
3 B 1998 2 -0.488287737
4 B 2002 3 0.644782
I have tried options like this below but with errors about duplicating the axis or key errors -
DF3['variability'] =
((DF1.output/DF2.set_index('month'['output_average'].reindex(DF1['name']).values)-1)
Thank you for your help in leaning Python row calculations coming from matlab!
For two columns, you can better use merge instead of set_index:
df3 = df1.merge(df2, on=['Name','month'], how='left')
df3['variability'] = df3['output']/df3['output_average'] - 1
Output:
Name year month output output_average variability
0 A 1991 1 10864.80 11785.199000 -0.078098
1 A 1997 2 11168.50 8973.991000 0.244541
2 B 1994 1 6769.20 8874.113000 -0.237197
3 B 1998 2 3137.91 6132.176667 -0.488288
4 B 2002 3 4965.21 3018.768000 0.644780

Applying function on two columns, refers to previous row - Pandas

I've got data frame that include x and y variables, and the indexes are: ID, date and time.
I want to create new variable that will be created by applying some defined function.
For example, the function could be:
def some_function(x1, x2 , y1, y2):
z = x1*x2 + y1*y2
return z
The real function is more complex.
Note: The function should be applied on each ID separately.
Data illustration:
ID date time x y
1 08/27/2019 18:00 1 2
19:00 3 4
20:00 .. ..
21:00 .. ..
2 08/28/2019 18:00 .. ..
19:00 .. ..
19:31 .. ..
19:32 .. ..
19:34 .. ..
So for example, the first row in the new variable should be 0, since there is no previous row, and the second row should be 3*1 + 4*2 = 11.
You can try:
def myfunc(d):
return d['x'].mul(d['x'].shift()) + d['y'].mul(d['y'].shift())
df['new_col'] = df.groupby('ID').apply(myfunc)
Assuming index is numeric,
(df.join(df.groupby('id')[['x','y']].shift(),lsuffix='1',rsuffix='2')
.apply(lambda x:some_function(x.x1,x.x2,x.y1,x.y2),axis=1))
You can do this with shift:
df_shifted= df[['x', 'y']].shift(1).fillna(0)
df['new_col']= df['x']*df_shifted['x']+df['y']*df_shifted['y']
The output looks like this:
df= pd.DataFrame(dict(
ID= [1, 1, 2, 3, 3],
time= ['02:37', '05:28', '09:01', '10:05', '10:52'],
x=[1, 3, 4, 7, 1],
y=[2, 4, 3, 2, 6]
)
)
df_shifted= df.shift(1).fillna(0)
df['new_col']= df['x']*df_shifted['x']+df['y']*df_shifted['y']
df
Out[474]:
ID time x y new_col
0 1 02:37 1 2 0.0
1 1 05:28 3 4 11.0
2 2 09:01 4 3 24.0
3 3 10:05 7 2 34.0
4 3 10:52 1 6 19.0
So it kind of mixes the rows of different IDs. So the Value of ID 2 is calculated with the last row of ID 1. If you don't want to have that, you need to work with groupby like this:
# make sure, the dataframe is sorted
df.sort_values(['ID', 'time'], inplace=True)
# define a function that gets the sub dataframes
# which belong to the same id
def calculate(sub_df):
df_shifted= sub_df.shift(1).fillna(0)
sub_df['new_col']= sub_df['x']*df_shifted['x']+sub_df['y']*df_shifted['y']
return sub_df
df.groupby('ID').apply(calculate)
The output looks like this on the same data as above:
Out[472]:
ID time x y new_col
0 1 02:37 1 2 0.0
1 1 05:28 3 4 11.0
2 2 09:01 4 3 0.0
3 3 10:05 7 2 0.0
4 3 10:52 1 6 19.0
You see, that now the first entry of each group is 0.0. Mixing doesn't happen anymore.

return rows which elements are duplicates, not the logical vector

I know the duplicated-function of the package dplyr. The problem is that it only returns a logical vector indicating which elements (rows) are duplicates.
I want to get a vector which gives back those rows with the certain elements.
I want to get back all the observations of A and B because they have for the key Name and year duplicated values.
I already have coded this:
>df %>% group_by(Name) %>% filter(any(( ?????)))
but I dont know how to write the last part of code.
Anyone any ideas?
Thanks :)
An option using dplyr can be achieved by grouping on both Name and Year to calculate count. Afterwards group on only Name and filter for groups having any count > 1 (meaning duplicate):
library(dplyr)
df %>% group_by(Name, Year) %>%
mutate(count = n()) %>%
group_by(Name) %>%
filter(any(count > 1)) %>%
select(-count)
# # A tibble: 7 x 3
# # Groups: Name [2]
# Name Year Value
# <chr> <int> <int>
# 1 A 1990 5
# 2 A 1990 3
# 3 A 1991 5
# 4 A 1995 5
# 5 B 2000 0
# 6 B 2000 4
# 7 B 1998 5
Data:
df <- read.table(text =
"Name Year Value
A 1990 5
A 1990 3
A 1991 5
A 1995 5
B 2000 0
B 2000 4
B 1998 5
C 1890 3
C 1790 2",
header = TRUE, stringsAsFactors = FALSE)

Calculating Growth-Rates by applying log-differences

I am trying to transform my data.frame by calculating the log-differences of each column
and controlling for the rows id. So basically I like to calculate the growth rates for each id's variable.
So here is a random df with an id column, a time period colum p and three variable columns:
df <- data.frame (id = c("a","a","a","c","c","d","d","d","d","d"),
p = c(1,2,3,1,2,1,2,3,4,5),
var1 = rnorm(10, 5),
var2 = rnorm(10, 5),
var3 = rnorm(10, 5)
)
df
id p var1 var2 var3
1 a 1 5.375797 4.110324 5.773473
2 a 2 4.574700 6.541862 6.116153
3 a 3 3.029428 4.931924 5.631847
4 c 1 5.375855 4.181034 5.756510
5 c 2 5.067131 6.053009 6.746442
6 d 1 3.846438 4.515268 6.920389
7 d 2 4.910792 5.525340 4.625942
8 d 3 6.410238 5.138040 7.404533
9 d 4 4.637469 3.522542 3.661668
10 d 5 5.519138 4.599829 5.566892
Now I have written a function which does exactly what I want BUT I had to take a detour which is possibly unnecessary and can be removed. However, somehow I am not able to locate
the shortcut.
Here is the function and the output for the posted data frame:
fct.logDiff <- function (df) {
df.log <- dlply (df, "code", function(x) data.frame (p = x$p, log(x[, -c(1,2)])))
list.nalog <- llply (df.log, function(x) data.frame (p = x$p, rbind(NA, sapply(x[,-1], diff))))
ldply (list.nalog, data.frame)
}
fct.logDiff(df)
id p var1 var2 var3
1 a 1 NA NA NA
2 a 2 -0.16136569 0.46472004 0.05765945
3 a 3 -0.41216720 -0.28249264 -0.08249587
4 c 1 NA NA NA
5 c 2 -0.05914281 0.36999681 0.15868378
6 d 1 NA NA NA
7 d 2 0.24428771 0.20188025 -0.40279188
8 d 3 0.26646102 -0.07267311 0.47041227
9 d 4 -0.32372771 -0.37748866 -0.70417351
10 d 5 0.17405309 0.26683625 0.41891802
The trouble is due to the added NA-rows. I don't want to collapse the frame and reduce it, which would be automatically done by the diff() function. So I had 10 rows in my original frame and am keeping the same amount of rows after the transformation. In order to keep the same length I had to add some NAs. I have taken a detour by transforming the data.frame into a list, add the NAs to each id's first line, and afterwards transform the list back into a data.frame. That looks tedious.
Any ideas to avoid the data.frame-list-data.frame class transformation and optimize the function?
How about this?
nadiff <- function(x, ...) c(NA, diff(x, ...))
ddply(df, "code", colwise(nadiff, c("var1", "var2", "var3")))