This is my code:
p14 <- ggplot(plot14, aes(x = Harvest, y = Percentage, fill = factor(Plant, level = orderplants)))+
geom_col(show.legend = FALSE)+
geom_vline(xintercept=3.5)+
labs(y = "Bedekking %",
x = NULL,
fill = "Plantensoort")+
theme_classic()
plot 14
The code is about plant coverage of a plot (I have 70 plots in total). So bedekking is the Dutch word for coverage. The problem: the numbers represent the time periods of the measurements:
July 2020
August 2020
October 2020
May 2021
June 2021
July 2021
August 2021
October 2021
I would like the bars of each month to line up, so there would be two rows (2020 and 2021) where the bars of the same months are above/below each other (see ugly sketch below). Is this possible to code, or do I need to change my entire dataset?
very quick example of goal
It would be better if you could include some sample raw data as part of a reproducible example. I've created a little made up data to illustrate.
Ideally, you'd want the raw dates behind the numbered time measuements so that you can split the dates into months and years as separate variables. Assuming you only have the numbers, you could create some logic like this to make the months and years.
And you could use facet_wrap to show one year above another for the respective months.
library(tidyverse)
library(scales)
tibble(
harvest = seq(1, 8, 1),
percentage = rep(1, 8),
plant = rep(c("this", "that"), 4)
) |>
mutate(
month = case_when(
harvest %in% c(1, 6) ~ 7,
harvest %in% c(2, 7) ~ 8,
harvest %in% c(3, 8) ~ 10,
harvest %in% c(4) ~ 5,
harvest %in% c(5) ~ 6,
TRUE ~ NA_real_
),
year = case_when(
harvest <= 3 ~ 2020,
TRUE ~ 2021
)
) |>
ggplot(aes(month, percentage, fill = plant)) +
geom_col(show.legend = FALSE) +
labs(
y = "Bedekking %",
x = NULL,
fill = "Plantensoort"
) +
facet_wrap(~year, ncol = 1) +
scale_y_continuous(labels = label_percent()) +
theme_classic()
Created on 2022-05-13 by the reprex package (v2.0.1)
Related
Is there any way how to calculate the percentage growth (into the future) in pandas?
pandas have .pct_change method to calculate the percent change of some columns.
I would like to perform this in the future - my function does the work. however, I find it kind of weird to be using a for-loop for some calculations
def cf_future_projection(
cashflow_of_last_year: float,
cashflow_pct_grow: float,
last_observed_year: int,
n_year_future: int = 5,
) -> dict:
grow_values = {}
grow_values[last_observed_year + 1] = cashflow_of_last_year * (
1 + cashflow_pct_grow
)
for year in range(1, n_year_future):
grow_values[last_observed_year + 1 + year] = grow_values[
last_observed_year + 1 + year - 1
] * (1 + cashflow_pct_grow)
return grow_values
cf_future_projection(150, 0.15, 2020, 15)
Any way how to do that in pandas and without a for loop?
When you run cf_future_projection(150, 0.15, 2020, 15), the basic calculation you are performing is 150*(1+0.15)^n for n years into the future, so I think that your function, while nicely written, is unnecessarily complicated.
I don't know exactly what your use case is, but I think if you want to create a dataframe with new rows at the end, adding news rows one at a time is an expensive operation, and you probably don't want to use a for loop with dataframes as you mentioned. You might be better off taking the last row of an existing dataframe, creating a new dataframe with projected future values, and concatenating the original and new dataframes together.
For example, let's say you're starting with a dataframe that looks like:
df = pd.DataFrame({'year':[2019,2020],'value':[140,150]})
To do something similar to cf_future_projection(150, 0.15, 2020, 15), we can take the starting value from the row of the dataframe corresponding to 2020, and then use a list comprehension to create our new future values and future years. If you like, you can wrap this operation in a function
year,value = df[df['year'] == 2020].values[0]
n = 5
year_future = [2020 + i for i in range(1,n+1)]
value_future = [value*(1+0.15)**i for i in range(1,n+1)]
df_future = pd.DataFrame({'year':year_future,'value':value_future})
df_future = pd.concat([df,df_future])
Result:
>>> df_future
year value
0 2019 140.000000
1 2020 150.000000
0 2021 172.500000
1 2022 198.375000
2 2023 228.131250
3 2024 262.350937
4 2025 301.703578
I am working with R where I have a variable '2 month 3 day 6 hour 70 minute' as string. The variable changes over time and therefore does not have the same length/structure. I need this variable to do a query on a PostgreSQL database by casting it to an interval. This works just fine.
Now I need this interval/string-variable as integer in minutes to do some mathematical calculations.
I thought of using sqldf the following:
library(sqldf)
my_interval = '2 month 3 day 6 hour 70 minute'
interval_minutes <- sqldf(paste("SELECT EXTRACT(EPOCH FROM '",my_interval,"'::INTERVAL)/60"))
interval_minutes_novar <- sqldf("SELECT EXTRACT(EPOCH FROM '2 month 3 day 6 hour 70 minute'::INTERVAL)/60")
but am getting Error: near "FROM": syntax error. From my research I know that sqldf uses SQLite, which does not support EXTRACT().
How can I convert a SQL-Interval to minutes using R?
1) sqldf/gsubfn Using gsubfn replace each word in my_interval with *, the appropriate number of minutes and + . Remove any trailing + and spaces and then either parse and evaluate mins or substitute mins into the sql statement. There are 365.25 / 12 days in the average month over 4 calendar years, having one leap year, but if you want to get the same answer as PostgreSQL replace 365.25 / 12 with 30, as noted in the comments.
library(sqldf) # this also pulls in gsubfn
# input
my_interval = '2 month 3 day 6 hour 70 minute'
L <- list(minute = " +", hour = "*60 +", day = "*60*24 +",
month = "*365.25 * 60 * 24 /12 +")
mins <- my_interval |>
gsubfn(pattern = "\\w+", replacement = L) |>
trimws(whitespace = "[+ ]")
eval(parse(text = mins))
## [1] 92410
fn$sqldf("select $mins mins")
## mins
## 1 92410
2) Base R This is a base R solution. Extract the numbers and words into separate vectors, translate the words to the appropriate factors and take their inner product. The discussion about 30 days months in (1) applies here too.
v <- c(minute = 1, hour = 60, day = 60 * 24, month = 365.25 * 60 * 24 /12)
nums <- my_interval |>
gsub(pattern = "[a-z]", replacement = "") |>
textConnection() |>
scan(quiet = TRUE)
words <- my_interval |>
gsub(pattern = "\\d", replacement = "") |>
textConnection() |>
scan(what = "", quiet = TRUE)
sum(v[words] * nums)
## [1] 92410
3) lubridate lubridate duration objects can be used.
library(lubridate)
as.numeric(duration(my_interval), "minute")
## [1] 92410
Although lubridate does not handle 30 day months (and Hadley says it is
not planned) we can preprocess my_interval to get the effect.
library(gsubfn)
library(lubridate)
my_interval |>
gsubfn(pattern = "(\\d+) +month", replacement = ~paste(30*as.numeric(x),"day")) |>
duration() |>
as.numeric("minute")
## [1] 91150
Adapting my answer here to this, I'll restate a rather gaping problem with this conversion: convert "month" into "seconds" is not constant, as months vary between 28-31 days. If we assume 30, though, for the sake of arguments, then:
func <- function(x, ptn) {
out <- gsub(paste0(".*?\\b([0-9.]+)\\s*", ptn, ".*"), "\\1", x, ignore.case = TRUE)
ifelse(out == x, NA, out)
}
res1 <- lapply(c(mon = "month", day = "day", hr = "hour", min = "minute"),
function(ptn) as.numeric(func(my_interval, ptn)))
res2 <- lapply(res1, function(z) ifelse(is.na(z), 0, z))
res2
# $mon
# [1] 2
# $day
# [1] 3
# $hr
# [1] 6
# $min
# [1] 70
86400 * (res2$mon*31 + res2$day) + 3600*res2$mon + 60*res2$hr
# [1] 5623560
Because I'm using lapply and simple vectorizable operations here, this also works if my_interval is more than one string (of similar format). It is robust to missing variables (presumed 0), and can include "year" (albeit with leap-year inaccuracies) and/or "second" if desired.
intervals <- c("2 month 3 day 6 hour 70 minute", "1 year", "1 hour 1 second")
res1 <- lapply(c(yr = "year", mon = "month", day = "day", hr = "hour", min = "minute", sec = "second"),
function(ptn) as.numeric(func(intervals, ptn)))
res2 <- lapply(res1, function(z) ifelse(is.na(z), 0, z))
str(res2)
# List of 6
# $ yr : num [1:3] 0 1 0
# $ mon: num [1:3] 2 0 0
# $ day: num [1:3] 3 0 0
# $ hr : num [1:3] 6 0 1
# $ min: num [1:3] 70 0 0
# $ sec: num [1:3] 0 0 1
86400 * (res2$yr*365 + res2$mon*31 + res2$day) + 3600*res2$mon + 60*res2$hr + res2$sec
# [1] 5.6e+06 3.2e+07 6.1e+01
My workaround is to use my PostgreSQL connection to do it:
library(sf)
library(RPostgres)
my_postgresql_connection <- dbConnect(Postgres(), dbname = "my_db", host = "my_host", port = 1234, user = "my_user", password = "my_password")
my_interval = '2 month 3 day 6 hour 70 minute'
my_dataframe <- st_read(my_postgresql_connection, query = paste("SELECT EXTRACT(EPOCH FROM '",my_interval,"'::INTERVAL)/60 as minutes"))
my_interval_in_minutes <- as.double(my_dataframe$minutes[1])
In Norway we have something called D- and S-numbers. These are National identification number where the day or month of birth are modified.
D-number
[d+4]dmmyy
S-number
dd[m+5]myy
I have a column with dates, some of them normal (ddmmyy) and some of them are formatted as D- or S-numbers. Leading zeroes are also missing.
df = pd.DataFrame({'dates': [241290, #24.12.90
710586, #31.05.86
105299, #10.02.99
56187] #05.11.87
})
dates
0 241290
1 710586
2 105299
3 56187
I've written this function to add leading zero and convert the dates, but this solution doesn't feel that great.
def func(s):
s = s.astype(str)
res = []
for index, value in s.items():
# Make sure all dates have 6 digits (add leading zero)
if len(value) == 5:
value = ('0' + value)
# Convert S- and D-dates to regular dates
if int(value[0]) > 3:
# substract 4 from the first digit
res.append(str(int(value[0]) - 4) + value[1:])
elif int(value[2]) > 1:
# subtract 5 from the third digit
res.append(value[:2] + str(int(value[2]) - 5) + value[3:])
else:
res.append(value)
return pd.Series(res)
Is there a smoother and faster way of accomplishing the same result?
Normalize dates by padding with 0 then explode into 3 columns of two digits (day, month, year). Apply your rules and combine columns to a DateTimeIndex:
# Suggested by #HenryEcker
# Changed: .pad(6, fillchar='0') to .zfill(6)
dates = df['dates'].astype(str).str.zfill(6).str.findall('(\d{2})') \
.apply(pd.Series).astype(int) \
.rename(columns={0: 'day', 1: 'month', 2: 'year'}) \
.agg({'day': lambda d: d if d <= 31 else d - 40,
'month': lambda m: m if m <= 12 else m - 50,
'year': lambda y: 1900 + y})
df['dates2'] = pd.to_datetime(dates)
Output:
>>> df
dates dates2
0 241290 1990-12-24
1 710586 1986-05-31
2 105299 1999-02-10
3 56187 1987-11-05
>>> dates
day month year
0 24 12 1990
1 31 5 1986
2 10 2 1999
3 5 11 1987
You can keep the Series as integers until the final step. The disadvantage of the method below is that the offsets do not match what the specifications say and may take more mental power to comprehend:
def func2(s):
# In mathematical operations, digits are counted from right
# so "first digit" becomes sixth and "third digit" becomes
# fourth in a 6-digit number
delta = np.select(
[s // 10**5 % 10 > 3, s // 10**3 % 10 > 1],
[4 * 10**5 , 5 * 10**3 ],
0
)
return (s - delta).astype('str').str.pad(6, fillchar='0')
The dataframe below shows the monthly revenue of two shops (shop_id=11, shop_id=15) during the period of a few years:
data = { 'shop_id' : [ 11, 15, 15, 15, 11, 11 ],
'month' : [ 1, 1, 2, 3, 2, 3 ],
'year' : [ 2011, 2015, 2015, 2015, 2014, 2014 ],
'revenue' : [11000, 5000, 4500, 5500, 10000, 8000]
}
df = pd.DataFrame(data)
df = df[['shop_id', 'month', 'year', 'revenue']]
display(df)
You can notice that shop_id=11 has only one entry in 2011 (january) and shop_id=15 has a few entries in 2015 (january, february, march). Nevertheless, it's interesting to note that the first shop has a few more entries in 2014:
I'm trying to optimize a custom function (used along with .apply()) that creates a new feature called diff_revenue: this feature shows the change in revenue from the previous month, for each shop:
I would like to offer some explanation on how some of the values found in diff_revenue were generated:
The value first cell is 0 (red) because there is no previous information for shop_id=11;
The 2nd cell is also 0 (orange), for the same reason: there is no previous information for shop_id=15;
The 3rd cell is 500 (green), because the change from the last entry (january, 2015) of this shop to the current cell's revenue (february, 2015), is 500 Trumps.
The 5th cell is 1000 (dark blue), because the change from the last entry (january, 2011) of this shop to the current cell's revenue (february, 2014) was 1000 Trumps.
I'm no expert in Pandas and was wondering if the Pandas' gods knew a better way. The DataFrame I have to work with is quite large (+1M observations) and my current approach is too slow. I'm looking for a faster alternative or maybe something more readable.
You more or less want to use Series.diff on the 'Revenue' column, but need to do a few additional things:
Sort to ensure your DataFrame is in chronological order (can undo this later)
Perform a groupby on 'shop_id' to do group level operations
Take the absolute value, since you don't want to distinguish between positive and negative
In terms of code:
# sort the values so they're in order when we perform a groupby
df = df.sort_values(by=['year', 'month'])
# perform a groupby on 'shop_id' and get the row-wise difference within each group
df['diff_revenue'] = df.groupby('shop_id')['revenue'].diff()
# fill NA as zero (no previous info), take absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].fillna(0).abs().astype('int')
# revert to original order
df = df.sort_index()
The resulting output:
shop_id month year revenue diff_revenue
0 11 1 2011 11000 0
1 15 1 2015 5000 0
2 15 2 2015 4500 500
3 15 3 2015 5500 1000
4 11 2 2014 10000 1000
5 11 3 2014 8000 2000
Edit
A little less straight forward solution, but maybe slightly more performant:
# sort the values so they're chronological order by shop_id
df = df.sort_values(by=['shop_id', 'year', 'month'])
# take the row-wise difference ignoring changes in shop_id
df['diff_revenue'] = df['revenue'].diff()
# zero out locations where shop_id changes (no previous info)
df.loc[df['shop_id'] != df['shop_id'].shift(), 'diff_revenue'] = 0
# Take the absolute value, convert float -> int
df['diff_revenue'] = df['diff_revenue'].abs().astype('int')
# revert to original order
df = df.sort_index()
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Trying some marketing analytics.
Using R and SQL.
This dataset:
user_id install_date app_version
1 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
2 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
3 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
4 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
5 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
6 002a9119a4b3dfb05e0159eee40576b6 2015-12-29 v2
user_session_id event_timestamp app time_seconds
1 f3501a97f8caae8e93764ff7a0a75a76 2015-06-20 10:59:22 draw 682
2 d1fdd0d46f2aba7d216c3e1bfeabf0d8 2015-05-04 18:06:54 build 1469
3 b6b9813985db55a4ccd08f9bc8cd6b4e 2016-01-31 19:27:12 build 261
4 ce644b02c1d0ab9589ccfa5031a40c98 2016-01-31 18:44:01 draw 195
5 7692f450607a0a518d564c0a1a15b805 2015-06-18 15:39:50 draw 220
6 4403b5bc0b3641939dc17d3694403773 2016-03-17 21:45:12 build 644
link for the dataset
I want to create a plot that looks like this:
but showcases it per version like this (just focus on the versions part of this graph - so like the above picture but 1 time per version like the one below):
Basically, shows the percentage of retention and churn throughout months per version.
This is what i have done so far:
ee=sqldf("select user_id, count(user_id)as n ,strftime('%Y-%m', event_timestamp) as dt ,app_version
from w
group by user_id ,strftime('%Y-%m', event_timestamp),app_version
having count(*)>1 order by n desc")
ee
user_id n dt app_version
1 fab9612cea12e2fcefab8080afa10553 238 2015-11 v2
2 fab9612cea12e2fcefab8080afa10553 204 2015-12 v2
3 121d81e4b067e72951e76b7ed8858f4e 173 2016-01 v2
4 121d81e4b067e72951e76b7ed8858f4e 169 2016-02 v2
5 fab9612cea12e2fcefab8080afa10553 98 2015-10 v2
The above, shows the unique users that used the app more than once.So these are the population that the retention rate analysis is referring.
What i am having difficulties with is summarizing through time each of the user_id through their events in event_timestamp column to find the retention / churn outcome like the first image i mentioned.
I don't think your question is very clear, nor do I think you know exactly what you're trying to do. You say you're trying to use R to calculate churn and retention, but you've provided no actual R code, only SQL statements that you appear to be running from inside an R environment.
If you want to know the SQL to do all of this in one step, you need to ask a different, better question. However, given that you've provided a .csv file of the data, I have ignored the SQL portion of your question and provided an R solution, including data handling.
library(dplyr)
library(zoo)
library(ggplot2)
library(reshape2)
library(scales)
df <-
read.csv([location of the .csv on your machine], header = TRUE) %>%
mutate(month = format(as.Date(event_timestamp), format = "%Y-%m"))
installs <- #calculates all installs for all versions of the app by month
df %>%
group_by(user_id) %>%
slice(1) %>%
group_by(month) %>%
summarise(tot_installs = n())
last_use_date <- #finds the last time a user actually used any version of the app (i.e., when they "churned" away)
df %>%
group_by(user_id, month) %>%
summarise(tot_uses = n()) %>%
group_by(user_id) %>%
filter(month == max(month)) %>%
group_by(month) %>%
summarise(stopped_using = n())
installs %>%
full_join(last_use_date) %>%
mutate(cum_sum_install = cumsum(tot_installs),
cum_sum_stopped = cumsum(stopped_using),
Churn = cum_sum_stopped/cum_sum_install,
Retention = 1 - Churn) %>%
select(month, Churn, Retention) %>%
melt(id.vars = "month") %>% # melt the data frame for easy plotting
ggplot(aes(x = month, y = value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_manual(name = "", values = c("red", "blue")) +
labs(x = "Month", y = "") +
scale_y_continuous(labels = percent) +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))