Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
Trying some marketing analytics.
Using R and SQL.
This dataset:
user_id install_date app_version
1 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
2 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
3 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
4 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
5 000a0efdaf94f2a5a09ab0d03f92f5bf 2014-12-25 v1
6 002a9119a4b3dfb05e0159eee40576b6 2015-12-29 v2
user_session_id event_timestamp app time_seconds
1 f3501a97f8caae8e93764ff7a0a75a76 2015-06-20 10:59:22 draw 682
2 d1fdd0d46f2aba7d216c3e1bfeabf0d8 2015-05-04 18:06:54 build 1469
3 b6b9813985db55a4ccd08f9bc8cd6b4e 2016-01-31 19:27:12 build 261
4 ce644b02c1d0ab9589ccfa5031a40c98 2016-01-31 18:44:01 draw 195
5 7692f450607a0a518d564c0a1a15b805 2015-06-18 15:39:50 draw 220
6 4403b5bc0b3641939dc17d3694403773 2016-03-17 21:45:12 build 644
link for the dataset
I want to create a plot that looks like this:
but showcases it per version like this (just focus on the versions part of this graph - so like the above picture but 1 time per version like the one below):
Basically, shows the percentage of retention and churn throughout months per version.
This is what i have done so far:
ee=sqldf("select user_id, count(user_id)as n ,strftime('%Y-%m', event_timestamp) as dt ,app_version
from w
group by user_id ,strftime('%Y-%m', event_timestamp),app_version
having count(*)>1 order by n desc")
ee
user_id n dt app_version
1 fab9612cea12e2fcefab8080afa10553 238 2015-11 v2
2 fab9612cea12e2fcefab8080afa10553 204 2015-12 v2
3 121d81e4b067e72951e76b7ed8858f4e 173 2016-01 v2
4 121d81e4b067e72951e76b7ed8858f4e 169 2016-02 v2
5 fab9612cea12e2fcefab8080afa10553 98 2015-10 v2
The above, shows the unique users that used the app more than once.So these are the population that the retention rate analysis is referring.
What i am having difficulties with is summarizing through time each of the user_id through their events in event_timestamp column to find the retention / churn outcome like the first image i mentioned.
I don't think your question is very clear, nor do I think you know exactly what you're trying to do. You say you're trying to use R to calculate churn and retention, but you've provided no actual R code, only SQL statements that you appear to be running from inside an R environment.
If you want to know the SQL to do all of this in one step, you need to ask a different, better question. However, given that you've provided a .csv file of the data, I have ignored the SQL portion of your question and provided an R solution, including data handling.
library(dplyr)
library(zoo)
library(ggplot2)
library(reshape2)
library(scales)
df <-
read.csv([location of the .csv on your machine], header = TRUE) %>%
mutate(month = format(as.Date(event_timestamp), format = "%Y-%m"))
installs <- #calculates all installs for all versions of the app by month
df %>%
group_by(user_id) %>%
slice(1) %>%
group_by(month) %>%
summarise(tot_installs = n())
last_use_date <- #finds the last time a user actually used any version of the app (i.e., when they "churned" away)
df %>%
group_by(user_id, month) %>%
summarise(tot_uses = n()) %>%
group_by(user_id) %>%
filter(month == max(month)) %>%
group_by(month) %>%
summarise(stopped_using = n())
installs %>%
full_join(last_use_date) %>%
mutate(cum_sum_install = cumsum(tot_installs),
cum_sum_stopped = cumsum(stopped_using),
Churn = cum_sum_stopped/cum_sum_install,
Retention = 1 - Churn) %>%
select(month, Churn, Retention) %>%
melt(id.vars = "month") %>% # melt the data frame for easy plotting
ggplot(aes(x = month, y = value, fill = variable)) +
geom_bar(stat = "identity") +
scale_fill_manual(name = "", values = c("red", "blue")) +
labs(x = "Month", y = "") +
scale_y_continuous(labels = percent) +
theme(legend.position = "bottom",
axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
Related
This is my code:
p14 <- ggplot(plot14, aes(x = Harvest, y = Percentage, fill = factor(Plant, level = orderplants)))+
geom_col(show.legend = FALSE)+
geom_vline(xintercept=3.5)+
labs(y = "Bedekking %",
x = NULL,
fill = "Plantensoort")+
theme_classic()
plot 14
The code is about plant coverage of a plot (I have 70 plots in total). So bedekking is the Dutch word for coverage. The problem: the numbers represent the time periods of the measurements:
July 2020
August 2020
October 2020
May 2021
June 2021
July 2021
August 2021
October 2021
I would like the bars of each month to line up, so there would be two rows (2020 and 2021) where the bars of the same months are above/below each other (see ugly sketch below). Is this possible to code, or do I need to change my entire dataset?
very quick example of goal
It would be better if you could include some sample raw data as part of a reproducible example. I've created a little made up data to illustrate.
Ideally, you'd want the raw dates behind the numbered time measuements so that you can split the dates into months and years as separate variables. Assuming you only have the numbers, you could create some logic like this to make the months and years.
And you could use facet_wrap to show one year above another for the respective months.
library(tidyverse)
library(scales)
tibble(
harvest = seq(1, 8, 1),
percentage = rep(1, 8),
plant = rep(c("this", "that"), 4)
) |>
mutate(
month = case_when(
harvest %in% c(1, 6) ~ 7,
harvest %in% c(2, 7) ~ 8,
harvest %in% c(3, 8) ~ 10,
harvest %in% c(4) ~ 5,
harvest %in% c(5) ~ 6,
TRUE ~ NA_real_
),
year = case_when(
harvest <= 3 ~ 2020,
TRUE ~ 2021
)
) |>
ggplot(aes(month, percentage, fill = plant)) +
geom_col(show.legend = FALSE) +
labs(
y = "Bedekking %",
x = NULL,
fill = "Plantensoort"
) +
facet_wrap(~year, ncol = 1) +
scale_y_continuous(labels = label_percent()) +
theme_classic()
Created on 2022-05-13 by the reprex package (v2.0.1)
Is there a way to use numpy to add numbers in a series up to a threshold, then restart the counter. The intention is to form groupby based on the categories created.
amount price
0 27 22.372505
1 17 126.562276
2 33 101.061767
3 78 152.076373
4 15 103.482099
5 96 41.662766
6 108 98.460743
7 143 126.125865
8 82 87.749286
9 70 56.065133
The only solutions I found iterate with .loc which is slow. I tried building a solution based on this answer https://stackoverflow.com/a/56904899:
sumvals = np.frompyfunc(lambda a,b: a+b if a <= 100 else b,2,1)
df['cumvals'] = sumvals.accumulate(df['amount'], dtype=np.object)
The use-case is to find the average price of every 75 sold amounts of the thing.
Solution #1 Interpreting the following one way will get my solution below: "The use-case is to find the average price of every 75 sold amounts of the thing." If you are trying to do this calculation the "hard way" instead of pd.cut, then here is a solution that will work well but the speed / memory will depend on the cumsum() of the amount column, which you can find out if you do df['amount'].cumsum(). The output will take about 1 second per every 10 million of the cumsum, as that is how many rows is created with np.repeat. Again, this solution is not horrible if you have less than ~10 million in cumsum (1 second) or even 100 million in cumsum (~10 seconds):
i = 75
df = np.repeat(df['price'], df['amount']).to_frame().reset_index(drop=True)
g = df.index // i
df = df.groupby(g)['price'].mean()
df.index = (df.index * i).astype(str) + '-' + (df.index * i +75).astype(str)
df
Out[1]:
0-75 78.513748
75-150 150.715984
150-225 61.387540
225-300 67.411182
300-375 98.829611
375-450 126.125865
450-525 122.032363
525-600 87.326831
600-675 56.065133
Name: price, dtype: float64
Solution #2 (I believe this is wrong but keeping just in case)
I do not believe you are tying to do it this way, which was my initial solution, but I will keep it here in case, as you haven't included expected output. You can create a new series with cumsum and then use pd.cut and pass bins=np.arange(0, df['Group'].max(), 75) to create groups of cumulative 75. Then, groupby the groups of cumulative 75 and take the mean. Finally, use pd.IntervalIndex to clean up the format and change to a sting:
df['Group'] = df['amount'].cumsum()
s = pd.cut(df['Group'], bins=np.arange(0, df['Group'].max(), 75))
df = df.groupby(s)['price'].mean().reset_index()
df['Group'] = pd.IntervalIndex(df['Group']).left.astype(str) + '-' + pd.IntervalIndex(df['Group']).right.astype(str)
df
Out[1]:
Group price
0 0-75 74.467390
1 75-150 101.061767
2 150-225 127.779236
3 225-300 41.662766
4 300-375 98.460743
5 375-450 NaN
6 450-525 126.125865
7 525-600 87.749286
I would like to know whether there is a way of importing JSON data from a MySQL DB to an R dataframe.
I have a table like this:
id created_at json
1 2020-07-01 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 2020-07-01 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 2020-07-01 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
I would like to get the columns 'id' and 'json'.
I am using RMySQL package for getting the data from the db to an R dataframe but this gives me only the column 'id', the column 'json' contains only NAs in each row.
Is there any way how to import/load the data and get the json column displayed? And possibly to extract the "sensor" part of the json values?
The result would be a dataframe (df) like this:
id json
1 {"name":"Dent, Arthur","group":"Green","age (y)":43,"height (cm)":187,"wieght (kg)":89,"sensor":34834834}
2 {"name":"Doe, Jane","group":"Blue","age (y)":23,"height (cm)":172,"wieght (kg)":67,"sensor":12342439}
3 {"name":"Curt, Travis","group":"Red","age (y)":13,"height (cm)":128,"wieght (kg)":47,"sensor":83287699}
Or with with the extracted value:
id sensor
1 "sensor":34834834
2 "sensor":12342439
3 "sensor":83287699
Thank you very much for any suggestions.
Using unnest_wider from tidyr
library(dplyr)
con <- DBI::dbConnect(RMySQL::MySQL(), 'db_name', user = 'user', password = 'pass', host = 'hostname')
t <- tbl(con, 'table_name')
t %>%
as_tibble() %>%
transmute(j = purrr::map(json, jsonlite::fromJSON)) %>%
tidyr::unnest_wider(j)
DBI::dbDisconnect(con)
Result:
# A tibble: 3 x 6
name group `age (y)` `height (cm)` `wieght (kg)` sensor
<chr> <chr> <int> <int> <int> <int>
1 Dent, Arthur Green 43 187 89 34834834
2 Doe, Jane Blue 23 172 67 12342439
3 Curt, Travis Red 13 128 47 83287699
If you want to only retrieve data from the last 24 hours (as the OP requested) change the tbl(con, 'table_name') statement to:
t <- DBI::dbGetQuery(con, 'SELECT * FROM `table_name` WHERE DATE(`created_at`) > NOW() - INTERVAL 1 DAY')
Converting your JSON response to a data frame should be straightforward but, because the structure of a JSON response is essentially arbitrary and you haven't given us details of how you obtain it or exact details of its content, it's impossible to give you code that will work in your specific case. However, this is the basic process that works in one of my appliocations, starting with the post call to the API that provides access to the database.
library(httr)
library(jsonlite)
# Query the API
response <- POST(<your code here>)
# Extract the content of the response. Amend the format an encoding if necessary.
content <- content(response, as="text", encoding="UTF-8")
# Convert the content to an R object
content <- fromJSON(content, flatten=FALSE)
# Coerce to data.frame
df <- as.data.frame(content)
You should, of course, incorporate error and status checking throughout the process.
Note: your data contains a spelling mistake. "wieght" should be "weight".
I'm trying to write code that will loop through a list of integers, which relate to a number of sensors, to provide summary statistics (at this stage just cor()).
# GOOD TO HERE
corr_table <-data.frame(ID = integer()
, HxT = double())
for(j in gt_thrsh_key){ #this is currently set to 2:5 for testing - its a list of sensors I want to summarise
# extract humidity and time vectors
x <- sqldf(sprintf("SELECT humidity FROM data_agg_2 WHERE ID = %s",j))
y <- sqldf(sprintf("SELECT time_elapsed FROM data_agg_2 WHERE ID = %s",j))
# format into row
new_row <- data.frame(ID = c(j), HxT = c(cor(x,y))) #insert new variables into row
# append to dataframe
corr_table <- rbind(corr_table, new_row)
print(sprintf("Sensor %s has been summarised.",j)) # check 1
print(cor(x,y)) # check 2
}
print(corr_table)
assign("data_agg_2", data_agg_2, envir = .GlobalEnv)
I get output:
[1] "Sensor 2 has been summarised." "Sensor 3 has been summarised." "Sensor 4 has been summarised." "Sensor 5 has been summarised."
humidity -0.08950285
ID HxT
1 2 -0.08950285 #INCORRECT
2 3 -0.08950285 #INCORRECT
3 4 -0.08950285 #INCORRECT
4 5 -0.08950285 #correct
This is only the correct measurement for the final iteration of loop (id = 5), so somehow I must be overwriting previous entries. Does anyone know why this is happening? Or can you recommend a better way to perform this loop?
Thanks!!
EDIT: check 2 which prints the cor() of x and y through the loop confirms that only the final run of loop is calculating a value. Has anyone seen this before?
Here is a base R solution that uses lapply() to generate the correlations and write them to a list(). The list is converted to a data frame with do.call(rbind,...).
# simulate some data
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:10,20)
humidity <- rnorm(200,mean = 30,sd = 15)
elapsed_time <- rpois(200,2.5)
data <- data.frame(ID,humidity, elapsed_time)
uniqueIDs <- unique(data$ID)
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(data,ID == x)
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,HxT = HxT)
})
correlations <- do.call(rbind,correlationList)
...and the output:
> correlations
ID HxT
1 1 -0.1805885
2 2 -0.3166290
3 3 0.1749233
4 4 -0.2517737
5 5 0.1428092
6 6 0.3112812
7 7 -0.3180825
8 8 0.3774637
9 9 -0.3790178
10 10 -0.3070866
>
sqldf() version
We can restructure the code from the original post so it extracts all the data it needs through a single SQL query, and performs all subsequent processing in R.
First, we simulate 60,000 rows of data.
set.seed(19041798) # ensure consistency across multiple runs
ID <- rep(1:30,2000)
humidity <- rnorm(60000,mean = 30,sd = 15)
elapsed_time <- rpois(60000,2.5)
data <- data.frame(ID,humidity, elapsed_time)
Next, we extract data for the first 5 sensors from the data with sqldf(), as well as the vector of uniqueIDs.
library(sqldf)
# select ID <= 5
sqlStmt <- "select ID, humidity,elapsed_time from data where ID <= 5"
dataSubset <- sqldf(sqlStmt)
sqlStmt <- "select distinct ID from data where ID <= 5"
uniqueIDs <- sqldf(sqlStmt)[[1]]
At this point, the dataSubset data frame has 10,000 observations. We use lapply() with the vector of uniqueIDs to generate correlations by ID, count the complete.cases() included in each correlation, and write the results to a list of data frames.
correlationList <- lapply(uniqueIDs,function(x){
y <- subset(dataSubset,ID == x)
count <- sum(complete.cases(y)) # number of obs included in cor()
HxT <- cor(y$humidity,y$elapsed_time)
# return as data frame
data.frame(ID = x,count = count,HxT = HxT)
})
Finally, a do.call(rbind,...) and a print, and we have our list of correlations including counts of rows used to calculate the correlation.
correlations <- do.call(rbind,correlationList)
correlations
...and the output:
> correlations
ID count HxT
1 1 2000 0.015640244
2 2 2000 0.017143573
3 3 2000 -0.011283180
4 4 2000 0.052482666
5 5 2000 0.002083603
>
I have a DataFrame that consists of many stacked time series. The index is (poolId, month) where both are integers, the "month" being the number of months since 2000. What's the best way to calculate one-month lagged versions of multiple variables?
Right now, I do something like:
cols_to_shift = ["bal", ...5 more columns...]
df_shift = df[cols_to_shift].groupby(level=0).transform(lambda x: x.shift(-1))
For my data, this took me a full 60 s to run. (I have 48k different pools and a total of 718k rows.)
I'm converting this from R code and the equivalent data.table call:
dt.shift <- dt[, list(bal=myshift(bal), ...), by=list(poolId)]
only takes 9 s to run. (Here "myshift" is something like "function(x) c(x[-1], NA)".)
Is there a way I can get the pandas verison to be back in line speed-wise? I tested this on 0.8.1.
Edit: Here's an example of generating a close-enough data set, so you can get some idea of what I mean:
ids = np.arange(48000)
lens = np.maximum(np.round(15+9.5*np.random.randn(48000)), 1.0).astype(int)
id_vec = np.repeat(ids, lens)
lens_shift = np.concatenate(([0], lens[:-1]))
mon_vec = np.arange(lens.sum()) - np.repeat(np.cumsum(lens_shift), lens)
n = len(mon_vec)
df = pd.DataFrame.from_items([('pool', id_vec), ('month', mon_vec)] + [(c, np.random.rand(n)) for c in 'abcde'])
df = df.set_index(['pool', 'month'])
%time df_shift = df.groupby(level=0).transform(lambda x: x.shift(-1))
That took 64 s when I tried it. This data has every series starting at month 0; really, they should all end at month np.max(lens), with ragged start dates, but good enough.
Edit 2: Here's some comparison R code. This takes 0.8 s. Factor of 80, not good.
library(data.table)
ids <- 1:48000
lens <- as.integer(pmax(1, round(rnorm(ids, mean=15, sd=9.5))))
id.vec <- rep(ids, times=lens)
lens.shift <- c(0, lens[-length(lens)])
mon.vec <- (1:sum(lens)) - rep(cumsum(lens.shift), times=lens)
n <- length(id.vec)
dt <- data.table(pool=id.vec, month=mon.vec, a=rnorm(n), b=rnorm(n), c=rnorm(n), d=rnorm(n), e=rnorm(n))
setkey(dt, pool, month)
myshift <- function(x) c(x[-1], NA)
system.time(dt.shift <- dt[, list(month=month, a=myshift(a), b=myshift(b), c=myshift(c), d=myshift(d), e=myshift(e)), by=pool])
I would suggest you reshape the data and do a single shift versus the groupby approach:
result = df.unstack(0).shift(1).stack()
This switches the order of the levels so you'd want to swap and reorder:
result = result.swaplevel(0, 1).sortlevel(0)
You can verify it's been lagged by one period (you want shift(1) instead of shift(-1)):
In [17]: result.ix[1]
Out[17]:
a b c d e
month
1 0.752511 0.600825 0.328796 0.852869 0.306379
2 0.251120 0.871167 0.977606 0.509303 0.809407
3 0.198327 0.587066 0.778885 0.565666 0.172045
4 0.298184 0.853896 0.164485 0.169562 0.923817
5 0.703668 0.852304 0.030534 0.415467 0.663602
6 0.851866 0.629567 0.918303 0.205008 0.970033
7 0.758121 0.066677 0.433014 0.005454 0.338596
8 0.561382 0.968078 0.586736 0.817569 0.842106
9 0.246986 0.829720 0.522371 0.854840 0.887886
10 0.709550 0.591733 0.919168 0.568988 0.849380
11 0.997787 0.084709 0.664845 0.808106 0.872628
12 0.008661 0.449826 0.841896 0.307360 0.092581
13 0.727409 0.791167 0.518371 0.691875 0.095718
14 0.928342 0.247725 0.754204 0.468484 0.663773
15 0.934902 0.692837 0.367644 0.061359 0.381885
16 0.828492 0.026166 0.050765 0.524551 0.296122
17 0.589907 0.775721 0.061765 0.033213 0.793401
18 0.532189 0.678184 0.747391 0.199283 0.349949
In [18]: df.ix[1]
Out[18]:
a b c d e
month
0 0.752511 0.600825 0.328796 0.852869 0.306379
1 0.251120 0.871167 0.977606 0.509303 0.809407
2 0.198327 0.587066 0.778885 0.565666 0.172045
3 0.298184 0.853896 0.164485 0.169562 0.923817
4 0.703668 0.852304 0.030534 0.415467 0.663602
5 0.851866 0.629567 0.918303 0.205008 0.970033
6 0.758121 0.066677 0.433014 0.005454 0.338596
7 0.561382 0.968078 0.586736 0.817569 0.842106
8 0.246986 0.829720 0.522371 0.854840 0.887886
9 0.709550 0.591733 0.919168 0.568988 0.849380
10 0.997787 0.084709 0.664845 0.808106 0.872628
11 0.008661 0.449826 0.841896 0.307360 0.092581
12 0.727409 0.791167 0.518371 0.691875 0.095718
13 0.928342 0.247725 0.754204 0.468484 0.663773
14 0.934902 0.692837 0.367644 0.061359 0.381885
15 0.828492 0.026166 0.050765 0.524551 0.296122
16 0.589907 0.775721 0.061765 0.033213 0.793401
17 0.532189 0.678184 0.747391 0.199283 0.349949
Perf isn't too bad with this method (it might be a touch slower in 0.9.0):
In [19]: %time result = df.unstack(0).shift(1).stack()
CPU times: user 1.46 s, sys: 0.24 s, total: 1.70 s
Wall time: 1.71 s