I would like to separate a field using tidyr:separate and keep the separator and use negative look back - tidyr

I would like to use separate with a negative look behind and keep the separator. My solution
below does not keep the first capital letter of the last name.
There is an answer that does not use negative and I can't figure out how to modify it for
negative look back.
How do I split a string with tidyr::separate in R and retain the values of the separator string?
tidyr::tibble(myname = c("HarlanNelson")) |>
tidyr::separate(col = myname, into = c("first", "last"), sep = "(?<!^)[[:upper:]]")
#> # A tibble: 1 × 2
#> first last
#> <chr> <chr>
#> 1 Harlan elson
Created on 2022-10-20 by the reprex package (v2.0.1)
tidyr::tibble(myname = c("HarlanNelson", "Another Person")) |>
tidyr::separate(col = myname, into = c("first", "last"), sep = c(" ", "(?<!^)[[:upper:]]"))
#> Warning in gregexpr(pattern, x, perl = TRUE): argument 'pattern' has length > 1
#> and only the first element will be used
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> # A tibble: 2 × 2
#> first last
#> <chr> <chr>
#> 1 HarlanNelson <NA>
#> 2 Another Person
Created on 2022-10-20 by the reprex package (v2.0.1)
tidyr::tibble(myname = c("HarlanNelson", "Another Person", "someone else")) |>
tidyr::separate(col = myname, into = c("first", "last"), sep = c(" ", "(?<!^)[[:upper:]]"))
#> Warning in gregexpr(pattern, x, perl = TRUE): argument 'pattern' has length > 1
#> and only the first element will be used
#> Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
#> # A tibble: 3 × 2
#> first last
#> <chr> <chr>
#> 1 HarlanNelson <NA>
#> 2 Another Person
#> 3 someone else
Created on 2022-10-20 by the reprex package (v2.0.1)

This is what I figured out.
But this is just developing an understanding of the answer at
https://stackoverflow.com/a/51415101/4629916
from #cameron
and applying it to my problem.
tidyr::tibble(myname = c("HarlanNelson", "Another Person", "someone else")) |>
tidyr::separate(col = myname, into = c("first", "last"), sep = "(?<=[[:lower:]])(?=[[:upper:]])", extra = 'merge', fill = 'right') |>
tidyr::separate(col = first, into = c("first", "last2"), sep = " ", fill = 'right', extra = 'merge') |>
dplyr::mutate(last = dplyr::coalesce(last, last2)) |>
dplyr::select(-last2)
#> # A tibble: 3 × 2
#> first last
#> <chr> <chr>
#> 1 Harlan Nelson
#> 2 Another Person
#> 3 someone else
tidyr::tibble(myname = c("HarlanNelson", "Another Person", "someone else")) |>
tidyr::separate(col = myname, into = c("first", "last"), sep = "(?<!^)(?=[[:upper:]])", extra = 'merge', fill = 'right') |>
tidyr::separate(col = first, into = c("first", "last2"), sep = " ", extra = 'merge', fill = 'right') |>
dplyr::mutate(last = dplyr::coalesce(last, last2)) |>
dplyr::select(-last2)
#> # A tibble: 3 × 2
#> first last
#> <chr> <chr>
#> 1 Harlan Nelson
#> 2 Another Person
#> 3 someone else

Related

How can I percentage makeup of TRUE from a vector?

This is the code used to derive the first table in my question.
JH %>% group_by(ATT_ID, CAR=="B") %>%
summarize(count = n(), .groups = "drop")
ATT_ID
CAR == "B"
Count
ONE
FALSE
1
TWO
TRUE
1
THREE
TRUE
3
THREE
FALSE
5
FOUR
FALSE
2
FIVE
TRUE
4
SIX
TRUE
8
SIX
FALSE
4
How can I get the table above to look like:
ATT_ID
Percentage of "B"
ONE
0%
TWO
100%
THREE
37.5%
FOUR
0%
FIVE
100%
SIX
67%
Notice how some ID's are seen twice so as to show the presence of both FALSE & TRUE whereas other ID's appear once to showcase the presence of only one or the other.
Thank you
You can do the following:
dt %>%
group_by(ATT_ID) %>%
summarize(perc = sprintf("%3.1f%%", 100*sum(Count*`CAR =="B"`)/sum(Count)))
Output:
# A tibble: 6 × 2
ATT_ID perc
<chr> <chr>
1 FIVE 100.0%
2 FOUR 0.0%
3 ONE 0.0%
4 SIX 66.7%
5 THREE 37.5%
6 TWO 100.0%
Input:
structure(list(ATT_ID = c("ONE", "TWO", "THREE", "THREE", "FOUR",
"FIVE", "SIX", "SIX"), `CAR =="B"` = c(FALSE, TRUE, TRUE, FALSE,
FALSE, TRUE, TRUE, FALSE), Count = c(1, 1, 3, 5, 2, 4, 8, 4)), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -8L))

Merge rows with same id, different vallues in 1 column to multiple columns

what i have length can be of different values/ so somethimes 1 id has 4 rows with different values in column val, the other columns have all the same values
df1 = pd.DataFrame({'id':[1,1,1,2,2,2,3,3,3], 'val': ['06123','nick','#gmail','06454','abey','#gmail','06888','sisi'], 'media': ['nrc','nrc','nrc','nrc','nrc','nrc','nrc','nrc']})
what i need
id kolom 1 kolom2 kolom 3 media
1 06123 nick #gmail nrc
2 06454 abey #gmail nrc
3 6888 sisi None nrc
I hope I gave a good example, in the corrected way, thanks for the help
df2 = df1.groupby('id').agg(list)
df2['col 1'] = df2['val'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2['col 2'] = df2['val'].apply(lambda x: x[1] if len(x) > 1 else 'None')
df2['col 3'] = df2['val'].apply(lambda x: x[2] if len(x) > 2 else 'None')
df2['media'] = df2['media'].apply(lambda x: x[0] if len(x) > 0 else 'None')
df2.drop(columns='val')
Here is another way. Since your original dataframe doesn't have lists with the same length (which will get you a ValueError, you can define it as:
data = {"id":[1,1,1,2,2,2,3,3,3],
"val": ["06123","nick","#gmail","06454","abey","#gmail","06888","sisi"],
"media": ["nrc","nrc","nrc","nrc","nrc","nrc","nrc","nrc"]}
df = pd.DataFrame.from_dict(data, orient="index")
df = df.transpose()
>>> df
id val media
0 1 06123 nrc
1 1 nick nrc
2 1 #gmail nrc
3 2 06454 nrc
4 2 abey nrc
5 2 #gmail nrc
6 3 06888 nrc
7 3 sisi nrc
8 3 NaN NaN
Afterwards, you can replace with np.nan values with an empty string, so that you can groupby your id column and join the values in val separated by a ,.
df = df.replace(np.nan, "", regex=True)
df_new = df.groupby(["id"])["val"].apply(lambda x: ",".join(x)).reset_index()
>>> df_new
id val
0 1.0 06123,nick,#gmail
1 2.0 06454,abey,#gmail
2 3.0 06888,sisi,
Then, you only need to transform the new val column into 3 columns by splitting the string inside, with any method you want. For example,
new_cols = df_new["val"].str.split(",", expand=True) # Good ol' split
df_new["kolom 1"] = new_cols[0] # Assign to new columns
df_new["kolom 2"] = new_cols[1]
df_new["kolom 3"] = new_cols[2]
df_new.drop("val", 1, inplace=True) # Delete previous val
df_new["media"] = "nrc" # Add the media column again
df_new = df_new.replace("", np.nan, regex=True) # If necessary, replace empty string with np.nan
>>> df_new
id kolom 1 kolom 2 kolom 3 media
0 1.0 06123 nick #gmail nrc
1 2.0 06454 abey #gmail nrc
2 3.0 06888 sisi NaN nrc

Fix a pie-chart plot to make it more readable by adding a Legend

I need to fix the following pie chart, I need a legend with the names of the countries (percentages can be fine like this.
COMP_plot <- comp_plot %>%
select(-commodity) %>%
pivot_longer(cols = names(.)) %>%
mutate(name = factor(name, levels = rev(name))) %>%
mutate(position = cumsum(lag(value, default = 0)) + value/2) %>%
ggplot(aes(x = value, y = 1, fill = name)) +
geom_col() +
geom_bar(stat = "identity",color = "black") +
geom_text(aes(x = position, y = 2.3, label = name)) +
geom_text(aes(x = position, y = 1.7, label = value), colour = "black") +
coord_polar() +
theme_void() +
guides(fill = "none") +
labs(title = "Cotton Exports by country of origin")
COMP_plot
this is the data frame I used:
> comp_plot
# A tibble: 1 x 7
commodity China Australia India Rep.of.Korea Thailand ROW
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Cotton 14.1 2.1 6.2 2.2 1.1 74.3

geom_violin using the weight aesthetic unexpectedly drop levels

library(tidyverse)
set.seed(12345)
dat <- data.frame(year = c(rep(1990, 100), rep(1991, 100), rep(1992, 100)),
fish_length = sample(x = seq(from = 10, 131, by = 0.1), 300, replace = F),
nb_caught = sample(x = seq(from = 1, 200, by = 0.1), 300, replace = T),
stringsAsFactors = F) %>%
mutate(age = ifelse(fish_length < 20, 1,
ifelse(fish_length >= 20 & fish_length < 100, 2,
ifelse(fish_length >= 100 & fish_length < 130, 3, 4)))) %>%
arrange(year, fish_length)
head(dat)
year fish_length nb_caught age
1 1990 10.1 45.2 1
2 1990 10.7 170.0 1
3 1990 10.9 62.0 1
4 1990 12.1 136.0 1
5 1990 14.1 80.8 1
6 1990 15.0 188.9 1
dat %>% group_by(year) %>% summarise(ages = n_distinct(age)) # Only 1992 has age 4 fish
# A tibble: 3 x 2
year ages
<dbl> <int>
1 1990 3
2 1991 3
3 1992 4
dat %>% filter(age == 4) # only 1 row for age 4
year fish_length nb_caught age
1 1992 130.8 89.2 4
Here:
year = year of sampling
fish_length = length of the fish in cm
nb_caught = number of fish caught following the use of an age-length key, hence explaining the presence of decimals
age = age of the fish
graph1: geom_violin not using the weight aesthetic.
Here, I got to copy each line of dat according to the value found in nb_caught.
dim(dat) # 300 rows
dat_graph1 <- dat[rep(1:nrow(dat), floor(dat$nb_caught)), ]
dim(dat_graph1) # 30932 rows
dat_graph1$nb_caught <- NULL # useless now
sum(dat$nb_caught) - nrow(dat_graph1) # 128.2 rows lost here
Since I have decimal values of nb_caught, I took the integer value to create dat_graph1. I lost 128.2 "rows" in the process.
Now for the graph:
dat_tile <- data.frame(year = sort(unique(dat$year))[sort(unique(dat$year)) %% 2 == 0])
# for the figure's background
graph1 <- ggplot(data = dat_graph1,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph1") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph1
graph1
Note here that I have a flat bar for age 4 in year 1992.
dat_graph1 %>% filter(year == 1992, age == 4) %>% pull(fish_length) %>% unique
[1] 130.8
That is because I only have one length for that particular year-age combination.
graph2: geom_violin using the weight aesthetic.
Now, instead of copying each row of dat by the value of number_caught, let's use the weight aesthetic.
Let's calculate the weight wt that each line of dat will have in the calculation of the density curve of each year-age combinations.
dat_graph2 <- dat %>%
group_by(year, age) %>%
mutate(wt = nb_caught / sum(nb_caught)) %>%
as.data.frame()
head(dat_graph2)
year fish_length nb_caught age wt
1 1990 10.1 45.2 1 0.03573123
2 1990 10.7 170.0 1 0.13438735
3 1990 10.9 62.0 1 0.04901186
4 1990 12.1 136.0 1 0.10750988
5 1990 14.1 80.8 1 0.06387352
6 1990 15.0 188.9 1 0.14932806
graph2 <- ggplot(data = dat_graph2,
aes(x = as.factor(year), y = fish_length, fill = as.factor(age),
color = as.factor(age), .drop = F)) +
geom_tile(data = dat_tile, aes(x = factor(year), y = 1, height = Inf, width = 1),
fill = "grey80", inherit.aes = F) +
geom_violin(aes(weight = wt), draw_quantiles = c(0.05, 0.5, 0.95), color = "black",
scale = "width", position = "dodge") +
scale_x_discrete(expand = c(0,0)) +
labs(x = "Year", y = "Fish length", fill = "Age", color = "Age", title = "graph2") +
scale_fill_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_color_brewer(palette = "Paired", drop = F) + # drop = F for not losing levels
scale_y_continuous(expand = expand_scale(mult = 0.01)) +
theme_bw()
graph2
dat_graph2 %>% filter(year == 1992, age == 4)
year fish_length nb_caught age wt
1 1992 130.8 89.2 4 1
graph2
Note here that the flat bar for age 4 in year 1992 seen on graph1 has been dropped here even though the line exists in dat_graph2.
My questions
Why is the age 4 in 1992 level dropped when using the weight aesthetic? How can I overcome this?
Why are the two graphs not visually alike even though they used the same data?
Thanks in advance for your help!
1.
Problem 1 is not related to using the weight aesthetic. You can check this by dropping the weight aesthetic in the code for your second graph. The problem is, that the algorithm for computing the density fails, when there are too less observations.
That is the reason, why group 4 shows up in graph 1 with the expanded dataset (grpah 1). Here you increase the number of observations by replicating the number of obs.
Unfortunately, geom_violin gives no warning in your specific case. However, if you filter dat_graph2 for age == 4 geom_violin gives you the warning
Warning message:
Computation failed in `stat_ydensity()`:
replacement has 1 row, data has 0
geom_density is much clearer on this issue, giving a warning, that groups with less than two obs have been dropped.
Unfortunately, I have no solution to overcome this, besides working with the expanded dataset.
2.
Concerning problem 2 I have no convincing answer except that I guess that this is related to the details of the kernel density estimator used by geom_violin, geom_density, ... and perhaps also somehow related to the number of data points.

Generating a non BOOLEAN result to a CSV using R

I have taken over a report from a college who has left the organisation. The report is written in R, it accesses an Oracle database, and runs SQL scripts into R.
The R code then pushes the data out to a csv. The difficulty I have is that it is pushing TRUE, FALSE or a blank cell to the csv. I want to update the code to print Updated, Not Updated, or a blank cell. I am struggling to find the correct place to make these updates.
Below is the code that generates the CSV, as well as a snippet that the code refers to, that may be generating the TRUE and FALSE that I referred to. Any help would be greatly appreciated.
# prepare a data frame based on Today_Active_address_wf for printing.
ForPrintONLYToday_Active_address_wf <- Today_Active_address_wf
# add an empty row to dataframe
ForPrintONLYToday_Active_address_wf[nrow(ForPrintONLYToday_Active_address_wf)+1, ] <- NA
# write today Active address wf data frame to .csv file
filename <- paste("Daily Address Change Workflow Report___", format(Sys.time(), "%Y-%m-%d__%Hh%M"), ".csv", sep ="")
filelocation <- "\\\\DCV-PANAPP-P001\\File Share\\BIS\\Data Administration\\Daily Workflow Reports\\"
filewrite <- paste(filelocation, filename, sep = "")
write.csv(ForPrintONLYToday_Active_address_wf, file = filewrite, row.names = FALSE, na="")
# add additional info (totals for each column)
addInfo1 <- c("Total_duplicated_WF"
,"Total_inconsistent_wf_addr"
,"Total_QAS_validated_addr"
,"Total_invalid_Wf_addr_date"
,"Postal_unmatched_with_valid_WF"
,"Total_No_Policy"
,"Total_home_risk"
,"Total_motor_risk"
,"Total_risk_addr_change"
)
addInfo2 <- c(sum(ForPrintONLYToday_Active_address_wf$dupCloseFlag, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$inconsistentDataFlag, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$WF_ADDRESS_VALIDATED, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$valid_addr_date == FALSE, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$MatchedPO_WF_addr == FALSE & ForPrintONLYToday_Active_address_wf$valid_addr_date == TRUE, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$hasNOpolicy, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$hasHomeRisk, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$hasMotorRisk, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$riskAddrChange_Flag, na.rm = TRUE)
)
addInfoTable <- as.table(setNames(addInfo2, addInfo1))
write.table(addInfoTable, file = filewrite, row.names = FALSE, col.names = FALSE, na="",append = TRUE, sep = ":,")
### Additional Function ###
# function that converts "Y" to TRUE and "N" to FALSE
yn_to_logical <- function(x) {
y <- rep.int(NA, length(x))
y[x == "Y"] <- TRUE
y[x == "N"] <- FALSE
y
} '
Thanks all, I know this is a pile of code, but R is not my strong point.
There are many ways to replace one set of values with another set of values. Here is one way:
# Create a simple data set that contains a column with "Y","N" and NAs (missing values)
df <- data.frame(x = sample(c("Y","N",NA), 10, replace = TRUE),
val = rnorm(10) )
df
# x val
# 1 N 0.56554865
# 2 <NA> -1.81437749
# 3 Y -1.21385694
# 4 Y -1.30173545
# 5 N -0.18994710
# 6 N -0.67519801
# 7 <NA> 0.02093869
# 8 N 0.69082204
# 9 Y 0.01715652
# 10 Y 1.34007199
# use function recode() from car package
library(car)
df$x <- recode(df$x, " 'Y' = 'Updated'; 'N' = 'Not Updated'")
df
# x val
# 1 Not Updated 0.56554865
# 2 <NA> -1.81437749
# 3 Updated -1.21385694
# 4 Updated -1.30173545
# 5 Not Updated -0.18994710
# 6 Not Updated -0.67519801
# 7 <NA> 0.02093869
# 8 Not Updated 0.69082204
# 9 Updated 0.01715652
# 10 Updated 1.34007199