Generating a non BOOLEAN result to a CSV using R - sql

I have taken over a report from a college who has left the organisation. The report is written in R, it accesses an Oracle database, and runs SQL scripts into R.
The R code then pushes the data out to a csv. The difficulty I have is that it is pushing TRUE, FALSE or a blank cell to the csv. I want to update the code to print Updated, Not Updated, or a blank cell. I am struggling to find the correct place to make these updates.
Below is the code that generates the CSV, as well as a snippet that the code refers to, that may be generating the TRUE and FALSE that I referred to. Any help would be greatly appreciated.
# prepare a data frame based on Today_Active_address_wf for printing.
ForPrintONLYToday_Active_address_wf <- Today_Active_address_wf
# add an empty row to dataframe
ForPrintONLYToday_Active_address_wf[nrow(ForPrintONLYToday_Active_address_wf)+1, ] <- NA
# write today Active address wf data frame to .csv file
filename <- paste("Daily Address Change Workflow Report___", format(Sys.time(), "%Y-%m-%d__%Hh%M"), ".csv", sep ="")
filelocation <- "\\\\DCV-PANAPP-P001\\File Share\\BIS\\Data Administration\\Daily Workflow Reports\\"
filewrite <- paste(filelocation, filename, sep = "")
write.csv(ForPrintONLYToday_Active_address_wf, file = filewrite, row.names = FALSE, na="")
# add additional info (totals for each column)
addInfo1 <- c("Total_duplicated_WF"
,"Total_inconsistent_wf_addr"
,"Total_QAS_validated_addr"
,"Total_invalid_Wf_addr_date"
,"Postal_unmatched_with_valid_WF"
,"Total_No_Policy"
,"Total_home_risk"
,"Total_motor_risk"
,"Total_risk_addr_change"
)
addInfo2 <- c(sum(ForPrintONLYToday_Active_address_wf$dupCloseFlag, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$inconsistentDataFlag, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$WF_ADDRESS_VALIDATED, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$valid_addr_date == FALSE, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$MatchedPO_WF_addr == FALSE & ForPrintONLYToday_Active_address_wf$valid_addr_date == TRUE, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$hasNOpolicy, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$hasHomeRisk, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$hasMotorRisk, na.rm = TRUE)
,sum(ForPrintONLYToday_Active_address_wf$riskAddrChange_Flag, na.rm = TRUE)
)
addInfoTable <- as.table(setNames(addInfo2, addInfo1))
write.table(addInfoTable, file = filewrite, row.names = FALSE, col.names = FALSE, na="",append = TRUE, sep = ":,")
### Additional Function ###
# function that converts "Y" to TRUE and "N" to FALSE
yn_to_logical <- function(x) {
y <- rep.int(NA, length(x))
y[x == "Y"] <- TRUE
y[x == "N"] <- FALSE
y
} '
Thanks all, I know this is a pile of code, but R is not my strong point.

There are many ways to replace one set of values with another set of values. Here is one way:
# Create a simple data set that contains a column with "Y","N" and NAs (missing values)
df <- data.frame(x = sample(c("Y","N",NA), 10, replace = TRUE),
val = rnorm(10) )
df
# x val
# 1 N 0.56554865
# 2 <NA> -1.81437749
# 3 Y -1.21385694
# 4 Y -1.30173545
# 5 N -0.18994710
# 6 N -0.67519801
# 7 <NA> 0.02093869
# 8 N 0.69082204
# 9 Y 0.01715652
# 10 Y 1.34007199
# use function recode() from car package
library(car)
df$x <- recode(df$x, " 'Y' = 'Updated'; 'N' = 'Not Updated'")
df
# x val
# 1 Not Updated 0.56554865
# 2 <NA> -1.81437749
# 3 Updated -1.21385694
# 4 Updated -1.30173545
# 5 Not Updated -0.18994710
# 6 Not Updated -0.67519801
# 7 <NA> 0.02093869
# 8 Not Updated 0.69082204
# 9 Updated 0.01715652
# 10 Updated 1.34007199

Related

In R, I want to pull out the minimum value of z when x and y are duplicates

In R, I want to pull out the minimum value of z when x and y are duplicates.
line x y z
3559 1284283 10315402 20995
3560 1284283 10315402 20982
3596 1284341 10315344 20995
3597 1284341 10315344 20982
3633 1284399 10315286 20995
3634 1284399 10315286 20982
I did this so far:
identify duplicate x,y locations in the zone, creates a logical vector
I added the logical vector to the dataframe in column 4
z1_repeat_xy <- duplicated(df_Zone1[,1:2])
df_Zone1$repeat_xy <- z1_repeat_xy # feeds the true or false value to a new column called repeat_xy.
line# x y z repeat_xy
135161 1283668 10314903 19994 FALSE
135164 1283726 10314845 19994 FALSE
135167 1283784 10314787 19994 FALSE
135170 1283842 10314729 19994 FALSE
135171 1283842 10314729 19981 TRUE
135172 1283842 10314729 19968 TRUE
now I want the minumum z value corresponding to TRUE values in the 4th column along with the rows containing FALSE
I create a new df with only TRUE values in the repeat column
df_repeats <- filter(df_zone1, repeat_xy == TRUE)
Created the test_max and test_min df using the ave() function
seems to do what I want. Gives the min or max z value for repeat xy values.
test_max <- df_repeats[df_repeats$z == ave(df_repeats$z, df_repeats$x , FUN =max),]
test_min <- df_repeats[df_repeats$z == ave(df_repeats$z, df_repeats$x , FUN =min),]
write.table(test_max, file = "test_max.txt", sep = " ", row.names = FALSE)
write.table(test_min, file = "test_min.txt", sep = " ", row.names = FALSE)

Julia: Collapsing DataFrame by multiple values retaining additional variables

I have some data that has duplicate fields with the exception of a single field which I would like to join. In the data everything but the report should stay the same on each day and each company. Companies can file multiple reports on the same day.
I can join using the following code but I am losing the variables which are not in my by function. Any suggestions?
Mock Data
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v" * string(i))] = ""
end
for i in 1:size(x, 1),j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v" * string(j))] =
join(rand('a':'z', 3), "")
end
Collapsed data
outdf = by(df, [:company, :day]) do sub
t = DataFrame(fullreport = join(sub.report, "\n(Joined)\n"))
end
Here are some minor tweaks in your data preparation code:
using DataFrames
# Number of observations
n = 100
words = split("the wigdet drop air flat fall fling flap freeze flop tool fox", " ")
df = DataFrame(day = cumsum(rand(0:1, n)), company = rand(0:3, n),
report = [join(rand(words, rand(1:5, 1)[1]), " ") for i in 1:n])
x = df[:, [:day, :company]]
# Number of variables which are identical for each day/company.
nv = 100
for i in 1:nv
df[:, Symbol("v", i)] .= ""
end
for i in 1:size(x, 1), j in 1:nv
df[(df.day .== x[i,1]) .& (df.company .== x[i,2]), Symbol("v", j)] .= join(rand('a':'z', 3), "")
end
and here is by that keeps all other variables (assuming they are constant per group, this code should be efficient even for relatively large data):
outdf = by(df, [:company, :day]) do sub
merge((fullreport = join(sub.report, "\n(Joined)\n"),),
copy(sub[1, Not([:company, :day, :report])]))
end
I put the fullreport variable as a first one.
Here is the code that would keep all rows from the original data frame:
outdf = by(df, [:company, :day]) do sub
insertcols!(select(sub, Not([:company, :day, :report])), 1,
fullreport = join(sub.report, "\n(Joined)\n"))
end
and now you can e.g. check that unique(outdf) produces the same data frame as the one that was generated by the fist by.
(in the codes above I dropped also :report variable as I guess you did not want it in the result - right?)

Create Dataframe name from 2 strings or variables pandas

i am extracting selected pages from a pdf file. and want to assign dataframe name based on the pages extracted:
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
for i in selected_pages():
df{str(i)} = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True,area = [100,10,740,950],pages= (i), index = False)
print (df{str(i)} )
The idea, ultimately, as in above example, is to have dataframes: df10, df11. I have tried "df" + str(i), "df" & str(i) & df{str(i)}. however all are giving error msg: SyntaxError: invalid syntax
Or any better way of doing it is most welcome. thanks
This is where a dictionary would be a much better option.
Also note the error you have at the start of the loop. selected_pages is a list, so you can't do selected_pages().
file = "abc"
selected_pages = ['10','11'] #can be any combination eg ['6','14','20]
df = {}
for i in selected_pages:
df[i] = read_pdf(path + file + ".pdf",encoding = 'ISO-8859-1', stream = True, area = [100,10,740,950], pages= (i), index = False)
i = int(i) - 1 # this will bring it to 10
dfB = df[str(i)]
#select row number to drop: 0:4
dfB.drop(dfB.index[0:4],axis =0, inplace = True)
dfB.columns = ['col1','col2','col3','col4','col5']

geom_vline() dateRangeInput()

I have set up a line graph in shiny. The x axis has dates covering 2014 to current date.
I have set up various vertical lines using geom_vline() to highlight points in the data.
I'm using dateRangeInput() so the user can choose the start/end date range to look at on the graph.
One of my vertical lines is in Feb 2014. If the user uses the dateRangeInput() to look at dates from say Jan 2016 the vertical line for Feb 2014 is still showing on the graph. This is also causing the x axis to go from 2014 even though the data line goes from Jan 2016 to current date.
Is there a way to stop this vertical line showing on the graph when it's outside of the dataRangeInput()? Maybe there's an argument in geom_vline() to deal with this?
library(shiny)
library(tidyr)
library(dplyr)
library(ggplot2)
d <- seq(as.Date("2014-01-01"),Sys.Date(),by="day")
df <- data.frame(date = d , number = seq(1,length(d),by=1))
lines <- data.frame(x = as.Date(c("2014-02-07","2017-10-31", "2017-08-01")),
y = c(2500,5000,7500),
lbl = c("label 1", "label 2", "label 3"))
#UI
ui <- fluidPage(
#date range select:
dateRangeInput(inputId = "date", label = "choose date range",
start = min(df$date), end = max(df$date),
min = min(df$date), max = max(df$date)),
#graph:
plotOutput("line")
)
#SERVER:
server <- function(input, output) {
data <- reactive({ subset(df, date >= input$date[1] & date <= input$date[2])
})
#graph:
output$line <- renderPlot({
my_graph <- ggplot(data(), aes(date, number )) + geom_line() +
geom_vline(data = lines, aes(xintercept = x, color = factor(x) )) +
geom_label(data = lines, aes(x = x, y = y,
label = lbl, colour = factor(x),
fontface = "bold" )) +
scale_color_manual(values=c("#CC0000", "#6699FF", "#99FF66")) +
guides(colour = "none", size = "none")
return(my_graph)
})
}
shinyApp(ui = ui, server = server)
As mentioned by Aimée in a different thread:
In a nutshell, ggplot2 will always plot all of the data that you provide and the axis limits are based on that unless you specify otherwise. So because you are telling it to plot the line & label, they will appear on the plot even though the rest of the data doesn't extend that far.
You can resolve this by telling ggplot2 what you want the limits of your x axis to be, using the coord_cartesian function.
# Set the upper and lower limit for the x axis
dateRange <- c(input$date[1], input$date[2])
my_graph <- ggplot(df, aes(date, number)) + geom_line() +
geom_vline(data = lines, aes(xintercept = x, color = factor(x) )) +
geom_label(data = lines, aes(x = x, y = y,
label = lbl, colour = factor(x),
fontface = "bold" )) +
scale_color_manual(values=c("#CC0000", "#6699FF", "#99FF66")) +
guides(colour = "none", size = "none") +
coord_cartesian(xlim = dateRange)

Pandas - Find and index rows that match row sequence pattern

I would like to find a pattern in a dataframe in a categorical variable going down rows. I can see how to use Series.shift() to look up / down and using boolean logic to find the pattern, however, I want to do this with a grouping variable and also label all rows that are part of the pattern, not just the starting row.
Code:
import pandas as pd
from numpy.random import choice, randn
import string
# df constructor
n_rows = 1000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
df.head(10)
Which returns this:
I can find the start of the pattern (with no grouping though) by this:
# the row ordinal pattern to detect
p0, p1, p2, p3 = 1, 2, 2, 0
# flag the row at the start of the pattern
df['pat_flag'] = \
df['row_pat'].eq(p0) & \
df['row_pat'].shift(-1).eq(p1) & \
df['row_pat'].shift(-2).eq(p2) & \
df['row_pat'].shift(-3).eq(p3)
df.head(10)
What i cant figure out, is how to do this only withing the "group_var", and instead of returning True for the start of the pattern, return true for all rows that are part of the pattern.
Appreciate any tips on how to solve this!
Thanks...
I think you have 2 ways - simplier and slowier solution or faster complicated.
use Rolling.apply and test pattern
replace 0s to NaNs by mask
use bfill with limit (same as fillna with method='bfill') for repeat 1
then fillna NaNs to 0
last cast to bool by astype
pat = np.asarray([1, 2, 2, 0])
N = len(pat)
df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
.apply(lambda x: (x==pat).all())
.mask(lambda x: x == 0)
.bfill(limit=N-1)
.fillna(0)
.astype(bool)
)
If is important performance, use strides, solution from link was modify:
use rolling window approach
compare with pattaern and return Trues for match by all
get indices of first occurencies by np.mgrid and indexing
create all indices with list comprehension
compare by numpy.in1d and create new column
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
Another solution, thanks #divakar:
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
Timings:
np.random.seed(456)
import pandas as pd
from numpy.random import choice, randn
from scipy.ndimage.morphology import binary_dilation
import string
# df constructor
n_rows = 100000
df = pd.DataFrame({'date_time': pd.date_range('2/9/2018', periods=n_rows, freq='H'),
'group_var': choice(list(string.ascii_uppercase), n_rows),
'row_pat': choice([0, 1, 2, 3], n_rows),
'values': randn(n_rows)})
# sorting
df.sort_values(by=['group_var', 'date_time'], inplace=True)
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
c = np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
return c
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
m = (rolling_window(arr, len(pat)) == pat).all(1)
m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
arr = df['row_pat'].values
b = np.all(rolling_window(arr, N) == pat, axis=1)
c = np.mgrid[0:len(b)][b]
d = [i for x in c for i in range(x, x+N)]
df['rm2'] = np.in1d(np.arange(len(arr)), d)
print (df.iloc[460:480])
date_time group_var row_pat values rm0 rm1 rm2
12045 2019-06-25 21:00:00 A 3 -0.081152 False False False
12094 2019-06-27 22:00:00 A 1 -0.818167 False False False
12125 2019-06-29 05:00:00 A 0 -0.051088 False False False
12143 2019-06-29 23:00:00 A 0 -0.937589 False False False
12145 2019-06-30 01:00:00 A 3 0.298460 False False False
12158 2019-06-30 14:00:00 A 1 0.647161 False False False
12164 2019-06-30 20:00:00 A 3 -0.735538 False False False
12210 2019-07-02 18:00:00 A 1 -0.881740 False False False
12341 2019-07-08 05:00:00 A 3 0.525652 False False False
12343 2019-07-08 07:00:00 A 1 0.311598 False False False
12358 2019-07-08 22:00:00 A 1 -0.710150 True True True
12360 2019-07-09 00:00:00 A 2 -0.752216 True True True
12400 2019-07-10 16:00:00 A 2 -0.205122 True True True
12404 2019-07-10 20:00:00 A 0 1.342591 True True True
12413 2019-07-11 05:00:00 A 1 1.707748 False False False
12506 2019-07-15 02:00:00 A 2 0.319227 False False False
12527 2019-07-15 23:00:00 A 3 2.130917 False False False
12600 2019-07-19 00:00:00 A 1 -1.314070 False False False
12604 2019-07-19 04:00:00 A 0 0.869059 False False False
12613 2019-07-19 13:00:00 A 2 1.342101 False False False
In [225]: %%timeit
...: df['rm0'] = (df['row_pat'].rolling(window=N , min_periods=N)
...: .apply(lambda x: (x==pat).all())
...: .mask(lambda x: x == 0)
...: .bfill(limit=N-1)
...: .fillna(0)
...: .astype(bool)
...: )
...:
1 loop, best of 3: 356 ms per loop
In [226]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...: c = np.mgrid[0:len(b)][b]
...: d = [i for x in c for i in range(x, x+N)]
...: df['rm2'] = np.in1d(np.arange(len(arr)), d)
...:
100 loops, best of 3: 7.63 ms per loop
In [227]: %%timeit
...: arr = df['row_pat'].values
...: b = np.all(rolling_window(arr, N) == pat, axis=1)
...:
...: m = (rolling_window(arr, len(pat)) == pat).all(1)
...: m_ext = np.r_[m,np.zeros(len(arr) - len(m), dtype=bool)]
...: df['rm1'] = binary_dilation(m_ext, structure=[1]*N, origin=-(N//2))
...:
100 loops, best of 3: 7.25 ms per loop
You could make use of the pd.rolling() methods and then simply compare the arrays that it returns with the array that contains the pattern that you are attempting to match on.
pattern = np.asarray([1.0, 2.0, 2.0, 0.0])
n_obs = len(pattern)
df['rolling_match'] = (df['row_pat']
.rolling(window=n_obs , min_periods=n_obs)
.apply(lambda x: (x==pattern).all())
.astype(bool) # All as bools
.shift(-1 * (n_obs - 1)) # Shift back
.fillna(False) # convert NaNs to False
)
It is important to specify the min periods here in order to ensure that you only find exact matches (and so the equality check won't fail when the shapes are misaligned). The apply function is doing a pairwise check between the two arrays, and then we use the .all() to ensure all match. We convert to a bool, and then call shift on the function to move it to being a 'forward looking' indicator instead of only occurring after the fact.
Help on the rolling functionality available here -
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html
This works.
It works like this:
a) For every group, it takes a window of size 4 and scans through the column until it finds the combination (1,2,2,0) in exact sequence. As soon as it finds the sequence, it populates the corresponding index values of new column 'pat_flag' with 1.
b) If it doesn't find the combination, it populates the column with 0.
pattern = [1,2,2,0]
def get_pattern(df):
df = df.reset_index(drop=True)
df['pat_flag'] = 0
get_indexes = []
temp = []
for index, row in df.iterrows():
mindex = index +1
# get the next 4 values
for j in range(mindex, mindex+4):
if j == df.shape[0]:
break
else:
get_indexes.append(j)
temp.append(df.loc[j,'row_pat'])
# check if sequence is matched
if temp == pattern:
df.loc[get_indexes,'pat_flag'] = 1
else:
# reset if the pattern is not found in given window
temp = []
get_indexes = []
return df
# apply function to the groups
df = df.groupby('group_var').apply(get_pattern)
## snippet of output
date_time group_var row_pat values pat_flag
41 2018-03-13 21:00:00 C 3 0.731114 0
42 2018-03-14 05:00:00 C 0 1.350164 0
43 2018-03-14 11:00:00 C 1 -0.429754 1
44 2018-03-14 12:00:00 C 2 1.238879 1
45 2018-03-15 17:00:00 C 2 -0.739192 1
46 2018-03-18 06:00:00 C 0 0.806509 1
47 2018-03-20 06:00:00 C 1 0.065105 0
48 2018-03-20 08:00:00 C 1 0.004336 0
Expanding on Emmet02's answer: using the rolling function for all groups and setting match-column to 1 for all matching pattern indices:
pattern = np.asarray([1,2,2,0])
# Create a match column in the main dataframe
df.assign(match=False, inplace=True)
for group_var, group in df.groupby("group_var"):
# Per group do rolling window matching, the last
# values of matching patterns in array 'match'
# will be True
match = (
group['row_pat']
.rolling(window=len(pattern), min_periods=len(pattern))
.apply(lambda x: (x==pattern).all())
)
# Get indices of matches in current group
idx = np.arange(len(group))[match == True]
# Include all indices of matching pattern,
# counting back from last index in pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Update matches
match.values[idx] = True
df.loc[group.index, 'match'] = match
df[df.match==True]
edit: Without a for loop
# Do rolling matching in group clause
match = (
df.groupby("group_var")
.rolling(len(pattern))
.row_pat.apply(lambda x: (x==pattern).all())
)
# Convert NaNs
match = (~match.isnull() & match)
# Get indices of matches in current group
idx = np.arange(len(df))[match]
# Include all indices of matching pattern
idx = idx.repeat(len(pattern)) - np.tile(np.arange(len(pattern)), len(idx))
# Mark all indices that are selected by "idx" in match-column
df = df.assign(match=df.index.isin(df.index[idx]))
You can do this by defining a custom aggregate function, then using it in group_by statement, finally merge it back to the original dataframe. Something like this:
Aggregate function:
def pattern_detect(column):
# define any other pattern to detect here
p0, p1, p2, p3 = 1, 2, 2, 0
column.eq(p0) & \
column.shift(-1).eq(p1) & \
column.shift(-2).eq(p2) & \
column.shift(-3).eq(p3)
return column.any()
Use group by function next:
grp = df.group_by('group_var').agg([patter_detect])['row_pat']
Now merge it back to the original dataframe:
df = df.merge(grp, left_on='group_var',right_index=True, how='left')