rxDataStep using lagged values - revolution-r

In SAS its possible to go through a dataset and used lagged values.
The way I would do it is to use a function that does a "lag", but this presumably would produce a wrong value at the beginning of a chunk. For example if a chunk starts at row 200,000, then it will assume an NA for a lagged value that should come instead from row 199,999.
Is there a solution for this?

Here's another approach for lagging: self-merging using a shifted date. This is dramatically simpler to code and can lag several variables at once. The downsides are that it takes 2-3 times longer to run than my answer using transformFunc, and requires a second copy of the dataset.
# Get a sample dataset
sourcePath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")
# Set up paths for two copies of it
xdfPath <- tempfile(fileext = ".xdf")
xdfPathShifted <- tempfile(fileext = ".xdf")
# Convert "Date" to be Date-classed
rxDataStep(inData = sourcePath,
outFile = xdfPath,
transforms = list(Date = as.Date(Date)),
overwrite = TRUE
)
# Then make the second copy, but shift all the dates up
# one (or however much you want to lag)
# Use varsToKeep to subset to just the date and
# the variables you want to lag
rxDataStep(inData = xdfPath,
outFile = xdfPathShifted,
varsToKeep = c("Date", "Open", "Close"),
transforms = list(Date = as.Date(Date) + 1),
overwrite = TRUE
)
# Create an output XDF (or just overwrite xdfPath)
xdfLagged2 <- tempfile(fileext = ".xdf")
# Use that incremented date to merge variables back on.
# duplicateVarExt will automatically tag variables from the
# second dataset as "Lagged".
# Note that there's no need to sort manually in this one -
# rxMerge does it automatically.
rxMerge(inData1 = xdfPath,
inData2 = xdfPathShifted,
outFile = xdfLagged2,
matchVars = "Date",
type = "left",
duplicateVarExt = c("", "Lagged")
)

You're exactly right about the chunking problem. The workaround is to use rxGet and rxSet to pass values between chunks. Here's the function:
lagVar <- function(dataList) {
# .rxStartRow returns the overall row number of the first row in this
# chunk. So - the first row of the first chunk is equal to one.
# If this is the very first row, there's no previous value to use - so
# it's just an NA.
if(.rxStartRow == 1) {
# Put the NA out front, then shift all the other values down one row.
# newName is the desired name of the lagged variable, set using
# transformObjects - see below
dataList[[newName]] <- c(NA, dataList[[varToLag]][-.rxNumRows])
} else {
# If this isn't the very first chunk, we have to fetch the previous
# value from the previous chunk using .rxGet, then shift all other
# values down one row, just as before.
dataList[[newName]] <- c(.rxGet("lastValue"),
dataList[[varToLag]][-.rxNumRows])
}
# Finally, once this chunk is done processing, set its lastValue so that
# the next chunk can use it.
.rxSet("lastValue", dataList[[varToLag]][.rxNumRows])
# Return dataList with the new variable
dataList
}
and how to use it in rxDataStep:
# Get a sample dataset
xdfPath <- file.path(rxGetOption("sampleDataDir"), "DJIAdaily.xdf")
# Set a path to a temporary file
xdfLagged <- tempfile(fileext = ".xdf")
# Sort the dataset chronologically - otherwise, the lagging will be random.
rxSort(inData = xdfPath,
outFile = xdfLagged,
sortByVars = "Date")
# Finally, put the lagging function to use:
rxDataStep(inData = xdfLagged,
outFile = xdfLagged,
transformObjects = list(
varToLag = "Open",
newName = "previousOpen"),
transformFunc = lagVar,
append = "cols",
overwrite = TRUE)
# Check the results
rxDataStep(xdfLagged,
varsToKeep = c("Date", "Open", "previousOpen"),
numRows = 10)

Related

Change number format in Excel using names of headers - openpyxl [duplicate]

I have an Excel (.xlsx) file that I'm trying to parse, row by row. I have a header (first row) that has a bunch of column titles like School, First Name, Last Name, Email, etc.
When I loop through each row, I want to be able to say something like:
row['School']
and get back the value of the cell in the current row and the column with 'School' as its title.
I've looked through the OpenPyXL docs but can't seem to find anything terribly helpful.
Any suggestions?
I'm not incredibly familiar with OpenPyXL, but as far as I can tell it doesn't have any kind of dict reader/iterator helper. However, it's fairly easy to iterate over the worksheet rows, as well as to create a dict from two lists of values.
def iter_worksheet(worksheet):
# It's necessary to get a reference to the generator, as
# `worksheet.rows` returns a new iterator on each access.
rows = worksheet.rows
# Get the header values as keys and move the iterator to the next item
keys = [c.value for c in next(rows)]
for row in rows:
values = [c.value for c in row]
yield dict(zip(keys, values))
Excel sheets are far more flexible than CSV files so it makes little sense to have something like DictReader.
Just create an auxiliary dictionary from the relevant column titles.
If you have columns like "School", "First Name", "Last Name", "EMail" you can create the dictionary like this.
keys = dict((value, idx) for (idx, value) in enumerate(values))
for row in ws.rows[1:]:
school = row[keys['School'].value
I wrote DictReader based on openpyxl. Save the second listing to file 'excel.py' and use it as csv.DictReader. See usage example in the first listing.
with open('example01.xlsx', 'rb') as source_data:
from excel import DictReader
for row in DictReader(source_data, sheet_index=0):
print(row)
excel.py:
__all__ = ['DictReader']
from openpyxl import load_workbook
from openpyxl.cell import Cell
Cell.__init__.__defaults__ = (None, None, '', None) # Change the default value for the Cell from None to `` the same way as in csv.DictReader
class DictReader(object):
def __init__(self, f, sheet_index,
fieldnames=None, restkey=None, restval=None):
self._fieldnames = fieldnames # list of keys for the dict
self.restkey = restkey # key to catch long rows
self.restval = restval # default value for short rows
self.reader = load_workbook(f, data_only=True).worksheets[sheet_index].iter_rows(values_only=True)
self.line_num = 0
def __iter__(self):
return self
#property
def fieldnames(self):
if self._fieldnames is None:
try:
self._fieldnames = next(self.reader)
self.line_num += 1
except StopIteration:
pass
return self._fieldnames
#fieldnames.setter
def fieldnames(self, value):
self._fieldnames = value
def __next__(self):
if self.line_num == 0:
# Used only for its side effect.
self.fieldnames
row = next(self.reader)
self.line_num += 1
# unlike the basic reader, we prefer not to return blanks,
# because we will typically wind up with a dict full of None
# values
while row == ():
row = next(self.reader)
d = dict(zip(self.fieldnames, row))
lf = len(self.fieldnames)
lr = len(row)
if lf < lr:
d[self.restkey] = row[lf:]
elif lf > lr:
for key in self.fieldnames[lr:]:
d[key] = self.restval
return d
The following seems to work for me.
header = True
headings = []
for row in ws.rows:
if header:
for cell in row:
headings.append(cell.value)
header = False
continue
rowData = dict(zip(headings, row))
wantedValue = rowData['myHeading'].value
I was running into the same issue as described above. Therefore I created a simple extension called openpyxl-dictreader that can be installed through pip. It is very similar to the suggestion made by #viktor earlier in this thread.
The package is largely based on source code of Python's native csv.DictReader class. It allows you to select items based on column names using openpyxl. For example:
import openpyxl_dictreader
reader = openpyxl_dictreader.DictReader("names.xlsx", "Sheet1")
for row in reader:
print(row["First Name"], row["Last Name"])
Putting this here for reference.

Is there a method for converting a winmids object to a mids object?

Suppose I create 10 multiply-imputed datasets and use the (wonderful) MatchThem package in R to create weights for my exposure variable. The MatchThem package takes a mids object and converts it to an object of the class winmids.
My desired output is a mids object - but with weights. I hope to pass this mids object to BRMS as follows:
library(brms)
m0 <- brm_multiple(Y|weights(weights) ~ A, data = mids_data)
Open to suggestions.
EDIT: Noah's solution below will unfortunately not work.
The package's first author, Farhad Pishgar, sent me the following elegant solution. It will create a mids object from a winmidsobject. Thank you Farhad!
library(mice)
library(MatchThem)
#"weighted.dataset" is our .wimids object
#Extracting the original dataset with missing value
maindataset <- complete(weighted.datasets, action = 0)
#Some spit-and-polish
maindataset <- data.frame(.imp = 0, .id = seq_len(nrow(maindataset)), maindataset)
#Extracting imputed-weighted datasets in the long format
alldataset <- complete(weighted.datasets, action = "long")
#Binding them together
alldataset <- rbind(maindataset, alldataset)
#Converting to .mids
newmids <- as.mids(alldataset)
Additionally, for BRMS, I worked out this solution which instead creates a list of dataframes. It will work in fewer steps.
library("mice")
library("dplyr")
library("MatchThem")
library("brms") # for bayesian estimation.
# Note, I realise that my approach here is not fully Bayesian, but that is a good thing! I need to ensure balance in the exposure.
# impute missing data
data("nhanes2")
imp <- mice(nhanes2, printFlag = FALSE, seed = 0, m = 10)
# MathThem. This is just a fast method
w_imp <- weightthem(hyp ~ chl + age, data = imp,
approach = "within",
estimand = "ATE",
method = "ps")
# get individual data frames with weights
out <- complete(w_imp, action ="long", include = FALSE, mild = TRUE)
# assemble individual data frames into a list
m <- 10
listdat<- list()
for (i in 1:m) {
listdat[[i]] <- as.data.frame(out[[i]])
}
# pass the list to brms, and it runs as it should!
fit_1 <- brm_multiple(bmi|weights(weights) ~ age + hyp + chl,
data = listdat,
backend = "cmdstanr",
family = "gaussian",
set_prior('normal(0, 1)',
class = 'b'))
brm_multiple() can take in a list of data frames for its data argument. You can produce this from the wimids object using complete(). The output of complete() with action = "all" is a mild object, which is a list of data frames, but this is not recognized by brm_multiple() as such. So, you can just convert it to a list. This should look like the following:
df_list <- complete(mids_data, "all")
class(df_list) <- "list"
m0 <- brm_multiple(Y|weights(weights) ~ A, data = df_list)
Using complete() automatically adds a weights column to the resulting imputed data frames.

efficient way to join 65,000 .csv files

I have say 65,000 .csv files that I need to work with in julia language.
The goal is to perform basic statistics on the data set.
I had some ways of joining all the data sets
#1 - set a common index and leftjoin() - perform statistics row wise
#2 - vcat() the dataframes on top of each other - vertically stacked use group by
Eitherway the final data frames are very large ! and become slow in processing
Is there an efficient way of doing this ?
I thought of performing either #1 or #2 and splitting the joining operations in thirds, lets say after 20,000 joins save to .csv and operate in chunks then at the end join all 3 in one last operation.
Well not sure how to replicate making 65k .csv files but basically below I loop through the files in the directory, load the csv then vcat() to one df. Question more relating to if there is a better way to manage the size of the operation. vcat() makes something grow. Ahead of time maybe I can cycle through the .csv files, obtain file dimensions per .csv, initialize the full dataframe to final output size, then cycle through each .csv row by row and populate the initialized df.
using CSV
using DataFrames
# read all files in directory
csv_dir_tmax = cd(readdir, "C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax")
# initialize outputs
tmax_all = DataFrame(Date = [], TMAX = [])
c=1
for c = 1:length(csv_dir_tmax)
print("Starting csv file ", csv_dir_tmax[c]," - Iteration ",c,"\n")
if c <= length(csv_dir_tmax)
csv_tmax = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c]]), DataFrame, header=true)
tmax_all = vcat(tmax_all, csv_tmax)
end
end
The following approach should be relatively efficient (assuming that data fits into memory):
tmax_all = reduce(vcat, [CSV.read("YOUR_DIR$x", DataFrame) for x in csv_dir_tmax])
initializing the final output to the total size of final output (like vcat() would finally build). Then populate it element wise seems to be working way better:
# get the dimensions of each .csv files
tmax_all_total_output_size = fill(0, size(csv_dir_tmax,1))
tmin_all_total_output_size = fill(0, size(csv_dir_tmin,1))
tavg_all_total_output_size = fill(0, size(csv_dir_tavg,1))
tmax_dim = Int64[]
tmin_dim = Int64[]
tavg_dim = Int64[]
c=1
for c = 1:length(csv_dir_tmin) # 47484 - last point
print("Starting csv file ", csv_dir_tmin[c]," - Iteration ",c,"\n")
if c <= length(csv_dir_tmax)
tmax_csv = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmax/", csv_dir_tmax[c] ]), DataFrame, header=true)
global tmax_dim = size(tmax_csv,1)
tmax_all_total_output_size[c] = tmax_dim
end
if c <= length(csv_dir_tmin)
tmin_csv = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tmin/", csv_dir_tmin[c]]), DataFrame, header=true)
global tmin_dim = size(tmin_csv,1)
tmin_all_total_output_size[c] = tmin_dim
end
if c <= length(csv_dir_tavg)
tavg_csv = CSV.read(join(["C:/Users/andrew.bannerman/Desktop/Julia/scripts/GHCN data/ghcnd_all_csv/tavg/", csv_dir_tavg[c]]), DataFrame, header=true)
global tavg_dim = size(tavg_csv,1)
tavg_all_total_output_size[c] = tavg_dim
end
end
# sum total dimension of all .csv files
tmax_sum = sum(tmax_all_total_output_size)
tmin_sum = sum(tmin_all_total_output_size)
tavg_sum = sum(tavg_all_total_output_size)
# initialize final output to total final dimension
tmax_date_array = fill(Date("13000101", "yyyymmdd"),tmax_sum)
tmax_array = zeros(tmax_sum)
tmin_date_array = fill(Date("13000101", "yyyymmdd"),tmin_sum)
tmin_array = zeros(tmin_sum)
tavg_date_array = fill(Date("13000101", "yyyymmdd"),tavg_sum)
tavg_array = zeros(tavg_sum)
# initialize outputs
tmax_all = DataFrame(Date = tmax_date_array, TMAX = tmax_array)
tmin_all = DataFrame(Date = tmin_date_array, TMIN = tmin_array)
tavg_all = DataFrame(Date = tavg_date_array, TAVG = tavg_array)
tmax_count = 0
tmin_count = 0
tavg_count = 0
Then begin filling the initialized df.

How can I use a loop to apply a function to a list of csv files?

I'm trying to loop through all files in a directory and add "indicator" data to them. I had the code working where I could select 1 file and do this, but now am trying to make it work on all files. The problem is when I make the loop it says
ValueError: Invalid file path or buffer object type: <class 'list'>
The goal would be for each loop to read another file from list, make changes, and save file back to folder with changes.
Here is complete code w/o imports. I copied 1 of the "file_path"s from the list and put in comment at bottom.
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file in listdrs_path:
file_path = listdrs_path
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
###################################################
#here is an example of what path from list looks like
#'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/A.csv'
The problem is in line number 13 and 14. Your filename is in variable file but you are using file_path which you've assigned the file list. Because of this you are getting ValueError. Try this:
### open dialog to select file
#file_path = filedialog.askopenfilename()
###create list from dir
listdrs = os.listdir('c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/')
###append full path to list
string = 'c:/Users/17409/AppData/Local/Programs/Python/Python38/Indicators/Sentdex Tutorial/stock_dfs/'
listdrs_path = [ string + x for x in listdrs]
print (listdrs_path)
###start loop, for each "file" in listdrs run the 2 functions below and overwrite saved csv.
for file_path in listdrs_path:
data = pd.read_csv(file_path, index_col=0)
########################################
####function 1
def get_price_hist(ticker):
# Put stock price data in dataframe
data = pd.read_csv(file_path)
#listdr = os.listdir('Users\17409\AppData\Local\Programs\Python\Python38\Indicators\Sentdex Tutorial\stock_dfs')
print(listdr)
# Convert date to timestamp and make index
data.index = data["Date"].apply(lambda x: pd.Timestamp(x))
data.drop("Date", axis=1, inplace=True)
return data
df = data
##print(data)
######Indicator data#####################
def get_indicators(data):
# Get MACD
data["macd"], data["macd_signal"], data["macd_hist"] = talib.MACD(data['Close'])
# Get MA10 and MA30
data["ma10"] = talib.MA(data["Close"], timeperiod=10)
data["ma30"] = talib.MA(data["Close"], timeperiod=30)
# Get RSI
data["rsi"] = talib.RSI(data["Close"])
return data
#####end functions#######
data2 = get_indicators(data)
print(data2)
data2.to_csv(file_path)
Let me know if it helps.

stat_cor function in ggplot2: Print R and p-values on two lines

Is it possible to Correlation-values and the p-value on two lines instead of comma-separated as is the default:
default:
R=0.8, p=0.004
want:
R=0.8
p=0.004
The stat_cor function is from the ggpubr library (not base ggplot2). Regardless, the documentation for the function has your answer, which is to use the label.sep= argument in stat_cor. You can set that to "\n" to add a new line character as a separation and get the label over two lines. See the example in the documentation with the adjustment:
library(ggplot2)
library(ggpubr)
data("mtcars")
df <- mtcars
df$cyl <- as.factor(df$cyl)
sp <- ggscatter(df, x = "wt", y = "mpg",
add = "reg.line", # Add regressin line
add.params = list(color = "blue", fill = "lightgray"), # Customize reg. line
conf.int = TRUE # Add confidence interval
)
# Add correlation coefficient
sp + stat_cor(method = "pearson", label.x = 3, label.y = 30, label.sep='\n')