How to read multiple .xls files in one go in r - fread

Tried the below code multiple times, but nothing happens when I run the below code. I think fread does not read .xls format. Thus I tried two other different codes, one with Rio package and another with openxlsx package. Sorry i am new to this. There are 38 files, each with name "Cust+Txn+Details+Customer (36).xls". Thank you.
## First put all file names into a list
library(data.table)
files <- list.files(path = "F:\\MUMuniv\\machine learning class\\
price sensitivty\\PS works\\Customer files",
pattern = ".xls", full.names = T)
readdata <- function(fn){
dt_temp <- fread(fn)
return(dt_temp)
}
# then using
all.files <- lapply(files, readdata)
final.data <- rbindlist(all.files)
Error I got: " Error in fread(fn) : mmap'd region has EOF at the end "
#Example 2
#rio package
require("rio")
xls <- dir(path = ".", all.files = T)
created <- mapply(convert, xls, gsub(".xlsx", ".csv", "xls"))
unlink(xls)
Error in get_ext(file) : 'file' has no extension
#example 3
# using openxlsx package
require("openxlsx")
# Create a vector of Excel files to read
files.to.read = list.files(path = ".", all.files = T)
# Read each file and write it to csv
lapply(files.to.read, function(f) {
df = read.xlsx(f, sheet=1)
write.csv(df, gsub("xlsx", "csv", f), row.names=FALSE)
})
Error in file(con, "r") : invalid 'description' argument In addition: Warning message:
In unzip(xlsxFile, exdir = xmlDir) : error 1 in extracting from zip file

Related

wait until the blob storoage folder is created

I would like to download a picture into a blob folder.
Before that I need to create the folder first.
Below codes are what I am doing.
The issue is the folder needs time to be created.
When it comes to with open(abs_file_name, "wb") as f:
it can not find the folder.
I am wondering whether there is an 'await' to get to know the completion of the folder creation, then do the write operation.
for index, row in data.iterrows():
url = row['Creatives']
file_name = url.split('/')[-1]
r = requests.get(url)
abs_file_name = lake_root + file_name
dbutils.fs.mkdirs(abs_file_name)
if r.status_code == 200:
with open(abs_file_name, "wb") as f:
f.write(r.content)
The final sub folder will not be created when using dbutils.fs.mkdirs() on blob storage.
It creates a file with the final sub folder name which would be considered as a directory, but it is not a directory. Look at the following demonstration:
dbutils.fs.mkdirs('/mnt/repro/s1/s2/s3.csv')
When I try to open this file, the error says that this is a directory.
This might be the issue with the code. So, try using the following code instead:
for index, row in data.iterrows():
url = row['Creatives']
file_name = url.split('/')[-1]
r = requests.get(url)
abs_file_name = lake_root + 'fail' #creates the fake directory (to counter the problem we are facing above)
dbutils.fs.mkdirs(abs_file_name)
if r.status_code == 200:
with open(lake_root + file_name, "wb") as f:
f.write(r.content)

Dropbox - Automatic Refresh token Using oauth 2.0 with offlineaccess

I now: the automatic token refreshing is not a new topic.
This is the use case that generate my problem: let's say that we want extract data from Dropbox. Below you can find the code: for the first time works perfectly: in fact 1) the user goes to the generated link; 2) after allow the app coping and pasting the authorization code in the input box.
The problem arise when some hours after the user wants to do the same operation. How to avoid or by-pass the newly generation of authorization code and go straight to the operation?enter code here
As you can see in the code in a short period is possible reinject the auth code inside the code (commented in the code). But after 1 hour or more this is not loger possible.
Any help is welcome.
#!/usr/bin/env python3
import dropbox
from dropbox import DropboxOAuth2FlowNoRedirect
'''
Populate your app key in order to run this locally
'''
APP_KEY = ""
auth_flow = DropboxOAuth2FlowNoRedirect(APP_KEY, use_pkce=True, token_access_type='offline')
target='/DVR/DVR/'
authorize_url = auth_flow.start()
print("1. Go to: " + authorize_url)
print("2. Click \"Allow\" (you might have to log in first).")
print("3. Copy the authorization code.")
auth_code = input("Enter the authorization code here: ").strip()
#auth_code="3NIcPps_UxAAAAAAAAAEin1sp5jUjrErQ6787_RUbJU"
try:
oauth_result = auth_flow.finish(auth_code)
except Exception as e:
print('Error: %s' % (e,))
exit(1)
with dropbox.Dropbox(oauth2_refresh_token=oauth_result.refresh_token, app_key=APP_KEY) as dbx:
dbx.users_get_current_account()
print("Successfully set up client!")
for entry in dbx.files_list_folder(target).entries:
print(entry.name)
def dropbox_list_files(path):
try:
files = dbx.files_list_folder(path).entries
files_list = []
for file in files:
if isinstance(file, dropbox.files.FileMetadata):
metadata = {
'name': file.name,
'path_display': file.path_display,
'client_modified': file.client_modified,
'server_modified': file.server_modified
}
files_list.append(metadata)
df = pd.DataFrame.from_records(files_list)
return df.sort_values(by='server_modified', ascending=False)
except Exception as e:
print('Error getting list of files from Dropbox: ' + str(e))
#function to get the list of files in a folder
def create_links(target, csvfile):
filesList = []
print("creating links for folder " + target)
files = dbx.files_list_folder('/'+target)
filesList.extend(files.entries)
print(len(files.entries))
while(files.has_more == True) :
files = dbx.files_list_folder_continue(files.cursor)
filesList.extend(files.entries)
print(len(files.entries))
for file in filesList :
if (isinstance(file, dropbox.files.FileMetadata)) :
filename = file.name + ',' + file.path_display + ',' + str(file.size) + ','
link_data = dbx.sharing_create_shared_link(file.path_lower)
filename += link_data.url + '\n'
csvfile.write(filename)
print(file.name)
else :
create_links(target+'/'+file.name, csvfile)
#create links for all files in the folder belgeler
create_links(target, open('links.csv', 'w', encoding='utf-8'))
listing = dbx.files_list_folder(target)
#todo: add implementation for files_list_folder_continue
for entry in listing.entries:
if entry.name.endswith(".pdf"):
# note: this simple implementation only works for files in the root of the folder
res = dbx.sharing_get_shared_links(
target + entry.name)
#f.write(res.content)
print('\r', res)

How to update the .RDS file with the files user uploads and store it for next time in Shiny (server)?

Basically, there are 50 files which are about 50 MB each and therefore, processed them and made 30k rows processed data as .RDS file to render some visualization.
However, User needs the app to upload recent files and keep updating the visualization.
Is it possible to update the .RDS file with the files user uploads ?
Can the user access this updated .RDS file even next time (session) ?
In the following example,
there is an upload button and render just a file.
Can we store the uploaded files somehow ?
So, that, we can use these uploaded files to update the .RDS file
relevant links:
R Shiny Save to Server
https://shiny.rstudio.com/articles/persistent-data-storage.html
library(shiny)
# Define UI for data upload app ----
ui <- fluidPage(
# App title ----
titlePanel("Uploading Files"),
# Sidebar layout with input and output definitions ----
sidebarLayout(
# Sidebar panel for inputs ----
sidebarPanel(
# Input: Select a file ----
fileInput("files", "Upload", multiple = TRUE, accept = c(".csv"))
),
# In order to update the database if the user clicks on the action button
actionButton("update_database","update Database")
# Main panel for displaying outputs ----
mainPanel(
# Output: Data file ----
dataTableOutput("tbl_out")
)
)
)
# Define server logic to read selected file ----
server <- function(input, output) {
lst1 <- reactive({
validate(need(input$files != "", "select files..."))
##### Here, .RDS file can be accessed and updated for visualiztaion
if (is.null(input$files)) {
return(NULL)
} else {
path_list <- as.list(input$files$datapath)
tbl_list <- lapply(input$files$datapath, read.table, header=TRUE, sep=";")
df <- do.call(rbind, tbl_list)
return(df)
}
})
output$tbl_out <- renderDataTable({
##### Can also access this data for visualizations
lst1()
})
session$onSessionEnded({
observeEvent(input$update_database,{
s3save(appended_data, object = "object_path_of_currentRDS_file",bucket = "s3_bucket")
})
})
}
# Create Shiny app ----
shinyApp(ui, server)
But there was an error:
Warning: Error in private$closedCallbacks$register: callback must be a function
50: stop
49: private$closedCallbacks$register
48: session$onSessionEnded
47: server [/my_folder_file_path.R#157]
Error in private$closedCallbacks$register(sessionEndedCallback) :
callback must be a function
Solution:
replace this code
session$onSessionEnded({
observeEvent(input$update_database,{
s3save(appended_data, object = "object_path_of_currentRDS_file",bucket = "s3_bucket")
})
})
## with this
observeEvent(input$update_database,{
s3saveRDS(appended_data, object = "object_path_of_currentRDS_file",bucket = "s3_bucket")
})

Quanteda: error message while tokenizing "unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’"

I have been trying to tokenise and clean my 400 txt documents before using structured topic modelling (STM). I wanted to remove punctuations, stopwords, symbols, etc. However, I get the following error message:
"Error in (function (classes, fdef, mtable): unable to find an inherited method for function ‘tokens’ for signature ‘"corpus"’". This is my original code:
answers2 <- tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
ngrams = 1L, verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")
I also tried to tokenize a simple string text - just to check if it was an encoding problem while importing my txt files - but I got the same error message, plus a couple of extra ones when I tried to tokenise the the text directly, without converting it to corpus: "Error: Unable to locate Ciao bella ciao" and "Error: No language specified!". Here is my example code in case someone wants to replicate the error message:
prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- tokens(prova2_corpus)
prova2_tok <- tokens(prova_corpus)
The packages that are loaded are: data.table, ggplot2, quanteda, readtext, stm, stringi, stringr, tm, textstem. Any suggestion on how I could proceed to tokenise and clean my texts?
After several attempts, I managed to find a solution. When several text analysis/topic modelling packages are loaded in Rstudio, the "tokens" functions can overlap. You need to force the command to be quantedas "tokens", ie quanteda::tokens(answers). Here is the updated code
answers2 <- quanteda::tokens(answers_corpus, what = c("word"), remove_numbers = TRUE, remove_punct = TRUE,
remove_symbols = TRUE, remove_separators = TRUE,
remove_twitter = TRUE, remove_hyphens = TRUE, remove_url = TRUE,
verbose = quanteda_options("verbose"), include_docvars = TRUE, text_field = "text")
And the updated example code too:
prova <- c("Ciao bella ciao")
prova2 <- "Ciao bella ciao"
prova_corpus <- corpus(prova)
prova2_corpus <- corpus(prova2)
prova_tok <- quanteda::tokens(prova2_corpus)
prova2_tok <- quanteda::tokens(prova_corpus)

R dynamic sql query using RODBC and export to .csv

If I run the following code in R studio then it works but I have set sys.sleep. I have a large batch of queries to run and I don't know how long each will take. If I exclude the sys.sleep then the exports are blank as the export is run before the query is complete. Is there a way of getting R to wait until the query is complete?
#setup
#install.packages("stringr", dependencies=TRUE)
require(stringr)
library(RODBC)
#odbc connection
db <- odbcDriverConnect("dsn=DW Master;uid=username;pwd=password;")
#sql to be run
qstr <- "select top 10 * from prod"
#variable
weeknum<-c('201401','201402','201403')
for (i in weeknum )
{
data <- sqlQuery(db, qstr, believeNRows = FALSE)
Sys.sleep(10)
filename<-paste("data_", str_trim(i), ".csv")
filename
write.csv(data, file = filename)
}
From this SO post, try adding the rows_at_time argument:
data <- sqlQuery(db, qstr, believeNRows = FALSE, rows_at_time = 1)
Alternatively you can break up the two processes:
# QUERIES TO DATA FRAMES
weeknum<-c('201401','201402','201403')
for (i in weeknum ) {
data <- sqlQuery(db, qstr, believeNRows = FALSE, rows_at_time = 1)
assign(paste("data",i,sep=""),data)
}
# DATA FRAMES TO CSV FILES
dfList <- c('data201401','data201402','data201403')
for (n in dfList) {
df<-get(n)
filename<-paste(n, ".csv", sep="")
write.csv(df, file = filename)
}