In R disk.frame package ,how to read csv file using `csv_to_disk.frame()`? - tidyverse

There are csv files in "D:\dc" as image , when I run 'csv_to_disk.frame' to read them, it's failed . Anyone can help ? Thanks!
library(disk.frame)
library(tidyverse)
files <- file.path("D:\\dc\\*.csv")
mydisk_frame <- csv_to_disk.frame(
files,
outdir="C:\\mydisk_frame.df",
overwrite = T
)

Related

How to prevent errors in format (string-variable) when downloading data from Google Big Query to R?

I have some data stored (Tweets streamed from Twitters Rest API) in Google Big Query, which, in the preview looks like this
'I’m up by myself.'
However, when I download it into R, it looks like this;
'I’m up by myself.'
Is there any way to prevent it?
I am using this code to download the data in R:
library(bigrquery)
project_id <- "my_project"
sql_string <-
"SELECT
text,
FROM my_under_project.my_table,
LIMIT 500
;"
test <- query_exec(sql_string, project = project_id, useLegacySql = FALSE, allowLargeResults=TRUE, max_pages = Inf)
str(test)
#data.frame': 500 obs. of 1 variable:
#$ text: chr "tweets" ...
The data from 'text' is stored as a string in Big Query.
Any help is appreciated! Thanks in advance!
I downloaded the data by 'bq_table_download' from the same package (instead of query_exec) from the same package and that solved the problem!
Special characters when importing from BigQuery to R

Read a locked .db file from Photos App with R

I want to analize my Photos App data using R. I know there is a photos.db file in my Library which, I'd guess, have all the metadata on my pictures. I tried opening the file but I am getting the Error in result_create(conn#ptr, statement) : database is locked
# To open any db file, you simply:
library(RSQLite)
filename <- "~Pictures/MyLibrary.photoslibrary/database/photos.db"
sqlite.driver <- dbDriver("SQLite")
db <- dbConnect(sqlite.driver, dbname = filename)
dbListTables(db)
Any idea of how can I analyze Photos App data or to open this file as a read-only db?
Thanks! :)

RevoScaleR can not find file, directory that exists

I am using RevoscaleR and I have successfully converted csv files to xdf files which I have saved to my local disk.
However, when I try to run functions that call these xdf files I get an error message that there is no such file or directory:
The file or directory 'P:/PROPENSITY/CL_Generic_Retail_201506' cannot be found.
Let me expose the whole process:
My working directory:
> getwd()
[1] "P:/PROPENSITY"
I used this code to convert csv file to xdf:
rx_CL_Generic_Retail_201506 <- rxImport(
inData = "CL_Generic_Retail_201506_23-05-2017.csv",
outFile = "CL_Generic_Retail_201506.xdf",
overwrite = TRUE
)
Then I used this code to check that the conversion was successful:
rxSummary(formula = ~ Avg_Deposits + Total_Num_ + Sumof_CC_AVGBAL_,
data = "CL_Generic_Retail_201506.xdf"
)
Summary Statistics Results for: ~Avg_Deposits + Total_Num_ + Sumof_CC_AVGBAL_
Data: "CL_Generic_Retail_201506.xdf" (RxXdfData Data Source)
File name: CL_Generic_Retail_201506.xdf
Number of valid observations: 7155413
Name Mean StdDev Min Max ValidObs MissingObs
Avg_Deposits 4562.914627 128614.5683 -325684032 69317080.0 7155413 0
Total_Num_ 7.062068 247.1506 1 224579.0 831567 6323846
Sumof_CC_AVGBAL_ 951.484138 2249.3149 0 164746.6 601304 6554109
Up to that point everything was fine.
I continued to convert files to xdf files.
Then I returned to that same file and tried to run the same function (summary) and I got the following error message:
> rxSummary(formula = ~ Avg_Deposits + Total_Num_ + Sumof_CC_AVGBAL_,
+
+ data = "CL_Generic_Retail_201506.xdf"
+
+ )
The file or directory 'CL_Generic_Retail_201506.xdf' cannot be found.
In case I repeat the process and run again rxImport the rxSummary function runs again. But then after a while, the same error repeats.
Could this have to do with back slashes?
I.e.: The message is:
The file or directory 'P:\PROPENSITY\CL_Generic_Retail_201506.xdf' cannot be found.
But when I ask R to print the working directory it returns:
> getwd()
[1] "P:/PROPENSITY"
Observe that in the RevoScaleR error message the slashes are \ while R's output of getwd() has /.
If this is the problem what I could do about it?
By the way this problem occurs in a workstation where Windows and RevoScaleR are installed. In a notebook running also RevoScaleR the problem does not appear.
I would appreciate any suggestion.
---------------------------------------------------------------------------
Here is an image of the directory where it is apparent that the files exist.
Image of the PROPENSITY folder with the xdf files
Try using append = "rows". The last csv is probably empty, resulting in overwritting a xdf with an empty xdf which is no file.
rx_CL_Generic_Retail_201506 <- rxImport(inData = "CL_Generic_Retail_201506_23-05-2017.csv", outFile = "CL_Generic_Retail_201506.xdf", overwrite = TRUE,
append = "rows"
)

Bigquery error (ASCII 0) encountered for external table and when loading table

I'm getting this error
"Error: Error detected while parsing row starting at position: 4824. Error: Bad character (ASCII 0) encountered."
The data is not compressed.
My external table points to multiple CSV files, and one of them contains a couple of lines with that character. In my table definition I added "MaxBadRecords", but that had no effect. I also get the same problem when loading the data in a regular table.
I know I could use DataFlow or even try to fix the CSVs, but is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
is there an alternative to that does not include writing a parser, and hopefully just as easy and efficient?
Try below in Google Cloud SDK Shell (with use of tr utility)
gsutil cp gs://bucket/badfile.csv - | tr -d '\000' | gsutil cp - gs://bucket/fixedfile.csv
This will
Read your "bad" file
Remove ASCII 0
Save "fixed" file into new file
After you have new file - just make sure your table now points to that fixed one
Sometimes it occurs that a final byte appears in file.
What could help is replacing it thanks to :
tr '\0' ' ' < file1 > file2
You can clean the file using an external tool like python or PowerShell. There is no way to load any file with an ASCII0 in bigquery
This is a script that can clear the file with python:
def replace_chars(self,file_path,orignal_string,new_string):
#Create temp file
fh, abs_path = mkstemp()
with os.fdopen(fh,'w', encoding='utf-8') as new_file:
with open(file_path, encoding='utf-8', errors='replace') as old_file:
print("\nCurrent line: \t")
i=0
for line in old_file:
print(i,end="\r", flush=True)
i=i+1
line=line.replace(orignal_string, new_string)
new_file.write(line)
#Copy the file permissions from the old file to the new file
shutil.copymode(file_path, abs_path)
#Remove original file
os.remove(file_path)
#Move new file
shutil.move(abs_path, file_path)
The same but for PowerShell:
(Get-Content "C:\Source.DAT") -replace "`0", " " | Set-Content "C:\Destination.DAT"

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9