Using R functions lapply and read.sql.csv - sql

I am trying to open multiple csv files using a list such as the below;
filenames <- list.files("temp", pattern="*.csv", full.names=TRUE)
I have found examples that use lapply and read.csv to open all the files in the temp directory, but I know appriori what data i need to extract from the file, so to save time reading i want to use the SQL extension of this;
somefile = read.csv.sql("temp/somefile.csv", sql="select * from file ",eol="\n")
However i am having trouble combining these two pieces of functionality into a single command such that i can read all the files in a directory applying the same sql query.
Has anybody had success doing this?

If you want a list of dataframes from each file (assuming your working directory contains the .csv files):
files <- list.files(".", pattern="*.csv")
df.list <- sapply(filenames, read.csv.sql,sql="select * from file ",eol="\n",simplify=F)
Or if you want them all combined:
df <- ldply(filenames, read.csv.sql,sql="select * from file ",eol="\n")

Related

Merging multiple files in Pig

I have several files (around 10 files) which I would like to merge together in Pig:
Student01.txt
Student02.txt
...
Student10.txt
I am aware that I could merge two datasets together by:
data = UNION Student01, Student02
Is there any way that I could iterate over a loop to merge the dataset from Student01 to Student10?
Assuming the files are in the same format, then LOAD command allows you to read all files if you provide it a directory or a glob.
From docs -
The input data to the load can be a file, a directory or a glob
Example
STUDENTS = LOAD("/path/to/students/Student*.txt") USING PigStorage();

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Not able to filter files using pathGlobFilter

We are trying to read file from directory based on pattern from azure blob srorage.We are using
pathGlobFilter option to select files. The directory contains following files
Sales_51820_14529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_T_7a3cc7d1d17261fd17e7e1fabd3.csv
We need to process only those files which does not have "T" in file name .We need to process only these two files
Sales_51820_14529409_7a3cc7d1d17261fd17e7e1fabd3.csv
Sales_61820_17529409_7a3cc7d1d17261fd17e7e1fabd3.csv
But we are not able to read only these two files.
Here is the code,
df = spark.read.format("csv").schema(structSchema).options(header=False,inferSchema=True,sep='|',pathGlobFilter= "Sales_\d{5} _ \d{8}_[a-z0-9]+.csv$").load("wasbs://abc#xxxxx.blob.core.windows.net/abc/2022/02/11/"
Regards,
Rajib
Glob is not a standard regular expression, there is differences between them.
For example glob doesn't match the number of times.
For details, see:here
Back to this question, a relatively stupid way, looking forward to the perfect solution of the giant.
pathGlobFilter="Sales_[0-9][0-9][0-9][0-9][0-9]_[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]_[a-z0-9]*.csv"

Importing a *random* csv file from a folder into pandas

I have a folder with several csv files, with file names between 100 and 400 (Eg. 142.csv, 278.csv etc). Not all the numbers between 100-400 are associated with a file, for example there is no 143.csv. I want to write a loop that imports 5 random files into separate dataframes in pandas instead of manually searching and typing out the file names over and over. Any ideas to get me started with this?
You can use glob and read all the csv files in the directory.
file = glob.glob('*.csv')
random_files=np.random.choice(file,5)
dataframes= []
for fp in random_files :
dataframes.append(pd.read_csv(fp))
From this you can chose the random 5 files from directory and then read them seprately.
Hope I answer your question

R: can a function in a package read a text file in the same package?

The use case is as follows: I have created a package which contains functions which run queries against a data base. Instead of defining the query in an R-script e.g. (excuse the pseudo-code):
house_price_query <- "select * from house_prices"
get.house_prices <- function() run_query(house_price_query)
Can I save the query as a text file in say queries/house_prices.sql and then read this text file into the query?
Cheers
You can put your file house_prices.sql in inst/queries in your package folder, and then load it from within R (once your package is installed) with :
system.file(file.path("queries","house_prices.sql"), package="your_package_name")