how to EXTRACT against a ZIP file containing multiple CSV datasets - azure-data-lake

I understand that if we simply have 1 zip file that has 1 csv file, we can simply EXTRACT it:
DECLARE #file1 string = #"/input/input.csv.zip";
#file =
EXTRACT col1 string,
col2 string,
col3 string
FROM #file1
USING Extractors.Csv(silent : true);
However, what if we have multiple csv files in 1 zip:
inputfiles.zip
-file1.csv
-file2.csv
-file3.csv
How do we EXTRACT / SELECT from inputfiles.zip?

U-SQL cannot extract it natively, but you can create your own extractor do do that.
I used that code and it works:
https://ryansimpson.net/2016/10/15/query-zipfile-adla/

U-SQL cannot do this natively. Consider using Data Factory (eg For Each loop with child items) to extract the files first.
Having worked through this recently, the go-to blog post on this looks out-of-date, ie Azure Data Factory's Get Metadata task no longer deals with zip files. Instead, the Copy task can do it directly. I've tried to document the approach without using screenprints from the GUI:

Related

Merging multiple files in Pig

I have several files (around 10 files) which I would like to merge together in Pig:
Student01.txt
Student02.txt
...
Student10.txt
I am aware that I could merge two datasets together by:
data = UNION Student01, Student02
Is there any way that I could iterate over a loop to merge the dataset from Student01 to Student10?
Assuming the files are in the same format, then LOAD command allows you to read all files if you provide it a directory or a glob.
From docs -
The input data to the load can be a file, a directory or a glob
Example
STUDENTS = LOAD("/path/to/students/Student*.txt") USING PigStorage();

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you
I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

spark RDD saveAsTextFile does not use the specified filename

I have some code like this
wordCounts
.map{ case (word, count) =>
Seq(
word,
count
).mkString("\t")
}
.coalesce(1,true)
.saveAsTextFile("s3n://mybucket/data/myfilename.csv")
However myfilename.csv was created as a directory in my S3 bucket and the file name is always something like myfilename.csv/part-00000? Is there a way I can change the name of the file I am writing to? Thanks!
I strongly suggest that you use the spark-csv package from Databrick to read and write csv files in Spark. One of the (many) benefits from using this package is that it allows you to specify the name of the output csv-file :)

Using Pentaho Kettle, how can I convert a csv using commas to a csv with pipe delimiters?

I have a CSV input file with commas. I need to change the delimiter to pipe. Which step should I use in Pentaho kettle? Please do suggest.
Thanks!
Do not use big gun when you try to shoot small target. Can use sed or awk. Or when you want to integrate with kettle, can use step to run shell script and within script use sed for example.
If your goal is to output a pipe separated CSV file from data within a transform and you're already running Kettle, just use a Text File output step.
If the goal is to do something unusual with CSV data within the transform itself, you might look into the Concat Fields step.
If the goal is simply to take a CSV file and write out another CSV with different separators, use the solution #martinnovoty suggests.
You can achieve this easy:
Add a javascript step after the load your csv step into a variable "foo" and add this code onto the js step:
var newFoo = replace(foo,",", "|");
now your cvs file is loaded in newFoo var with pipes.

Using R functions lapply and read.sql.csv

I am trying to open multiple csv files using a list such as the below;
filenames <- list.files("temp", pattern="*.csv", full.names=TRUE)
I have found examples that use lapply and read.csv to open all the files in the temp directory, but I know appriori what data i need to extract from the file, so to save time reading i want to use the SQL extension of this;
somefile = read.csv.sql("temp/somefile.csv", sql="select * from file ",eol="\n")
However i am having trouble combining these two pieces of functionality into a single command such that i can read all the files in a directory applying the same sql query.
Has anybody had success doing this?
If you want a list of dataframes from each file (assuming your working directory contains the .csv files):
files <- list.files(".", pattern="*.csv")
df.list <- sapply(filenames, read.csv.sql,sql="select * from file ",eol="\n",simplify=F)
Or if you want them all combined:
df <- ldply(filenames, read.csv.sql,sql="select * from file ",eol="\n")