Create a ADF Dataset to load multiple csv files (same format) from the Blob

Create a ADF Dataset to load multiple csv files (same format) from the Blob - azure-data-factory-2

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you

I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

Related

pyspark dataframe writing csv files twice in s3

I have created a pyspark dataframe and trying to write the file in s3 bucket in csv format. here the file is writing in csv but the issue is it's writing the file twice(i.e., with actual data and another is with empty data). I have checked the data frame by printing fine only. please suggest any way to prevent that empty wouldn't create.
code snippet:
df = spark.createDataFrame(data=dt1, schema = op_df.columns)
df.write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)

One possible solution to make sure that the output will include only one file is to do repartition(1) or coalesce(1) before writing.
So something like this:
df.repartition(1).write.option("header","true").csv("s3://"+ src_bucket_name+"/src/output/"+row.brand +'/'+fileN)
Note that having one partition doesn't not necessarily mean that it will result in one file as this can depend on the spark.sql.files.maxRecordsPerFile configuration as well. Assuming this config is set to 0 (the default) you should get only 1 file in the output.

Copy multiple files from multiple folder to single folder in Azure data factory

I have a folder structure like this as a source
Source/2021/01/01/*.xlsx files
Source/2021/03/02/*.xlsx files
Source/2021/04/03/.*xlsx files
Source/2021/05/04/.*xlsx files
I want to drop all these excel files into a different folder called Output.
Method 1:
When I am trying this, I used copy activity and I am able to get Files with folder structure( not a requirement) in Output folder. I used Binary file format.
Method 2:
Also, I am able to get files as some random id .xlsx in my output folder. I used Flatten Hierrachy.
My requirement is to get files with the same name as source.

This is what i suggest and I have implemented something in the past and i am pretty confident this should work .
Steps
Use getmetada activity and try to loop through all the folder inside Source/2021/
Use a FE loop and pass the ItemType as folder ( so that you get folder only and NO files , I know at this time you dont; have file )
Inside the IF , add a Execute pipeline activity , they should point to a new pipeline which will take a parameter like
Source/2021/01/01/
Source/2021/03/02/
The new pipeline should have a getmetadata activity and FE loop and this time we will look for files only .
Inside the FE loop add a copy activity and now will have to use the full file name as source name .

how to read multiple text files into a dataframe in pyspark

i have a few txt files in a directory(i have only the path and not the names of the files) that contain json data,and i need to read all of them into a dataframe.
i tried this:
df=sc.wholeTextFiles("path/*")
but i cant even display the data and my main goal is to preform queries in diffrent ways on the data.

Instead of wholeTextFiles(gives key, value pair having key as filename and data as value),
Try with read.json and give your directory name spark will read all the files in the directory into dataframe.
df=spark.read.json("<directorty_path>/*")
df.show()
From docs:
wholeTextFiles(path, minPartitions=None, use_unicode=True)
Read a directory of text files from HDFS, a local file system
(available on all nodes), or any Hadoop-supported file system URI.
Each file is read as a single record and returned in a key-value pair,
where the key is the path of each file, the value is the content of
each file.
Note: Small files are preferred, as each file will be loaded fully in
memory.

Remove files with Pig script after merging them

I'm trying to merge a large number of small files (200k+) and have come up with the following super-easy Pig code:
Files = LOAD 'hdfs/input/path' using PigStorage();
store Files into 'hdfs/output/path' using PigStorage();
Once Pig is done with the merging is there a way to remove the input files? I'd like to check that the file has been written and is not empty (i.e. 0 bytes). I can't simply remove everything in the input path because new files may have been inserted in the meantime, so that ideally I'd remove only the ones in the Files variable.

With Pig it is not possible i guess. Instead what you can do is use -tagsource with the LOAD statement and get the filename and stored it somewhere. Then use HDFS FileSystem API and read from the stored file to remove those files which are merged by pig.
A = LOAD '/path/' using PigStorage('delimiter','-tagsource');

You should be able to use hadoop commands in your Pig script
Move input files to a new folder
Merge input files to output folder
Remove input files from the new folder
distcp 'hdfs/input/path' 'hdfs/input/new_path'
Files = LOAD 'hdfs/input/new_path' using PigStorage();
STORE Files into 'hdfs/output/path' using PigStorage();
rmdir 'hdfs/input/new_path'

Avoiding multiple headers in pig output files

We use Pig to load files from directories containing thousands of files, transform them, and then output files that are a consolidation of the input.
We've noticed that the output files contain the header record of every file processed, i.e. the header appears multiple times in each file.
Is there any way to have the header only once per output file?
raw_data = LOAD '$INPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage(',')
DO SOME TRANSFORMS
STORE data INTO '$OUTPUT'
USING org.apache.pig.piggybank.storage.CSVExcelStorage('|')

Did you try this option?
SKIP_INPUT_HEADER
See https://github.com/apache/pig/blob/31278ce56a18f821e9c98c800bef5e11e5396a69/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/CSVExcelStorage.java#L85

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Create a ADF Dataset to load multiple csv files (same format) from the Blob - azure-data-factory-2

I have tested it on my side and it can work fine. add FolderName parameter preview data If you want to merge all csv files in Data Flow, you can do this: 1.output to single file 2.set Single partition

Related

pyspark dataframe writing csv files twice in s3

Copy multiple files from multiple folder to single folder in Azure data factory

how to read multiple text files into a dataframe in pyspark

Remove files with Pig script after merging them

Avoiding multiple headers in pig output files

Categories

Resources