How to read all files containing in a sub folder for external tables in Sql Server Data Warehouse - sql-server-2016

I have to load the data from datalake to sql server data warehouse using the polybase tables.I have created the set up for the creation of external tables.i have created the external table with location as "/A/B/PARQUET/*.parquet/". But i'm getting invalid path error.Under PARQUET folder there are subfolders with name.parquet,under that folder it has .parquet files.As there is no path called *.parquet.but how to get all the sub-folders(.parquet) under PARQUET folder?
Is there any way to get all sub folders containing .parquet files under PARQUET folder.Can someone help me on this? Thanks in advance.
CREATE EXTERNAL TABLE [dbo].[EXT_TEST1]
( A VARCHAR(10),B VARCHAR(20))
(DATA_SOURCE = [Azure_Datalake],LOCATION = N'/A/B/PARQUET/*.parquet/',FILE_FORMAT =csvfileformat,REJECT_TYPE = VALUE,REJECT_VALUE = 1)
folder structure:
A->B->PARQUET->asdfolder.parquet-> file1.parquet
->dfgfolder.parquet-> file2.parquet
->shdfolder.parquet-> file3.parquet

Please change the location to:
LOCATION = '/A/B/PARQUET'
Polybase will load all files in that folder and subfolders. The only exception is files or folders which begin with a period (.) or an underscore (_) as described here.

Related

Copy multiple files from multiple folder to single folder in Azure data factory

I have a folder structure like this as a source
Source/2021/01/01/*.xlsx files
Source/2021/03/02/*.xlsx files
Source/2021/04/03/.*xlsx files
Source/2021/05/04/.*xlsx files
I want to drop all these excel files into a different folder called Output.
Method 1:
When I am trying this, I used copy activity and I am able to get Files with folder structure( not a requirement) in Output folder. I used Binary file format.
Method 2:
Also, I am able to get files as some random id .xlsx in my output folder. I used Flatten Hierrachy.
My requirement is to get files with the same name as source.
This is what i suggest and I have implemented something in the past and i am pretty confident this should work .
Steps
Use getmetada activity and try to loop through all the folder inside Source/2021/
Use a FE loop and pass the ItemType as folder ( so that you get folder only and NO files , I know at this time you dont; have file )
Inside the IF , add a Execute pipeline activity , they should point to a new pipeline which will take a parameter like
Source/2021/01/01/
Source/2021/03/02/
The new pipeline should have a getmetadata activity and FE loop and this time we will look for files only .
Inside the FE loop add a copy activity and now will have to use the full file name as source name .

Azure Data Factory "Name File as column data" option in Sink transformation of data flow is creating blobs name for virtual folder

I have create and ADF pipeline. Source and sink, both are storage account.
I want to create file based on date in column data, so I selected the option "File Name as Column Data".
In this option, we are giving filename, with virtual folder path.
But when process is completed, blob(hot inferred) also created for virtual folder, which I don't need. I just need blob for files(which is also present). If I delete those virtual folder blobs, I cant put the file in incremental way in those folder.
What should I do?
You should set the column format first in source projection before set "File Name as Column Data".
Here's my source file:
Source projection settings: specify the column date type and format.
Sink settings:
Output:
Then we can get the output files without the virtual folder.
HTH.

Create a ADF Dataset to load multiple csv files (same format) from the Blob

I try to create a dataset containing multiple csv files from the Blob. In the file path of dataset setting: I create a parameter - #dataset().FolderName and add FolderName in the Parameters.
I leave file (from File Path) empty as I want to grab all files in the folder. However, there is no data when I preview data. Is there anything missing? Thank you
I have tested it on my side and it can work fine.
add FolderName parameter
preview data
If you want to merge all csv files in Data Flow, you can do this:
1.output to single file
2.set Single partition

load file with csv extension in ssis

I have to load file with csv extension from one particular folder to data base in ssis. file name is not known but folder and extension is fixed.
To load the content of a file, the file name with folder path is required else the connection manager can not be validated and configured.
The easiest way is to get file name is using a For Each Loop container:
Select the option [Foreach File Enumerator]
Provide the Folder path and extension (like *.csv) you already have.
Get the File Name in a variable and use it within the Source of the data flow task within the For each Loop container.
Refer

Hive Reading external table from compressed bz2 file

this is my scenario.
I have bz2 file in Amazon s3. Within the bz2 file, there lies files with .dat,.met,.sta extensions.I am only interested in files with *.dat extensions.You can download this samplefile to take a look at bz2 file.
create external table cdr (
anum string,
bnum string,
numOfTimes int
)
row format delimited
fields terminated by ','
lines terminated by '\n'
location 's3://mybucket/dir'; #the zip file is inside here
The problem lies such that when I execute the above command, some of the records/rows had issues.
1)all the data from files such as *.sta and *.met are also included.
2)the metadata of the filenames are also included.
The only idea I had was to show the INPUT_FILE_NAME. But then, all the records/rows had the same INPUT_FILE_NAME which was the filename.tar.bz2.
Any suggestions are welcome. I am currently completely lost.