In databricks I have several CSV files that I need to load. I would like to add a column to my table with the file path, but I can't seem to find that option
My data is structured with
FileStore/subfolders/DATE01/filenameA.csv
FileStore/subfolders/DATE01/filenameB.csv
FileStore/subfolders/DATE02/filenameA.csv
FileStore/subfolders/DATE02/filenameB.csv
I'm using this SQL function in databricks, as this can loop through all the dates and add all filenameA into clevertablenameA, and all filenameB into clevertablenameB etc.
DROP view IF EXISTS clevertablenameA;
create temporary view clevertablenameA
USING csv
OPTIONS (path "dbfs:/FileStore/subfolders/*/filenameA.csv", header = true)
My desired outcome is something like this
col1 | col2|....| path
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
data | data|....| dbfs:/FileStore/subfolders/DATE02/filenameA.csv
Is there a clever option, or should I load my data another way?
The function input_file_name() could be used to retrieve the file name while reading.
SELECT *, input_file_name() as path FROM clevertablenameA
Note that this does not add a column to the view and merely returns the name of the file being read.
Refer to below link for more information.
https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/functions/input_file_name
Alternatively you could try reading the files in a pyspark/scala cell and add the file name using the same function using the .withColumn("path", input_file_name()) function and then create the view on top of it.
Related
I have got requirement saying, blob storage has multiple files with names file_1.csv,file_2.csv,file_3.csv,file_4.csv,file_5.csv,file_6.csv,file_7.csv. From these i have to read only filenames from 5 to 7.
how we can achieve this in ADF/Synapse pipeline.
I have repro’d in my lab, please see the below repro steps.
ADF:
Using the Get Metadata activity, get a list of all files.
(Parameterize the source file name in the source dataset to pass ‘*’ in the dataset parameters to get all files.)
Get Metadata output:
Pass the Get Metadata output child items to ForEach activity.
#activity('Get Metadata1').output.childItems
Add If Condition activity inside ForEach and add the true case expression to copy only required files to sink.
#and(greater(int(substring(item().name,4,1)),4),lessOrEquals(int(substring(item().name,4,1)),7))
When the If Condition is True, add copy data activity to copy the current item (file) to sink.
Source:
Sink:
Output:
I took a slightly different approaching using a Filter activity and the endsWith function:
The filter expression is:
#or(or(endsWith(item().name, '_5.csv'),endsWith(item().name, '_6.csv')),endsWith(item().name, '_7.csv'))
Slightly different approaches, similar results, it depends what you need.
You can always do what #NiharikaMoola-MT suggested . But since you already know the range of the files ( 5-7) , I suggest
Declare two paramter as an upper and lower range
Create a Foreach loop and pass the parameter and to create a range[lowerlimit,upperlimit]
Create a paramterized dataset for source .
Use the fileNumber from the FE loop to create a dynamic expression like
#concat('file',item(),'.csv')
I Have to Upload Data in SQL Server from .dbf Files through SSIS.
My Output Column is fixed but the input column is not fixed because the files come from the client and the client may have updated data by his own style. there may be some unused columns too or the input column name can be different from the output column.
One idea I had in my mind was to map files input column with output column in SQL Database table and use only those column which is present in the row for file id.
But I am not getting how to do that. Any idea?
Table Example
FileID
InputColumn
OutputColumn
Active
1
CustCd
CustCode
1
1
CName
CustName
1
1
Address
CustAdd
1
2
Cust_Code
CustCode
1
2
Customer Name
CustName
1
2
Location
CustAdd
1
If you create a similar table, you can use it in 2 approaches to map columns dynamically inside SSIS package, or you must build the whole package programmatically. In this answer i will try to give you some insights on how to do that.
(1) Building Source SQL command with aliases
Note: This approach will only work if all .dbf files has the same columns count but the names are differents
In this approach you will generate the SQL command that will be used as source based on the FileID and the Mapping table you created. You must know is the FileID and the .dbf File Path stored inside a Variable. as example:
Assuming that the Table name is inputoutputMapping
Add an Execute SQL Task with the following command:
DECLARE #strQuery as VARCHAR(4000)
SET #strQuery = 'SELECT '
SELECT #strQuery = #strQuery + '[' + InputColumn + '] as [' + OutputColumn + '],'
FROM inputoutputMapping
WHERE FileID = ?
SET #strQuery = SUBSTRING(#strQuery,1,LEN(#strQuery) - 1) + ' FROM ' + CAST(? as Varchar(500))
SELECT #strQuery
And in the Parameter Mapping Tab select the variable that contains the FileID to be Mapped to the parameter 0 and the variable that contains the .dbf file name (alternative to table name) to the parameter 1
Set the ResultSet type to Single Row and store the ResultSet 0 inside a variable of type string as example #[User::SourceQuery]
The ResultSet value will be as following:
SELECT [CustCd] as [CustCode],[CNAME] as [CustName],[Address] as [CustAdd] FROM database1
In the OLEDB Source select the Table Access Mode to SQL Command from Variable and use #[User::SourceQuery] variable as source.
(2) Using a Script Component as Source
In this approach you have to use a Script Component as Source inside the Data Flow Task:
First of all, you need to pass the .dbf file path and SQL Server connection to the script component via variables if you don't want to hard code them.
Inside the script editor, you must add an output column for each column found in the destination table and map them to the destination.
Inside the Script, you must read the .dbf file into a datatable:
C# Read from .DBF files into a datatable
Load a DBF into a DataTable
After loading the data into a datatable, also fill another datatable with the data found in the MappingTable you created in SQL Server.
After that loop over the datatable columns and change the .ColumnName to the relevant output column, as example:
foreach (DataColumn col in myTable.Columns)
{
col.ColumnName = MappingTable.AsEnumerable().Where(x => x.FileID = 1 && x.InputColumn = col.ColumnName).Select(y => y.OutputColumn).First();
}
After loop over each row in the datatable and create a script output row.
In addition, note that in while assigning output rows, you must check if the column exists, you can first add all columns names to list of string, then use it to check, as example:
var columnNames = myTable.Columns.Cast<DataColumn>()
.Select(x => x.ColumnName)
.ToList();
foreach (DataColumn row in myTable.Rows){
if(columnNames.contains("CustCode"){
OutputBuffer0.CustCode = row("CustCode");
}else{
OutputBuffer0.CustCode_IsNull = True
}
//continue checking all other columns
}
If you need more details about using a Script Component as a source, then check one of the following links:
SSIS Script Component as Source
Creating a Source with the Script Component
Script Component as Source – SSIS
SSIS – USING A SCRIPT COMPONENT AS A SOURCE
(3) Building the package dynamically
I don't think there are other methods that you can use to achieve this goal except you has the choice to build the package dynamically, then you should go with:
BIML
Integration Services managed object model
EzApi library
(4) SchemaMapper: C# schema mapping class library
Recently i started a new project on Git-Hub, which is a class library developed using C#. You can use it to import tabular data from excel, word , powerpoint, text, csv, html, json and xml into SQL server table with a different schema definition using schema mapping approach. check it out at:
SchemaMapper: C# Schema mapping class library
You can follow this Wiki page for a step-by-step guide:
Import data from multiple files into one SQL table step by step guide
In the following code I've loaded data from two excel documents that have the exact same column names, and have therefore given one of the tables aliases.
My problem occurs when I try to put in a not match() condition at the end of the script.
// New table
NewTable:
LOAD
[namn] as namnNy
FROM
[pglistaNy.xlsx]
(ooxml, embedded labels);
// Old table
OldTable:
LOAD
[namn]
FROM
[pglistaOld.xlsx]
(ooxml, embedded labels)
Where not match(namn, namnNy);
I get an error telling me that it does not recognize the namnNy alias, why is that and what's a better solution / method?
match function will not work in your case. You are trying to match values from field names from different tables. You should use the exists function (full documentation on Qlik's help page)
So your script will be:
// New table
NewTable:
LOAD
[namn] as namnNy
FROM
[pglistaNy.xlsx]
(ooxml, embedded labels);
// Old table
OldTable:
LOAD
[namn]
FROM
[pglistaOld.xlsx]
(ooxml, embedded labels)
Where
not Exists(namnNy, namn);
Example qvw file here
I want to rename image files with schoolId in [school] table. Is there any other approach to do this.
Currently, i am doing following steps for each file:
1. copy image file name
2. use this query to get schoolId
SELECT * FROM [School]
where SchoolCDS='01611926000962
Rename image file with SchoolId
Any best approach?
have you stored image file name?
if yes,
then fetch the image file name from db, then fetch school id from db based on file name, finally update db with new file name.
if no, (in case of physical stored image)
get file name as given below:
string fileName = #"C:\mydir\myfile.ext";
string path = #"C:\mydir\";
string result;
result = Path.GetFileName(fileName);
rest code is same as above.
Is that possible in qlikview to concatenate multiple files from different paths.
Suppose, i am loading multiple files with a path and want to concatenate multiple files which have same number and name of columns as first path's file. So, my question is how can i do that.
Thanks in Advance.
When you say "load a file", I am assuming you mean that you are loading the contents into a table, as you would an QVD, XML, or Excel file.
If this is the case, if the columns are identical in each load, QlikView will attempt to concatenate them by default if they are loaded in sequence.
Otherwise, name your first table, such as TableName:, then preface the following loads of other files with concatenate(TableName).
Ex:
TableName:
LOAD Col1, Col2
from [file.qvd];
CONCATENATE(TableName)
LOAD Col1, Col2
from [file2.qvd];
Note: As I mentioned above, since these are in sequence and have identically named columns, QlikView will attempt to autoconcatenate them in my example, so the CONCATENATE line, though still functional, is not required.
I just want to add example how to do it if there is dynamic amount of files in multiple directories with some name:
SUB LoadFromFolder (RootDir)
TRACE Loading data ...;
TRACE Directory: $(RootDir);
TRACE ;
FOR Each FoundFile in FileList(RootDir & '\FileName.xml')
TRACE Loading data from '$(FoundFile)' ...;
Data:
LOAD Prop1,
Prop2,
Prop3
From [$(FoundFile)] (XmlSimple, Table is [XmlRoot/XmlTag]);
TRACE Loaded.;
NEXT FoundFile
FOR Each SubDirectory in DirList(RootDir & '\*' )
CALL LoadFromFolder(SubDirectory);
NEXT SubDirectory
TRACE ;
END Sub
CALL LoadFromFolder ('C:\Path\To\Dir\WithoutslashAtTheEnd');
As Dickie already told, each time you load to "Data:", it will be added there.