Mosaic Decisions Azure BLOB writer node creating multiple files - mosaic-decisions

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?

Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

Related

Download big number of files (400k) from S3 bucket into Azure Datalake Gen2 using Azure Data Factory

I need to download a big number of files (around 400k) files from an S3 bucket. I have the paths stored in a csv file. Some of the paths may not exist.
The two options i see are:
Use the foreach activity and somehow pass the contents of the file there. But i think that this would flood my monitor pane with a huge number of runs, and it feels like it is meant to be for smaller pipelines.
Use the listOfFiles option which is supported in the S3 source. The problem with this approach is that the list must be in the S3 bucket and cannot be loaded from Azure Datalake Gen2 (anybody knows why, please let me know as well).
I have tried using the listOfFiles way, but the pipeline fails once it finds the first missing file. The fault tolerance options contain a "skip missing file" option but it is defined as "Skip the files if it is being deleted from source store during the data movement", so it is of no use to me.
I don't want to download more files than needed, so copying the bucket as-is is not an option. How can i approach this issue with ADF? I'm looking for a solution that uses the predefined transformations, ideally i would like to not involve Azure Batch or Azure Functions for such a simple task.

Creating multiple files for uploading to Snowflake

Currently, my company uses SSIS and BCP to export data from SQL Server to CSV files. However, we are only able to create a single file per SQL table (due to the limitations of BCP). Most of these files are quite large; if I am correct, they are too large to get the best performance when loading them into Snowflake. On their website, they state that we should be working with multiple gzip files to offer the best performance.
I am wondering how other people made this work? Splitting up the CSV to multiple files and zipping them? Any good tools that can do this during export from SSIS?
I'd keep the current process that exports the large .csv files using SSIS, then run 7zip via command line to create a split gzip set for each text file, either within the SSIS package or via Powershell.
The -v switch is used to specify the volume size.
https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm
You may be able to start importing/uploading the completed chunks before the later ones are finished to pick up some additional time savings, but I've not tested that.

Approach for large data set for reporting

I am having 220 millions of raw files in AWS s3 which I considering to merge all into a single file which estimate around 10 terabyte. The merge file will be serve as a fact table but in file format for reporting purposes for the audit.
The raw files are source data from an application. If there is any new data changes to the application, the contain of the file will be change.
I would like to ask is anybody come across this end to end process for this user case?
s3--> ETL (file merging)--> s3 --> reporting (tableau)
I haven't personally tried it, but this is kind of what Athena is made for... Skipping your ETL process, and querying directly from the files. Is there a reason you are dumping this all into a single file instead of keeping it dispersed? Rewriting a 10TB file over and over again is very expensive and time consuming... I'd personally at least investigate keeping the files 1-1 with the source files.
Create a s3 trigger that fires when a file is rewritten on s3
Create a Lambda that creates your "audit ready" report files on s3
Use AWS Athena to query those report files
Tableau connector to Athena for your reports

Joining ADLS files created with Append and ConcurrentAppend

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

Large file to LookUp other large file when files are dependent- Mule ESB

Could you please suggest. I have two files each have 80 to 90k product and these two files are interlinked with each other(one file have information on other) and i need to generate one single file by looking up the other files. These files probably comes in the sameTime with different name.
Both the files are csv and i need to generate the new csv.
Is that the only way I should keep any one of these files in memory and keep looking by iterating.
I planned to use Batch inside dataMapper. Is there any way we can keep the first file in Datamapper userDefined table or something like that.And the getting the new file to make a look up on it.( I'm not provided with external DB)
If any one of the file have some 5000 or 10k lines it the sense, i can keep that in memory and make the 80k file to look on it. I'm not comfortable to keep 80 or 90k file in memory.
Have reference this link: Mule ESB - design a multi file processing flow when files are dependent on each other.
Could you please suggest me the best solution.
Also any idea How long to process the file it does take, Thanks in advance.
Mule studio:5.3.1 and Runtime: 3.7.2
I would think of the problem as two distinct events from Mule's perspective, and plan to keep state from the first one in a "database" of some kind. This doesn't have to be an Oracle cluster or anything, you can run H2 in process or Redis on the same server as Mule for example.
I think you're on the right track with the Batch idea. When the first file is received, I'd create a record for each in a batch job. Then when the second file is received, I'd run a second batch job that looks up the relevant information from the database, and generates the CSV file you need. It could also remove the records that have been matched from the database in a subsequent batch step.
For the transformations, I'd recommend trying DataWeave instead of DataMapper. It's a better way to write transformation logic, and Mulesoft has deprecated DataMapper, to be removed as of Mule 4.0.