Joining ADLS files created with Append and ConcurrentAppend - azure-data-lake

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.

At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

Related

Creating multiple files for uploading to Snowflake

Currently, my company uses SSIS and BCP to export data from SQL Server to CSV files. However, we are only able to create a single file per SQL table (due to the limitations of BCP). Most of these files are quite large; if I am correct, they are too large to get the best performance when loading them into Snowflake. On their website, they state that we should be working with multiple gzip files to offer the best performance.
I am wondering how other people made this work? Splitting up the CSV to multiple files and zipping them? Any good tools that can do this during export from SSIS?
I'd keep the current process that exports the large .csv files using SSIS, then run 7zip via command line to create a split gzip set for each text file, either within the SSIS package or via Powershell.
The -v switch is used to specify the volume size.
https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm
You may be able to start importing/uploading the completed chunks before the later ones are finished to pick up some additional time savings, but I've not tested that.

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

how to store auto generated files in a different AWS S3 folder while running Tableau using Athena connector?

I am using Athena to connect a single csv file stored in AWS S3 folder with Tableau Desktop and have been successful in connecting the S3 data using Athena.
However, when I perform any activity in Tableau like drag and drop, slice and dice, for each activity, an auto generated csv and a metadata gets saved in the same folder as my input file.
Due to this additional files getting auto-generated in the same input file folder, the visuals in Tableau also get affected (due to additional records).
How do i ensure that, for any activity i perform in Tableau, the auto-generated files get stored in a different folder (rather than the same folder from where the input file is being called) ?
This will solve my problem as the visuals and the analysis will show correct numbers.
Currently, the work-around that I am using is after every activity I perform in Tableau (slice,filter, etc..), i go back to the S3 folder, delete the additional files that got auto-generated, then continue with activity in Tableau, then back to S3 folder for deletion, etc... (Definitely not the ideal way).
While executing Athena query, I am storing the query results in a different folder, because there is a provision for doing the same.
Please suggest if there is a similar provision for storing the auto-generated files (while working on Tableau) in a different folder ?
P.S. If there is an option of preventing these files from getting generated, that will also be helpful.
Anand
How do I ensure that the auto-generated files get stored in a different folder?
In order to store results of you queries in a different location, you need to specify different path for S3 Staging Directory. In order to do that, you need to Edit Connection to AWS Athena.
Here we did everything within Tableau itself, but the same result can be accomplished within AWS Athena settings for query result locaion
If there is an option of preventing these files from getting generated, that will also be helpful.
On the left side of the toolbar, there is an option Pause/Resume Auto Updates. When paused, Tableau doesn't send new query to AWS Athena.

SQL Database in GitHub

I am building a Java app that uses an SQLite database to hold most of its data. For the end-user, the database would be almost entirely read-only, with very occasional edits. I'll (theoretically) be displaying/distributing it through my GitHub page, so my question is:
What's the best way to load the database into GitHub? (I'm using IntelliJ with DataGrip.)
I'd prefer to be able to update the database when I commit/push, instead of having to overwrite the whole file. The closest question I can find is How to include MySQL database schema on GitHub? but there could potentially be hundreds or thousands of entries, so I can't just rebuild the tables when the user installs the app.
I'm applying for entry-level developer jobs, and this project is going to be my main portfolio piece during job-hunting. I'm trying to make sure it is not only functional but also makes a good impression. Any help is (very) greatly appreciated.
EDIT:
After moving my .db file into the folder connected to GitHub (same level as my src folder) apparently I can now commit/push it with the rest of my files. How do I make sure that the connection from my Java code to the database stays valid once it is loaded onto another user's system? Can I just stick with
connection = DriverManager.getConnection("jdbc:sqlite:mydatabase.db");
or do I need to rework the path?
Upon starting, if your application can't find a corresponding sqlite database file, have it create one. Then do initial load of your tables from either CSV, JSON or XML files.
You can upload these files to Git, as they are text formats.

Synchronized shared definition in Pentaho

Is there a way in Pentaho to create a synchronized shared definition?
Let say we have a source file s1 which is used in two transformations t1, t2. Now, suppose if I make change in t1 and add one more column in s1 there, I want it to get reflected in t2 too. Is there a way in Pentaho to achieve this?
When we share database connection in Pentaho all the changes gets reflected where ever we are using it. Can we do the similar thing with files too(if I am creating a shared definition of file and storing it in repository and then using it in other transformations)?
Thanks for you time.
Normally, I would recommend using a subtransformation for shared logic. But in this case, the number of fields in the resulting stream is going to change, so the subtransformation won't buy you much; you would still need to go into the parent transformations to change the stream metadata.
Another approach is to use the Metadata Injection step, so that you can have dynamic stream structures. This is probably overkill if you only have one source file used in two transformations, but if you have lots of source files shared by lots of transformations, it is a good approach. There are multiple sources on the web on how to use this step; one such can be found here.