U-SQL : How to merge two usql files with same import statement - azure-data-lake

I want to deploy multiple tables creation script as one adla job to save on cost. I am using packages to get set of defined partition keys for all tables. When i try to deploy as merged script it complains that import statement is declared multiple times and fails.
While i can still deploy script one by one but wanted to see if we can merge script for faster deployment.
Thanks
Amit

I am not sure I completely get your scenario. If you want to deploy a single object by itself, then that file needs to include all the dependencies (e.g., your package). If you want to deploy several objects, you should include the dependencies only once.
You probably should set up something that generates your script from the underlying "fragments". One fragment would be the reference to the package, the other fragments would be the creation of one object. And your deployment system would concatenate the files as needed.

Related

ADF: Using ForEach and Execute Pipeline with Pipeline Folder

I have a folder of pipelines, and I want to execute the pipelines inside the folder using a single pipeline. There will be times when there will be another pipeline added to the folder, so creating a pipeline filled with Execute Pipelines is not an option (well, it is the current method, but it's not very "automate-y" and adding another Execute Pipeline whenever a new pipeline is added is, as you can imagine, a pain). I thought of the ForEach Activity, but I don't know what the approach is.
I have not tried this approach but I think we can use the
ADF RestAPI to get all the details of the pipelines which needs to be executed. Since the response is in JSON you can write it back to temp blob and add filter and focus on what you need .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/list-by-factory?tabs=HTTP
You can use the Create RUN API to trigger the pipeline .
https://learn.microsoft.com/en-us/rest/api/datafactory/pipelines/create-run?tabs=HTTP
As Joel called out , if different pipeline has different count of paramter , it will be little messy to maintain .
Folders are really just organizational structures for the code assets that describe pipelines (same for Datasets and Data Flows), they have no real substance or purpose inside the executing environment. This is why pipeline names have to be globally unique rather than unique to their containing folder.
Another problem you are going to face is that the "Execute Pipeline" activity is not very dynamic. The pipeline name has to be known as design time, and while parameter values are dynamic, the parameter names are not. For these reasons, you can't have a foreach loop that dynamically executes child pipelines.
If I were tackling this problem, it would be through an external pipeline management system that you would have to build yourself. This is not trivial, and in your case would have additional challenges because of the folder level focus.

Joining ADLS files created with Append and ConcurrentAppend

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

Is there a way to clone data from CrateDB into Crate running on a new container?

I currently have one container which runs Crate, and stores all its data in the /data/ directory. I am trying to create a clone of this container for debugging purposes -- ideally, the clone would be running Crate (which I can query) using the exact same data. I've tried mounting the same data directory into the /data/ directory of the cloned container and starting Crate, but when I run any queries, I notice that Crate shows 0 tables (that is, it doesn't recognize the data in the folder as database tables). How do I get around this? I know I can export and import data using COPY TO and COPY FROM, but I have so many tables that that would be quite cumbersome to write.
I’m a little bit wondering why you want to use the same data directory for debugging purposes, since you then modify data, which you probably don’t want to change. Also, the two instances would overwrite each others data, when using the same data directory at the same time. That’s the reason why this is not working.
What you still can do, is simply copying the folder in your file system and mount the second debugging node to the cloned folder.
Another solution would be to create a cluster containing both nodes as documented here: https://crate.io/docs/crate/guide/best_practices/docker.html.
Hope that helps.

FlywayDB ignore sub-folder in migration

I have a situation where I would like to ignore specific folders inside of where Flyway is looking for the migration files.
Example
/db/Migration
2.0-newBase.sql
/oldScripts
1.1-base.sql
1.2-foo.sql
I want to ignore everything inside of the 'oldScripts' sub folder. Is there a flag that I can set in Flyway configs like ignoreFolder=SOME_FOLDER or scanRecursive=false?
An example for why I would do this is say, I have 1000 scripts in my migration folder. If we onboard a new member, instead of having them run the migration on 1000 files, they could just run the one script (The new base) and proceed from there. The alternative would be to never sync those files in the first place, but then people would need to remember to check source control to prior migrations instead of just looking on their local drive.
This is not currently supported directly. You could put both directories at the same level in the hierarchy (without nesting them) and selectively configure flyway.locations to achieve the same thing.
Since Flyway 6.4.0 wildcards are supported in flyway.locations. Examples:
db/**/test
db/release1.*
db/release1.?
More info at https://flywaydb.org/blog/organising-your-migrations

Choosing multiple Hibernate import.sql based on conditions

How can I specify which import file I want hibernate to run. Is there any configuration option that I can put (I think I have seen something like this somewhere) that I can say custom .sql file and hibernate will run it.
I want to split my creation into multiple files. And also I want to run differnet scripts that will generate date based on my hibernate config that I am using. So if I am using local it should one set of .sql files and if I am testing it into QA it should use another.
I have multiple config files that I can run depending on what I want, so now I need to figure out how to put which script should run in which configuration.
cheers
'hibernate.hbm2ddl.import_files' is the setting you want (org.hibernate.cfg.AvailableSettings#HBM2DDL_IMPORT_FILES).
http://docs.jboss.org/hibernate/orm/4.1/javadocs/org/hibernate/cfg/AvailableSettings.html#HBM2DDL_IMPORT_FILES