MongoDB connection with Pentaho Kettle (PDI)

MongoDB connection with Pentaho Kettle (PDI) - pentaho

I've just downloaded Pentaho Data Integration Community (pdi-ce-6.1.0.1-196) a.k.a. Kettle, with the goal of designing an ETL routine to make nightly migrations from MongoDB scheme into PostgreSQL.
I couldn't achieve the very first task: create a MongoDB connection. MongoDB is not listed as a Connection Type in the New Connection dialog, so I chose Generic database. Then, I failed to find anything related to MongoDB in the Custom Driver Class Name field required for the generic connection.
Is it possible that the installation/configuration went wrong with Kettle? I remember that I had to kill the first startup because it hanged forever.
Or does PDI-CE lacks some component that I must get somewhere else?

PDI handles Mongodb differently than other databases.
If working on a transformation (vs a job), go to the "Big Data" group of steps and there are two steps - one for MongoDB Input and one for MongoDB Output.
Within those steps you specify the connection information to your database.
Hope that helps,
Mark
P.S. There is also a "MongoDB Delete" in the marketplace that comes in useful when deleting data from collections.

Related

Use Liquibase autogenerated xml for Corda Enterprise DB migration

I switched to Corda Enterprise mainly to try how it handles automated database migration.
In the documentation here it says tools-database-manager generates only SQL version of Liquibase script for initial DB and SQL version is Database specific so should not be used for production.
But it is possible to generate the XML also with liqubase cmd using this command:
/snap/bin/liquibase --url="jdbc:h2:tcp://localhost:10039/node" --driver=org.h2.Driver --classpath=/home/corda/Downloads/h2.jar generateChangeLog
which I did, and then I had to remove all the chnagelogs which are related to corda internal tables, and left only the ones that are my own and it seems everything works.
So the question is - may this approach have some hidden dangers that I don't know. Why otherwise Corda team developed tools-database-manager, and why they don't yet support xml generation with tools-database-manager?
And this leads to another question - what if I for example forget to include one of my tables in the initial script? Seems corda does not complain about it. Won't my table be created? Will I be able to ever migrate that table if it is missing in the initial script?

Firstly tools-database-manager is a helper tool available to make it easy for developers to perform database migration.
Let’s say you have 2 nodes in your network, each using a different database. PartyA uses PostgreSQL and PartyB uses Oracle. If PartyA uses this tool to create the migration script by connecting to PostgreSQL, this will out SQL statements specific to PostgreSQL.
Hence this is not portable and hence it's said the generated script is database specific.
Also, you do not want to trust a script and fire it directly on your production database, it contains DDL statements, so it is strongly recommended that every time a script is generated, make sure you know what the script is doing by manually looking into it.
There are a lot of enhancements going on in this space, supporting XML for migration script being one of them.
As mentioned earlier, you should manually look at the migration script. If you forget to add one of your table, Corda will not complain. It will fail sometime later when from within your code you try to access this table.
Yes, you can stop the node and create the table again by adding a create table script.

How mysql repository works in Pentaho User console?

Based on Pentaho guideline (https://help.pentaho.com/Documentation/8.2/Setup/Installation/Archive/MySQL_Repository) I successfully converted pentaho File based repository to MySQL database repository.
Now does anyone have any idea how MySQL repository store the data in database? It means If create new folder, new dashboard or new connection then how pentaho store this data in mysql database. Also need to know which tables is used for which purpose of data store.
Default created attached schema and tables based on mysql pentaho repository.
Please Provide any inputs or any reference material for same?

Pentaho's repository comprises three third party technologies: Jackrabbit, Hibernate, and Quartz. Reports/Jobs/Transformations and any other artifacts stored inside the Pentaho Server are generally stored in Jackrabbit. Scheduling info and triggers are stored in Quartz. And diagnostic info is stored in Hibernate (such as who accessed what reports, how long a report took to run, etc.).
None of this info is designed to be human readable directly out of the database tables. These are sort of "black box" technologies. These are third party technologies that Pentaho simply leverages for its repository functions. If you have additional questions, I'd recommend checking out the technologies themselves on their project pages.

Exporting data sources between environments in pentaho

I'm new with Pentaho and I'm trying to set up an automatic deployment process for the pentaho business analytics platform repository, but I'm having troubles to find out how to proceed with the data sources.
I would like to do export/import all the data sources, the same that here is explained with the repository (Reporting, Analyzer, Dashboards, Solution Files...) but with the data connections, mondrian files, schemas....
I know there's way to backup and restore the entire repository (explained here), but that's not the way I want to proceed, since the entire repository could contain undesired changes for production.
This would need to be with command line or rest system or some other thing that be triggered by Jenkins.

Did you try import-export with the -ds(DataSource) qualifier ? This will include the data connection, mondrian schema and metadata models.
Otherwise, you can import everything, unzip, filter according a certain logic (to be defined by the guy in charge of the deployment), zip again and export it to prod. A half day project with the Pentaho Data Integrator.

Best Practices of continuous Integration with SQL Server project or local mdf file in project

Today I maintain project that has really messy DB that need a lot of refactor and publish on clients machines.
I know that I could add a SQL Server Database project that contains just scripts of the database and creates a .dacpac file that allows me to change clients databases automatically.
Also I know that I could just add an .mdf file to the App_Data or even to Solution_Data folder and have my database there. I suppose that localDb that already exists allows me to startup my solution without SQL Server
And atlast i know that Entity Framework exist with it's own migrations. But i don't want to use it, besouse i can't add and change indexes with it's migrations and i don't have anought flexibility when i need to describe difficult migrations scenarios.
My goals:
Generate migration scripts to clients DB's automaticaly.
Make my solution self-contained, that any new Programmer that came to project don't even need to install SQL Server on his machine.
Be able to update local (development) base in 1-2 clicks.
Be able to move back in history of db changes (I have TFS server)
Be able to have clean (only with dictionaries or lookup tables) db in solution with up to date DB scheme.
Additionally i want to be able to update my DB model (EF or .dbml) automatically or very easy way.
So what I what to ask:
What's a strengths and weaknesses of using this 2 approaches if I want to achive my goals?
Can be that I should use sort of combination of this tools?
Or don't I know about other existing tool from MS?
Is there a way to update my DAL model from this DB?

What's a strengths and weaknesses of using this 2 approaches if I want to achive my goals?
Using a database project allows you to version control all of the database objects. You can publish to various database instances and roll out changes incrementally, rather than having to drop and recreate the database, thus preserving data. These changes can be in the form of a dacpac, a SQL script, or done right through the VS interface. You gain a lot of control over deployments using pre- and post-deployment scripts and publishing profiles. Developers will be required to install SQL Server (the developer/express edition is usually good enough).
LocalDB is a little easier to work with -- you can make your changes directly in the database without having to publish. LocalDB doesn't have a built-in publish process for pushing changes to other instances. No SQL Server installation required.
Use a database project if you need version control for your database objects, if you have multiple users concurrently making changes, or if you have multiple applications that use the same database. Use LocalDB if none of those conditions apply or for small apps that require their own standalone database.
Can be that I should use sort of combination of this tools?
Yes. According to Kevin's comment below, "If the Database Project is set as your startup project, hitting F5 will automatically deploy it to LocalDB. You don't even need a publish profile in this case."
Or don't I know about other existing tool from MS?
Entity Framework's Code First approach comes close.
Is there a way to update my DAL model from this DB?
Entity Framework's POCO generator works well unless you make changes to your DAL classes, then those changes get lost the next time you run the generator.
There is a new tool called SqlSharpener which can generate classes from the SQL files in a database project. I have not used it so I cannot vouch for it but it looks promising.

One way for generating client script for DB changes is to use database modeling tool like ERWin Which have a free community edition. The best way to meet your database version control requirement and easy script generation is Redgate SQL Source Control. Using Redgate tool you will meet the first five goals mentioned. Moreover, you can now update EF Model by single click after changing DB schema (i.e. Database first approach) as required in goal 6.
I do not recommend using LocalDB at all. It always make issues with source control like "DB File is in use and can't commit...” In addition, the developer in the project will not have common set of updated data to work on unless a developer add test data to the database and ask others to get latest version and overwrite their own database Or generate update script by the previous mentioned tool and ask every developer to run it on his localDB.
The best way in your situation is to use SQL Server on network. A master version that all the developers use. Since you have version control on the database using previously mentioned tool, you can rollback any buggy change in the database server.
If you think that RedGate tool is expensive for the budget of your project. A second approach is to generate single SQL file from your database that has all database object and the other developers update the SQL file in source control per their changes. This can be done easily by using schema compare tool in visual studio and appending the generated script to SQL file in the source control. With EF DB First approach, you will not have to add many migration classes as in EF Code first.

Remote MSSQL/ODBC Syncing With Rails

I am working with a client that has data in an MSSQL database. I only have read access to a remote ODBC connection and cannot modify the database in any form.
I'd like to replicate a subset of the data locally in an open-source alternative, syncing once per day or so. This is largely to eliminate reads against the data during peak hours. The local data will be used in a Rails 4 application. Note that syncing only needs to be one-way, as I don't have write access.
How can I best accomplish this?
FreeTDS?
Are there any libraries that will help with the syncing, or can I expect to write all the glue code myself?

I would advise you to create a ruby script that can be scheduled to do the data retrieving.
In order to connect to the MSSQL database, please take a look at this simple project I've created.
Then you only need to code the data you want to retrieve and the way you store it.
I prefer the approach of being decoupled from your rails application, although you can use a scheduler like rufus-scheduler or sidekiq and run it with your application.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas