a large dataset to test apache ignite? - ignite

I am new to Apache Ignite. Can you please suggest a way to get a large data set (preferably CSVs along with DDL statements that is Ignite compliant), which I could use it to create schema, tables in Ignite (uses native persistence), to test a few use cases that I have.

You can use Web Console to copy data from relational DB into Apache Ignite, creating data structure and project files along the way.
Apply it on existing database or something like MySQL Employees sample database.
Web Console will connect to existing internally deployed Database using 'agent' program ran locally.

Related

Can Apache Ignite update when 3rd party SQL Server database changed something directly?

Can I get some advice on whether it is possible to proceed like the steps below?
SQL Server data is loaded in Ignite Cluster
The data in SQL Server has been changed.
-> Is there any other way to reflect this changed data without reloading the data from SQL Server?
When used as a cache in front of the database, when changes are made directly to the DB without going through the Ignite Cluster, can the already loaded cache data be directly reflected in the Ignite cache?
Is it possible to set only the value to change without loading the data again?
If possible, which part should I set? Please.
I suppose the real question is - how to propagate changes applied to SQL Server first to the Apache Ignite cluster. And the short answer is - you need to do it by yourself, i.e. you need to implement some synchronization logic between the two databases. This should not be a complex task if most of the data updates come through Ignite and SQL Server-first updates are rare.
As for the general approach, you can check for the Change Data Capture (CDC) pattern implementations. There are multiple articles on how you can achieve it using external tools, for sample, CDC Between MySQL and GridGain With Debezium or this video.
It's worth mentioning that Apache Ignite is currently working on its own native implementation of CDC.
Take a look at Ignite's external storage integration, and the read/write through features. See: https://ignite.apache.org/docs/latest/persistence/external-storage
and https://ignite.apache.org/docs/latest/persistence/custom-cache-store
examples here: https://github.com/apache/ignite/tree/master/examples/src/main/java/org/apache/ignite/examples/datagrid/store

Options for ingesting and processing data in Azure sql

I need expert opinion on a project I am working on. We currently get data files that we load into our Azure sql database using a local script that calls stored procedures. I am planning on replacing the script with ssis jobs to load the data into our Azure Sql but wondering if that's a good option given our needs.I am opened to different suggestions too. The process we go through is to load data file to staging tables and validate before making updates to live tables. The validation and updates are done by calling stored procedures...so the ssis package will just load the data and make calls to those stored procedures. I have looked at ADF IR and Databricks but they seem overkill but am open to hear people with experience using those as well. I am currently running the ssis package locally as well. Any suggestion on better architecture or tools for this scenario? Thanks!
I would definitely have a look at Azure Data Factory Data flows. With this you can easily build your ETL pipelines in the a Azure Data Factory GUI.
In the following example two text files from a Blob Storage are read, joined, a surrogate key is added and finally the data is loaded to Azure Synapse Analytics (would be the same for Azure SQL):
You finally put this Mapping Data Flow into a pipeline and can trigger it, e. g. if new data arrives.
You can just BULK INSERT data from Azure Blob Store:
https://learn.microsoft.com/en-us/sql/relational-databases/import-export/examples-of-bulk-access-to-data-in-azure-blob-storage?view=sql-server-ver15#accessing-data-in-a-csv-file-referencing-an-azure-blob-storage-location
Then you can use ADF (no IR) or Databricks or Azure Batch or Azure Elastic Jobs to schedule the execution.

Best way to set up a new database on a new server which periodically refreshes tables from a live SQL Server?

I need to create a database solely for analytical purposes. The idea here is for it to start off as a 1:1 replica of a current SQL Server database but we will then add in additional tables. The idea here is to be able to have read-write access to a db without dropping anything in production inadvertently.
We would ideally like to set a daily refresh schedule to update all tables in the new tb to match the tables in the live environment.
In terms of the DBMS for the new database, I am very easy - MySQL, SQL Server, PostgreSQL would be great -- I am not hugely familiar with the Google Storage/BigQuery stack but if this is an easy option, I'm open to it.
You could use a standard HA/DR solution with a readable secondary (Availability Groups/mirroring /log shipping).
then have a second database on the new server for your additional tables.
Cloud Storage and BigQuery are not RDBMS services themselves, but could be used in this case to store the backups/exports/dumps from the replica, and then have the analytical work performed on those backups.
Here is an example workflow:
Perform a backup and restore in a different database
Add the new tables in the new database
Export the database as a CSV file on your local machine
Here you could either directly load the CSV file in BigQuery, or upload that file in a Cloud Storage bucket previously created
Query the data
I suggest to take a look at the multiple methods for loading data in BigQuery, as well as the methods for querying against external data sources which may help to determine which database replication/export method might be best for your use case.

Is there a native SQL source in Apache Flume?

I need to create a simple data warehouse. The data sources for the data warehouse are heterogeneous, thus I'm experimenting with Frameworks like Apache Flume for data collection. I went through the documentation but didn't find anything about SQL. (http://flume.apache.org/FlumeDeveloperGuide.html and http://flume.apache.org/FlumeUserGuide.html#flume-sources)
Question: Are there any (native) possibilities to connect an Apache Flume source to an SQL server?
Apache Flume is designed to collect, aggregate and move log data to HDFS.
If you are considering moving large amounts of data from a SQL database, take a look at Apache Sqoop:
http://sqoop.apache.org/
Look into this project flume-ng-sql-source. Here are some examples as well.
http://www.toadworld.com/platforms/oracle/w/wiki/11093.streaming-oracle-database-logs-to-hdfs-with-flume
http://www.toadworld.com/platforms/oracle/w/wiki/11100.streaming-mysql-table-data-to-oracle-nosql-database-with-flume

Concern with using external SQL server for DIH

I am looking to import entries to my SOLR server by using the DIH, connecting to an external PostgreSQL server using the JDBC driver. I will be importing about 50,000 entries each time.
Is connecting to an external SQL server for my data unreliable or risky, or is it instead perfectly reasonable?
My only alternative is to export the SQL file on the other server, download the SQL file to my SOLR server, import it to my Solr servers copy of PostgreSQL and then run the DIH on the local database.
The way you're using it is pretty much why the DIH exists. Otherwise, you could just use the /update handler with XML documents. The core I'm working on right now regularly indexes 11,000,000 rows per batch.
This is a standard use case, importing from a remote DB. Proceed with confidence!