Spark SQL - SQL scripts processing - sql

I'm new to Spark and would like to know if there is any possibility to pass Spark an SQL script for processing.
My goal is to bring data from both mysql through jdbc and Cassandra into Spark and pass an SQL script file without having to modify it or minimal modifications applied to it. The reason why I'm saying minimal modifications is that I have a lot of SQL scripts (similar structure to stored procedures) which I don't want to convert them manually to RDD.
Main purpose is to process the data (execute these SQL scripts) through Spark, thus taking advantage of its capabilities and speed.

This guy found a pretty general way to run sql scripts, just pass in the connection to your database:
https://github.com/syncany/syncany/blob/15dc5344696a800061e8b363f94986e821a0b362/syncany-lib/src/main/java/org/syncany/util/SqlRunner.java
One limitation is that each of the statements in your SQL script has to be delimited with a semi-colon. It basically just parses the script like a text document and executes each statement as it goes. You could probably modify it to take advantage of Spark's SQLContext, instead of using a Connection.
In terms of performance, it probably won't be as fast as a stored procedure because you're bottle-necking with the InputStream. But it is a work-around.

Related

Can I replicate my ADF dataflow by using SQL stored procedure to transform JSON

I understand this may not be too precise, so feel free to delete.
I am currently using Azure Data Factory as an ETL pipeline to transform JSON files that come in daily, with the same schema, and load them into Azure SQL- ADF Dataflow ADF Pipeline Output to SQL table
This process is working exactly as I want, but we would like to move everything on premise.
I have a powershell script that will move files from sftp to local, I'd like to then use a stored procedure to transform them into their respective tables.
However, I have very limited experience with SQL and creating such a procedure to parse these rather large files seems intimidating
My current ADF process is entirely automated- copied from sftp server to blob storage, ran through parametrized dataflow pipeline for transformations on trigger, and loaded into SQL
I would like to maintain this same functionality, ie. no manual input to run process
I'm aware of OPENJSON + cross apply and made an attempt, if anyone can point me in the right direction (if what im asking is possible using t-SQL)- my code is moot just using as an example.
tldr; Can I replicate my ADF pipeline + dataflow by using a SQL stored procedure to parse complex json file?

Lambda and SQL queries that changes over time

I want to deploy multiple Lambda Functions with each one issuing Athena SQL queries. This query may change depending on schema changes of the table involved.
I'm considering to create a sql file in s3 or to redeploy this s3 lambda function every time the queries change. Is there any recommended approach for this use case?
It depends on which of the following is more important.
Speed of modifying SQL statements
Speed of Lambda function execution
You could have the Lambda function reach out to an S3 bucket to grab a SQL file on every invocation but that would be wildly inefficient and more expensive. You could slightly improve upon this by using a caching strategy to check if the file had changed (by hash/checksum) before pulling it down.
The better approach would be to include the SQL in the function and simply re-deploy when you want to change the SQL. Without, further context I can't say if this will suit your needs (perhaps you need to be able to swap the SQL very quickly.) Deploying Lambda functions can be very quick however if setup properly (<1 minute to update a single function).
There might be a third approach where you parameterize your Lambda function and SQL commands such that you don't need to change the SQL as frequently and simply pass different parameters to the function.

How to execute a macro in bigquery

I have a requirement to move some of the existing frontend applications running Teradata as the backend to Google BigQuery. One of the common pattern used in these frontend applications is to call a Macro in Teradata, based on different input selected by users. Considering BigQuery doesn't have a way to create a macro entity, how can I replace this and have the frontend calling BigQuery to execute something similar. Connection to BigQuery is through ODBC/JDBC or java services.
A macro in Teradata is just a way to execute multiple SQL statements as a single request, which is in turn treated as a single transaction. It also allows you to parameterize your query.
If your new DB backend supports it, you can convert the macros into stored procedures / functions. Otherwise, you can pull out the individual SQL statements from the macro and try to run them together as a single transaction.
These links may be helpful: Functions,
DML
Glancing at the documentation, it looks like writing a function may be your best bet: "There is no support for multi-statement transactions."
You can look at Bigquery scripting which is in Beta - https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#bigquery-scripting for migrating your macros from Teradata. With this release you can write procedures where you can define all your business logic and then execute the procedure using a CALL statement.
Thanks,
Jayadeep
As mentioned above:
A macro in Teradata is just a way to execute multiple SQL statements
as a single request, which is in turn treated as a single transaction.
It also allows you to parameterize your query.
Having said that, you just need to do the migrating part from teradata, here you can find the guide to do this, and answering your question, the connection is made through JDBC whose drivers are tdgssconfig.jar and terajdbc4.jar.

Transfer Data from Oracle database 11G to MongoDB

I want to have an automatic timed transfer from Oracle database to MongoDB. In a typical RDBMBS scenario, i would have established connection between two databases by creating a dblink and transferred the data by using PL/SQL procedures.
But i don't know what to do in MongoDB case; thus, how and what should i be implementing so that i can have an automatic transfer from Oracle database to MongoDB.
I would look at using Oracle Goldengate. It has a MONGODB Handler.
https://docs.oracle.com/goldengate/bd123110/gg-bd/GADBD/using-mongodb-handler.htm#GADBD-GUID-084CCCD6-8D13-43C0-A6C4-4D2AC8B8FA86
https://oracledb101.wordpress.com/2016/07/29/using-goldengate-to-replicate-to-mongodb/
What type of data do you want to transfer from the Oracle database to MongoDB? If you just want to export/import a small number of tables on a set schedule, you could use something like UTL_FILE on the Oracle side to create a .csv export of the table(s) and use DBMS_SCHEDULER to schedule the export to happen automatically based on your desired time frame.
You could also use an application like SQL Developer to export tables as .csv files by browsing to the table the schema list, then Right Click -> Export and choosing the .csv format. You may also find it a little easier to use UTL_FILE and DBMS_SCHEDULER through SQL Developer instead of relying on SQL*Plus.
Once you have your .csv file(s), you can use mongoimport to import the data, though I'm not sure if MongoDB supports scheduled jobs like Oracle (I work primarily with the latter.) If you are using Linux, you could use cron to schedule a script that will import the .csv file on a scheduled interval.

Is it possible to transfer data to SQL Server from an unconventional database

My company has a SQL Server database which they would like to populate with data from a hierarchical database (as opposed to relational). I have already written a .net application to map its schema to a relational database and have successfully done that. However my problem is that the tech being used here is so old that I see no obvious way of data transfer.
However, I have certain ideas about how I can do this. This involves having to write file scans in my unconventional database and dump out files as csv. Then do a bulk upload into SQL Server. I do not appreciate this as there is the element of invalid data involved which terminates the bulk upload quite so often.
I was hoping to explore options around service broker. I was hoping to dump out live transactions where a record has changed in my database and then this can somehow be picked up?
Secondly I was also hoping to use something which if I dump out live or changed records in a file (I can format the file to whatever format is needed), can something suck it into SQL Server?
Any help would be greatly appreciated.
Regards,
Waqar
Service Broker is a very powerful queue/messaging management system. I am not sure why you want to use it for this.
You can set up an SSIS job that keeps checking a folder for csv files and when detects a new one it reads it into SQL Server and then zips it and archives it somewhere else. This is very common. SSIS can then either process the data (its a wonderful ETL tool) or invoke procedures in SQL Server to process the data. SSIS is very fast and is rarely overwhelmed so why would you use Service Broker?
If its IMS (mainframe) type data you have to convert it to flat tables and then as csv type text tables for SQL Server to read.
SQL server is very good at processing XML and, as of 2016, JSON shaped data, so if that is your data type you can directly import into SQL Server.
Skip bulk insert. The SQL Server xml data type lends itself to doing what you're doing. If you can output data from your other system into an XML format, you can push that XML directly into an argument for a stored procedure.
After that, you can use the functions for the XML type to iterate through the data as needed.