My source is SQL Server and I am using SSIS to export data to S3 Buckets, but now my requirement is to send files as parquet File formate.
Can you guys give some clues on how to achieve this?
Thanks,
Ven
For folks stumbling on this answer, Apache Parquet is a project that specifies a columnar file format employed by Hadoop and other Apache projects.
Unless you find a custom component or write some .NET code to do it, you're not going to be able to export data from SQL Server to a Parquet file. KingswaySoft's SSIS Big Data Components might offer one such custom component, but I've got no familiarity.
If you were exporting to Azure, you'd have two options:
Use the Flexible File Destination component (part of the Azure feature pack), which exports to a Parquet file hosted in Azure Blob or Data Lake Gen2 storage.
Leverage PolyBase, a SQL Server feature. It let's you export to a Parquet file via the external table feature. However, that file has to be hosted in a location mentioned here. Unfortunately S3 isn't an option.
If it were me, I'd move the data to S3 as a CSV file then use Athena to convert the CSV file to Pqrquet. There is a nifty article here that talks through the Athena piece:
https://www.cloudforecast.io/blog/Athena-to-transform-CSV-to-Parquet/
Net-net, you'll need to spend a little money, get creative, switch to Azure, or do the conversion in AWS.
Related
I read a few threads on this but noticed most are outdated, with excel becoming an integration in 2020.
I have a few excel files stored in Drobox, I would like to automate the extraction of that data into azure data factory, perform some ETL functions with data coming from other sources, and finally push the final, complete table to Azure SQL.
I would like to ask what is the most efficient way of doing so?
Would it be on the basis of automating a logic app to extract the xlsx files into Azure Blob, use data factory for ETL, join with other SQL tables, and finally push the final table to Azure SQL?
Appreciate it!
Before using Logic app to extract excel file Know Issues and Limitations with respect to excel connectors.
If you are importing large files using logic app depending on size of files you are importing consider this thread once - logic apps vs azure functions for large files
Just to summarize approach, I have mentioned below steps:
Step1: Use Azure Logic app to upload excel files from Dropbox to blob storage
Step2: Create data factory pipeline with copy data activity
Step3: Use blob storage service as a source dataset.
Step4: Create SQL database with required schema.
Step5: Do schema mapping
Step6: Finally Use SQL database table as sink
I have requirement to read data from azure sql server and write in excel blob using data factory. i created csv file from azure sql server using datafactory copy activity. i have no idea how to convert csv to excel or directly read excel from azure sql using data factory. I searched on internet and found azure functions as an option.
Any suggestions you all have about saving CSV to XLSX via Azure Functions?
Excel format is supported for the following connectors: Amazon S3,
Azure Blob, Azure Data Lake Storage Gen1, Azure Data Lake Storage
Gen2, Azure File Storage, File System, FTP, Google Cloud Storage,
HDFS, HTTP, and SFTP. It is supported as source but not sink.
As the MSDN says, Excel format is not supported as sink by now. So you can't directly convert csv file to excel file using Copy activity.
In Azure function, you can create a python function and use pandas to read csv file. Then convert it to excel file as #Marco Massetti comments.
I've just consolidated 100 csv.files into a single monster file with a total size of about 50gb.
I now need to load this into my azure database. Given that I have already created my table in the database what would be the quickest method for me to get this single file into the table?
The methods I've read about include: Import Flat File, blob storage/data factory, BCP.
I'm looking for the quickest method that someone can recommend please?
Azure data factory should be a good fit for this scenario as it is built to process and transform data without worrying about the scale.
Assuming that you have the large csv file stored somewhere on the disk you do not want to move it to any external storage (to save time and cost) - it would be better if you simply create a self integration runtime pointing to your machine hosting your csv file and create linked service in ADF to read the file. Once that is done, simply ingest the file and point it to the sink which is your SQL Azure database.
https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system
I am looking for some way to directly export the SQL query results to a CSV file from AWS lambda. I have found this similar question - Exporting table from Amazon RDS into a csv file. But it will not work with the AWS Golang API.
Actually, I want to schedule a lambda function which will daily query some of the views/tables from RDS(SQL Server) and put it to the S3 bucket in CSV format. So, I want to directly download the query results in the CSV form in the lambda and then upload it to S3.
I have also found data pipeline service of AWS to copy RDS data to S3 directly, but I am not sure if I can make use of it here.
It would be helpful if anyone can suggest me the right process to do it and references to implement it.
You can transfer files between a DB instance running Amazon RDS for
SQL Server and an Amazon S3 bucket. By doing this, you can use Amazon
S3 with SQL Server features such as BULK INSERT. For example, you can
download .csv, .xml, .txt, and other files from Amazon S3 to the DB
instance host and import the data from D:\S3\into the database. All
files are stored in D:\S3\ on the DB instance
Reeference:
https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/User.SQLServer.Options.S3-integration.html
I have a MySQL DB in AWS and can I use the database as a data source in Big Query.
I m going with CSV upload to Google Cloud Storage bucket and loading into it.
I would like to keep it Synchronised by directly giving the data source itself than loading it every time.
You can create a permanent external table in BigQuery that is connected to Cloud Storage. Then BQ is just the interface while the data resides in GCS. It can be connected to a single CSV file and you are free to update/overwrite that file. But not sure if you can link BQ to a directory full of CSV files or even are tree of directories.
Anyway, have a look here: https://cloud.google.com/bigquery/external-data-cloud-storage