Quickest way to import a large (50gb) csv file into azure database - sql

I've just consolidated 100 csv.files into a single monster file with a total size of about 50gb.
I now need to load this into my azure database. Given that I have already created my table in the database what would be the quickest method for me to get this single file into the table?
The methods I've read about include: Import Flat File, blob storage/data factory, BCP.
I'm looking for the quickest method that someone can recommend please?

Azure data factory should be a good fit for this scenario as it is built to process and transform data without worrying about the scale.
Assuming that you have the large csv file stored somewhere on the disk you do not want to move it to any external storage (to save time and cost) - it would be better if you simply create a self integration runtime pointing to your machine hosting your csv file and create linked service in ADF to read the file. Once that is done, simply ingest the file and point it to the sink which is your SQL Azure database.
https://learn.microsoft.com/en-us/azure/data-factory/connector-file-system

Related

Excel into Azure Data Factory into SQL

I read a few threads on this but noticed most are outdated, with excel becoming an integration in 2020.
I have a few excel files stored in Drobox, I would like to automate the extraction of that data into azure data factory, perform some ETL functions with data coming from other sources, and finally push the final, complete table to Azure SQL.
I would like to ask what is the most efficient way of doing so?
Would it be on the basis of automating a logic app to extract the xlsx files into Azure Blob, use data factory for ETL, join with other SQL tables, and finally push the final table to Azure SQL?
Appreciate it!
Before using Logic app to extract excel file Know Issues and Limitations with respect to excel connectors.
If you are importing large files using logic app depending on size of files you are importing consider this thread once - logic apps vs azure functions for large files
Just to summarize approach, I have mentioned below steps:
Step1: Use Azure Logic app to upload excel files from Dropbox to blob storage
Step2: Create data factory pipeline with copy data activity
Step3: Use blob storage service as a source dataset.
Step4: Create SQL database with required schema.
Step5: Do schema mapping
Step6: Finally Use SQL database table as sink

SSIS sending source Oledb data to S3 Buckets in parquet File

My source is SQL Server and I am using SSIS to export data to S3 Buckets, but now my requirement is to send files as parquet File formate.
Can you guys give some clues on how to achieve this?
Thanks,
Ven
For folks stumbling on this answer, Apache Parquet is a project that specifies a columnar file format employed by Hadoop and other Apache projects.
Unless you find a custom component or write some .NET code to do it, you're not going to be able to export data from SQL Server to a Parquet file. KingswaySoft's SSIS Big Data Components might offer one such custom component, but I've got no familiarity.
If you were exporting to Azure, you'd have two options:
Use the Flexible File Destination component (part of the Azure feature pack), which exports to a Parquet file hosted in Azure Blob or Data Lake Gen2 storage.
Leverage PolyBase, a SQL Server feature. It let's you export to a Parquet file via the external table feature. However, that file has to be hosted in a location mentioned here. Unfortunately S3 isn't an option.
If it were me, I'd move the data to S3 as a CSV file then use Athena to convert the CSV file to Pqrquet. There is a nifty article here that talks through the Athena piece:
https://www.cloudforecast.io/blog/Athena-to-transform-CSV-to-Parquet/
Net-net, you'll need to spend a little money, get creative, switch to Azure, or do the conversion in AWS.

Best approach loading a text file (.txt) to bigquery table

Anyone got any pratical idea with regard to what is the best possible approach to upload a text file to a bigquery table? I have a few zipped text files I need to download from a remote SFTP server and load it into a bigquery table. Should I download it to a google cloud storage and upload it from there to bigquery for faster speed? The text files are about 5GB each and will grow further.
Thank you.
First thing to consider if you are loading files from your local data source is that there are limitations for that, according to the documentation.
Loading data from a local data source is subject to the following limitations:
Wildcards and comma separated lists are not supported when you load
files from a local data source. Files must be loaded individually.
When using the classic BigQuery web UI, files loaded from a local data
source must be 10 MB or less and must contain fewer than 16,000 rows.
Besides that, with this provided above link , there are instructions how to upload your data with Console or CLI.
Nevertheless, using the cloud storage you can take advantage of long term storage, which means that you are not charged by loading data into bigquery instead for storing the data in Cloud Storage. You can read more about it here.
Finally, I would like you to consider two points External and Natives tables in bigquery.
Native tables: tables backed by native BigQuery storage.
External tables: tables backed by storage external to BigQuery. For more
information, see Querying External Data Sources.
In other words, using Native tables you import the full data inside BigQuery. Thus, it tends to me faster when executing data analysis. Meanwhile, external tables do not store data in BigQuery, instead references the data from an external source.
The cost of storing in BigQuery is higher than in Cloud storage. Although, querying external tables is slower than querying against native tables, mainly if the files are significantly large. Lastly, since external tables are pointers to files, you do not have to wait for the data to load.

Import large table to azure sql database

I want to transfer one table from my SQL Server instance database to newly created database on Azure. The problem is that insert script is 60 GB large.
I know that the one approach is to create backup file and then load it into storage and then run import on azure. But the problem is that when I try to do so than while importing on azure IO have an error:
Could not load package.
File contains corrupted data.
File contains corrupted data.
Second problem is that using this approach I cant copy only one table, the whole database has to be in the backup file.
So is there any other way to perform such an operation? What is the best solution. And if the backup is the best then why I get this error?
You can use tools out there that make this very easy (point and click). If it's a one time thing, you can use virtually any tool (Red Gate, BlueSyntax...). You always have BCP as well. Most of these approaches will allow you to backup or restore a single table.
If you need something more repeatable, you should consider using a backup API or code this yourself using the SQLBulkCopy class.
I don't know that I'd ever try to execute a 60gb script. Scripts generally do single inserts which aren't very optimized. Have you explored using various bulk import/export options?
http://msdn.microsoft.com/en-us/library/ms175937.aspx/css
http://msdn.microsoft.com/en-us/library/ms188609.aspx/css
If this is a one-time load, using a IaaS VM to do the import into the SQL Azure database might be a good alternative. The data file, once exported could be compressed/zipped and uploaded to blob storage. Then pull that file back out of storage into your VM so you can operate on it.
Have you tried using BCP in the command prompt?
As explained here: Bulk Insert Azure SQL.
You basically create a text file with all your table data in it and bulk copy it your azure sql database by using the BCP command in the command prompt.

Storing files in SQL server vs something like Amazon S3

Whats the advantage/disadvantage between storing files as a byte array in a SQL table and using something like Amazon S3 to store them? Whats the advantage of S3 that makes it so I should use that instead of SQL?
Pros for storing files in the database:
transactional consistency
security (assuming you need it and that your database isn't wide open anyway)
Cons for storing files in the database:
much larger database files + backups (which can be costly if you are hosting on someone else's storage)
much more difficult to debug (you can't say "SELECT doc FROM table" in Management Studio and have Word pop up)
more difficult to present the documents to users (and allow them to upload) - instead of just presenting a link to a file on the file system, you must build an app that takes the file and stores it in the database, and pulls the file from the database to present it to the user.
typically, database file storage and I/O are charged at a much higher premium that flat file storage