Transfer large file from Google BigQuery to Google Cloud Storage - google-bigquery

I need to transfer a large table in BigQuery, 2B records, to Cloud Storage with csv format. I am doing the transfer using the console.
I need to specify a uri including a * to shard the export due to the size of the file. I end up with 400 csv files in Cloud Storage. Each has a header row.
This makes combining the files time consuming, since I need to download the csv files to another machine, strip out the header rows, combine the files, and then re-upload. FY the size of the combined csv file is about 48GB.
Is there a better approach for this?

Using the API, you will be able to tell BigQuery not to print the header row during the table extraction. This is done by setting the configuration.extract.printHeader option to false. See the documentation for more info. The command-line utility should also be able to do that.
Once you've done this, concatenating the files is much easier. In a Linux/Mac computer it would be a single cat command. However, you could also try to concatenate directly from Cloud Storage by using the compose operation. See more details here. Composition can be performed either from the API or the command line utility.
Since composition actions is limited to 32 components, you will have to compose 32 files after 32 files. That should make around 13 composition operations for 400 files. Note that I have never tried the composition operation, so I'm just guessing on this part.

From the console, use the bq utility to strip the headers:
bq --skip_leading_rows 1

Related

Load file from Cloud Storage to BigQuery to single string column

We are designing a new ingestion framework (Cloud Storage -> BigQuery) using Cloud Functions. However, we receive some files (json, csv) that are corrupted and cannot be inserted as is (bad field names, missing columns, etc.) not even as external tables. Therefore, we would like to ingest every row to one cell as a JSON string and deal with the issues when we cleanse the data in BigQuery.
Is there a way to do that natively and efficiently and as little processing possible (so Cloud Functions wouldn't time out)? I wrote a function that processes the files and wraps lines one by one but for bigger files it won't be an option. We would prefer to stay with Cloud Functions to have this as lightweight as possible.
My option in that case is to ingest the CSV with a dummy separator, for instance # or |. I know that I will never have those characters and that's why I chose them.
Like that, the schema autodetect detect only 1 column, and create a single string column table.
If you can pick a character like that, it's the easiest solution, but without any guaranty of course (it's corrupted file, it's hard to know in advance what will be the unused characters)

Creating multiple files for uploading to Snowflake

Currently, my company uses SSIS and BCP to export data from SQL Server to CSV files. However, we are only able to create a single file per SQL table (due to the limitations of BCP). Most of these files are quite large; if I am correct, they are too large to get the best performance when loading them into Snowflake. On their website, they state that we should be working with multiple gzip files to offer the best performance.
I am wondering how other people made this work? Splitting up the CSV to multiple files and zipping them? Any good tools that can do this during export from SSIS?
I'd keep the current process that exports the large .csv files using SSIS, then run 7zip via command line to create a split gzip set for each text file, either within the SSIS package or via Powershell.
The -v switch is used to specify the volume size.
https://sevenzip.osdn.jp/chm/cmdline/switches/volume.htm
You may be able to start importing/uploading the completed chunks before the later ones are finished to pick up some additional time savings, but I've not tested that.

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

Joining ADLS files created with Append and ConcurrentAppend

We have several large CSV files in Azure Data Lake Store that were created using the Append method of the .NET API. Recently, we switched over to ConcurrentAppend for performance reasons. Since ConcurrentAppend and Append cannot be used interchangeably, the switch required us to create a new folder structure for the files, to make sure that the ConcurrentAppend would never hit any files created using Append.
However, our downstream application needs to load all data, both from before and after the switch. Instead of changing our application, we wanted to join the files (using the PowerShell SDK Join-AzureRmDataLakeStoreItem cmdlet), but the documentation does not specify whether files joined this way can be written to by ConcurrentAppend after the join. I suspect that we will face issues, since we are going to join files created by both methods (maybe it's not even possible to do the join?)
So my questions are as follows:
Can ConcurrentAppend write to a file that has been joined using Join-AzureRmDataLakeStoreItem, even if one or more of the source files have been created using Append?
If not, we will use U-SQL to combine the files, but can ConcurrentAppend write to a file that has been outputted from a U-SQL job?
If not, do we have any other options than executing a local script (using the .NET API for example), which will read all files, and write a new set of files back to the lake using only ConcurrentAppend?
Cost is a concern, which is why we prefer to use the PowerShell cmdlet if possible, and would like to avoid the last option.
At present after the join operation, no append operations can be executed on the file. We are currently working on a feature to remove this limitation. However, at present after concatenating files, the appends will not work.

Enrich CSV with metadata from database

I've been looking around for a lightweight, scaleable solution to enrich a CSV file with additional metadata from a database. Each line in the CSV represents a data item and the columns the metadata belonging to that item.
Basically I have a CSV extract and I need to add additional metadata from a database. The metadata can be accessed via ODBC or REST API call.
I have a number of options in my head but I'm looking for other ideas. My options are as follows:
Import the CSV into a database table, apply the additional metadata with sql UPDATE statements by finding the necessary metadata with SELECT statements, and then export the data back into CSV format. For this solution I was thinking to use an ETL tool which may be a bit heavyweight to tackle this problem.
I also thought about a NodeJS based solution where I read the CSV in, call web service to get the metadata and write back the data into the CSV file. The CSV can be however quite large with potentially tens of thousands of rows so this could be heavy on memory or in case of line-by-line processing not very performant.
If you have a better solution in mind, please post. Many thanks.
I think you've come up with a couple of pretty good ideas here already.
Running with your first suggestion using an ETL tool to enrich your CSV files, you should check out https://github.com/streamsets/datacollector
It's a continuous ingestion approach, so you could even monitor a directory of CSV files to load as you get them. While there's no specific functionality yet for doing lookups in a database, its certainly possible in a number of ways (including writing your own custom logic in Java, or a script in python or JavaScript).
*Full disclosure I work on this project.