Load GZIP Binary Data in a column in Snowflake DB

Load GZIP Binary Data in a column in Snowflake DB - blob

I have a csv file which has two columns. First Column is an id and second column is the compressed GZIP binary data. I want to load this record into the Snowflake table with having two columns id as number data type and bin_text as binary data type.
Tried to load csv file(tab as seperator) with "COPY into" command but the GZIP compressed binary data has a multiple new lines which snowflake considers as seperate recod.
I need to load the whole GZIP compressed binary data which has multiple new lines into a single record.
Please help.
Table structure - id as number, compress_data as Binary
For example,
first record - 1, gzip of ("hello world. This is snowflake example. I am having some doubts so went for stackoverflow to clear the doubts. The issue is to load the bianry data into snowflake table. I have a csv file which has two columns. First Column is an id and second column is the compressed GZIP binary data. I want to load this record into the Snowflake table with having two columns id as number data type and bin_text as binary data type. Tried to load csv file(tab as seperator) with "COPY into" command but the GZIP compressed binary data has a multiple new lines which snowflake considers as seperate recod. I need to load the whole GZIP compressed binary data which has multiple new lines into a single record.").
To generate the compressed format of the text, i am using the following command:
echo "hello world. This is snowflake example. I am having some doubts so went for stackoverflow to clear the doubts. The issue is to load the bianry data into snowflake table. I have a csv file which has two columns. First Column is an id and second column is the compressed GZIP binary data. I want to load this record into the Snowflake table with having two columns id as number data type and bin_text as binary data type. Tried to load csv file(tab as seperator) with "COPY into" command but the GZIP compressed binary data has a multiple new lines which snowflake considers as seperate recod. I need to load the whole GZIP compressed binary data which has multiple new lines into a single record." | gzip -cf9 | wc -l
this command produces 4 lines as compressed output. I want to store this 4 lines as a single record.
The output file is a CSV(tab seperated) stored in Internal stage of Snowflake.
Copy command options used:
copy into compress
from (
select
t.$1,
t.$2
from <INTERNAL STAGE> t
)
file_format = ( type = csv
field_delimiter='\t' escape_unenclosed_field=none
binary_format=UTF8);

Related

bq load command to load parquet file from GCS to BigQuery with column name start with number

I am loading parquet file into BigQuery using bq load command, my parquet file contains column name start with number (e.g. 00_abc, 01_xyz). since BigQuery don't support column name start number I have created column in BigQuery such as _00_abc, _01_xyz.
But I am unable to load the parquet file to BigQuery using bq load command.
Is there any way to specify bq load command that source column 00_abc (from parquet file) will load to target column _00_abc (in BigQuery).
Thanks in advance.
Regards,
Gouranga Basak

It's general best practice to not start a Parquet column name with a number. You will experience compatibility issues with more than just bq load. For example, many Parquet readers use the parquet-avro library, and Avro's documentation says:
The name portion of a fullname, record field names, and enum symbols must:
start with [A-Za-z_]
subsequently contain only [A-Za-z0-9_]
The solution here is to rename the column in the Parquet file. Depending on how much control you have over the Parquet file's creation, you may need to write a Cloud Function to rename the columns (Pandas Dataframes won't complain about your column names).

CSV file matadata validation (comparing with existing SQl Table)

I have a requirement of validating CSV file before loading into staged-folder, and later have to load into sql table.
I need to validate metadata (the structure of the file must be same as target sql table)
No. of columns should be equal to the target sql table
order of columns should be same as target sql table
Data types of columns (no text values should exist in numeric field of csv file)
looking for some easy and efficient way achieve this.
Thanks for help

A Python program and module that does most of what you're looking for is chkcsv.py: https://pypi.org/project/chkcsv/. It can be used to verify that a CSV file contains a specified set of columns and that the data type of each column conforms to the specification. It does not, however, verify that the order of columns in the CSV file is the same as the order in the database table. Instead of loading the CSV file directly into the target table, you can load it into a staging table and then move it from there into the target table--this two-step process eliminates column order dependence.
Disclaimer: I wrote chkcsv.py
Edit 2020-01-26: I just added an option that allows you to specify that the column order should be checked also.

Redis - how to load CSV into a table?

Is it possible to load CSV file directly into Redis (5.X) or CSV first needs to be converted into JSON and then load it programmatically?

Depending on how you want to store the data in your CSV file, you may or may not need to process it programmatically. For example, running this:
redis-cli SET foo "$(cat myfile.csv)"
will result with the contents of the files stored in the key 'foo' as a Redis String. If you want to store each line in its own data structure under a key (perhaps a Hash with all the columns), you'll need to process it with code and populate the database accordingly.
Note: there is no need, however, to convert it to JSON.

HiveQL Where In Clause That Points to a Set of Files

I have a set of ~100 files each with 50k IDs in them. I want to be able to make a query against Hive that has a Where In clause using the IDs from these files. I could also do this directly from Groovy, but I'm thinking the code would be cleaner if I did all of the processing from Hive instead of referencing an external Set. Is this possible?

Create an external table describing the format of your files, and set the location to the HDFS path of a directory containing the files.. i.e for tab delimited files
create external table my_ids(
id bigint,
other_col string
)
row format delimited fields terminated by "\t"
stored as textfile
location 'hdfs://mydfs/data/myids'
Now you can use Hive to access this data.

load data infile mysql how to ignore not valid records inserting

I have a text file and want to load it to my db table using load data infile, but this file contains invalid values, that contain empty strings, how to ignore this lines?

edit your text file first to search-and-replace the invalid values. Then load the data into mysql.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas