How to generate a single file per partition - Snowflake COPY into location - sql

I've managed to unload my data into a partitions, but each one of them is also being partitioned into multiple files. Is there a way to force Snowflake to generate a single file per partition?
It also would be great if I can zip all the files.
This is what I got so far:
COPY INTO 'gcs_bucket'
FROM test
PARTITION BY TRUNC(number_of_rows/500000)
STORAGE_INTEGRATION = gcs_int
FILE_FORMAT = (TYPE = CSV, COMPRESSION = gzip, NULL_IF = ('NULL','null'), RECORD_DELIMITER= '\r\n', FIELD_OPTIONALLY_ENCLOSED_BY = "'")
HEADER = TRUE
PS. I'm using csv format (can't change that)

The upper size limit of each file could be changed with MAX_FILE_SIZE option. Default is 16MB.
COPY INTO 'gcs_bucket'
FROM test
PARTITION BY TRUNC(number_of_rows/500000)
STORAGE_INTEGRATION = gcs_int
...
MAX_FILE_SIZE = 167772160 -- (160MB)
MAX_FILE_SIZE = num
Definition
Number (> 0) that specifies the upper size limit (in bytes) of each file to be generated in parallel per thread. Note that the actual file size and number of files unloaded are determined by the total amount of data and number of nodes available for parallel processing.
Snowflake utilizes parallel execution to optimize performance. The number of threads cannot be modified.**

Related

Reading and handling many small CSV-s to concatenate one large Dataframe

I have two folders each contains about 8,000 small csv files. One with an aggregated size of around 2GB and another with aggregated size of around 200GB.
These files are stored like this to better update them in a daily basis. However, when I conduct EDA, I would like them to be assigned to a single variable. For example.
path = "some random path"
df = pd.concat([pd.read_csv(f"{path}//{files}") for files in os.listdir(path)])
It would take much less time for me to read the dataset with 2GB in total size than reading it on the super computer cluster. And it is impossible to read the 200GB dataset on the local machine unless using some sort of scaling Pandas solutions. The situation does not seem to improve on the cluster even using the popular open-source tools like Dask and Modin.
Is there an effective way that enables to read those csv files effectively with given situation?
Q :"Is there an effective way that enables to read those csv files effectively ... ?"
A :Oh, sure, there is :
CSV format ( standard attempts in RFC4180 ) is not unambiguous and is not obeyed under all circumstances ( commas inside fields, header present or not ), so some caution & care is needed here. Given you are your own data curator, you shall be able to decide plausible steps for handling your own data properly.
So, the as-is state is :
# in <_folder_1_>
:::::::: # 8000 CSV-files ~ 2GB in total
||||||||||||||||||||||||||||||||||||||||||| # 8000 CSV-files ~ 200GB in total
# in <_folder_2_>
Speaking efficiency, O/S coreutils provide the best, stable, proven and most efficient (as system tool used to be since ever ) tools for the phase of merging thousands and thousands of plain CSV-files' content :
###################### if need be,
###################### use an in-place remove of all CSV-file headers first :
for F in $( ls *.csv ); do sed -i '1d' $F; done
this helps for case we cannot avoid headers on the CSV-exporter side. Works like this :
(base):~$ cat ?.csv
HEADER
1
2
3
HEADER
4
5
6
HEADER
7
8
9
(base):~$ for i in $( ls ?.csv ); do sed -i '1d' $i; done
(base):~$ cat ?.csv
1
2
3
4
5
6
7
8
9
Now, the merging phase :
###################### join
cat *.csv > __all_CSVs_JOINED.csv
Given the nature of the said file storage policy, performance can be boosted by using more processes for independent taking small files and large files separately, as defined above, having put the logic inside a pair of conversion_script_?.sh shell-scripts :
parallel --jobs 2 conversion_script_{1}.sh ::: $( seq -f "%1g" 1 2 )
As the transformation is a "just"-[CONCURRENT] flow of processing for a sake of removing the CSV-headers, but a pure-[SERIAL] ( for larger number of files, there might become interesting to use a multi-staged tree of trees - using several stages of [SERIAL]-collections of [CONCURRENT]-ly pre-processed leaves, yet for just 8000 files, not knowing the actual file-system details, the latency-masking from a just-[CONCURRENT] processing both of the directories just independently will be fine to start with )
Last but not least, the final pair of ___all_CSVs_JOINED.csv are safe to get opened using in a way, that prevents moving all disk-stored date into RAM at once ( using chunk-size-fused file-reading-iterator, avoiding RAM-spillovers by using mmaped-mode as a context manager ) :
with pandas.read_csv( "<_folder_1_>//___all_CSVs_JOINED.csv",
sep = NoDefault.no_default,
delimiter = None,
...
chunksize = SAFE_CHUNK_SIZE,
...
memory_map = True,
...
) \
as df_reader_MMAPer_CtxMGR:
...
When tweaking for ultimate performance, details matter and depend on physical hardware bottlenecks ( disk-I/O-wise, filesystem-wise, RAM-I/O-wise ), so due care may take further improvement for minimising the repetitive performed end-to-end processing times ( sometimes even turning data into a compressed/zipped form, in cases, where CPU/RAM resources permit sufficient performance advantages over limited performance of disk-I/O throughput - moving less bytes is so faster, that CPU/RAM-decompression costs are still lower, than moving 200+ [GB]s of uncompressed plain text data.
Details matter,tweak options,benchmark,tweak options,benchmark,tweak options,benchmark
would be nice to post your progress on testing the performanceend-2-end duration of strategy ... [s] AS-IS nowend-2-end duration of strategy ... [s] with parallel --jobs 2 ...end-2-end duration of strategy ... [s] with parallel --jobs 4 ...end-2-end duration of strategy ... [s] with parallel --jobs N ... + compression ... keep us posted

Snowflake - Azure File upload - How can i partition the file if size is more than 40MB

I have to upload the data from a Snowflake table to Azure BLOB using COPYINTO command. The copy command I have is working for SINGLE = TRUE property but I want to break the in multiple files if the size exceeds 40MB.
For example, There is a table 'TEST' in snowflake with 100MB, I want to upload this data in azure BLOB.
The copy into command should create files in below format
TEST_1.csv (40MB)
TEST_2.csv (40MB)
TEST_3.csv (20MB)
--COPY INTO Command I am using
copy into #stage/test.csv from snowflake.test file_format = (format_name = PRW_CSV_FORMAT) header=true OVERWRITE = TRUE SINGLE = TRUE max_file_size = 40000000
We cannot control the output size of file unloads, only the max file size. The number and size of the files are based on maximum performance as it parallelizes the operation. If you want to control the number/size of files, that would be a feature request. Otherwise, just work out a process outside of Snowflake to combine the files afterward. For more details about unloading, please refer to the blog

how to split large text files into smaller text files using vba?

I have a database textfile.
It is large text file about 387,480 KB. This file contains table name, headers of the table and values. I need to split this file into multiple files each containing table creation and insertion with a file name as table name.
Please can anyone help me??
I don't see how Excel will open a 347MB file. You can try to load it into Access, and do the split, using VBA. However, the process of importing a file that large may fragment enough to blow Access up to #GB, and then it's all over. SQL Server would handle this kind of job. Alternatively, you could use Python or R to do the work for you.
### Python:
import pandas as pd
for i,chunk in enumerate(pd.read_csv('C:/your_path/main.csv', chunksize=3)):
chunk.to_csv('chunk{}.csv'.format(i))
### R
setwd("C:/your_path/")
mydata = read.csv("annualsinglefile.csv")
# If you want 5 different chunks with same number of lines, lets say 30.
# Chunks = split(mydata,sample(rep(1:5,30))) ## 5 Chunks of 30 lines each
# If you want 100000 samples, put any range of 20 values within the range of number of rows
First_chunk <- sample(mydata[1:100000,]) ## this would contain first 100000 rows
# Or you can print any number of rows within the range
# Second_chunk <- sample(mydata[100:70,] ## this would contain last 30 rows in reverse order if your data had 100 rows.
# If you want to write these chunks out in a csv file:
write.csv(First_chunk,file="First_chunk.csv",quote=F,row.names=F,col.names=T)
# write.csv(Second_chunk,file="Second_chunk.csv",quote=F,row.names=F,col.names=T)

Create a 350000 column csv file by merging smaller csv files

I have about 350000 one-column csv files, which are essentially 200 - 2000 numbers printed one under another. The numbers are formatted like this: "-1.32%" (no quotes). I want to merge the files to create a monster of a csv file where each file is a separate column. The merged file will have 2000 rows maximum (each column may have a different length) and 350000 columns.
I thought of doing it with MySQL but there is a 30000 column limit. An awk or sed script could do the job but I don't know them all that well and I am afraid it will take a very long time. I could use a server if the solution requires to. Any suggestions?
This python script will do what you want:
#!/usr/bin/env python2
import os
import sys
import codecs
fhs = []
count = 0
for filename in sys.argv[1:]:
fhs.append(codecs.open(filename,'r','utf-8'))
count += 1
while count > 0:
delim = ''
for fh in fhs:
line = fh.readline()
if not line:
count -= 1
line = ''
sys.stdout.write(delim)
delim = ','
sys.stdout.write(line.rstrip())
sys.stdout.write('\n')
for fh in fhs:
fh.close()
Call it with all the CSV files you want to merge and it will print a new file to stdout.
Note that you can't merge all files at once; for one, you can't pass 350,000 file names as arguments to a process and secondly, a process can only open 1024 files at once.
So you'll have to do it in several passes. I.e. merge files 1-1000, then 1001-2000, etc. Then you should be able to merge the 350 resulting intermediate files at once.
Or you could write a wrapper script which uses os.listdir() to get the names or all files and calls this script several times.

Hadoop S3 No Space Left On Device

I am running a map reduce job that takes a small input (~3MB, list of integers of size z),
with a sparse matrix cache of size n x m, and basically outputs z sparse vectors of dimension (n x 1). The output here is pretty big (~2TB). I am running 20 m1.small nodes on Amazon EC2 with S3 storage as inputs and output.
However, I am getting a IOException: No space left on device.
It seems like there are s3 bytes written on Hadoop logs, but no files are created.
When I used a smaller input (smaller z), the output is correctly there after the job is done.
Thus, I believe that it runs out on a temporary storage.
Is there way to check where this temporary storage is?
Also, funny thing is that the log is saying that all the bytes are written to s3, but I see no files and don't know where these bytes are being written.
Thank you for your help.
Example code (Have also tried to split into map and reduce job with same error)
public void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, VectorWritable>.Context context)
throws IOException, InterruptedException
{
// Assume the input is id \t number
String[] input = value.toString().split("\t");
int idx = Integer.parseInt(input[0]) - 1;
// Some operations to do, but basically outputting a vector
// Collect the output
context.write(new LongWritable(idx), new VectorWritable(matrix.getColumn(idx)));
};
Amazon EMR supports a couple of versions. These are the default values 0.20.205
hadoop.tmp.dir - /tmp/hadoop-${user.name} - A base for other temporary directories.
mapred.local.dir - ${hadoop.tmp.dir}/mapred/local - The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
mapred.temp.dir - ${hadoop.tmp.dir}/mapred/temp - A shared directory for temporary files.
Run the du --max-depth=7 /home/xyz | sort -n command on the hadoop.tmp.dir and check which directory is occupying the most space. Although hadoop.tmp.dir says temporary, it stores system and data files also.