Snowflake - Azure File upload - How can i partition the file if size is more than 40MB - file-upload

I have to upload the data from a Snowflake table to Azure BLOB using COPYINTO command. The copy command I have is working for SINGLE = TRUE property but I want to break the in multiple files if the size exceeds 40MB.
For example, There is a table 'TEST' in snowflake with 100MB, I want to upload this data in azure BLOB.
The copy into command should create files in below format
TEST_1.csv (40MB)
TEST_2.csv (40MB)
TEST_3.csv (20MB)
--COPY INTO Command I am using
copy into #stage/test.csv from snowflake.test file_format = (format_name = PRW_CSV_FORMAT) header=true OVERWRITE = TRUE SINGLE = TRUE max_file_size = 40000000

We cannot control the output size of file unloads, only the max file size. The number and size of the files are based on maximum performance as it parallelizes the operation. If you want to control the number/size of files, that would be a feature request. Otherwise, just work out a process outside of Snowflake to combine the files afterward. For more details about unloading, please refer to the blog

Related

How to generate a single file per partition - Snowflake COPY into location

I've managed to unload my data into a partitions, but each one of them is also being partitioned into multiple files. Is there a way to force Snowflake to generate a single file per partition?
It also would be great if I can zip all the files.
This is what I got so far:
COPY INTO 'gcs_bucket'
FROM test
PARTITION BY TRUNC(number_of_rows/500000)
STORAGE_INTEGRATION = gcs_int
FILE_FORMAT = (TYPE = CSV, COMPRESSION = gzip, NULL_IF = ('NULL','null'), RECORD_DELIMITER= '\r\n', FIELD_OPTIONALLY_ENCLOSED_BY = "'")
HEADER = TRUE
PS. I'm using csv format (can't change that)
The upper size limit of each file could be changed with MAX_FILE_SIZE option. Default is 16MB.
COPY INTO 'gcs_bucket'
FROM test
PARTITION BY TRUNC(number_of_rows/500000)
STORAGE_INTEGRATION = gcs_int
...
MAX_FILE_SIZE = 167772160 -- (160MB)
MAX_FILE_SIZE = num
Definition
Number (> 0) that specifies the upper size limit (in bytes) of each file to be generated in parallel per thread. Note that the actual file size and number of files unloaded are determined by the total amount of data and number of nodes available for parallel processing.
Snowflake utilizes parallel execution to optimize performance. The number of threads cannot be modified.**

How to use the taildir source in Flume to append only newest lines of a .txt file?

I recently asked the question Apache Flume - send only new file contents
I am rephrasing the question in order to learn more and provide more benefitto future users of Flume.
Setup: Two servers, one with a .txt file that gets lines appended to it regularly.
Goal: Use flume TAILDIR source to append the most recently written line to a file on the other server.
Issue: Whenever the source file has a new line of data added, the current configuration appends everything in file on server 1 to the file in server 2. This results in duplicate lines in file 2 and does not properly recreate the file from server 1.
Configuration on server 1:
#configure the agent
agent.sources=r1
agent.channels=k1
agent.sinks=c1
#using memort channel to hold upto 1000 events
agent.channels.k1.type=memory
agent.channels.k1.capacity=1000
agent.channels.k1.transactionCapacity=100
#connect source, channel,sink
agent.sources.r1.channels=k1
agent.sinks.c1.channel=k1
#define source
agent.sources.r1.type=TAILDIR
agent.sources.r1.channels=k1
agent.sources.r1.filegroups=f1
agent.sources.r1.filegroups.f1=/home/tail_test_dir/test.txt
agent.sources.r1.maxBackoffSleep=1000
#connect to another box using avro and send the data
agent.sinks.c1.type=avro
agent.sinks.c1.hostname=10.10.10.4
agent.sinks.c1.port=4545
Configuration on server 2:
#configure the agent
agent.sources=r1
agent.channels=k1
agent.sinks=c1
#using memory channel to hold up to 1000 events
agent.channels.k1.type=memory
agent.channels.k1.capacity=1000
agent.channels.k1.transactionCapacity=100
#connect source, channel, sink
agent.sources.r1.channels=k1
agent.sinks.c1.channel=k1
#here source is listening at the specified port using AVRO for data
agent.sources.r1.type=avro
agent.sources.r1.bind=0.0.0.0
agent.sources.r1.port=4545
#use file_roll and write file at specified directory
agent.sinks.c1.type=file_roll
agent.sinks.c1.sink.directory=/home/Flume_dump
You have to set position json file. Then the source check the position and write only new added lines to sink.
ex) agent.sources.s1.positionFile = /var/log/flume/tail_position.json

Unzip a file to s3

I am looking at a simple way to extract a zip/gzip present in s3 bucket to the same bucket location and delete the parent zip/gzip file post extraction.
I am unable to achieve this with any of the API's currently.
Have tried native boto, pyfilesystem(fs), s3fs.
The source and destination links seem to be an issue for these functions.
(Using with Python 2.x/3.x & Boto 2.x )
I see there is an API for node.js(unzip-to-s3) to do this job , but none for python.
Couple of implementations i can think of:
A simple API to extract the zip file within the same bucket.
Use s3 as a filesystem and manipulate data
Use a data pipeline to achieve this
Transfer the zip to ec2 , extract and copy back to s3.
The option 4 would be the least preferred option, to minimise the architecture overhead with ec2 addon.
Need support in getting this feature implementation , with integration to lambda at a later stage. Any pointers to these implementations are greatly appreciated.
Thanks in Advance,
Sundar.
You could try https://www.cloudzipinc.com/ that unzips/expands several different formats of archives from S3 into a destination in your bucket. I used it to unzip components of a digital catalog into S3.
Have solved by using ec2 instance.
Copy the s3 files to local dir in ec2
and copy that directory back to S3 bucket.
Sample to unzip to local directory in ec2 instance
def s3Unzip(srcBucket,dst_dir):
'''
function to decompress the s3 bucket contents to local machine
Args:
srcBucket (string): source bucket name
dst_dir (string): destination location in the local/ec2 local file system
Returns:
None
'''
#bucket = s3.lookup(bucket)
s3=s3Conn
path=''
bucket = s3.lookup(bucket_name)
for key in bucket:
path = os.path.join(dst_dir, key.name)
key.get_contents_to_filename(path)
if path.endswith('.zip'):
opener, mode = zipfile.ZipFile, 'r'
elif path.endswith('.tar.gz') or path.endswith('.tgz'):
opener, mode = tarfile.open, 'r:gz'
elif path.endswith('.tar.bz2') or path.endswith('.tbz'):
opener, mode = tarfile.open, 'r:bz2'
else:
raise ValueError ('unsuppported format')
try:
os.mkdir(dst_dir)
print ("local directories created")
except Exception:
logger_s3.warning ("Exception in creating local directories to extract zip file/ folder already existing")
cwd = os.getcwd()
os.chdir(dst_dir)
try:
file = opener(path, mode)
try: file.extractall()
finally: file.close()
logger_s3.info('(%s) extracted successfully to %s'%(key ,dst_dir))
except Exception as e:
logger_s3.error('failed to extract (%s) to %s'%(key ,dst_dir))
os.chdir(cwd)
s3.close
sample code to upload to mysql instance
Use the "LOAD DATA LOCAL INFILE" query to upload to mysql directly
def upload(file_path,timeformat):
'''
function to upload a csv file data to mysql rds
Args:
file_path (string): local file path
timeformat (string): destination bucket to copy data
Returns:
None
'''
for file in file_path:
try:
con = connect()
cursor = con.cursor()
qry="""LOAD DATA LOCAL INFILE '%s' INTO TABLE xxxx FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' (col1 , col2 ,col3, #datetime , col4 ) set datetime = str_to_date(#datetime,'%s');""" %(file,timeformat)
cursor.execute(qry)
con.commit()
logger_rds.info ("Loading file:"+file)
except Exception:
logger_rds.error ("Exception in uploading "+file)
##Rollback in case there is any error
con.rollback()
cursor.close()
# disconnect from server
con.close()
Lambda function:
You can use a Lambda function where you read zipped files into the buffer, gzip the individual files, and reupload them to S3. Then you can either archive the original files or delete them using boto.
You can also set an event based trigger that runs the lambda automatically everytime there is a new zipped file in S3. Here's a full tutorial for the exact thing here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

How to fill CSV Data Set Config dynamically before running test?

I have a page that creates bulk users in my application, and I was wondering if it's possible to use the created users and put them in my users.csv file (CSV Data Set Config element) so that I will use those users in the current test only.
The idea is to have dynamic users.csv file each test instead of fixed one, for all concurrent tests.
Yes, You can do in the current test - But in a different thread group when you run consecutively.
Use BeanShell PostProcessor to write the created users in a CSV file as given below in the Thread Group 1.
import org.apache.jmeter.services.FileServer;
f = new FileOutputStream("CSV file Path.csv", true);
p = new PrintStream(f);
p.println(vars.get("username") + "," + vars.get("password"));
p.close();
f.close();
Then you can use CSV Data Set Config to read the same file and get the User Name , Password in the Next Thread Group.
If you want to use it in the same Thread Group, You can write in the CSV file - but use vars.get("username"), vars.get("password") in your test - as you can not read the CSV file yet to be created using CSV Data Set Config.

Hadoop S3 No Space Left On Device

I am running a map reduce job that takes a small input (~3MB, list of integers of size z),
with a sparse matrix cache of size n x m, and basically outputs z sparse vectors of dimension (n x 1). The output here is pretty big (~2TB). I am running 20 m1.small nodes on Amazon EC2 with S3 storage as inputs and output.
However, I am getting a IOException: No space left on device.
It seems like there are s3 bytes written on Hadoop logs, but no files are created.
When I used a smaller input (smaller z), the output is correctly there after the job is done.
Thus, I believe that it runs out on a temporary storage.
Is there way to check where this temporary storage is?
Also, funny thing is that the log is saying that all the bytes are written to s3, but I see no files and don't know where these bytes are being written.
Thank you for your help.
Example code (Have also tried to split into map and reduce job with same error)
public void map(LongWritable key, Text value,
Mapper<LongWritable, Text, LongWritable, VectorWritable>.Context context)
throws IOException, InterruptedException
{
// Assume the input is id \t number
String[] input = value.toString().split("\t");
int idx = Integer.parseInt(input[0]) - 1;
// Some operations to do, but basically outputting a vector
// Collect the output
context.write(new LongWritable(idx), new VectorWritable(matrix.getColumn(idx)));
};
Amazon EMR supports a couple of versions. These are the default values 0.20.205
hadoop.tmp.dir - /tmp/hadoop-${user.name} - A base for other temporary directories.
mapred.local.dir - ${hadoop.tmp.dir}/mapred/local - The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.
mapred.temp.dir - ${hadoop.tmp.dir}/mapred/temp - A shared directory for temporary files.
Run the du --max-depth=7 /home/xyz | sort -n command on the hadoop.tmp.dir and check which directory is occupying the most space. Although hadoop.tmp.dir says temporary, it stores system and data files also.