Merge small files from S3 to create a 10 Mb file - amazon-s3

I am new to map reduce. I have a s3 bucket that gets 3000 files every minute. I am trying to use Map reduce to merge these files to make a file between size 10 -100 MB. The python code will use Mrjob and will run on aws EMR. Mrjob's documentation say, mapper_raw can be used to pass entire files to the mapper.
class MRCrawler(MRJob):
def mapper_raw(self, wet_path, wet_uri):
from warcio.archiveiterator import ArchiveIterator
with open(wet_path, 'rb') as f:
for record in ArchiveIterator(f):
...
Is there a way to limit it to only read 5000 files in one run and delete those files after the reducer saves the results to S3 so that the same files are not picked in the next run.

You can do as follows:
configure SQS on the S3 bucket
have lambda which gets triggered by cron; which reads the events from the SQS and copies the relevant files into a staging folder -- you can configure this lambda to read only 5000 messages at a given time.
do all your processing on top of staging folder and once you're done with your Spark job in emr, clean the staging folder

Related

How to read large text file stored in S3 from sagemaker jupyter notebook?

I have a large (25 MB approx.) CSV file stored in S3. It contains two columns. Each cell of the first column contains the file references and each cell of the second column contains a large(500 to 1000 words) body of the text. There are several thousand rows in this CSV.
I want to read it from sagemaker jupyter notebook and save it as a list of strings in memory. And then I shall use this list in my NLP models.
I am using the following code:
def load_file(bucket, key, sep=','):
client = boto3.client('s3')
obj = client.get_object(Bucket=bucket, Key=key)
data = obj['Body'].read().decode('utf-8')
text = open(data)
string_io = StringIO(data)
return pd.read_csv(string_io, sep=sep)
file = load_file("bucket", 'key',sep=',')
I am getting the following error:
OSError: [Errno 36] File name too long:
25MB is relatively small so you shouldn't have any problem with that. There are a number of different methods that you can use within a SageMaker Notebook instance. Since a SageMaker Notebook has an AWS execution role, it automatically handles credentials for you. This makes using the aws cli easy. This example will copy the file to your local system for the notebook and then you can access the file locally (relative to the notebook):
!aws s3 cp s3://$bucket/$key ./
You can find other examples of ingesting data into SageMaker Notebooks in both Studio and notebook instances in this tutorial hosted on GitHub.

Redshift Unload command with CSV extension

I'm using the following Unload command -
unload ('select * from '')to 's3://**summary.csv**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file created in S3 is summary.csv000
If I change and remove the file extension from the command like below
unload ('select * from '')to 's3://**summary**'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key=''' parallel off allowoverwrite CSV HEADER;
The file create in S3 is summary000
Is there a way to get summary.csv, so I don't have to change the file extension before importing it into excel?
Thanks.
actually a lot of folks asked the similar question, right now it's not possible to have an extension for the files. (but parquet files can have)
The reason behind this is, RedShift by default export it in parallel which is a good thing. Each slice will export its data. Also from the docs,
PARALLEL
By default, UNLOAD writes data in parallel to multiple files,
according to the number of slices in the cluster. The default option
is ON or TRUE. If PARALLEL is OFF or FALSE, UNLOAD writes to one or
more data files serially, sorted absolutely according to the ORDER BY
clause, if one is used. The maximum size for a data file is 6.2 GB.
So, for example, if you unload 13.4 GB of data, UNLOAD creates the
following three files.
So it has to create new files after 6GB that's why they are adding numbers as a suffix.
How do we solve this?
No native options from RedShift, but we can do some workaround with lambda.
Create a new S3 bucket and a folder inside it specifically for this process.(eg: s3://unloadbucket/redshift-files/)
Your unload files should go to this folder.
Lambda function should be triggered based on S3 put object event.
Then the lambda function,
Download the file(if it is large use EFS)
Rename it with .csv
Upload to the same bucket(or different bucket) into a different path (eg: s3://unloadbucket/csvfiles/)
Or even more simple if you use shell/powershell script to do the following process
Download the file
Rename it with .csv
As per AWS Documentation around UNLOAD command, it's possible to save data as CSV.
In your case, this is what your code would look like:
unload ('select * from '')
to 's3://summary/'
CREDENTIALS 'aws_access_key_id='';aws_secret_access_key='''
CSV <<<
parallel off
allowoverwrite
CSV HEADER;

How to use pentaho kettle to load multiple files from s3 bucket

I want to use the step S3 CSV Input to load multiple files from an s3 bucket then transform and load back into S3. But I can see this step support only one file at once and I need to supply the file names, is there any way to load all files at once by supplying only the bucket name i.e. <s3-bucket-name>/*?
S3-CSV-Input is inspired by CSV-Input and doesn't support multi-file-processing like Text-File-Input does, for example. You'll have to retrieve the filenames first, so you can loop over the filename list as you would do with CSV-Input.
Two options:
AWS CLI method
Write a simple shell script that calls AWS CLI. Put it in your path. Call it s3.sh
aws s3 ls s3://bucket.name/path | cut -c32-
In PDI:
Generate Rows: Limit 1, Fields: Name: process, Type: String, Value s3.sh
Execute a Process: Process field: process, Output Line Delimiter |
Split Field to Rows: Field to split: Result output. Delimiter | New field name: filename
S3 CSV Input: The filename field: filename
S3 Local Sync
Mount the S3 directory to a local directory, using s3fs
If you have many large files in that bucket directory, it wouldn't work so fast...well it might be okay if your PDI runs on an Amazon machine
Then use the standard file reading tools
$ s3fs my-bucket.example.com/path/ ~/my-s3-files -o use_path_request_style -o url=https://s3.us-west-2.amazonaws.com

How to get the first 100 lines of a file on S3?

I have a huge (~6 GB) file on Amazon S3 and want to get the first 100 lines of it without having to download the whole thing. Is this possible?
Here's what I'm doing now:
aws cp s3://foo/bar - | head -n 100
But this takes a while to execute. I'm confused -- shouldn't head close the pipe once it's read enough lines, causing aws cp to crash with a BrokenPipeError before it has time to download the entire file?
Using the Range HTTP header in a GET request, you can retrieve a specific range of bytes in an object stored in Amazon S3. (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html)
if you use aws cli you can use aws s3api get-object --range bytes=0-xxx, see http://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html
It is not exactly as a number of lines but should allow you to retrieve your file in part so avoid downloading the full object

Apache Spark to S3 upload performance Issue

I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...
Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder.
Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr log.
I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR.
Has anyone encountered this issue ?
Update 1
How can I increase the number of threads that does this move process ?
Any suggestion is highly appreciated...
Thanks
This was fixed with SPARK-3595 (https://issues.apache.org/jira/browse/SPARK-3595). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark).
I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.
ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
",");
counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
TextOutputFormat.class);
Make sure that you create more output files . more number of smaller files will make upload faster.
API details
saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit
Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.