How to get the first 100 lines of a file on S3? - amazon-s3

I have a huge (~6 GB) file on Amazon S3 and want to get the first 100 lines of it without having to download the whole thing. Is this possible?
Here's what I'm doing now:
aws cp s3://foo/bar - | head -n 100
But this takes a while to execute. I'm confused -- shouldn't head close the pipe once it's read enough lines, causing aws cp to crash with a BrokenPipeError before it has time to download the entire file?

Using the Range HTTP header in a GET request, you can retrieve a specific range of bytes in an object stored in Amazon S3. (see http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectGET.html)
if you use aws cli you can use aws s3api get-object --range bytes=0-xxx, see http://docs.aws.amazon.com/cli/latest/reference/s3api/get-object.html
It is not exactly as a number of lines but should allow you to retrieve your file in part so avoid downloading the full object

Related

Merge small files from S3 to create a 10 Mb file

I am new to map reduce. I have a s3 bucket that gets 3000 files every minute. I am trying to use Map reduce to merge these files to make a file between size 10 -100 MB. The python code will use Mrjob and will run on aws EMR. Mrjob's documentation say, mapper_raw can be used to pass entire files to the mapper.
class MRCrawler(MRJob):
def mapper_raw(self, wet_path, wet_uri):
from warcio.archiveiterator import ArchiveIterator
with open(wet_path, 'rb') as f:
for record in ArchiveIterator(f):
...
Is there a way to limit it to only read 5000 files in one run and delete those files after the reducer saves the results to S3 so that the same files are not picked in the next run.
You can do as follows:
configure SQS on the S3 bucket
have lambda which gets triggered by cron; which reads the events from the SQS and copies the relevant files into a staging folder -- you can configure this lambda to read only 5000 messages at a given time.
do all your processing on top of staging folder and once you're done with your Spark job in emr, clean the staging folder

make all files private using aws-shell for S3

I want to change the files of a bucket (all files) to private, so i'm wondering how to do it with aws-shell. I think that maybe the mv command can be useful to achieve this, but i cant figure out how to use it, because this is the first time that i use aws-shell.
Edit 1
I tried using s3 mv s3://bucket s3://temporary --recursive --acl private, but i needed to create another temporary bucket to make the swap. Because of this error:
Cannot mv a file onto itself [...]
Is there a way to do this without creating a temporary bucket? I mean that this could cause charges for having transactions and space being used by duplicate files
You can copy the files onto themselves and change the Access Control List.
Test it out, but it would be something like:
aws s3 cp s3://bucket s3://bucket --recursive --acl private
Keep the source and destination the same.

How to use pentaho kettle to load multiple files from s3 bucket

I want to use the step S3 CSV Input to load multiple files from an s3 bucket then transform and load back into S3. But I can see this step support only one file at once and I need to supply the file names, is there any way to load all files at once by supplying only the bucket name i.e. <s3-bucket-name>/*?
S3-CSV-Input is inspired by CSV-Input and doesn't support multi-file-processing like Text-File-Input does, for example. You'll have to retrieve the filenames first, so you can loop over the filename list as you would do with CSV-Input.
Two options:
AWS CLI method
Write a simple shell script that calls AWS CLI. Put it in your path. Call it s3.sh
aws s3 ls s3://bucket.name/path | cut -c32-
In PDI:
Generate Rows: Limit 1, Fields: Name: process, Type: String, Value s3.sh
Execute a Process: Process field: process, Output Line Delimiter |
Split Field to Rows: Field to split: Result output. Delimiter | New field name: filename
S3 CSV Input: The filename field: filename
S3 Local Sync
Mount the S3 directory to a local directory, using s3fs
If you have many large files in that bucket directory, it wouldn't work so fast...well it might be okay if your PDI runs on an Amazon machine
Then use the standard file reading tools
$ s3fs my-bucket.example.com/path/ ~/my-s3-files -o use_path_request_style -o url=https://s3.us-west-2.amazonaws.com

Apache Spark to S3 upload performance Issue

I'm seeing a major performance issue when Apache Spark uploads its results to S3. As per my understanding it goes these steps...
Output of final stage is written to _temp/ table in HDFS and the same is moved into "_temporary" folder inside the specific S3 folder.
Once the whole process is done - Apache spark completes the saveAsTextFile stage and then files inside "_temporary" folder in S3 are moved into the main folder. This is actually taking a long time [ approximately 1 min per file (average size : 600 MB BZ2) ]. This part is not getting logged in the usual stderr log.
I'm using Apache Spark 1.0.1 with Hadoop 2.2 on AWS EMR.
Has anyone encountered this issue ?
Update 1
How can I increase the number of threads that does this move process ?
Any suggestion is highly appreciated...
Thanks
This was fixed with SPARK-3595 (https://issues.apache.org/jira/browse/SPARK-3595). Which was incorporated in builds 1.1.0.e and later (see https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark).
I use below functions . it uploads file to s3. it uploads around 60 gb , gz files in 4-6 mins.
ctx.hadoopConfiguration().set("mapred.textoutputformat.separator",
",");
counts.saveAsHadoopFile(s3outputpath, Text.class, Text.class,
TextOutputFormat.class);
Make sure that you create more output files . more number of smaller files will make upload faster.
API details
saveAsHadoopFile[F <: org.apache.hadoop.mapred.OutputFormat[_, ]](path: String, keyClass: Class[], valueClass: Class[], outputFormatClass: Class[F], codec: Class[ <: org.apache.hadoop.io.compress.CompressionCodec]): Unit
Output the RDD to any Hadoop-supported file system, compressing with the supplied codec.

s3.exe: S3 PUT not working when bucket contains hyphens

I'm trying to use s3.exe, a windows CLI for S3 from s3.codeplex.com, to PUT an object.
Here is the command I'm running:
c:\>s3 put My-Bucket file.txt /key:MYKEY /secret:MYSECRET
It returns: <403> Forbidden.
But when I try to PUT the file into a bucket without a hypen, it works.
c:\>s3 put MyNoHyphenBucket file.txt /key:MYKEY /secret:MYSECRET
Can someone else try it and see if they have the same issue? Any help on how to get it working with hyphenated bucket names would be greatly appreciated.
I'd be open to trying alternative s3 CLI for Windows.
Are you using an EU or NA bucket?
I found this:
"European Bucket allows only lower case letters. Although Buckets created in the US may contain lower case and upper case both, Amazon recommends that you use all lower case letters when creating a bucket."
Apparently whatever's behind that also impacts hyphens.
With an EU bucket, I get the same behaviour (403) as yourself. Repeat experiment with an NA bucket, and it succeeds.
I saw this error on NOT US buckets.
So, I created US bucket (select region US Standard when creating) and all works fine!