Using blockchain_parser to parse blk files stored on S3 - amazon-s3

I'm trying to use alecalve's bitcoin-parser pkg for python.
The problem is, my node is saved on an s3 bucket.
As the parser uses os.path.expanduser for the .dat files dir, expecting a filesystem, I can't just use my s3 path. The example from the documentation is:
import os
from blockchain_parser.blockchain import Blockchain
blockchain = Blockchain(os.path.expanduser('~/.bitcoin/blocks'))
for block in blockchain.get_ordered_blocks(os.path.expanduser('~/.bitcoin/blocks/index'), end=1000):
print("height=%d block=%s" % (block.height, block.hash))
And the error I'm getting is as follows:
'name' arg must be a byte string or a unicode string
Is there a way to use s3fs or any different s3-to-filesystem method to use the s3 paths as dirs for the parser to work as intended?

Related

Aws s3 batch operation error: Task target couldn't be URL decoded

I need to restore a lot of object from aws s3 glacier deep archive. So i try to use a s3 batch jobs. For that i use a python code to create a manifest as a csv with to columns Bucket,Key.
But my first issue : some Key contain a comma so the job failed.
To solve (partialy) this issue i just cut the csv file to keep only the first two columns hoping that there are not many files involved.
But now i have another issue:
ErrorMessage: Task target couldn't be URL decoded
Any Idea ?
As mentioned on https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html#specify-batchjob-manifest, the manifest CSV file must be URL encoded. The , character in a key name gets converted to %2C with URL encoding so the resulting file will be valid CSV even with commas in the key name

How do I save csv file to AWS S3 with specified name from AWS Glue DF?

I am trying to generate a file from a Dataframe that I have created in AWS-Glue, I am trying to give it a specific name, I see most answers on stack overflow actually uses Filesystem modules, but here this particular csv file is generated in S3, also I want to give the file a name while generating it, and not rename it after it is generated, is there any way to do that?
I have tried using df.save(s3:://PATH/filename.csv) which actually generates a new directory in S3 named filename.csv and then generates part-*.csv inside that directory
df.repartition(1).write.mode('append').format('csv').save('s3://PATH').option("header", "true")

Boto3: upload file from base64 to S3

How can I directly upload a base64 encoded file to S3 with boto3?
object = s3.Object(BUCKET_NAME,email+"/"+save_name)
object.put(Body=base64.b64decode(file))
I tried to upload the base64 encoded file like this, but then the file is broken. Directly uploading the string without the base64 decoding also doesn't work.
Is there anything similar to set_contents_from_string() from boto2?
I just fixed the problem and found out that the way of uploading was correct, but the base64 string was incorrect because it still contained the prefix data:image/jpeg;base64, - removing that prefix solved the problem.
If you read the documentation thoughtfully on Object.put, you will see this
response = object.put(
ACL='private'......,
Body=b'bytes'|file,
.....,
Body only accept file object or bytes, any other type will failed. base64.b64decode doesn't read file object automatically, you must read the data into the decode module.
# FIX
object.put(Body=base64.b64decode(file.read()))
As reminder, always post the stacktrace.

Write in memory object to S3 via boto3

I am attempting to write files directly to S3 without creating a local file which is then uploaded.
I am using cStringIO to generate a file in memory, but I am having trouble figuring out the proper way to upload it in boto3.
def writetos3(sourcedata, filename, folderpath):
s3 = boto3.resource('s3')
data = open(sourcedata, 'rb')
s3.Bucket('bucketname').put_object(Key= folderpath + "/" + filename, Body=data)
Above is the standard boto3 method that I was using previously with the local file, it does not work without a local file, I get the following error: coercing to Unicode: need string or buffer, cStringIO.StringO found
.
Because the in memory file (I believe) is already considered open, I tried changing it to the code below, but it still does not work, no error is given the script simply hangs on the last line of the method.
def writetos3(sourcedata, filename, folderpath):
s3 = boto3.resource('s3')
s3.Bucket('bucketname').put_object(Key= folderpath + "/" + filename, Body=sourcedata)
Just for more info, the value I am attempting to write looks like this
(cStringIO.StringO object at 0x045DC540)
Does anyone have an idea of what I am doing wrong here?
It looks like you want this:
data = open(sourcedata, 'rb').decode()
It defaults to utf8. Also, I encourage you to run your code under python3, and to use appropriate language tags for your question.

How to use pentaho kettle to load multiple files from s3 bucket

I want to use the step S3 CSV Input to load multiple files from an s3 bucket then transform and load back into S3. But I can see this step support only one file at once and I need to supply the file names, is there any way to load all files at once by supplying only the bucket name i.e. <s3-bucket-name>/*?
S3-CSV-Input is inspired by CSV-Input and doesn't support multi-file-processing like Text-File-Input does, for example. You'll have to retrieve the filenames first, so you can loop over the filename list as you would do with CSV-Input.
Two options:
AWS CLI method
Write a simple shell script that calls AWS CLI. Put it in your path. Call it s3.sh
aws s3 ls s3://bucket.name/path | cut -c32-
In PDI:
Generate Rows: Limit 1, Fields: Name: process, Type: String, Value s3.sh
Execute a Process: Process field: process, Output Line Delimiter |
Split Field to Rows: Field to split: Result output. Delimiter | New field name: filename
S3 CSV Input: The filename field: filename
S3 Local Sync
Mount the S3 directory to a local directory, using s3fs
If you have many large files in that bucket directory, it wouldn't work so fast...well it might be okay if your PDI runs on an Amazon machine
Then use the standard file reading tools
$ s3fs my-bucket.example.com/path/ ~/my-s3-files -o use_path_request_style -o url=https://s3.us-west-2.amazonaws.com