Julia load dataframe from s3 csv file - amazon-s3

I'm having trouble finding an example to follow online for this simple use-case:
Load a CSV file from an s3 object location to julia DataFrame.
Here is what I tried that didn't work:
using AWSS3, DataFrames, CSV
filepath = S3Path("s3://muh-bucket/path/data.csv")
CSV.File(filepath) |> DataFrames # fails
# but I am able to stat the file
stat(filepath)
#=
Status( mode = -rw-rw-rw-,
...etc
size = 2141032 (2.0M),
blksize = 4096 (4.0K),
blocks = 523,
mtime = 2021-09-01T23:55:26,
...etc
=#
I can also read the file to a string object locally:
data_as_string = String(AWSS3.read(filepath);
#"column_1\tcolumn_2\tcolumn_3\t...etc..."
My AWS config is in order, I can access the object from julia locally.
How to I get this into a dataframe?

Thanks to help from the nice people on julia slack channel (#data).
bytes = AWSS3.read(S3Path("s3://muh-bucket/path/data.csv"))
typeof(bytes)
# Vector{UInt8} (alias for Array{UInt8, 1})
df = CSV.read(bytes, DataFrame)
Bingo, I'm in business. The CSV.jl maintainer mentions that S3Path types used to work when passed to CSV.read, so perhaps this will be even simpler in the future.
Helpful SO post for getting AWS configs in order

Related

Skip csv header row using boto3 in lambda and copy_from in psycopg2

I'm loading a csv into memory from s3 and then I need to insert it into postgres. I think the problem is I'm not using the right call for the s3 object or something as I don't appear to be able to skip the header line. On my local machine I would just load the file from the directory:
cur = DBCONN.cursor()
for filename in absolute_file_paths('/path/to/file/csv.log'):
print('Importing: ' + filename)
with open(filename, 'r') as log:
next(log) # Skip the header row.
cur.copy_from(log, 'vesta', sep='\t')
DBCONN.commit()
I have the below in lambda which I would like to work kind of like above, but it's different with s3. What is the correct way to have the below work like above? Or perhaps - what IS the correct way to do this?
s3 = boto3.client('s3')
#Load the file from s3 into memory
obj = s3.get_object(Bucket=bucket, Key=key)
contents = obj['Body']
next(contents, None) # Skip the header row - this does not seem to work
cur = DBCONN.cursor()
cur.copy_from(contents, 'my_table', sep='\t')
DBCONN.commit()
Seemingly, my problem had something to do with an incredibly wide csv file (I have over 200 columns) and somehow that messed up the next() function to not give the next row. SO! I will say that IF your file is not seemingly that wide, then the code I placed in the question should work. Below however is how I got it work, basically by just reading the file into memory and then writing that back to an in memory file after skipping the header row. This honestly seems a little like overkill so I'd be happy if someone could provide something more efficient but seeing as how I spend the last eight hours on this, I'm just happy to have SOMETHING that works.
s3 = boto3.client('s3')
...
def remove_header(contents):
# Reformat the file, removing the header row
data = csv.reader(io.StringIO(contents), delimiter='\t') #read data in
mem_file = io.StringIO() #create in memory file object
next(data) #skip header row
writer = csv.writer(mem_file, delimiter='\t') #set up the csv writer
writer.writerows(data) #write the data in memory to the in mem file
mem_file.getvalue() # Get the string from the buffer
mem_file.seek(0) # Go back to the beginning of the memory stream
return mem_file
...
#Load the file from s3 into memory
obj = s3.get_object(Bucket=bucket, Key=key)
contents = obj['Body'].read().decode('utf-8')
mem_file = remove_header(contents)
#Insert into postgres
try:
cur = DBCONN.cursor()
cur.copy_from(mem_file, 'my_table', sep='\t')
DBCONN.commit()
except BaseException as e:
DBCONN.rollback()
raise e
or if you want to do it with pandas
def remove_header_pandas(contents):
df = pd.read_csv(io.StringIO(contents), sep='\t')
mem_file = io.StringIO()
df.to_csv(mem_file, header=False, index=False) #remove header
mem_file.getvalue()
mem_file.seek(0)
return mem_file

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

reading paritionned dataset in aws s3 with pyarrow doesn't add partition columns

i'm trying to read a partitionned dataset in aws s3, it looks like :
MyDirectory--code=1--file.parquet
--code=2--another.parquet
--code=3--another.parquet
i created a file_list containing the path to all the files in the directory then executed
df = pq.ParquetDataset(file_list, filesystem=fs).read().to_pandas()
everything works except that the partition column code doesn't exist in the dataframe df.
i tried it also using one path to MyDirectory insted of file_list, but found an error
"Found files in an intermediate directory: s3://bucket/Mydirectoty", i can't find any answer online.
Thank you!
AWS has a project (AWS Data Wrangler) that helps with the integration between Pandas/PyArrow and their services.
This snippet should work:
import awswrangler as wr
# Write
wr.s3.to_parquet(
df=df,
path="s3://...",
mode="overwrite",
dataset=True,
database="my_databse", # Optional, only if you want it available on Athena/Glue Catalog
table="my_table",
partition_cols=["PARTITION_COL_NAME"])
# READ
df = wr.s3.read_parquet(path="s3://...", dataset=True)
If you're happy with other tools you can give dask a try. Assume all the data you want to read is in s3://folder you can just use
import dask.dataframe as dd
storage_options = {
'key': your_key,
'secret': your_secret}
df = dd.read_parquet("s3://folder",
storage_options=storage_options)

How to load lists in pelilcanconf.py from external file

There are different lists available in pelicanconf.py such as
SOCIAL = (('Facebook','www.facebook.com'),)
LINKS =
etc.
I want to manage these content and create my own lists by loading these values from an external file which can be edited independently. I tried importing data as a text file using python but it doesn't work. Is there any other way?
What exactly did not work? Can you provide code?
You can execute arbitrary python code in your pelicanconf.py.
Example for a very simple CSV reader:
# in pelicanconf.py
def fn_to_list(fn):
with open(fn, 'r') as res:
return tuple(map(lambda line: tuple(line[:-1].split(';')), res.readlines()))
print(fn_to_list("data"))
CSV file data:
A;1
B;2
C;3
D;4
E;5
F;6
Together, this yields the following when running pelican:
# ...
((u'A', u'1'), (u'B', u'2'), (u'C', u'3'), (u'D', u'4'), (u'E', u'5'), (u'F', u'6'))
# ...
Instead of printing you can also assign this list to a variable, say LINKS.

Using the BigQuery Connector with Spark

I'm not getting the Google example work
https://cloud.google.com/hadoop/examples/bigquery-connector-spark-example
PySpark
There are a few mistakes in the code i think, like:
'# Output Parameters
'mapred.bq.project.id': '',
Should be: 'mapred.bq.output.project.id': '',
and
'# Write data back into new BigQuery table.
'# BigQueryOutputFormat discards keys, so set key to None.
(word_counts
.map(lambda pair: None, json.dumps(pair))
.saveAsNewAPIHadoopDataset(conf))
will give an error message. If I change it to:
(word_counts
.map(lambda pair: (None, json.dumps(pair)))
.saveAsNewAPIHadoopDataset(conf))
I get the error message:
org.apache.hadoop.io.Text cannot be cast to com.google.gson.JsonObject
And whatever I try I can not make this work.
There is a dataset created in BigQuery with the name I gave it in the 'conf' with a trailing '_hadoop_temporary_job_201512081419_0008'
And a table is created with '_attempt_201512081419_0008_r_000000_0' on the end. But are always empty
Can anybody help me with this?
Thanks
We are working to update the documentation because, as you noted, the docs are incorrect in this case. Sorry about that! While we're working to update the docs, I wanted to get you a reply ASAP.
Casting problem
The most important problem you mention is the casting issue. Unfortunately,PySpark cannot use the BigQueryOutputFormat to create Java GSON objects. The solution (workaround) is to save the output data into Google Cloud Storage (GCS) and then load it manually with the bq command.
Code example
Here is a code sample which exports to GCS and loads the data into BigQuery. You could also use subprocess and Python to execute the bq command programatically.
#!/usr/bin/python
"""BigQuery I/O PySpark example."""
import json
import pprint
import pyspark
sc = pyspark.SparkContext()
# Use the Google Cloud Storage bucket for temporary BigQuery export data used
# by the InputFormat. This assumes the Google Cloud Storage connector for
# Hadoop is configured.
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory ='gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
# Input Parameters
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
# Perform word count.
word_counts = (
table_data
.map(lambda (_, record): json.loads(record))
.map(lambda x: (x['word'].lower(), int(x['word_count'])))
.reduceByKey(lambda x, y: x + y))
# Display 10 results.
pprint.pprint(word_counts.take(10))
# Stage data formatted as newline delimited json in Google Cloud Storage.
output_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_output'.format(bucket)
partitions = range(word_counts.getNumPartitions())
output_files = [output_directory + '/part-{:05}'.format(i) for i in partitions]
(word_counts
.map(lambda (w, c): json.dumps({'word': w, 'word_count': c}))
.saveAsTextFile(output_directory))
# Manually clean up the input_directory, otherwise there will be BigQuery export
# files left over indefinitely.
input_path = sc._jvm.org.apache.hadoop.fs.Path(input_directory)
input_path.getFileSystem(sc._jsc.hadoopConfiguration()).delete(input_path, True)
print """
###########################################################################
# Finish uploading data to BigQuery using a client e.g.
bq load --source_format NEWLINE_DELIMITED_JSON \
--schema 'word:STRING,word_count:INTEGER' \
wordcount_dataset.wordcount_table {files}
# Clean up the output
gsutil -m rm -r {output_directory}
###########################################################################
""".format(
files=','.join(output_files),
output_directory=output_directory)