Skip csv header row using boto3 in lambda and copy_from in psycopg2 - amazon-s3

I'm loading a csv into memory from s3 and then I need to insert it into postgres. I think the problem is I'm not using the right call for the s3 object or something as I don't appear to be able to skip the header line. On my local machine I would just load the file from the directory:
cur = DBCONN.cursor()
for filename in absolute_file_paths('/path/to/file/csv.log'):
print('Importing: ' + filename)
with open(filename, 'r') as log:
next(log) # Skip the header row.
cur.copy_from(log, 'vesta', sep='\t')
DBCONN.commit()
I have the below in lambda which I would like to work kind of like above, but it's different with s3. What is the correct way to have the below work like above? Or perhaps - what IS the correct way to do this?
s3 = boto3.client('s3')
#Load the file from s3 into memory
obj = s3.get_object(Bucket=bucket, Key=key)
contents = obj['Body']
next(contents, None) # Skip the header row - this does not seem to work
cur = DBCONN.cursor()
cur.copy_from(contents, 'my_table', sep='\t')
DBCONN.commit()

Seemingly, my problem had something to do with an incredibly wide csv file (I have over 200 columns) and somehow that messed up the next() function to not give the next row. SO! I will say that IF your file is not seemingly that wide, then the code I placed in the question should work. Below however is how I got it work, basically by just reading the file into memory and then writing that back to an in memory file after skipping the header row. This honestly seems a little like overkill so I'd be happy if someone could provide something more efficient but seeing as how I spend the last eight hours on this, I'm just happy to have SOMETHING that works.
s3 = boto3.client('s3')
...
def remove_header(contents):
# Reformat the file, removing the header row
data = csv.reader(io.StringIO(contents), delimiter='\t') #read data in
mem_file = io.StringIO() #create in memory file object
next(data) #skip header row
writer = csv.writer(mem_file, delimiter='\t') #set up the csv writer
writer.writerows(data) #write the data in memory to the in mem file
mem_file.getvalue() # Get the string from the buffer
mem_file.seek(0) # Go back to the beginning of the memory stream
return mem_file
...
#Load the file from s3 into memory
obj = s3.get_object(Bucket=bucket, Key=key)
contents = obj['Body'].read().decode('utf-8')
mem_file = remove_header(contents)
#Insert into postgres
try:
cur = DBCONN.cursor()
cur.copy_from(mem_file, 'my_table', sep='\t')
DBCONN.commit()
except BaseException as e:
DBCONN.rollback()
raise e
or if you want to do it with pandas
def remove_header_pandas(contents):
df = pd.read_csv(io.StringIO(contents), sep='\t')
mem_file = io.StringIO()
df.to_csv(mem_file, header=False, index=False) #remove header
mem_file.getvalue()
mem_file.seek(0)
return mem_file

Related

Julia load dataframe from s3 csv file

I'm having trouble finding an example to follow online for this simple use-case:
Load a CSV file from an s3 object location to julia DataFrame.
Here is what I tried that didn't work:
using AWSS3, DataFrames, CSV
filepath = S3Path("s3://muh-bucket/path/data.csv")
CSV.File(filepath) |> DataFrames # fails
# but I am able to stat the file
stat(filepath)
#=
Status( mode = -rw-rw-rw-,
...etc
size = 2141032 (2.0M),
blksize = 4096 (4.0K),
blocks = 523,
mtime = 2021-09-01T23:55:26,
...etc
=#
I can also read the file to a string object locally:
data_as_string = String(AWSS3.read(filepath);
#"column_1\tcolumn_2\tcolumn_3\t...etc..."
My AWS config is in order, I can access the object from julia locally.
How to I get this into a dataframe?
Thanks to help from the nice people on julia slack channel (#data).
bytes = AWSS3.read(S3Path("s3://muh-bucket/path/data.csv"))
typeof(bytes)
# Vector{UInt8} (alias for Array{UInt8, 1})
df = CSV.read(bytes, DataFrame)
Bingo, I'm in business. The CSV.jl maintainer mentions that S3Path types used to work when passed to CSV.read, so perhaps this will be even simpler in the future.
Helpful SO post for getting AWS configs in order

Using pandas to open Excel files stored in GCS from command line

The following code snippet is from a Google tutorial, it simply prints the names of files on GCP in a given bucket:
from google.cloud import storage
def list_blobs(bucket_name):
"""Lists all the blobs in the bucket."""
# bucket_name = "your-bucket-name"
storage_client = storage.Client()
# Note: Client.list_blobs requires at least package version 1.17.0.
blobs = storage_client.list_blobs(bucket_name)
for blob in blobs:
print(blob.name)
list_blobs('sn_project_data')
No from the command line I can run:
$ python path/file.py
And in my terminal the files in said bucket are printed out. Great, it works!
However, this isn't quite my goal. I'm looking to open a file and act upon it. For example:
df = pd.read_excel(filename)
print(df.iloc[0])
However, when I pass the path to the above, the error returned reads "invalid file path." So I'm sure there is some sort of GCP specific function call to actually access these files...
What command(s) should I run?
Edit: This video https://www.youtube.com/watch?v=ED5vHa3fE1Q shows a trick to open files and needs to use StringIO in the process. But it doesn't support excel files, so it's not an effective solution.
read_excel() does not support google cloud storage file path as of now but it can read data in bytes.
pandas.read_excel(io, sheet_name=0, header=0, names=None,
index_col=None, usecols=None, squeeze=False, dtype=None, engine=None,
converters=None, true_values=None, false_values=None, skiprows=None,
nrows=None, na_values=None, keep_default_na=True, na_filter=True,
verbose=False, parse_dates=False, date_parser=None, thousands=None,
comment=None, skipfooter=0, convert_float=True, mangle_dupe_cols=True,
storage_options=None)
Parameters: io : str, bytes, ExcelFile, xlrd.Book, path object, or
file-like object
What you can do is use the blob object and use download_as_bytes() to convert the object into bytes.
Download the contents of this blob as a bytes object.
For this example I just used a random sample xlsx file and read the 1st sheet:
from google.cloud import storage
import pandas as pd
bucket_name = "your-bucket-name"
blob_name = "SampleData.xlsx"
storage_client = storage.Client()
bucket = storage_client.bucket(bucket_name)
blob = bucket.blob(blob_name)
data_bytes = blob.download_as_bytes()
df = pd.read_excel(data_bytes)
print(df)
Test done:

how to add header row to df.to_csv output file only after a certain condition is met

I am reading a bulk download csv file of Stock prices and splitting it into many individual csv's based on ticker, where ticker name is the name of the outputted file and where the header row which contains "ticker, date, open, high, low, close, volume is being written ONLY the the first time I run the script because if I run it again with the header set to true, it writes a new header row mixed in with the stock data. I have mode set to "a", meaning "append" because I want each new row of data added to the file. However, I see a situation now where a new ticker has appeared in the source file, and because I have the header set to False, there is no header in this newly created output file, which causes processing to fail. How can I include a condition so that it writes a header row ONLY for new files which never existed before. Here is my code. Thanks
import pandas as pd
import os
import csv
import itertools
import datetime
datetime = datetime.datetime.today().strftime('%Y-%m-%d')
filename = "Bats_"+(datetime)+".csv"
csv_file = ("H:\\EOD_DATA_RECENT\\DOWNLOADS\\"+filename)
path = 'H:\\EOD_DATA_RECENT\\VIA-API-CALL\\BATS\\'
df = pd.read_csv(csv_file)
for i, g in df.groupby('Ticker'):
# SET HEADER TO TRUE THE FIRST RUN, THEN SET TO FALSE THEREAFTER
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
print(df.tail(5))
FINAL CODE SNIPPET BELOW THAT WORKS. Thanks
for i, g in df.groupby('Ticker'):
if os.path.exists(path+i+".csv"):
g.to_csv(path + '{}.csv'.format(i), mode='a', header=False, index=False, index_label=None)
else:
g.to_csv(path + '{}.csv'.format(i), mode='w', header=True, index=False, index_label=None)
print(df.tail(5))

How to skip duplicate headers in multiple CSV files having indetical columns and merge as one big data frame

I have copied 34 CSV files having identical columns in google colab and trying to merge as one big data frame. However, each CSV has a duplicate header which needs to be skipped.
The actual header anyway will be skipped while concatenating, as my CSV files having identical columns correct?
dfs = [pd.read_csv(path.join('/content/drive/My Drive/',x)skiprows=1) for x in os.listdir('/content/drive/My Drive/') if path.isfile(path.join('/content/drive/My Drive/',x))]
df = pd.concat(dfs)
Above code throwing below error.
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 1: invalid continuation byte
Below code working for sample files,but need an efficient way to skip dup headers and merged into one data frame.Please suggest.
df1=pd.read_csv("./Aug_0816.csv",skiprows=1)
df2=pd.read_csv("./Sep_0916.csv",skiprows=1)
df3=pd.read_csv("./Oct_1016.csv",skiprows=1)
df4=pd.read_csv("./Nov_1116.csv",skiprows=1)
df5=pd.read_csv("./Dec_1216.csv",skiprows=1)
dfs=[df1,df2,df3,df4,df5]
df=pd.concat(dfs)
Have you considered using glob from the standard library?
Try this
path = ('/content/drive/My Drive/')
os.chdir(path)
allFiles = glob.glob("*.csv")
dfs = [pd.read_csv(f,header=None,error_bad_lines=False) for f in allFiles]
#or if you know the specific delimiter for your csv
#dfs = [pd.read_csv(f,header=None,delimiter='yourdelimiter') for f in allFiles]
df = pd.concat(dfs)
Try this, the most generic script for concatenating multiple 'n' csv files in a specific path with a common file name format!
def get_merged_csv(flist, **kwargs):
return pd.concat([pd.read_csv(f,**kwargs) for f in flist], ignore_index=True)
path = r"C:\Users\Jyotsna\Documents"
fmask = os.path.join(path, 'Detail**.csv')
df = get_merged_csv(glob.glob(fmask), index_col=None)
df.head()
If you want to skip some fixed rows and/or columns in each of the files before concatenating, edit the code accordingly on this line!
return pd.concat([pd.read_csv(f, skiprows=4,usecols=range(9),**kwargs) for f in flist], ignore_index=True)

Get file size in bazaar

How can I get the size of a versioned file/files? This seems to be a very common operation but I can't find how (Am I missing something?). Well, I can get it indirectly by catting it and count the bytes but this is very slow.
Any help would be appropriated.
You can't get information about file size from command-line but can get it with python, using bzrlib.
import bzrlib
from bzrlib import branch
b = branch.Branch.open('.')
b.lock_read()
try:
rt = b.repository.revision_tree(revision_id)
fileid = rt.path2id('filename')
if fileid is None:
print 'file is not versioned!'
else:
print 'file size:', rt.get_file_size(fileid)
finally:
b.unlock()