Boto3 s3 Select CSV to Pandas Dataframe-- trouble delimiting - pandas

I am trying to use Boto3 to 'query' a .CSV within an s3 bucket and spit the data into a Pandas Dataframe object. It is 'working'-- with (almost all of the data) in a single column.
Here is the Python (thanks 20 Chrome tabs and stackoverflow threads):
import pandas as pd
import boto3
import io
s3 = boto3.client(service_name='s3',
aws_access_key_id = 'redacted',
aws_secret_access_key = 'redacted')
#just selecting everything until I get this proof of concept finished
query = """SELECT *
FROM S3Object"""
obj = s3.select_object_content(
Bucket='redacted',
Key='redacted',
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'Use',
'RecordDelimiter': '|'}},
OutputSerialization={'CSV': {}})
records = []
for event in obj['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'])
elif 'Stats' in event:
stats = event['Stats']['Details']
file_str = ''.join(r.decode('utf-8') for r in records)
df = pd.read_csv(io.StringIO(file_str))
This is what the .CSV in the s3 bucket looks like:
Field_1
"HeaderA""|""HeaderB""|""HeaderC""|""HeaderD"
"valueA1""|""valueB1""|""valueC1""|""valueD1"
"valueA2""|""valueB2""|""valueC2""|""valueD2"
"valueA3""|""valueB3""|""valueC3""|""valueD3"
.
.
.
"valueAn""|""valueBn""|""valueCn""|""valueDn"
And here is my current Dataframe output:
HeaderB
------------
HeaderC
HeaderD
valueA1
valueB1
valueC1
valueD1
valueA2
valueB2
valueC2
valueD2
...
valueDn
Desired output is 4 columns by n rows (plus headers)
Any ideas on how to fix this?
.
.
.
Edit:
InputSerialization={'CSV': {'FileHeaderInfo': 'None',
'FieldDelimiter': '"',
'AllowQuotedRecordDelimiter': True
}}
That got me 95% of the way there. The pipes were added as columns in the dataframe. Solution:
for col in df.columns:
if col[0] == '|':
df = df.drop(col, axis = 1)
Edit 2:
This solution works when pulling the entire CSV with SELECT *.
Now that this works, I've moved on to the next proof of concept, which is using a more specific query. There were some discrepancies with what was returned vs. what I could verify by looking directly at the CSV. I think this is due to the first line of the CSV being Field_1, followed by the actual header fields and record values. My current theory is that with this first line removed from the original input, I will be able to field-delimit on the quoted pipe and record-delimit on the newline and get the results I want. I am reaching out to the team responsible for these s3 dumps to see if the first line can be removed.

New csv file
Field_1
"HeaderA""|""HeaderB""|""HeaderC""|""HeaderD"
"a_val1""|""bv3""|""1""|""10"
"a_val2""|""bv4""|""1""|""20"
"a_val3""|""bv4""|""3""|""40"
"a_val4""|""bv6""|""4""|""40"
def get_results(query):
obj = s3.select_object_content(
Bucket=bucket,
Key=key,
ExpressionType='SQL',
Expression=query,
InputSerialization={'CSV': {'FileHeaderInfo': 'IGNORE',
'FieldDelimiter': '"',
'AllowQuotedRecordDelimiter': True
}},
OutputSerialization={'CSV': {}})
# print(list(obj['Payload']))
records = []
for event in obj['Payload']:
if 'Records' in event:
records.append(event['Records']['Payload'])
elif 'Stats' in event:
stats = event['Stats']['Details']
file_str = ''.join(r.decode('utf-8') for r in records)
df = pd.read_csv(io.StringIO(file_str))
# df = df.filter(regex='Header')
return df
To get this work, ignore the headers (the first line of file) and then specifically search for it in where/and clause. FIguring out the column positions is the time consuming part.
query = '''SELECT s._2, s._6, s._10, s._14 FROM S3Object s where s._6 = 'bv4' or s._6 = 'HeaderB' '''
query = '''SELECT s._2, s._6 FROM S3Object s where s._6 = 'bv4' or s._6 = 'HeaderB' '''
get_results(query)
Here are the outputs of the two queries
HeaderA HeaderB HeaderC HeaderD
0 a_val2 bv4 1 20
1 a_val3 bv4 3 40
HeaderA HeaderB
0 a_val2 bv4
1 a_val3 bv4

Related

convert TB to GB on specific index

I wrote such an code. Here i wanted to change all column that constitute TB and GB to single integer. for example if column has 2 TB, this code will delete TB and will keep it as 2. The program works good. What now i want to do is to convert 2TB to 2048 GB so that i can sum all column values. Is there any way to remove TB and make calculation on specific row at the same time?
def removeend():
df= pd.read_csv('ExportList.csv')
if df["Used Space"].str.contains("GB | TB").any() or df["Memory Size"].str.contains("GB | TB").any() or df["Host CPU"].str.contains("Hz|MHz|GHz").any():
df['Used Space'] = df['Used Space'].str.replace(r'GB|TB', '', regex=True)
df["Memory Size"] = df["Memory Size"].str.replace(r'GB|TB', '', regex=True)
df['Host CPU'] = df['Host CPU'].str.replace(r'MHz|Hz|GHz', '', regex=True)
df = df.convert_dtypes()
df["Used Space"] = pd.to_numeric(df["Used Space"])
df["Memory Size"] = pd.to_numeric(df["Memory Size"])
df["Host CPU"] = pd.to_numeric(df["Host CPU"])
else:
print("Error occured!!!")
return df
define\create a custom function:
def converter(x):
try:
return pd.eval(x)
except:
return x
Finally:
cols=["Used Space","Memory Size"]
df[cols]=df[cols].replace({'GB':'','TB':'*1024'},regex=True).applymap(converter)
df["Host CPU"]=df["Host CPU"].replace({'MHz':'','GHz':'*0.001','Hz':'*0.000001'},regex=True).map(converter)

Pandas combine mutilple columns in a BQ table to generate payload for FB conversions api

I am reading from a bigquery table to generate a payload to upload to FB conversions api.
cols=["payload","client_user_agent","event_source_url"]
I am copying the column values directly from the bq table as I am unable to print the full output of the dataframe in note book.
payload="{"pageDetail":{"pageName":"Confirmation","pageContentType":"cart","pageSiteSection":"cart","breadcrumbs":[{"title":"Home","url":"/en/home.html"},{"title":"Cart","url":"/cart"},{"title":"Confirmation","url":"/order-confirmation="}],"pageCategory":"Home","pageCategory1":"Cart","pageCategory2":"Confirmation","proBtbGlobalHeader":false},"orderDetails":{"hceid":"3b94a","orderConfirmed":true,"orderDate":"2021-01-15","orderId":"0123","unique":2,"pricingSummary":{"total":54.01},"items":[{"productId":"0456","quantity":1,"shippingAddress":{"postalCode":"V4N 3X3"},"promotion":{"voucherCode":null},"clickToInstall":{"eligible":false}},{"productId":"0789","quantity":1,"fulfillment":{"fulfillmentCost":""},"shippingAddress":{"postalCode":"A4N 3Y3"},"promotion":{"voucherCode":null},"clickToInstall":{"eligible":false}}],"billingAddress":{"postalCode":"M$X1A7"}},"event":{"type":"Load","page":"Confirmation","timestamp":1610706772998,"language":"English","url":"https://www"}}"
client_user_agent="Mozilla/5.0"
event_source_url= "https://www.def.com="
I need the value for email=[orderDetails][hceid] and value=["orderDetails"]["pricingSummary"]["total"]
Initially all the payload I wanted was in a single column and I was able to achieve the uploads with the following code
import time
from facebook_business.adobjects.serverside.event import Event
from facebook_business.adobjects.serverside.event_request import EventRequest
from facebook_business.adobjects.serverside.user_data import UserData
from facebook_business.adobjects.serverside.custom_data import CustomData
from facebook_business.api import FacebookAdsApi
import pandas as pd
import json
FacebookAdsApi.init(access_token=access_token)
query='''SELECT JSON_EXTRACT(payload, '$') AS payload FROM `project.dataset.events` WHERE eventType = 'Page Load' AND pagename = "Confirmation" limit 1'''
df = pd.read_gbq(query, project_id= project, dialect='standard')
payload = df.to_dict(orient="records")
for i in payload:
#print(type(i["payload"]))
k = json.loads(i["payload"])
email = k["orderDetails"]["hcemuid"]
user_data = UserData(email)
value=k["orderDetails"]["pricingSummary"]["total"]
order_id = k["orderDetails"]["orderId"]
custom_data = CustomData(
currency='CAD',
value=value)
event = Event(
event_name='Purchase',
event_time=int(time.time()),
user_data=user_data,
custom_data=custom_data,
event_id = order_id,
data_processing_options= [])
events = [event]
#print(events)
event_request = EventRequest(
events=events,
test_event_code='TEST8609',
pixel_id=pixel_id)
#print(event_request)
a=event_request.execute()
print(a)
Now there are additional values client_user_agent that needs to be part of user data and event_source_url as parts of events in the above code that are present as two different columns in GBQ table.
I have tried similar code as above for multiple columns but I am receiving a
TypeError: Object of type Series is not JSON serializable
So I tried concatenating the columns and then create a json serializable object but I am not able to do an upload.
Below is where I am stuck and lost and not sure how to proceed further any inputs appreciated.
import time
from facebook_business.adobjects.serverside.event import Event
from facebook_business.adobjects.serverside.event_request import EventRequest
from facebook_business.adobjects.serverside.user_data import UserData
from facebook_business.adobjects.serverside.custom_data import CustomData
from facebook_business.api import FacebookAdsApi
import pandas as pd
import json
FacebookAdsApi.init(access_token=access_token)
query='''SELECT payload AS payload,location.userAgent as client_user_agent,location.referrer as event_source_url FROM `project.Dataset.events` WHERE eventType = 'Page Load' AND pagename = "Confirmation" limit 1'''
df = pd.read_gbq(query, project_id= project, dialect='standard')
df.reset_index(drop=True, inplace=True)
payload = df.to_dict(orient="records")
print(payload)
## cols = ['payload', 'client_user_agent', 'event_source_url']
## df['combined'] = df[cols].apply(lambda row: ','.join(row.values.astype(str)), axis=1)
## del df["payload"]
## del df["client"]
## del df["source"]
## payload = df.to_dict(orient="records")
#tried concatinating all columns in a the dataframe but not able to create a valid json object for upload
columns = ['payload', 'client_user_agent', 'event_source_url']
df['payload'] = df['payload'].str.replace(r'}"$', '')
payload = df[columns].to_dict(orient='records')
print(payload)
## df = df.drop(columns=columns)
## pd.options.display.max_rows = 4000
# #print(payload)
# for i in payload:
# print(i["payload"])
# k = json.loads(i["payload"])
# email = k["orderDetails"]["hcemuid"]
# print(email)
I am following the instructions from this page:https://developers.facebook.com/docs/marketing-api/conversions-api
I have used the bigquery json_extract_scalar function to extract data from nested column instead of pandas which is a relatively better solution for my scenario.

Webscraping several URLs into panda df

Need some help appending several webscraping resaults to a panda df.
Currently im only getting the output from one of the URLs to the DF.
I left out the URLs, if you need them i will supply them to you.
##libs
import bs4
import requests
import re
from time import sleep
import pandas as pd
from bs4 import BeautifulSoup as bs
##webscraping targets
URLs = ["URL1","URL2","URL3"]
## Get columns
column_list = []
r1 = requests.get(URLs[0])
soup1 = bs(r1.content)
data1 = soup1.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
columns = soup1.find_all('dt')
for col in columns:
column_list.append(col.text.strip()) # strip() removes extra space from the text
##Get values
value_list = []
for url in URLs:
r1 = requests.get(url)
soup1 = bs(r1.content)
data1 = soup1.find_all('dl', attrs= {"class": "border XSText rightAlignText noMarginTop highlightOnHover thickBorderBottom noTopBorder"})
values = soup1.find_all('dd')
for val in values:
value_list.append(val.text.strip())
df=pd.DataFrame(list(zip(column_list,value_list)))
df.transpose()
Current output only showing the resaults of one URL:
Expected output:
The problem here is with your zip function. It will only zip the values until the length of the shortest list, in this case, the column_list. Leaving all the other values unused.
If you want to append the other values to the dataframe as well you will have to iterate over then. So change the last two lines on your code to this and it should work:
result = [[i] for i in column_list]
for i, a in enumerate(value_list):
result[i % len(column_list)].extend([a])
df = pd.DataFrame(result)
df.transpose()

Adding multiple dictionaries into a single Dataframe pandas

I have a set of python dictionaries that I have obtained by means of a for loop. I am trying to have these added to Pandas Dataframe.
Output for a variable called output
{'name':'Kevin','age':21}
{'name':'Steve','age':31}
{'name':'Mark','age':11}
I am trying to append each of these dictionary into a single Dataframe. I tried to perform the below but it just added the first row.
df = pd.DataFrame(output)
Could anyone advice as to where am going wrong and have all the dictionaries added to the Dataframe.
Update on the loop statement
The below code helps to read xml and convert it to a dataframe. Right now I see I am able to loop in through multiple xml files and created dictionaries for each xml file. I am trying to see how could I add each of these dictionaries to a single Dataframe:
def f(elem, result):
result[elem.tag] = elem.text
cs = elem.getchildren()
for c in cs:
result = f(c, result)
return result
result = {}
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
print(result)
You can append each dictionary to list and last call DataFrame constructor:
out = []
for file in allFiles:
tree = ET.parse(file)
root = tree.getroot()
result = f(root, result)
out.append(result)
df = pd.DataFrame(out)
We can add these dicts to a list:
ds = []
for ...: # your loop
ds += [d] # where d is one of the dicts
When we have the list of dicts, we can simply use pd.DataFrame on that list:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31},
{'name':'Mark','age':11}
]
pd.DataFrame(ds)
Output:
name age
0 Kevin 21
1 Steve 31
2 Mark 11
Update:
And it's not a problem if different dicts have different keys, e.g.:
ds = [
{'name':'Kevin','age':21},
{'name':'Steve','age':31,'location': 'NY'},
{'name':'Mark','age':11,'favorite_food': 'pizza'}
]
pd.DataFrame(ds)
Output:
age favorite_food location name
0 21 NaN NaN Kevin
1 31 NaN NY Steve
2 11 pizza NaN Mark
Update 2:
Building up on our previous discussion in Python - Converting xml to csv using Python pandas we can do:
results = []
for file in glob.glob('*.xml'):
tree = ET.parse(file)
root = tree.getroot()
result = f(root, {})
result['filename'] = file # added filename to our results
results += [result]
pd.DataFrame(results)

Pandas - Trying to save a set of files by reading it using Pandas but only the latest file gets saved

I am trying to read a set of txt files into Pandas as below. I see I am able to read them to a Dataframe however when I try to save the Dataframe it only saves the last file it read. However when I perform print(df) it prints all the records.
Given below is the code I am using:
files = '/users/user/files'
list = []
for file in files:
df = pd.read_csv(file)
list.append(df)
print(df)
df.to_csv('file_saved_path')
Could anyone advice why is the last file only being saved to the csv file and now the entire list.
Expected output:
output1
output2
output3
Current output:
output1,output2,output3
Try this:
path = '/users/user/files'
for id in range(len(os.listdir(path))):
file = os.listdir(path)[id]
data = pd.read_csv(path+'/'+file, sep='\t')
if id == 0:
df1 = data
else:
data = pd.concat([df1, data], ignore_index=True)
data.to_csv('file_saved_path')
First change variable name list, because code word in python (builtin), then for final DataFrame use concat:
files = '/users/user/files'
L = []
for file in files:
df = pd.read_csv(file)
L.append(df)
bigdf = pd.concat(L, ignore_index=True)
bigdf.to_csv('file_saved_path')