Data overwrite google sheet - Jupyter connection - api

I created a connection between my Jupyter notebook and google sheet.
My idea was to create a log so everytime I run the notebook it would update my google sheet with the new data but I dont want to overwrite the existing data, I want to add. I tried many solutions but it didnt work
Currently my code is:
## Connect to our service account
scope =["https://spreadsheets.google.com/feeds",'https://www.googleapis.com/auth/spreadsheets',"https://www.googleapis.com/auth/drive.file","https://www.googleapis.com/auth/drive"]
credentials = ServiceAccountCredentials.from_json_keyfile_name('jupyter-and-gsheet-303208-63903bea8f5d.json', scope)
gc = gspread.authorize(credentials)
spreadsheet_key = '1RbPnMdJ-EcJHbly280vrJxc8UvqwiBPkUTFLyo4efEA'
from df2gspread import df2gspread as d2g
wks_name = 'Data04'
d2g.upload(df_apn1, spreadsheet_key, wks_name, credentials=credentials)
It works perfectly but always overwriting the existing data.
Does anybody know how I can add instead of replace?
thank you

df2gspread document for upload() indicates that
if spreadsheet already exists, all data of provided worksheet(or first as default) will be replaced with data of given DataFrame, make sure that this is what you need!.
Another workaround is to convert your dataframe to a list and use gspread append_rows.
Example:
Code:
import gspread
import pandas as pd
gc = gspread.service_account()
sh = gc.open_by_key("someid").sheet1
df = pd.DataFrame({'Name': ['Bea', 'Andrew', 'Mike'], 'Age': [20, 19, 23]})
values = df.values.tolist()
sh.append_rows(values)
Before append:
After append:
You may also check the following libraries:
gspread-pandas
gspread-dataframe
Reference:
gspread

Related

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

I am trying to ingest S3 data(csv file) to RDS(MSSQL) through lambda. Sample code:
s3 = boto3.client('s3')
if event:
file_obj = event["Records"][0]
bucketname = str(file_obj["s3"]["bucket"]["name"])
csv_filename = unquote_plus(str(file_obj["s3"]["object"]["key"]))
print("Filename: ", csv_filename)
csv_fileObj = s3.get_object(Bucket=bucketname, Key=csv_filename)
file_content = csv_fileObj["Body"].read().decode("utf-8").split()
I have tried put my csv contents into a list but didnt work.
results = []
for row in csv.DictReader(file_content):
results.append(row.values())
print(results)
print(file_content)
return {
'statusCode': 200,
'body': json.dumps('S3 file processed')
}
Is there anyway I could convert "file_content" into a dataframe in Lambda? I have multiple columns to load.
Later I would follow this approach to load the data into RDS
import pyodbc
import pandas as pd
# insert data from csv file into dataframe(df).
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
# Insert Dataframe into SQL Server:
for index, row in df.iterrows():
cursor.execute("INSERT INTO HumanResources.DepartmentTest (DepartmentID,Name,GroupName) values(?,?,?)", row.DepartmentID, row.Name, row.GroupName)
cnxn.commit()
cursor.close()
Can anyone suggest how to go about it?
You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.
import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs
It appears that your goal is to load the contents of a CSV file from Amazon S3 into SQL Server.
You could do this without using Dataframes:
Loop through the Event Records (multiple can be passed-in)
For each object:
Download the object to /tmp/
Use the Python CSVReader to loop through the contents of the file
Generate INSERT statements to insert the data into the SQL Server table
You might also consider using aws-data-wrangler: Pandas on AWS, which is available as a Lambda Layer.

Load data only once on the RAM using Python

Hopefully someone can help me. I have a set of static data files to do some data analysis, however, every time I run my script it takes really long time to see what is happening, because the data is loaded every time. Is there a way to load the data once and after just work with the data??
I have been using Jupyter notebooks and it work really well, but I would like a way to fix this problem by using Python code.
The sequence of my code is:
File 1: contains all the functions;
File 2: Contains all the variables and it calls file 1 in order to know what to do with the data.\n
File 1 = functions.py\n
import numpy as np
def dict_files(filepath_lst):
dictoffiles = {}
for namefile in filepath_lst:
content_file = np.loadtxt(namefile)
dictoffiles[namefile] = content_file
## Sorting files according to smallest timestamp to largest##
sorted_dictoffiles = {keys: values for keys, values in sorted(dictoffiles.items(), key=lambda item: item[1][0, 0])}
return sorted_dictoffiles
File 2\n
import functions as f
### ----------File Path -----------###
directory = 'some_file_path'
file_path = glob.glob(filejoin(directory, '*.dat'))
dictionary_of_files = f.dict_files(file_path)

Read/Write Data in Google Spreadsheet

I am not using MS office in my local machine. So I am using Google Docs.
Now I need to create a script that fetches the data from google spreadsheet in Selenium.
I want to fetch the data[Read/Write] from the Google spreadsheet using selenium web-driver.
Is anyone have an idea about how to do it?
Technologies:
Selenium Web-Driver
JAVA
TestNG
Eclipse IDE
I don't have access to Google Sheets right now, but I'm guessing it would look something like this.
pip install gspread oauth2client
Then...
import gspread
from oauth2client.service_account import ServiceAccountCredentials
# use creds to create a client to interact with the Google Drive API
scope = ['https://spreadsheets.google.com/feeds']
creds = ServiceAccountCredentials.from_json_keyfile_name('client_secret.json', scope)
client = gspread.authorize(creds)
# Find a workbook by name and open the first sheet
# Make sure you use the right name here.
sheet = client.open("Copy of Legislators 2017").sheet1
# Extract and print all of the values
list_of_hashes = sheet.get_all_records()
print(list_of_hashes)
Or, get a list of lists:
sheet.get_all_values()
Finally, you could just pull the data from a single row, column, or cell:
sheet.row_values(1)
sheet.col_values(1)
sheet.cell(1, 1).value
https://www.twilio.com/blog/2017/02/an-easy-way-to-read-and-write-to-a-google-spreadsheet-in-python.html
https://towardsdatascience.com/accessing-google-spreadsheet-data-using-python-90a5bc214fd2

reading from hive table and updating same table in pyspark - using checkpoint

I am using spark version 2.3 and trying to read hive table in spark as:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
df = spark.table("emp.emptable")
here I am adding a new column with current date from system to the existing dataframe
import pyspark.sql.functions as F
newdf = df.withColumn('LOAD_DATE', F.current_date())
and now facing an issue,when I am trying to write this dataframe as hive table
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
pyspark.sql.utils.AnalysisException: u'Cannot overwrite table emp.emptable that is also being read from;'
so I am checkpointing the dataframe to break the lineage since I am reading and writing from same dataframe
checkpointDir = "/hdfs location/temp/tables/"
spark.sparkContext.setCheckpointDir(checkpointDir)
df = spark.table("emp.emptable").coalesce(1).checkpoint()
newdf = df.withColumn('LOAD_DATE', F.current_date())
newdf.write.mode("overwrite").saveAsTable("emp.emptable")
This way it's working fine and new column has been added to the hive table. but I have to delete the checkpoint files every time it's get created. Is there any best way to break the lineage and write the same dataframe with updated column details and save it to hdfs location or as a hive table.
or is there any way to specify a temp location for checkpoint directory, which will get deleted post the spark session completes.
As we discussed in this post, setting below property is way to go.
spark.conf.set("spark.cleaner.referenceTracking.cleanCheckpoints", "true")
That question had different context. we wanted to retain the checkpointed dataset so did not care to add on cleanup solution.
Setting above property is working sometime(tested scala, java and python) but its hard to rely on it. Official document says that by setting this property it Controls whether to clean checkpoint files if the reference is out of scope. I don't know what exactly it means because my understanding is that once spark session/context is stopped it should clean it. Would be great if someone can shad light on it.
Regarding
Is there any best way to break the lineage
Check this question, #BiS found some way to cut the lineage using createDataFrame(RDD, Schema) method. I haven't tested it by myself though.
Just FYI, I don't rely on above property usually and delete the checkpointed directory in code itself to be on safe side.
We can get the checkpointed directory like below:
Scala :
//Set directory
scala> spark.sparkContext.setCheckpointDir("hdfs:///tmp/checkpoint/")
scala> spark.sparkContext.getCheckpointDir.get
res3: String = hdfs://<name-node:port>/tmp/checkpoint/625034b3-c6f1-4ab2-9524-e48dfde589c3
//It gives String so we can use org.apache.hadoop.fs to delete path
PySpark:
// Set directory
>>> spark.sparkContext.setCheckpointDir('hdfs:///tmp/checkpoint')
>>> t = sc._jsc.sc().getCheckpointDir().get()
>>> t
u'hdfs://<name-node:port>/tmp/checkpoint/dc99b595-f8fa-4a08-a109-23643e2325ca'
# notice 'u' at the start which means It returns unicode object use str(t)
# Below are the steps to get hadoop file system object and delete
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs.delete(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
True
>>> fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path(str(t)))
False

AWS SageMaker pd.read_pickle() doesn't work but read_csv() does?

I've recently been trying to train some models on an AWS SageMaker jupyter notebook instance.
Everything is worked very well until I tried to load in some custom dataset (REDD) through files.
I have the dataframes stored in Pickle (.pkl) files on an S3 bucket. I couldn't manage to read them into sagemaker so I decided to convert them to csv's as this seemed to work but I ran into a problem. This data has an index of type datetime64 and when using .to_csv() this index gets converted to pure text and it loses it's data structure (and I need to keep this specific index for correct plotting.)
So I decided to try the Pickle files again but I can't get it to work and have no idea why.
The following code for csv's works but I can't use it due to the index problem:
bucket = 'sagemaker-peno'
houses_dfs = {}
data_key = 'compressed_data/'
data_location = 's3://{}/{}'.format(bucket, data_key)
for file in range(6):
houses_dfs[file+1] = pd.read_csv(data_location+'house_'+str(file+1)+'.csv', index_col='Unnamed: 0')
But this code does NOT work even though it uses almost the exact same syntax:
bucket = 'sagemaker-peno'
houses_dfs = {}
data_key = 'compressed_data/'
data_location = 's3://{}/{}'.format(bucket, data_key)
for file in range(6):
houses_dfs[file+1] = pd.read_pickle(data_location+'house_'+str(file+1)+'.pkl')
Yes it's 100% the correct path, because the csv and pkl files are stored in the same directory (compressed_data).
It throws me this error while using the Pickle method:
FileNotFoundError: [Errno 2] No such file or directory: 's3://sagemaker-peno/compressed_data/house_1.pkl'
I hope to find someone who has dealt with this before and can solve the read_pickle() issue or as an alternative fix my datetime64 type issue with csv's.
Thanks in advance!
read_pickle() likes the full path more than a relative path from where it was run. This fixed my issue.