Geocoding, iterrows() and itertuples do not get the job done for a larger DataFrame - pandas

Im trying to add coördinates to a set of addresses that are saved in an excel file using the google geocoder API. See code below:
for i, row in df.iterrows():
#below combines the address columns together in one variable, to push to the geocoder API.
apiAddress = str(df.at[i, 'adresse1']) + ',' + str(df.at[i, 'postnr']) + ',' + str(df.at[i, 'By'])
#below creates a dictionary with the API key and the address info, to push to the Geocoder API on each iteration
parameters = {
'key' : API_KEY,
'address' : apiAddress
}
#response from the API, based on the input url + the dictionary above.
response = requests.get(base_url, params = parameters).json()
#when you look at the response, it is given as a dictionary. with this command I access the geometry part of the dictionary.
geometry = response['results'][0]['geometry']
#within the geometry party of the dictionary given by the API, I access the lat and lng respectively.
lat = geometry['location']['lat']
lng = geometry['location']['lng']
#here I append the lat / lng to a new column in the dataframe for each iteration.
df.at[i, 'Geo_Lat_New'] = lat
df.at[i, 'Geo_Lng_New'] = lng
#printing the first 10 rows.
print(df.head(10))
the above code works perfectly fine for 20 addresses. But when I try to run it on the entire dataset of 90000 addresses; using iterrows() I get a IndexError:
File "C:\Users\...", line 29, in <module>
geometry = response['results'][0]['geometry']
IndexError: list index out of range
Using itertuples() instead, with:
for i, row in df.itertuples():
I get a ValueError:
File "C:\Users\...", line 22, in <module>
for i, row in df.itertuples():
ValueError: too many values to unpack (expected 2)
when I use:
for i in df.itertuples():
I get a complicated KeyError. That is to long to put here.
Any suggestions on how to properly add coördinates for each address in the entire dataframe?

Update, in the end I found out what the issue was. The google geocoding API only handles 50 request per second. Therefore I used to following code to take a 1 second break after every 49 requests:
if count == 49:
print('Taking a 1 second break, total count is:', total_count)
time.sleep(1)
count = 0
Where count keeps count of the number of loops, as soon as it hits 49, the IF statement above is executed, taking a 1 second break and resetting the count back to zero.

Although you have already found the error - Google API limits the amount of requests that can be done - it isn't usually good practice to use for with pandas. Therefore, I would re write your code to take advantage of pd.DataFrame.apply.
def get_geometry(row: pd.Series, API_KEY: str, base_url: str, tries: int = 0):
apiAddress = ",".join(row["adresse1"], row["postnr"], row["By"])
parameters = {"key": API_KEY, "address": apiAddress}
try:
response = requests.get(base_url, params = parameters).json()
geometry = response["results"][0]["geometry"]
except IndexError: # reach limit
# sleep to make the next 50 requests, but
# beware that consistently reaching limits could
# further limit sending requests.
# this is why you might want to keep track of how
# many tries you have already done, as to stop the process
# if a threshold has been met
if tries > 3: # tries > arbitrary threshold
raise
time.sleep(1)
return get_geometry(row, API_KEY, base_url, tries + 1)
else:
geometry = response["results"][0]["geometry"]
return geometry["location"]["lat"], geometry["location"]["lng"]
# pass kwargs to apply function and iterate over every row
lat_lon = df.apply(get_geometry, API_KEY = API_KEY, base_url = base_url, axis = 1)
df["Geo_Lat_New"] = lat_lon.apply(lambda latlon: latlon[0])
df["Geo_Lng_New"] = lat_lon.apply(lambda latlon: latlon[1])

Related

Workaround Google Sheets API does not accept range request without specifying desired final line

My spreadsheet has values in this model:
And I need to create a list to use in Python, including the empty fields that exist between values:
CLIENT_SECRET_FILE = 'client_secrets.json'
API_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
service = Create_Service(CLIENT_SECRET_FILE, API_NAME, API_VERSION, SCOPES)
spreadsheet_id = sheet_id
get_page_id = 'Winning_Margin'
range_score = 'O1:O10000'
spreadsheets_match_score = []
range_names2 = get_page_id + '!' + range_score
result2 = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=range_names2, valueRenderOption='UNFORMATTED_VALUE').execute()
sheet_output_data2 = result2["values"]
for i, eventao2 in enumerate(sheet_output_data2):
try:
spreadsheets_match_score.append(sheet_output_data2[i][0])
except:
spreadsheets_match_score.append('')
In this case, this list (spreadsheets_match_score = []) would result in:
["0-0","0-0","4-0","0-1","6-0","","","","0-3","2-2","","","","","0-1","","","3-0","1-1","3-1","","","",""]
My spreadsheet currently has 24 rows, but it will grow without a fixed ending value.
So, I tried to use the range without putting the value of the last line (range_score = 'O1:O'), but it doesn't accept, the range needs to specify the final line (range_score = 'O1:O10000').
I put 10000 exactly so I don't have to change, but this is very wrong to do, because it does a search for a non-existent range, I'm very afraid that in the future it will generate an error.
Is there any method so that I can not need to specify the last row of the worksheet?
To be something like:
range_score = 'O1:O'
The problem is not in the range specification method for data collection, can use either range_score = 'O1:O' or range_score = 'O1:O100000000000' if looking for all the column rows.
In the case of the question, the problem was when line 1 of the desired column has no values, being null, the request failed but because of the empty ["values"] return.
In short, I was looking for the error in the wrong place.

Understanding Pandas Series Data Structure

I am trying to get my head around the Pandas module and started learning about the Series data structure.
I have created the following Series in Spyder :-
songs = pd.Series(data = [145,142,38,13], name = "Count")
I can obtain information about the Series index using the code:-
songs.index
The output of the above code is as follows:-
My question is where it states Start = 0 and Stop = 4, what are these referring to?
I have interpreted start = 0 as the first element in the Series is in row 0.
But i am not sure what Stop value refers to as there are no elements in row 4 of the Series?
Can some one explain?
Thank you.
This concept as already explained adequately in the comments (indexing is at minus one the count of items) is prevalent in many places.
For instance, take the list data structure-
z = songs.to_list()
[145, 142, 38, 13]
len(z)
4 # length is four
# however indexing stops at i-1 position 'i' being the length/count of items in the list.
z[4] # this will raise an IndexError
# you will have to start at index 0 going till only index 3 (i.e. 4 items)
z[0], z[1], z[2], z[-1] # notice how -1 can be used to directly access the last element

Karate API - How we can get the specific array element for which index will be dynamic. I need value for last array element which is not fixed

I read that variable will not work as a index in array when we try to access array[i] or something like that.
so How I can or get a value for last element of the array which is dynamic. It will change after every API call.
Variables will work in JS. They won't work in JsonPath (e.g. match Left Hand Side), read the docs to understand the difference:
* def foo = [1, 2, 3, 4]
* def size = karate.sizeOf(foo)
* def last = foo[size - 1]
* match last == 4

TypeError: 'DataFrame' object is not callable in concatenating different dataframes of certain types

I keep getting the following error.
I read a file that contains time series data of 3 columns: [meter ID] [daycode(explain later)] [meter reading in kWh]
consum = pd.read_csv("data/File1.txt", delim_whitespace=True, encoding = "utf-8", names =['meter', 'daycode', 'val'], engine='python')
consum.set_index('meter', inplace=True)
test = consum.loc[[1048]]
I will observe meter readings for all the length of data that I have in this file, but first filter by meter ID.
test['day'] = test['daycode'].astype(str).str[:3]
test['hm'] = test['daycode'].astype(str).str[-2:]
For readability, I convert daycode based on its rule. First 3 digits are in range of 1 to 365 x2 = 730, last 2 digits in range of 1 to 48. These are 30-min interval reading of 2-year length. (but not all have in full)
So I create files that contain dates in one, and times in another separately. I will use index to convert the digits of daycode into the corresponding date & time that these file contain.
#dcodebook index starts from 0. So minus 1 from the daycode before match
dcodebook = pd.read_csv("data/dcode.txt", encoding = "utf-8", sep = '\r', names =['match'])
#hcodebook starts from 1
hcodebook = pd.read_csv("data/hcode.txt", encoding = "utf-8", sep ='\t', lineterminator='\r', names =['code', 'print'])
hcodebook = hcodebook.drop(['code'], axis= 1)
For some weird reason, dcodebook was indexed using .iloc function as I understood, but hcodebook needed .loc.
#iloc: by int-position
#loc: by label value
#ix: by both
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
#to avoid duplicate index Valueerror, create separate dataframes..
hm_df = hcodebook.loc[test['hm'].astype(int) - 1]
#.to_frame error / do I need .reset_index(drop=True)?
The following line is where the code crashes.
datcode_df = day_df(['match']) + ' ' + hm_df(['print'])
print datcode_df
print test
What I don't understand:
I tested earlier that columns of different dataframes can be merged using the simple addition as seen
I initially assigned this to the existing column ['daycode'] in test dataframe, so that previous values will be replaced. And the same error msg was returned.
Please advise.
You need same size of both DataFrames, so is necessary day and hm are unique.
Then reset_index with drop=True for same indices and last remove () in join:
day_df = dcodebook.iloc[test['day'].astype(int) - 1].reset_index(drop=True)
hm_df = hcodebook.loc[test['hm'].astype(int) - 1].reset_index(drop=True)
datcode_df = day_df['match'] + ' ' + hm_df['print']

Update stimulus attribute every ... ms or frame in PsychoPy

I'm trying to update the orientation of a gratingStim every 100 ms or so in the psychopy coder. Currently, I'm updating the attribute (or trying to) with these lines :
orientationArray = orientation.split(',') #reading csv line as a list
selectOri = 0 #my tool to select the searched value in the list
gabor.ori = int(orientationArray[selectOri]) #select value as function of the "selectOri", in this case always the first one
continueroutine = True
while continueroutine:
if timer == 0.1: # This doesn't work but it shows you what is planned
selectOri = selectOri + 1 #update value
gabor.ori = int(orientationArray[selectOri]) #update value
win.flip()
I can't find a proper way to update in a desired time frame.
A neat way to do something every x frames is to use the modulo operation in combination with a loop containin win.flip(). So if you want to do something every 6 frames (100 ms on a 60 Hz monitor), just do this in every frame:
frame = 0 # the current frame number
while continueroutine:
if frame % 6 == 0: # % is modulo. Here every sixth frame
gabor.ori = int(orientationArray[selectOri + 1])
# Run this every iteration to synchronize the while-loop with the monitor's frames.
gabor.draw()
win.flip()
frame += 1