I have 2 questions regarding google spreadsheet's api using python. My google spreadsheet is as follows:
a b1 23 4 5 6
When I run the script below I only get
root#darkbox:~/google_py# python test.py
1
2
3
4
I only want to get column 1 so i want to see
1
3
5
my second issue here is considering there is a space between the rows my script is not getting the second part (it should be 5 in this case)
How can I get the specified column and ignore white spaces?
#!/usr/bin/env python
import gdata.docs
import gdata.docs.service
import gdata.spreadsheet.service
import re, os
email = 'xxxx#gmail.com'
password = 'passw0rd'
spreadsheet_key = '14cT5KKKWzup1jK0vc-TyZt6BBwSIyazZz0sA_x0M1Bg' # key param
worksheet_id = 'od6' # default
#doc_name = 'python_test'
def main():
client = gdata.spreadsheet.service.SpreadsheetsService()
client.debug = False
client.email = email
client.password = password
client.source = 'test client'
client.ProgrammaticLogin()
q = gdata.spreadsheet.service.DocumentQuery()
feed = client.GetSpreadsheetsFeed(query=q)
feed = client.GetWorksheetsFeed(spreadsheet_key)
rows = client.GetListFeed(spreadsheet_key,worksheet_id).entry
for row in rows:
for key in row.custom:
print "%s" % (row.custom[key].text)
return
if __name__ == '__main__':
main()
To ignore white spaces:
Suggest you switch to CellFeed - I think list feed stops reading when it hits whitespace. Sorry, I forget the fine details. But I dropped List feed and switched to cellfeed a long time ago.
Related
Im trying to add coördinates to a set of addresses that are saved in an excel file using the google geocoder API. See code below:
for i, row in df.iterrows():
#below combines the address columns together in one variable, to push to the geocoder API.
apiAddress = str(df.at[i, 'adresse1']) + ',' + str(df.at[i, 'postnr']) + ',' + str(df.at[i, 'By'])
#below creates a dictionary with the API key and the address info, to push to the Geocoder API on each iteration
parameters = {
'key' : API_KEY,
'address' : apiAddress
}
#response from the API, based on the input url + the dictionary above.
response = requests.get(base_url, params = parameters).json()
#when you look at the response, it is given as a dictionary. with this command I access the geometry part of the dictionary.
geometry = response['results'][0]['geometry']
#within the geometry party of the dictionary given by the API, I access the lat and lng respectively.
lat = geometry['location']['lat']
lng = geometry['location']['lng']
#here I append the lat / lng to a new column in the dataframe for each iteration.
df.at[i, 'Geo_Lat_New'] = lat
df.at[i, 'Geo_Lng_New'] = lng
#printing the first 10 rows.
print(df.head(10))
the above code works perfectly fine for 20 addresses. But when I try to run it on the entire dataset of 90000 addresses; using iterrows() I get a IndexError:
File "C:\Users\...", line 29, in <module>
geometry = response['results'][0]['geometry']
IndexError: list index out of range
Using itertuples() instead, with:
for i, row in df.itertuples():
I get a ValueError:
File "C:\Users\...", line 22, in <module>
for i, row in df.itertuples():
ValueError: too many values to unpack (expected 2)
when I use:
for i in df.itertuples():
I get a complicated KeyError. That is to long to put here.
Any suggestions on how to properly add coördinates for each address in the entire dataframe?
Update, in the end I found out what the issue was. The google geocoding API only handles 50 request per second. Therefore I used to following code to take a 1 second break after every 49 requests:
if count == 49:
print('Taking a 1 second break, total count is:', total_count)
time.sleep(1)
count = 0
Where count keeps count of the number of loops, as soon as it hits 49, the IF statement above is executed, taking a 1 second break and resetting the count back to zero.
Although you have already found the error - Google API limits the amount of requests that can be done - it isn't usually good practice to use for with pandas. Therefore, I would re write your code to take advantage of pd.DataFrame.apply.
def get_geometry(row: pd.Series, API_KEY: str, base_url: str, tries: int = 0):
apiAddress = ",".join(row["adresse1"], row["postnr"], row["By"])
parameters = {"key": API_KEY, "address": apiAddress}
try:
response = requests.get(base_url, params = parameters).json()
geometry = response["results"][0]["geometry"]
except IndexError: # reach limit
# sleep to make the next 50 requests, but
# beware that consistently reaching limits could
# further limit sending requests.
# this is why you might want to keep track of how
# many tries you have already done, as to stop the process
# if a threshold has been met
if tries > 3: # tries > arbitrary threshold
raise
time.sleep(1)
return get_geometry(row, API_KEY, base_url, tries + 1)
else:
geometry = response["results"][0]["geometry"]
return geometry["location"]["lat"], geometry["location"]["lng"]
# pass kwargs to apply function and iterate over every row
lat_lon = df.apply(get_geometry, API_KEY = API_KEY, base_url = base_url, axis = 1)
df["Geo_Lat_New"] = lat_lon.apply(lambda latlon: latlon[0])
df["Geo_Lng_New"] = lat_lon.apply(lambda latlon: latlon[1])
I have a very wide Excel sheet, from Column A - DIE (about 2500 columns wide), of survey data. Each column is a question, and each row is a response. I'm trying to upload the data to SQL and convert it to a more SQL-friendly format using the UNPIVOT function, but I can't even get it loaded into SQL because it exceeds the 1024-column limit.
Basically, I have an Excel sheet that looks like this:
But I want to convert it to look like this:
What options do I have to make this change, either in Excel (prior to upload) or SQL (while circumventing the 1024 column limit)?
I have had to do this quite a bit. My solution was to write a Python script that would un-crosstab a CSV file (typically exported from Excel), creating another CSV file. The Python code is here: https://pypi.python.org/pypi/un-xtab/ and the documentation is here: http://pythonhosted.org/un-xtab/. I've never run it on a file with 2500 columns, but don't know why it wouldn't work.
R has a very specific function call in one of it's libraries. You can also connect, read, and write data with R into a database. Would suggest downloading R and Rstudio.
Here is a working script to get you started that does what you need:
Sample data:
df <- data.frame(id = c(1,2,3), question_1 = c(1,0,1), question_2 = c(2,0,2))
df
Input table:
id question_1 question_2
1 1 1 2
2 2 0 0
3 3 1 2
Code to transpose the data:
df2 <- gather(df, key = id, value = values)
df2
Output:
id id values
1 1 question_1 1
2 2 question_1 0
3 3 question_1 1
4 1 question_2 2
5 2 question_2 0
6 3 question_2 2
Some helper functions for you to import and export the csv data:
# Install and load the necessary libraries
install.packages(c('tidyr','readr'))
library(tidyr)
library(readr)
# to read a csv file
df <- read_csv('[some directory][some filename].csv')
# To output the csv file
write.csv(df2, '[some directory]data.csv', row.names = FALSE)
Thanks for all the help. I ended up using Python due to limitations in both SQL (over 1024 columns wide) and Excel (well over 1 million rows in the output). I borrowed the concepts from rd_nielson's code, but that was a bit more complicated than I needed. In case it's helpful to anyone else, this is the code I used. It outputs a csv file with 3 columns and 14 million rows that I can upload to SQL.
import csv
with open('Responses.csv') as f:
reader = csv.reader(f)
headers = next(reader) # capture current field headers
newHeaders = ['ResponseID','Question','Response'] # establish new header names
with open('PythonOut.csv','w') as outputfile:
writer=csv.writer(outputfile, dialect='excel', lineterminator='\n')
writer.writerow(newHeaders) # write new headers to output
QuestionHeaders = headers[1:len(headers)] # Slice the question headers from original header list
for row in reader:
questionCount = 0 # start counter to loop through each question (column) for every response (row)
while questionCount <= len(QuestionHeaders) - 1:
newRow = [row[0], QuestionHeaders[questionCount], row[questionCount + 1]]
writer.writerow(newRow)
questionCount += 1
I'm trying to make some changes to my dictionary counter in python. I want make some changes to my current counter, but not making any progress so far. I want my code to show the number of different words.
This is what I have so far:
# import sys module in order to access command line arguments later
import sys
# create an empty dictionary
dicWordCount = {}
# read all words from the file and put them into
#'dicWordCount' one by one,
# then count the occurance of each word
you can use the Count function from collections lib:
from collections import Counter
q = Counter(fileSource.read().split())
total = sum(q.values())
First, your first problem, add a variable for the word count and one for the different words. So wordCount = 0 and differentWords = 0. In the loop for your file reading put wordCount += 1 at the top, and in your first if statement put differentWords += 1. You can print these variables out at the end of the program as well.
The second problem, in your printing, add the if statement, if len(strKey)>4:.
If you want a full example code here it is.
import sys
fileSource = open(sys.argv[1], "rt")
dicWordCount = {}
wordCount = 0
differentWords = 0
for strWord in fileSource.read().split():
wordCount += 1
if strWord not in dicWordCount:
dicWordCount[strWord] = 1
differentWords += 1
else:
dicWordCount[strWord] += 1
for strKey in sorted(dicWordCount, key=dicWordCount.get, reverse=True):
if len(strKey) > 4: # if the words length is greater than four.
print(strKey, dicWordCount[strKey])
print("Total words: %s\nDifferent Words: %s" % (wordCount, differentWords))
For your first qs, you can use set to help you count the number of different words. (Assume there is a space between every two words)
str = 'apple boy cat dog elephant fox'
different_word_count = len(set(str.split(' ')))
For your second qs, using a dictionary to help you record the word_count is ok.
How about this?
#gives unique words count
unique_words = len(dicWordCount)
total_words = 0
for k, v in dicWordCount.items():
total_words += v
#gives total word count
print(total_words)
You don't need a separate variable for counting word counts since you're using dictionary, and to count the total words, you just need to add the values of the keys(which are just counts)
I'm currently trying to implement my own source block on GNURADIO. What this source block does is takes the values of 4 arguments (titled Value1,Value2,Value3 and Value4) and converts them to binary values with varying lengths. For example, Value1 is converted into a binary value with a length of 11 bits, whilst Value2 is converted into a binary value with a length of 16 bits etc. The conversion of these values work.
The issue I'm having is that I am unable to output these converted values onto a vector sink. when the work function returns (len(output[0]),when I execute the code using QA code. I end up with an endless loop, which when interrupted, outputs an extremely large vector with values of "0L".
Now from reading similar questions, I understand that as my block is set as a source, it will send out data as long as the vector sink has enough room in its input buffer. Therefore I set the return value to be -1 to force the flow graph to stop. This gives me an empty value "()" in the Vector sink.
The code is shown below:
import numpy
import time
import sys
import random
import os
from gnuget import *
from gnuradio import gr
from gnuradio import blocks
from gnuradio.blocks.blocks_swig1 import vector_source_b
class ConvertVariables(gr.sync_block):
"""
docstring for block ConvertVariables
"""
def __init__(self,Value1,Value2,Value3,Value4):
self.value1 = Value1
self.value2 = Value2
self.value3 = Value3
self.value4 = Value4
gr.sync_block.__init__(self,
name="ConvertVariables",
in_sig=None,
out_sig=[(numpy.byte)])
def work(self, input_items, output_items):
out = output_items[0]
self.value1_R = getvalue1(self.value1)
self.value2_R = getvalue2(self.value2)
self.value3_R = getvalue3(self.value3)
self.value4_R = getvalue4(self.value4)
self.value1_B = bytearray(self.value1_R,'utf-8')
self.value2_B = bytearray(self.value2_R,'utf-8')
self.value3_B = bytearray(self.value3_R,'utf-8')
self.value4_B = bytearray(self.value4_R,'utf-8')
print(type(self.value1_B))
print(type(self.value2_B))
print(type(self.value3_B))
print(type(self.value4_B))
print(len(output_items),"output_items")
self.out[:] = [self.value1_B,self.value2_B,self.value3_B,self.value4_B]
print(len(output_items[0]))
return (len(output_items[0]))
So ideally when the arguments(2034,5000,50,5) are sent. The expected result in the vector sink is
('11111110010','0001001110001000','110010','0101')
However what appears instead is either
(0L,0L,0L,0L)
(The length of the vector is not consistent, if I set the return value of work to len(output[0]), the terminal windows fills with 0L)
or
()
This only occurs if the return value of work is set to -1
We use grep, cut, sort, uniq, and join at the command line all the time to do data analysis. They work great, although there are shortcomings. For example, you have to give column numbers to each tool. We often have wide files (many columns) and a column header that gives column names. In fact, our files look a lot like SQL tables. I'm sure there is a driver (ODBC?) that will operate on delimited text files, and some query engine that will use that driver, so we could just use SQL queries on our text files. Since doing analysis is usually ad hoc, it would have to be minimal setup to query new files (just use the files I specify in this directory) rather than declaring particular tables in some config.
Practically speaking, what's the easiest? That is, the SQL engine and driver that is easiest to set up and use to apply against text files?
David Malcolm wrote a little tool named "squeal" (formerly "show"), which allows you to use SQL-like command-line syntax to parse text files of various formats, including CSV.
An example on squeal's home page:
$ squeal "count(*)", source from /var/log/messages* group by source order by "count(*)" desc
count(*)|source |
--------+--------------------+
1633 |kernel |
1324 |NetworkManager |
98 |ntpd |
70 |avahi-daemon |
63 |dhclient |
48 |setroubleshoot |
39 |dnsmasq |
29 |nm-system-settings |
27 |bluetoothd |
14 |/usr/sbin/gpm |
13 |acpid |
10 |init |
9 |pcscd |
9 |pulseaudio |
6 |gnome-keyring-ask |
6 |gnome-keyring-daemon|
6 |gnome-session |
6 |rsyslogd |
5 |rpc.statd |
4 |vpnc |
3 |gdm-session-worker |
2 |auditd |
2 |console-kit-daemon |
2 |libvirtd |
2 |rpcbind |
1 |nm-dispatcher.action|
1 |restorecond |
q - Run SQL directly on CSV or TSV files:
https://github.com/harelba/q
Riffing off someone else's suggestion, here is a Python script for sqlite3. A little verbose, but it works.
I don't like having to completely copy the file to drop the header line, but I don't know how else to convince sqlite3's .import to skip it. I could create INSERT statements, but that seems just as bad if not worse.
Sample invocation:
$ sql.py --file foo --sql "select count(*) from data"
The code:
#!/usr/bin/env python
"""Run a SQL statement on a text file"""
import os
import sys
import getopt
import tempfile
import re
class Usage(Exception):
def __init__(self, msg):
self.msg = msg
def runCmd(cmd):
if os.system(cmd):
print "Error running " + cmd
sys.exit(1)
# TODO(dan): Return actual exit code
def usage():
print >>sys.stderr, "Usage: sql.py --file file --sql sql"
def main(argv=None):
if argv is None:
argv = sys.argv
try:
try:
opts, args = getopt.getopt(argv[1:], "h",
["help", "file=", "sql="])
except getopt.error, msg:
raise Usage(msg)
except Usage, err:
print >>sys.stderr, err.msg
print >>sys.stderr, "for help use --help"
return 2
filename = None
sql = None
for o, a in opts:
if o in ("-h", "--help"):
usage()
return 0
elif o in ("--file"):
filename = a
elif o in ("--sql"):
sql = a
else:
print "Found unexpected option " + o
if not filename:
print >>sys.stderr, "Must give --file"
sys.exit(1)
if not sql:
print >>sys.stderr, "Must give --sql"
sys.exit(1)
# Get the first line of the file to make a CREATE statement
#
# Copy the rest of the lines into a new file (datafile) so that
# sqlite3 can import data without header. If sqlite3 could skip
# the first line with .import, this copy would be unnecessary.
foo = open(filename)
datafile = tempfile.NamedTemporaryFile()
first = True
for line in foo.readlines():
if first:
headers = line.rstrip().split()
first = False
else:
print >>datafile, line,
datafile.flush()
#print datafile.name
#runCmd("cat %s" % datafile.name)
# Create columns with NUMERIC affinity so that if they are numbers,
# SQL queries will treat them as such.
create_statement = "CREATE TABLE data (" + ",".join(
map(lambda x: "`%s` NUMERIC" % x, headers)) + ");"
cmdfile = tempfile.NamedTemporaryFile()
#print cmdfile.name
print >>cmdfile,create_statement
print >>cmdfile,".separator ' '"
print >>cmdfile,".import '" + datafile.name + "' data"
print >>cmdfile, sql + ";"
cmdfile.flush()
#runCmd("cat %s" % cmdfile.name)
runCmd("cat %s | sqlite3" % cmdfile.name)
if __name__ == "__main__":
sys.exit(main())
Maybe write a script that creates an SQLite instance (possibly in memory), imports your data from a file/stdin (accepting your data's format), runs a query, then exits?
Depending on the amount of data, performance could be acceptable.
MySQL has a CVS storage engine, that might do what you need, if your files are CSV files.
Otherwise, you can use mysqlimport to import text files into MySQL. You could create a wrapper around mysqlimport, which figures out columns etc. and creates the necessary table.
You might also be able to use DBD::AnyData, a Perl module which lets you access text files like a database.
That said, it sounds a lot like you should really look at using a database. Is it really easier keeping table-oriented data in text files?
I have used Microsoft LogParser to query csv files several times... and it serves the purpose. It was surprising to see such a useful tool from M$ that too Free!