Text not read when using pd.read_csv() on a Google Sheet - pandas

I am trying to read a Google Sheet using pandas pd.read_csv(), however when the columns contain cells with text and other cells with numeric values, the text is not read. My code is:
def build_sheet_url(doc_id, sheet_id):
return r"https://docs.google.com/spreadsheets/d/{}/gviz/tq?tqx=out:csv&sheet={}".format(doc_id, sheet_id)
sheet_url = build_sheet_url(doc_id, sheet_name)
df = pd.read_csv(sheet_url)
> df
Column1 Column2
0 12 21
1 13 22
2 14 23
3 15 24
This is what the spreadsheet looks like:
I have tried using dtype=str and dtype=object but could not get the text to show in my dataframe. Specifying the encoding encoding='utf-8' did not work either.

This is because query doesn't support mixed data types:
Data type. Supported data types are string, number, boolean, date, datetime and timeofday. All values of a column will have a data type that matches the column type, or a null value. These types are similar, but not identical, to the JavaScript types.
Use the /export end point(or drive-api endpoint instead):
https://docs.google.com/spreadsheets/d/[SPREADSHEET_ID]/export?format=[FORMAT]&gid=(SHEET_ID)&range=(A1NOTATION)
Related:
Google sheet to pandas via shared link without credentials in python
Query is ignoring string (non numeric) value

Related

Python: Remove exponential in Strings

I have been trying to remove the exponential in a string for the longest time to no avail.
The column involves strings with alphabets in it and also long numbers of more than 24 digits. I tried converting the column to string with .astype(str) but it just reads the line as "1.234123E+23". An example of the table is
A
345223423dd234324
1.234123E+23
how do i get the table to show the full string of digits in pandas?
b = "1.234123E+23"
str(int(float(b)))
output is '123412299999999992791040'
no idea how to do it in pandas with mixed data type in column

Pandas dataframe selection df['a'][50][:51]

I have a dataframe where one of the column name is 'a'
I came across a following selection expression
dataframe['a'][50][:50]
I understand dataframe['a'][50] selects the row 49 in column ['a'], but what does [:50] do?
Thank you
If dataframe['a'][50][:50] doesn't error out and it actually returns something, it means the row 49 in column ['a'] contains iterables(more precisely sequence types) such as list, string, tuple...
dataframe['a'][50][:50] returns the sequence from element 0 to 49 from the value of the row 49 in column ['a'].
As I said above, if the row 49 in column ['a'] doesn't contain a sequence type, you will get errors. Try check dataframe['a'][50] to see if it is a sequence type
Note: dataframe['a'][50] is chain-indexing. It is not recommended. However, it is out of the scope of this question so I don't go into the detail of it.

Replacing substrings based on lists

I am trying to replace substrings in a data frame by the lists "name" and "lemma". As long as I enter the lists manually, the code delivers the result in the dataframe m.
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
m=sdf.replace(regex= name, value =lemma)
As soon as I am reading in both lists from an excel file, my code is not replacing the substrings anymore. I need to use an excel file, since the lists are in one table that is very large.
sdf= pd.read_excel('training_data.xlsx')
synonyms= pd.read_excel('synonyms.xlsx')
lemma=synonyms['lemma'].tolist()
name=synonyms['name'].tolist()
m=sdf.replace(regex= name, value =lemma)
Thanks for your help!
df.replace()
Replace values given in to_replace with value.
Values of the DataFrame are replaced with other values dynamically. This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.
in short, this method won't make change on the series level, only on values.
This may achieve what you want:
sdf.regex = synonyms.name
sdf.value = synonyms.lemma
If you are just trying to replace 'Charge' with 'Hallo' and 'charge' with 'hallo' and 'Prepaid' with 'Hi' then you can use repalce() and pass the list of words to finds as the first argument and the list of words to replace with as the second keyword argument value.
Try this:
df=df.replace(name, value=lemma)
Example:
name=['Charge','charge','Prepaid']
lemma=['Hallo','hallo','Hi']
df = pd.DataFrame([['Bob', 'Charge', 'E333', 'B442'],
['Karen', 'V434', 'Prepaid', 'B442'],
['Jill', 'V434', 'E333', 'charge'],
['Hank', 'Charge', 'E333', 'B442']],
columns=['Name', 'ID_First', 'ID_Second', 'ID_Third'])
df=df.replace(name, value=lemma)
print(df)
Output:
Name ID_First ID_Second ID_Third
0 Bob Hallo E333 B442
1 Karen V434 Hi B442
2 Jill V434 E333 hallo
3 Hank Hallo E333 B442

Using Pyspark to convert column from string to timestamp

I have pyspark dataframe with 2 columns (Violation_Time, Time_First_Observed) which are captured as strings. Sample of data is below, where there it is captured as HHmm with "A" or "P" representing am or pm. Also, the data has error where some entries exceed 24HH.
Violation_Time Time_First_Observed
0830A 1600P
1450P 0720A
1630P 2540P
0900A 0100A
I would like to use pyspark to remove the "A" and "P" for both columns and subsequently convert the data (e.g., 0800, 1930 etc) into a timestamp for analysis purposes. I have tried to do this for the "Violation_Time" column and create a new column "timestamp" to store this (see code below). However, I can't seem to be able to do it. Any form of help is appreciate, thank you.
sparkdf3.withColumn('timestamp',F.to_timestamp("Violation_Time", "HH"))
sparkdf3.select(['Violation_Time','timestamp']).show()
You can use the following
sparkdf3 = sparkdf3.withColumn('timestamp', func.split(func.to_timestamp('Violation_Time', 'HHmm'), ' ').getItem(1))
sparkdf3.select(['Violation_Time','timestamp']).show()
Explanation
sparkdf3.withColumn('timestamp',
func.split(
func.to_timestamp('Violation_Time', 'HHmm') #Convert to timestamp. It will convert in datetime format
, ' '
).getItem(1) #Split on space and get first item
)

Why can't access df data in pandas?

I have a table where column names are not really organized like they have different years of data with different column numbers.
So I should access each data through specified column names.
I am using this syntax to access a column.
df = df[["2018/12"]]
But when I just want to extract numbers under that column, using
df.iloc[0,0]
it throws an error like
single positional indexer is out-of-bounds
So I am using
df.loc[0]
but it has the column name with the numeric data.
How can I extract just the number of each row?
Below is the CSV data
Closing Date,2014/12,2015/12,2016/12,2017/12,2018/12,Trend
Net Sales,"31,634","49,924","62,051","68,137","72,590",
""
Net increase,"-17,909","-16,962","-34,714","-26,220","-29,721",
Net Received,-,-,-,-,-,
Net Paid,-328,"-6,038","-9,499","-9,375","-10,661",
Assuming you have the following data frame df imported from your csv:
Closing Date 2014/12 2015/12 2016/12 2017/12 2018/12
0 Net Sales 31,634 49,924 62,051 68,137 72,590
1 Net increase -17,909 -16,962 -34,714 -26,220 -29,721
2 Net Received - - - - -
3 Net Paid -328 -6,038 -9,499 -9,375 -10,661
then by doing df = df[["2018/12"]] you create a new data frame with one column and df.iloc[0,0] will work perfectly well here returning 72,590. I you wrote df = df["2018/12"] you'd create a new series and here df.iloc[0,0] will throw an error 'too many indexers', because it's a one-dimensional series.
Anyway, if you need the values of a series, use the values attribute (or to_numpy() for version 0.24 or later) to get the data as array or to_list() to get them as a list.
But I guess what you really want is to have your table transposed:
df = df.set_index('Closing Date').T
to the following more logical form:
Closing Date Net Sales Net increase Net Received Net Paid
2014/12 31,634 -17,909 - -328
2015/12 49,924 -16,962 - -6,038
2016/12 62,051 -34,714 - -9,499
2017/12 68,137 -26,220 - -9,375
2018/12 72,590 -29,721 - -10,661
Here, df.loc['2018/12','Net Sales'] gives you 72,590 etc.