xlwings excel selection to dataframe - error with the date column - pandas

I am trying to write a UDF in python/Excel using xlwings. I have time-series data in three columns in a spreadsheet of the form:
Date Hour Value
01/11/2017 1 43.1
01/11/2017 2 41.8
01/11/2017 3 38.6
01/11/2017 4 38.6
01/11/2017 5 38.6
And I want to be able to select this range, manipulate it in several ways (monthly average etc.) using pandas groupby functions then output the results back into a new spreadsheet. This code works for me:
#xw.sub
def get_df_from_range():
"""get's df"""
#make the current selection a dataframe
wb = xw.Book.caller()
df = wb.selection.options(pd.DataFrame, index = False).value
#simple check: add a sheet and print the dataframe
sht = wb.sheets.add()
sht.range('A1').options(index = False).value = df
However, as soon as I try to do any manipulation of the dataframe before printing it, I get error messages. For example:
#xw.sub
def get_df_from_range():
"""get's df"""
#make the current selection a dataframe
wb = xw.Book.caller()
df = wb.selection.options(pd.DataFrame, index = False).value
#simple manipulation task
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')
#add a sheet and print the dataframe
sht = wb.sheets.add()
sht.range('A1').options(index = False).value = df
This gives an error message:
Run-time error '2147467259 (80004005)':
AttributeError: 'tuple' object has no attribute 'lower'
if value.lower() in _unit_map:
File
"C:\User\AppData\Local\Programs\Python\Python36-32
line 441, in f
unit = {k: f(k) for k in arg.keys()}
I thought I could debug this better by creating the same code, but not as a UDF; just by writing code in Spyder and connecting to the spreadsheet - so I would have the df variable in my variable explorer. But when I wrote almost exactly the same code, it did not give me an error message:
wb = xw.Book("my_spreadsheet.xlsm")
df = wb.selection.options(pd.DataFrame, index = False).value
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')
I am really stuck as to why this would be. Can someone please help?
I should note, I am aware that xlwings automatically reads Excel dates as datetime64[ns] formats. That isn't the point I am trying to make. I want to do other things with the dataframe (e.g. left join it to another dataframe) and all those other tasks also fail when I try the UDF method, but work O.K. when I just connect to the spreadsheet from Spyder. I am hoping that if I can get that one "simple manipulation task" to work, then all the other tasks may also work.

Related

Working on multiple data frames with data for NBA players during the season, how can I modify all the dataframes at the same time?

I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]

How do I do an OLS on GLS time series regression in python?

I am attempting to transfer my team's Eview code to Python and I got stock with the following line in Eviews:
equation eq_LSTrend.ls(cov=hac) log({Price})=c(1) * #trend + c(2).
Here, the time regression analysis of a certain time window is to be performed on the log(price) and the intercept c(1) as well as the slope c(2) have to be determined.
Let's say I have the following df:
import pandas as pd
Range = pd.date_range('1990-01-01', periods=8, freq='D')
log_price = [5.0835, 5.0906, 5.0946, 5.0916, 5.0825, 5.0833, 5.0782, 5.0709]
df = pd.DataFrame({ 'Date': Range, 'Log Price': log_price })
df.set_index('Date', inplace=True)
And the df looks like this:
Date Log Price
1990-01-01 5.0835
1990-01-02 5.0906
1990-01-03 5.0946
1990-01-04 5.0916
1990-01-05 5.0825
1990-01-06 5.0833
1990-01-07 5.0782
1990-01-08 5.0709
How could I, for example, take a rolling 5 period window, do a OLS or GLS analysis and get the wanted parameters (the slope and the intercept parameters?)
Also, which library would be appropriate for it (statsmodels or maybe some other library)?
Ideally, the code would look something like this:
df_window = df.rolling(window = 5)
slope_output = sm.GLS(df_window).slope
or if separate columns have to be provided as an input (in this case I would leave "Date" as a separate column in df)
df_window = df.rolling(window = 5)
slope_output = sm.GLS(depend_var = df_window["Log Price"], independ_var = df_window["Date"]).slope
I am quite new to python so please pardon my bad coding!

How to convert working pandas code to a dask code?

I have two dates in YYYMM format as
date1 = 203201
date2 = 201204
I have a dataframe [testdf] with 235 million rows which contains a date variable 'DATE_TO_COMPARE' that I need to compare with the above two dates for a filter.
I need to filter this dataframe as follows:
# Step 1: Create two date variables in the dataframe for comparison purposes
testdf['date1'] = pd.to_datetime(testdf['date1'], format = '%Y%m', errors='ignore')
testdf['date2'] = pd.to_datetime(testdf['date2'], format = '%Y%m', errors='ignore')
# Step 2: Apply the fiter
testdf_filtered = testdf[(testdf['DATE_TO_COMPARE'] <= testdf['date1']) & \
(testdf['DATE_TO_COMPARE'] > testdf['date2'])]
Problem is the above operations take 70 years to execute on 235 million rows :--)
So I recently realized, I have multiples cores on my PC, a sexy 5 cores lol. So, did some research and read about drumroll...DASK!
So here I am trying to daskize this code as follows:
# Daskize pandas dataframe
import dask as dd
ddata = dd.from_pandas(testdf, npartitions=5)
# Step 1: Create two date variables in the dataframe for comparison purposes
ddata['date1'] = pd.to_datetime(ddata['date1'], format = '%Y%m', errors='ignore')
ddata['date2'] = pd.to_datetime(ddata['date2'], format = '%Y%m', errors='ignore')
# Step 2: Apply the fiter
ddata_filtered = ddata[(ddata['DATE_TO_COMPARE'] <= ddata['date1']) & \
(ddata['DATE_TO_COMPARE'] > ddata['date2'])]
# Re-Pandize Daskized dataframe
testdf_filtered = ddata_filtered.compute(scheduler='processes')
I obviously run into a host of errors in the dask code! Example:
TypeError: 'DataFrame' object does not support item assignment etc.
Any education/advise/example will be much appreciated. Thanks.

Pandas - Appending data from one Dataframe to

I have a Dataframe (called df) that has list of tickets worked for a given date. I have a script that runs each day where this df gets generated and I would like to have a new master dataframe (lets say df_master) that appends values form df to a new Dataframe. So anytime I view df_master I should be able to see all the tickets worked across multiple days. Also would like to have a new column in df_master that shows date when the row was inserted.
Given below is how df looks like:
1001
1002
1003
1004
I tried to perform concat but it threw an error
TypeError: first argument must be an iterable of pandas objects, you passed an object of type "Series"
Update
df_ticket = tickets['ticket']
df_master = df_ticket
df_master['Date'] = pd.Timestamp('now').normalize()
L = [df_master,tickets]
master_df = pd.concat(L)
master_df.to_csv('file.csv', mode='a', header=False, index=False)
I think you need pass sequence to concat, obviously list is used:
objs : a sequence or mapping of Series, DataFrame, or Panel objects
If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised
L = [s1,s2]
df = pd.concat(L)
And it seems you pass only Series, so raised error:
df = pd.concat(s)
For insert Date column is possible set pd.Timestamp('now').normalize(), for master df I suggest create one file and append each day DataFrame:
df_ticket = tickets[['ticket']]
df_ticket['Date'] = pd.Timestamp('now').normalize()
df_ticket.to_csv('file.csv', mode='a', header=False, index=False)
df_master = pd.read_csv('file.csv', header=None)

Parse JSON to Excel - Pandas + xlwt

I'm kind of half way through this functionality. However, I need some help with formatting the data in the sheet that contains the output.
My current code...
response = {"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}
# Create a Pandas dataframe from the data.
df = pd.DataFrame.from_dict(json.loads(response), orient='index')
# Create a Pandas Excel writer using XlsxWriter as the engine.
writer = pd.ExcelWriter('pandas_simple.xlsx', engine='xlsxwriter')
# Convert the dataframe to an XlsxWriter Excel object.
df.to_excel(writer, sheet_name='Sheet1')
# Close the Pandas Excel writer and output the Excel file.
writer.save()
The output is as follows...
What I want is something like this...
I suppose that first I would need to extract and organise the headers.
This would also include manually assigning a header for a column that cannot have a header by default as in case of SIC column.
After that, I can feed data to the columns with their respective headers.
You can loop over the keys of your json object and create a dataframe from each, then use pd.concat to combine them all:
import json
import pandas as pd
response = '{"sic2":[{"confidence":1.0,"label":"73"}],"sic4":[{"confidence":0.5,"label":"7310"}],"sic8":[{"confidence":0.5,"label":"73101000"},{"confidence":0.25,"label":"73102000"},{"confidence":0.25,"label":"73109999"}]}'
json_data = json.loads(response)
all_frames = []
for k, v in json_data.items():
df = pd.DataFrame(v)
df['SIC Category'] = k
all_frames.append(df)
final_data = pd.concat(all_frames).set_index('SIC Category')
print(final_data)
This prints:
confidence label
SIC Category
sic2 1.00 73
sic4 0.50 7310
sic8 0.50 73101000
sic8 0.25 73102000
sic8 0.25 73109999
Which you can export to Excel as before, through final_data.to_excel(writer, sheet_name='Sheet1')