I have two dates in YYYMM format as
date1 = 203201
date2 = 201204
I have a dataframe [testdf] with 235 million rows which contains a date variable 'DATE_TO_COMPARE' that I need to compare with the above two dates for a filter.
I need to filter this dataframe as follows:
# Step 1: Create two date variables in the dataframe for comparison purposes
testdf['date1'] = pd.to_datetime(testdf['date1'], format = '%Y%m', errors='ignore')
testdf['date2'] = pd.to_datetime(testdf['date2'], format = '%Y%m', errors='ignore')
# Step 2: Apply the fiter
testdf_filtered = testdf[(testdf['DATE_TO_COMPARE'] <= testdf['date1']) & \
(testdf['DATE_TO_COMPARE'] > testdf['date2'])]
Problem is the above operations take 70 years to execute on 235 million rows :--)
So I recently realized, I have multiples cores on my PC, a sexy 5 cores lol. So, did some research and read about drumroll...DASK!
So here I am trying to daskize this code as follows:
# Daskize pandas dataframe
import dask as dd
ddata = dd.from_pandas(testdf, npartitions=5)
# Step 1: Create two date variables in the dataframe for comparison purposes
ddata['date1'] = pd.to_datetime(ddata['date1'], format = '%Y%m', errors='ignore')
ddata['date2'] = pd.to_datetime(ddata['date2'], format = '%Y%m', errors='ignore')
# Step 2: Apply the fiter
ddata_filtered = ddata[(ddata['DATE_TO_COMPARE'] <= ddata['date1']) & \
(ddata['DATE_TO_COMPARE'] > ddata['date2'])]
# Re-Pandize Daskized dataframe
testdf_filtered = ddata_filtered.compute(scheduler='processes')
I obviously run into a host of errors in the dask code! Example:
TypeError: 'DataFrame' object does not support item assignment etc.
Any education/advise/example will be much appreciated. Thanks.
Related
I have 12 months sales data for each month. I want to analyze the dataset as a whole.
I have tried using the concat function but It produces not a number (NaN) in my dataframe fields.
In R, cbind function solves this. How do i approach this differently in Python?
I tried using df.concat function to bind the rows cos all the column names are the same for the datasets.
What other options can i explore?
sales_1 = pd.read_csv('Sales_January_2019.csv')
sales_2 = pd.read_csv('Sales_February_2019.csv')
sales_3 = pd.read_csv('Sales_March_2019.csv')
sales_4 = pd.read_csv('Sales_April_2019.csv')
sales_5 = pd.read_csv('Sales_May_2019.csv')
sales_6 = pd.read_csv('Sales_June_2019.csv')
sales_7 = pd.read_csv('Sales_July_2019.csv')
sales_8 = pd.read_csv('Sales_August_2019.csv')
sales_9 = pd.read_csv('Sales_September_2019.csv')
sales_10 = pd.read_csv('Sales_October_2019.csv')
sales_11 = pd.read_csv('Sales_November_2019.csv')
sales_12 = pd.read_csv('Sales_December_2019.csv')
I expect all data frame to be merged into one since the column names are the same for all
perhaps
# using concat with the list of the DF that you already read-in to combine into a single DF
pd.concat([sales_1 ,sales_2 ,sales_3 ,sales_4 ,sales_5 ,sales_6 ,sales_7 ,sales_8 ,sales_9 ,sales_10 ,sales_11 ,sales_12 ])
I have a list of 16 dataframes that contain stats for each player in the NBA during the respective season. My end goal is to run unsupervised learning algorithms on the data frames. For example, I want to see if I can determine a player's position by their stats or if I can determine their total points during the season based on their stats.
What I would like to do is modify the list(df_list), unless there's a better solution, of these dataframes instead modifying each dataframe to:
Change the datatype of the MP(minutes played column from str to int.
Modify the dataframe where there are only players with 1000 or more MP and there are no duplicate players(Rk)
(for instance in a season, a player(Rk) can play for three teams in a season and have 200MP, 300MP, and 400MP mins with each team. He'll have a column for each team and a column called TOT which will render his MP as 900(200+300+400) for a total of four rows in the dataframe. I only need the TOT row
Use simple algebra with various and individual columns columns, for example: being able to total the MP column and the PTS column and then diving the sum of the PTS column by the MP column.
Or dividing the total of the PTS column by the len of the PTS column.
What I've done so far is this:
Import my libraries and create 16 dataframes using pd.read_html(url).
The first dataframes created using two lines of code:
url = "https://www.basketball-reference.com/leagues/NBA_1997_totals.html"
ninetysix = pd.read_html(url)[0]
HOWEVER, the next four data frames had to be created using a few additional line of code(I received an error code that said "html5lib not found, please install it" so I downloaded both html5lib and requests). I say that to say...this distinction in creating the DF may have to considered in a solution.
The code I used:
import requests
import uuid
url = 'https://www.basketball-reference.com/leagues/NBA_1998_totals.html'
cookies = {'euConsentId': str(uuid.uuid4())}
html = requests.get(url, cookies=cookies).content
ninetyseven = pd.read_html(html)[0]
These four data frames look like this:
I tried this but it didn't do anything:
df_list = [
eightyfour, eightyfive, eightysix, eightyseven,
eightyeight, eightynine, ninety, ninetyone,
ninetytwo, ninetyfour, ninetyfive,
ninetysix, ninetyseven, ninetyeight, owe_one, owe_two
]
for df in df_list:
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
owe_two
============================UPDATE===================================
This code will solves a portion of problem # 2
url = 'https://www.basketball-reference.com/leagues/NBA_1997_totals.html'
dd = pd.read_html(url)[0]
dd = dd[dd['Rk'].ne('Rk')]
dd['MP'] = dd['MP'].astype(int)
players_1000_rk_list = list(dd[dd['MP'] >= 1000]['Rk'])
players_dd = dd[dd['Rk'].isin(players_1000_rk_list)]
But it doesn't remove the duplicates.
==================== UPDATE 10/11/22 ================================
Let's say I take rows with values "TOT" in the "Tm" and create a new DF with them, and these rows from the original data frame...
could I then compare the new DF with the original data frame and remove the names from the original data IF they match the names from the new data frame?
the problem is that the df you are working on in the loop is not the same df that is in the df_list. you could solve this by saving the new df back to the list, overwriting the old df
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
df_list[i] = df
the2 lines are probably wrong as well
df = list(df[df['MP'] >= 1000]['Rk'])
df = df[df['Rk'].isin(df)]
perhaps you want this
for i,df in enumerate(df_list):
df = df.loc[df['Tm'] == 'TOT']
df = df.copy()
df['MP'] = df['MP'].astype(int)
df['Rk'] = df['Rk'].astype(int)
#df = list(df[df['MP'] >= 1000]['Rk'])
#df = df[df['Rk'].isin(df)]
# just the rows where MP > 1000
df_list[i] = df[df['MP'] >= 1000]
I am doing the following in Dask as the df dataframe has 7 million rows and 50 columns so pandas is extremely slow. However, I might not be using Dask correctly or Dask might not be appropriate for my goal. I need to do some preprocessing on the df dataframe, which is mainly creating some new columns. And then eventually saving the df (I am saving to csv but I have also tried parquet). However, before I save, I believe I have to do compute(). And compute() is taking very long -- I left it running for 3 hours and it still wasn't done. I tried to persist() throughout the calculations but persist() also took a long time. Is this expected with Dask given the size of my data? Could this be because of the number of partitions (I have 20 logical processor and dask is using 24 partitions -- I have 128 GB of memory if this helps too)? Is there something I could do to speed this up?
import dask.dataframe as dd
import numpy as np
import pandas as pd
from re import match
from dask_ml.preprocessing import LabelEncoder
df1 = dd.read_csv("data1.csv")
df2 = dd.read_csv("data2.csv")
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
df['actual__adj'] = (df['actual'] * df['travel'] + 809 * df['stopped']) / (
df['travel_time'] + df['stopped_time'])
df['c_adj'] = 1 - df['actual_adj'] / df['free']
df['stopped_tom'] = 1 * (df['stopped'] > 0)
def func(df):
df = df.sort_values('region')
df['first_established'] = 1 * (df['region_d']==df['region_d'].min())
df['last_established'] = 1 * (df['region_d']==df['region_d'].max())
df['actual_established'] = df['noted_timeframe'].shift(1, fill_value=0)
df['actual_established_2'] = df['noted_timeframe'].shift(-1, fill_value=0)
df['time_1'] = df['time_book'].shift(1, fill_value=0)
df['time_2'] = df['time_book'].shift(-1, fill_value=0)
df['stopped_investing'] = df['stopped'].shift(1, fill_value=1)
return df
df = df.groupby('country').apply(func).reset_index(drop=True)
df['actual_diff'] = np.abs(df['actual'] - df['actual_book'])
df['length_diff'] = np.abs(df['length'] - df['length_book'])
df['Investment'] = df['lor_index'].values * 1000
df = df.compute().to_csv("path")
Saving to csv or parquet will by default trigger computation, so the last line should be:
df = df.to_csv("path_*.csv")
The asterisk is needed to specify the pattern of csv file names (each partition is saved into a separate file, unless you specify single_file=True).
My guess is that most of the computation time is spent on this step:
df = df1.merge(df2, how='inner', left_on=['country', 'region'],
right_on=['country', 'region'])
If one of the dfs is small enough to fit in memory, then it would be good to keep it as a pandas dataframe, see further tips in the documentation.
I am trying to write a UDF in python/Excel using xlwings. I have time-series data in three columns in a spreadsheet of the form:
Date Hour Value
01/11/2017 1 43.1
01/11/2017 2 41.8
01/11/2017 3 38.6
01/11/2017 4 38.6
01/11/2017 5 38.6
And I want to be able to select this range, manipulate it in several ways (monthly average etc.) using pandas groupby functions then output the results back into a new spreadsheet. This code works for me:
#xw.sub
def get_df_from_range():
"""get's df"""
#make the current selection a dataframe
wb = xw.Book.caller()
df = wb.selection.options(pd.DataFrame, index = False).value
#simple check: add a sheet and print the dataframe
sht = wb.sheets.add()
sht.range('A1').options(index = False).value = df
However, as soon as I try to do any manipulation of the dataframe before printing it, I get error messages. For example:
#xw.sub
def get_df_from_range():
"""get's df"""
#make the current selection a dataframe
wb = xw.Book.caller()
df = wb.selection.options(pd.DataFrame, index = False).value
#simple manipulation task
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')
#add a sheet and print the dataframe
sht = wb.sheets.add()
sht.range('A1').options(index = False).value = df
This gives an error message:
Run-time error '2147467259 (80004005)':
AttributeError: 'tuple' object has no attribute 'lower'
if value.lower() in _unit_map:
File
"C:\User\AppData\Local\Programs\Python\Python36-32
line 441, in f
unit = {k: f(k) for k in arg.keys()}
I thought I could debug this better by creating the same code, but not as a UDF; just by writing code in Spyder and connecting to the spreadsheet - so I would have the df variable in my variable explorer. But when I wrote almost exactly the same code, it did not give me an error message:
wb = xw.Book("my_spreadsheet.xlsm")
df = wb.selection.options(pd.DataFrame, index = False).value
df['Date'] = pd.to_datetime(df['Date'], format = '%d/%m/%Y')
I am really stuck as to why this would be. Can someone please help?
I should note, I am aware that xlwings automatically reads Excel dates as datetime64[ns] formats. That isn't the point I am trying to make. I want to do other things with the dataframe (e.g. left join it to another dataframe) and all those other tasks also fail when I try the UDF method, but work O.K. when I just connect to the spreadsheet from Spyder. I am hoping that if I can get that one "simple manipulation task" to work, then all the other tasks may also work.
Unlike when I started, I found this problem to be a more difficult problem than I thought.
I want to refer to a particular column content from the SQLite database, make it into a Series, and then combine it into a single data frame.
I have tried like this but faild:
import pandas as pd
from pandas import Series, DataFrame
import sqlite3
con = sqlite3.connect("C:/Users/Kun/Documents/Dashin/data.db") #my sqldb
tmplist = ['A003060','A003070'] #db contains that table,I decided to call
#only two for practice.
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i), con , index_col =
None)['Close'].head(5)
tmpSeries2 = tmpSeries.append(listSeries)
print(tmpSeries2)
that code result show only dummy thing like this:
0 7150.0
1 6770.0
2 7450.0
3 7240.0
4 6710.0
dtype: float64
0 14950.0
1 15500.0
2 15000.0
3 14800.0
4 14500.0
What I want to do is like this:
A003060 A003070
0 7150.0 14950.0
1 6770.0 15500.0
2 7450.0 15000.0
3 7240.0 14800.0
4 6710.0 14500.0
I had a similar question ahead and got a answer. But The last question is
using predefined variables. But I must use loop because I have to deal with a series of large databases. I have already tried another effort using dataframe.append, transpose(). But I failed.
I would appreciate some small hints. Thank you.
To append pandas series using for loop
I think you can create list, then append data and last use concat:
dfs = []
for i in tmplist:
tmpSeries =pd.Series([])
listSeries = pd.read_sql("SELECT * FROM %s " %(i) con,index_col = None)['Close'].head(5)
dfs.append(listSeries)
df = pd.concat(dfs, axis=1, keys=tmplist)
print(df)