Adding the most recent data to a pandas data frame - pandas

I am trying to build, and keep up to date, a data frame/time series where I scrape the data from a website table, and want to take the most recent data, and add to the data I've already got. A sample of what the data frame looks like is:
Date Price
0 10/01/19 100
1 09/01/19 95
2 08/01/19 96
3 07/01/19 97
What I then want to do is run my little program and have it identify that I am missing data for the 11th and 12th of Jan, and then add it to the top of the data frame. I am quite comfortable with compiling a data frame using .read_html, and generally building a data frame, but this is a bit beyond my talents currently.
I know the done thing is usually to show you what I have so far attempted but to be honest I actually don't know where to begin with this one.
Many thanks

Lets say the old dataframe as df which looks like:
Date Price
0 2019-01-10 100
1 2019-01-09 95
2 2019-01-08 96
3 2019-01-07 97
After 2 days you download a data which gives you 2 rows for 2019-01-11 and 2019-01-12, lets name it new_df (values are just as examples):
Date Price
0 2019-01-12 67
1 2019-01-11 89
2 2019-01-10 100
3 2019-01-09 95
Note: there are a few values in the new df which are present in the old df.
Using df.append() , df.drop_duplicates() and df.sort_values() :-
>>df.append(new_df,ignore_index=True).drop_duplicates().sort_values(by='Date',ascending=False)
Date Price
4 2019-01-12 67
5 2019-01-11 89
0 2019-01-10 100
1 2019-01-09 95
2 2019-01-08 96
3 2019-01-07 97
This will append the new values and sort them in descending manner based on Date column keeping the latest date at the top.
if you want the index sorted just add sort_index() in the end : df.append(new_df,ignore_index=True).drop_duplicates().sort_values(by='Date',ascending=False).sort_index()

Related

Pandas: to get mean for each data category daily [duplicate]

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Pandas adding row to categorical index

I have a scenario where I would like to group my datasets by personally defined week indexes that are then averaged and aggregate the averages into a "Total" row. I am able to achieve the first half of my scenario, but when I try to append/insert a new "Total" row that sums these rows I am receiving error messages.
I attempted to create this row via two different methods:
Method 1:
week_index_avg_unit.loc['Total'] = week_index_avg_unit.sum()
TypeError: cannot append a non-category item to a CategoricalIndex
Method 2:
week_index_avg_unit.index.insert(['Total'], week_index_avg_unit.sum())
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have used the first approach in this scenario multiple times, but this is the first time where I'm cutting the data into multiple categories and clearly see where the CategoricalIndex type is the problem.
Here is the format of my data:
date organic ppc oa other content_partnership total \
0 2018-01-01 379 251 197 51 0 878
1 2018-01-02 880 527 405 217 0 2029
2 2018-01-03 859 589 403 323 0 2174
3 2018-01-04 835 533 409 335 0 2112
4 2018-01-05 760 449 355 272 0 1836
year_month day weekday weekday_name week_index
0 2018-01 1 0 Monday Week 1
1 2018-01 2 1 Tuesday Week 1
2 2018-01 3 2 Wednesday Week 1
3 2018-01 4 3 Thursday Week 1
4 2018-01 5 4 Friday Week 1
Here is the code:
import pandas as pd
import numpy as np
from datetime import datetime
import matplotlib.pyplot as plt
historicals = pd.read_csv("2018-2019_plants.csv")
# Capture dates for additional date columns
date_col = pd.to_datetime(historicals['date'])
historicals['year_month'] = date_col.dt.strftime("%Y-%m")
historicals['day'] = date_col.dt.day
historicals['weekday'] = date_col.dt.dayofweek
historicals['weekday_name'] = date_col.dt.day_name()
# create week ranges segment (7 day range)
historicals['week_index'] = pd.cut(historicals['day'],[0,7,14,21,28,32], labels=['Week 1','Week 2','Week 3','Week 4','Week 5'])
# Week Index Average (Units)
week_index_avg_unit = historicals[df_monthly_average].groupby(['week_index']).mean().astype(int)
type(week_index_avg_unit.index)
pandas.core.indexes.category.CategoricalIndex
Here is the week_index_avg_unit table:
organic ppc oa other content_partnership total day weekday
week_index
Week 1 755 361 505 405 22 2027 4 3
Week 2 787 360 473 337 19 1959 11 3
Week 3 781 382 490 352 18 2006 18 3
...
pd.CategoricalIndex is a special animal. It is immutable, so to do the trick you may need to use something like pd.CategoricalIndex.set_categories to add a new category.
See pandas docs: https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.CategoricalIndex.html

dataframe fill in next row with previous row value

AMA1=close1
AMA(t)=AMA(t-1)+alpha(t)*(close(t)-AMA(t-1))
id date close alpha AMA
1 2016-01-04 343 1 343
2 2016-01-05 335 2 327 (343+2*(335-343))
3 2016-01-06 337 3 357 (327+3*(337-327))
4 2016-01-07 338 -1 376 (357-1*(338-357))
I want to calculate column AMA of a dataframe, the rule likes this:
if id==1, AMA(1)=close(1)
if id>1, AMA(t)=AMA(t-1)+alpha(t)*(close(t)-AMA(t-1))
I used for loop, but it's very slow when i'm dealing with large data.
Does anyone know how to do this without for loop?Many thanks!!
for i in range(len(df)):
if i==1:
AMA(1)=close(1)
if i>1:
AMA(t)=AMA(t-1)+alpha(t)*(close(t)-AMA(t-1))

how to handle the missing values like this and date format for regression?

I want to make the regression model from this dataset(first two are dependent variable and last one is dependent variable).I have import dataset using dataset=pd.read_csv('data.csv')
Now I have made model previously also but never have done with date format dataset as independent variable so how should we handle these date format to make the regression model.
also how should we handle 0 value data in given dataset.
My dataset is like:in .csv format:
Month/Day, Sales, Revenue
01/01 , 0 , 0
01/02 , 100000, 0
01/03 , 400000, 0
01/06 ,300000, 0
01/07 ,950000, 1000000
01/08 ,10000, 15000
01/10 ,909000, 1000000
01/30 ,12200, 12000
02/01 ,950000, 1000000
02/09 ,10000, 15000
02/13 ,909000, 1000000
02/15 ,12200, 12000
I don't know to handle this format date and 0 value
Here's a start. I saved your data into a file and stripped all the whitespace.
import pandas as pd
df = pd.read_csv('20180112-2.csv')
df['Month/Day'] = pd.to_datetime(df['Month/Day'], format = '%m/%d')
print(df)
Output:
Month/Day Sales Revenue
0 1900-01-01 0 0
1 1900-01-02 100000 0
2 1900-01-03 400000 0
3 1900-01-06 300000 0
4 1900-01-07 950000 1000000
5 1900-01-08 10000 15000
6 1900-01-10 909000 1000000
7 1900-01-30 12200 12000
8 1900-02-01 950000 1000000
9 1900-02-09 10000 15000
10 1900-02-13 909000 1000000
11 1900-02-15 12200 12000
The year defaults to 1900 since it is not provided in your data. If you need to change it, that's an additional, different question. To change the year, see: Pandas: Change day
import datetime as dt
df['Month/Day'] = df['Month/Day'].apply(lambda dt: dt.replace(year = 2017))
print(df)
Output:
Month/Day Sales Revenue
0 2017-01-01 0 0
1 2017-01-02 100000 0
2 2017-01-03 400000 0
3 2017-01-06 300000 0
4 2017-01-07 950000 1000000
5 2017-01-08 10000 15000
6 2017-01-10 909000 1000000
7 2017-01-30 12200 12000
8 2017-02-01 950000 1000000
9 2017-02-09 10000 15000
10 2017-02-13 909000 1000000
11 2017-02-15 12200 12000
Finally, to find the correlation between columns, just use df.corr():
print(df.corr())
Output:
Sales Revenue
Sales 1.000000 0.953077
Revenue 0.953077 1.000000
How to handle missing data?
There is a number of ways to replace it. By average, by median or using moving average window or even RF-approach (or similar, MICE and so on).
For 'sales' column you can try any of this methods.
For 'revenue' column better not to use any of this especially if you have many missing values (it will harm the model). Just remove rows with missing values in 'revenue' column.
By the way, a few methods in ML accept missing values: XGBoost and in some way Trees/Forests. For the latest ones you may replace zeroes to some very different values like -999999.
What to do with the data?
Many things related to feature engineering can be done here:
1. Day of week
2. Weekday or weekend
3. Day in month (number)
4. Pre- or post-holiday
5. Week number
6. Month number
7. Year number
8. Indication of some factors (for example, if it is fruit sales data you can some boolean columns related to it)
9. And so on...
Almost every feature here should be preprocessed via one-hot-encoding.
And clean from correlations of course if you use linear models.

How do I loop through a table until condition reached

I have a table product, pick_qty, shortfall, location, loc_qty
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
1742 4 58 5 20
1742 4 58 6 20
1742 4 58 7 15
1742 4 58 8 15
1742 4 58 9 15
1742 4 58 10 20
I want a report to loop around and show the number of locations and the quantity I need to drop to fulfil the shortfall for replenishment. So the report would look like this.
Product Picked Qty Shortfall Location Location Qty
1742 4 58 1 15
1742 4 58 2 20
1742 4 58 3 15
1742 4 58 4 20
Note that it is best not to think about SQL "looping through a table" and instead to think about it as operating on some subset of the rows in a table.
What it sounds like you need to do is create a running total that tells how many of the item you would have if you were to take all of them from a location and all of the locations that came before the current location and then check to see if that would give you enough of the item to fulfill the shortfall.
Based on your example data, the following query would work, though if Locations aren't actually numerics then you would need to add a row number column and tweak the query a bit to use the row number instead of the Location Number; It would still be very similar to the query below.
SELECT
Totals.Product, Totals.PickedQty, Totals.ShortFall, Totals.Location, Totals.LocationQty
FROM (
SELECT
TheTable.Product, TheTable.PickedQty, TheTable.ShortFall,
TheTable.Location, TheTable.LocationQty, SUM(ForRunningTotal.LocationQty) AS RunningTotal
FROM TheTable
JOIN TheTable ForRunningTotal ON TheTable.Product = ForRunningTotal.Product
AND TheTable.Location >= ForRunningTotal.Location
GROUP BY TheTable.Product, TheTable.PickedQty, TheTable.ShortFall, TheTable.Location, TheTable.LocationQty
) Totals
-- Note you could also change the join above so the running total is actually the count of only the rows above,
-- not including the current row; Then the WHERE clause below could be "Totals.RunningTotal < Totals.ShortFall".
-- I liked RunningTotal as the sum of this row and all prior, it seems more appropriate to me.
WHERE Totals.RunningTotal - Totals.LocationQty <= Totals.ShortFall
AND Totals.LocationQty > 0
Also - as long as you are reading my answer, an unrelated side-note: Based on the data you showed above, your database schema isn't normalized as far as it could be. It seems like the Picked Quantity and the ShortFall actually depend only on the Product, so that would be a table of its own, and then the Location Quantity depends on the Product and Location, so that would be a table of its own. I'm pointing it out because if your data contained different Picked Quantities/ShortFall for a single product, then the above query would break; This situation would be impossible with the normalized tables I mentioned.