Upsampling datetime - ValueError: cannot reindex a non-unique index with a method or limit - indexing

I get the error below when I try to upsample...
import pandas as pd
from datetime import date
df1=pd.read_csv("C:/Codes/test.csv")
df1['Date'] = pd.to_datetime(df1['Date'])
df1 = df1.set_index(['Date'])
df2 = pd.DataFrame()
df2 = df1.Gen.resample('H').ffill()
I get this error...ValueError: cannot reindex a non-unique index with a method or limit. Please advise.
My test.csv is a simple file with two columns containing these 5 records
Date|Gen
----|----
5/1/2017|Ggulf
5/2/2017|Ggulf
5/1/2017|Nelson
5/3/2017|Ggulf
5/4/2017|Nelson

An index needs to have unique values. Your first record and third record have the same date '5/1/2017' which makes it impossible to set the date column as an index column.

Related

Set DateTime to index and then sum over a day

i would like to change the index of my dataframe to datetime to sum the colum "Heizung" over a day.
But it dont work.
After i set the new index, i like to use resample to sum over a day.
Here is an extraction from my dataframe.
Nr;DatumZeit;Erdtemp;Heizung
0;25.04.21 12:58:42;21.8;1
1;25.04.21 12:58:54;21.8;1
2;25.04.21 12:59:06;21.9;1
3;25.04.21 12:59:18;21.9;1
4;25.04.21 12:59:29;21.9;1
5;25.04.21 12:59:41;22.0;1
6;25.04.21 12:59:53;22.0;1
7;25.04.21 13:00:05;22.1;1
8;25.04.21 13:00:16;22.1;0
9;25.04.21 13:00:28;22.1;0
10;25.04.21 13:00:40;22.1;0
11;25.04.21 13:00:52;22.2;0
12;25.04.21 13:01:03;22.2;0
13;25.04.21 13:01:15;22.2;1
14;25.04.21 13:01:27;22.2;1
15;25.04.21 13:01:39;22.3;1
16;25.04.21 13:01:50;22.3;1
17;25.04.21 13:02:02;22.4;1
18;25.04.21 13:02:14;22.4;1
19;25.04.21 13:02:26;22.4;0
20;25.04.21 13:02:37;22.4;1
21;25.04.21 13:02:49;22.4;0
22;25.04.21 13:03:01;22.4;0
23;25.04.21 13:03:13;22.5;0
24;25.04.21 13:03:25;22.4;0
This is my code
import pandas as pd
Tab = pd.read_csv('/home/kai/Dokumente/TempData', delimiter=';')
Tab1 = Tab[["DatumZeit","Erdtemp","Heizung"]].copy()
Tab1['DatumZeit'] = pd.to_datetime(Tab1['DatumZeit'])
Tab1.plot(x='DatumZeit', figsize=(20, 5),subplots=True)
#Tab1.index.to_datetime()
#Tab1.index = pd.to_datetime(Tab1.index)
Tab1.set_index('DatumZeit')
Tab.info()
Tab1.resample('D').sum()
print(Tab1.head(10))
This is how we can set index and create Timestamp object and then resample it for 'D' and sum a column over it.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit)
Tab1 = Tab1.set_index('DatumZeit') ## missed here
Tab1.resample('D').Heizung.sum()
If we don't want to set index explicitly then other way to resample is pd.Grouper.
Tab1['DatumZeit'] = pd.to_datetime(Tab1.DatumZeit
Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum()
If we want output to be dataframe, then we can use to_frame method.
Tab1 = Tab1.groupby(pd.Grouper(key='DatumZeit', freq='D')).Heizung.sum().to_frame()
Output
Heizung
DatumZeit
2021-04-25 15
Pivot tables to the rescue:
import pandas as pd
import numpy as np
Tab1.pivot_table(index=["DatumZeit"], values=["Heizung"], aggfunc=np.sum)
If you need to do it with setting the index first, you need to use inplace=True on set_index
Tab1.set_index("DatumZeit", inplace=True)
Just note if you do this way, you can't go back to a pivot table. In the end, it's whatever works best for you.

pd.datetime not indexing correctly

I have dataset with date of every transaction in restaurant. I tried to set date as index, before converting it with df.to_datetime:
df['dateTransaction'] = pd.to_datetime(df['dateTransaction'])
df.info()
And I really get 'dateTransaction' type as datetime64[ns]. But than I tried to set index with
df = df.set_index('dateTransaction')
but my dataset didn't sorted by date correctly.
enter image description here
Please advice how to index dataframe by date in sorted way?
When you set_index it never rearranges rows, it just "moves" a column to an index. So you have to explicitly sort (either before or after set_index):
df = df.set_index('dateTransaction').sort_index()
# or
df = df.sort_values("dateTransaction").set_index('dateTransaction')
If you are reading it from CSV you can also try
data=pd.read_csv('SomeData.csv',index_col=['Date'],parse_dates=['Date'],dayfirst=True)
without dayfirst=True the days read in as months and vice versa

Key error for set_index of a pandas dataframe by list object

I want to set the index of a pandas dataframe by a list which includes dates in a common format YYYY:MM:DD hh:mm:ss
index=df.index.tolist()
df2=df1.set_index(index)
the outcome
KeyError: '2011-06-21 00:00:00'
I tried to
df2=df1.set_index(str(index))
because of the backspace between date and time but the result was a KeyError for every single date in my index list.
Add [] for nested list, else it looking for columns names:
df2 = df.set_index([index])

how to extract the unique values and its count of a column and store in data frame with index key

I am new to pandas.I have a simple question:
how to extract the unique values and its count of a column and store in data frame with index key
I have tried to:
df = df1['Genre'].value_counts()
and I am getting a series but I don't know how to convert it to data frame object.
Pandas series has a .to_frame() function. Try it:
df = df1['Genre'].value_counts().to_frame()
And if you wanna "switch" the rows to columns:
df = df1['Genre'].value_counts().to_frame().T
Update: Full example if you want them as columns:
import pandas as pd
import numpy as np
np.random.seed(400) # To reproduce random variables
df1 = pd.DataFrame({
'Genre': np.random.choice(['Comedy','Drama','Thriller'], size=10)
})
df = df1['Genre'].value_counts().to_frame().T
print(df)
Returns:
Thriller Comedy Drama
Genre 5 3 2
try
df = pd.DataFrame(df1['Genre'].value_counts())

dataframe multiply some columns with a series

I have a dataframe df1 where the index is a DatetimeIndex and there are 5 columns, col1, col2, col3, col4, col5.
I have another df2 which has an almost equal datetimeindex (some days of df1 may be missing from df1), and a single 'Value' column.
I would like to multiply df1 in-place by the Value from df2 when the dates are the same. But not for all columns col1...col5, only col1...col4
I can see it is possible to multiply col1*Value, then col2*Value and so on... and make up a new dataframe to replace df1.
Is there a more efficient way?
You an achieve this, by reindexing the second dataframe so they are the same shape, and then using the dataframe operator mul:
Create two data frames with datetime series. The second one using only business days to make sure we have gaps between the two. Set the dates as indices.
import pandas as pd
# first frame
rng1 = pd.date_range('1/1/2011', periods=90, freq='D')
df1 = pd.DataFrame({'value':range(1,91),'date':rng1})
df1.set_index('date', inplace =True)
# second frame with a business day date index
rng2 = pd.date_range('1/1/2011', periods=90, freq='B')
df2 = pd.DataFrame({'date':rng2})
df2['value_to_multiply'] = range(1-91)
df2.set_index('date', inplace =True)
reindex the second frame with the index from the first. Df1 will now have gaps for non-business days filled with the first previous valid observation.
# reindex the second dataframe to match the first
df2 =df2.reindex(index= df1.index, method = 'ffill')
Multiple df2 by df1['value_to_multiply_by']:
# multiple filling nans with 1 to avoid propagating nans
# nans can still exists if there are no valid previous observations such as at the beginning of a dataframe
df1.mul(df2['value_to_multiply_by'].fillna(1), axis=0)