With Python 2.7 and notebook, I am trying to display a simple Series that looks like:
year
2014 1
2015 3
2016 2
Name: mySeries, dtype: int64
I would like to:
Name the second column, I cant seem to succeed with s.columns = ['a','b']. How do we do this?
Plot the result where the years are written as such. When I run s.plot() I get the years as x, which is good but I get weird values:
Thanks for the help!
If it helps, this series comes from the following code:
df = pd.read_csv(file, usecols=['dates'], parse_dates=True)
df['dates'] = pd.to_datetime(df['date'])
df['year'] = pd.DatetimeIndex(df['dartes']).year
df
which gives me:
dates year
0 2015-05-05 14:21:00 2015
1 2014-06-06 14:22:00 2014
2 2015-05-05 14:14:00 2015
On which I do:
s = pd.Series(df.groupby('year').size())
For me works cast index to string by astype:
print s
y
2014 1
2015 3
2016 2
Name: mySeries, dtype: int64
s.index = s.index.astype(str)
s.plot()
just cast your index before :
df.set_index(df.index.astype(str),inplace=True)
then you will have what you expect.
Related
Say I have a DataFrame:
foo bar
0 1998 abc
1 1999 xyz
2 2000 123
What is the best way to combine the first two (or any n) characters of foo with the last two of bar to make a third column 19bc, 19yz, 2023 ? Normally to combine columns I'd simply do
df['foobar'] = df['foo'] + df['bar']
but I believe I can't do slicing on these objects.
If you convert your columns as string with astype, you can use str accessor and slice values:
df['foobar'] = df['foo'].astype(str).str[:2] + df['bar'].str[-2:]
print(df)
# Ouput
foo bar foobar
0 1998 abc 19bc
1 1999 xyz 19yz
2 2000 123 2023
Using astype(str) is useless if your column has already object dtype like bar column (?).
You can read: Working with text data
Example:
import pandas as pd
df = pd.DataFrame({"foo": [1998, 1999, 2000], "bar": ["abc", "xyz", "123"]})
df["foobar"] = df.apply(lambda x: str(x["foo"])[0:2] + x["bar"][1:], axis=1)
print(df)
# foo bar foobar
# 0 1998 abc 19bc
# 1 1999 xyz 19yz
# 2 2000 123 2023
import pandas as pd
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(['Company',df['Date'].dt.year]).diff()
print(df)
Date Revenue YTD
0 NaT NaN
1 92 days -100.0
2 NaT NaN
3 92 days -22670.0
4 NaT NaN
5 92 days 28627.0
I would like to calculate the company's revenue difference by September and December. I have tried with groupby company and year. But the result is not what I am expecting
Expecting result
Date Company Revenue YTD
0 2017 A -100
1 2018 A -22670
2 2017 B 28627
IIUC, this should work
(df.assign(Date=df['Date'].dt.year,
Revenue_Diff=df.groupby(['Company',df['Date'].dt.year])['Revenue YTD'].diff())
.drop('Revenue YTD', axis=1)
.dropna()
)
Output:
Date Company Revenue_Diff
1 2017 A -100.0
3 2017 B -22670.0
5 2018 A 28627.0
Try this:
Set it up:
import pandas as pd
import numpy as np
data = [['2017-09-30','A',123],['2017-12-31','A',23],['2017-09-30','B',74892],['2017-12-31','B',52222],['2018-09-30','A',37599],['2018-12-31','A',66226]]
df = pd.DataFrame.from_records(data,columns=["Date", "Company", "Revenue YTD"])
df['Date'] = pd.to_datetime(df['Date'])
Update with np.diff():
my_func = lambda x: np.diff(x)
df = (df.groupby([df.Date.dt.year, df.Company])
.agg({'Revenue YTD':my_func}))
print(df)
Revenue YTD
Date Company
2017 A -100
B -22670
2018 A 28627
Hope this helps.
This is input table in pandas:
this is an output table as shown below:
dtype: int64
Dear Friends,
I am new to pandas, how to get the result is shown in the second image using pandas.
I am getting output as shown below using this approach
"df.groupby(['Months', 'Status']).size()"
Months Status
Apr-20 IW 2
OW 1
Jun-20 IW 4
OW 4
May-20 IW 3
OW 2
dtype: int64
But how to convert this output as shown in the second image?
It will be more helpful if someone is able to help me. Thanks in advance.
Use crosstab with margins=True parameter, then if necessary remove last Total column, change order of columns by DataFrame.reindex with ordering of original column and last convert index to column by DataFrame.reset_index and remove columns names by DataFrame.rename_axis:
df = (pd.crosstab(df['Status'], df['Months'], margins_name='Total', margins=True)
.iloc[:, :-1]
.reindex(df['Months'].unique(), axis=1)
.reset_index()
.rename_axis(None, axis=1))
print (df)
Status Apr_20 May_20 Jun_20
0 IW 4 2 4
1 OW 1 2 4
2 Total 5 4 8
Unstack, and then transpose:
df = df.groupby(['Months', 'Status']).size().unstack().T
To get a total row:
df.sum().rename('Total').to_frame().T.append(df)
What would be the best way to parse the above Excel file in a Pandas dataframe? The idea would be to be able to update data easily, adding columns, dropping lines. For example, for every origin, I would like to keep only output3. Then for every column (2000, ....,2013) divide it by 2 given a condition (say value > 6000) .
Below is what I tried: first to parse and drop the unnecessary lines, but it's not satisfactory, as I had to rename columns manually. So this doesn't look very optimal as a solution. Any better idea?
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
cols1 = list(df.columns)
cols1 = [str(x)[:4] for x in cols1]
cols2 = list(df.iloc[0,:])
cols2 = [str(x) for x in cols2]
cols = [x + "_" + y for x,y in zip(cols1,cols2)]
df.columns = cols
df = df.drop(["Unna_nan"], axis =1).rename(columns ={'Time_Origine':'Country','Unna_Output' : 'Series','Unna_Ccy' : 'Unit','2000_nan' : '2000','2001_nan': '2001','2002_nan':'2002','2003_nan' : '2003','2004_nan': '2004','2005_nan' : '2005','2006_nan' : '2006','2007_nan' : '2007','2008_nan' : '2008','2009_nan' : '2009','2010_nan' : '2010','2011_nan': '2011','2012_nan' : '2012','2013_nan':'2013','2014_nan':'2014','2015_nan':'2015','2016_nan':'2016','2017_nan':'2017'})
df.drop(0,inplace=True)
df.drop(df.tail(1).index, inplace=True)
idx = ['Country', 'Series', 'Unit']
df = df.set_index(idx)
df = df.query('Series == "Output3"')
Without having an excel like this I think something like this might work.
In order to only get the rows from output3, you can use the following:
df = pd.read_excel("myExcel.xlsx", skiprows=2, sheet_name='1')
df = df.loc[df['Output'] == 'output3']
Now dividing every cell by 2 if the cell value is greater than 6000 by using pandas apply:
def foo(bar):
if bar > 6000:
return bar / 2
return bar
for col in df.columns:
try:
int(col) # to check if this column is a year
df[col] = df[col].apply(foo)
except ValueError:
pass
#read first 2 rows to MultiIndex nad remove last one
df = pd.read_excel("Excel1.xlsx", skiprows=2, header=[0,1], skipfooter=1)
print (df)
#create helper DataFrame
cols = df.columns.to_frame().reset_index(drop=True)
cols.columns=['a','b']
cols['a'] = pd.to_numeric(cols['a'], errors='ignore')
cols['b'] = cols['b'].replace('Unit.1','tmp', regex=False)
#create new column by condition
cols['c'] = np.where(cols['b'].str.startswith('Unnamed'), cols['a'], cols['b'])
print (cols)
a b c
0 Time Country Country
1 Time Series Series
2 Time Unit Unit
3 Time tmp tmp
4 2000 Unnamed: 4_level_1 2000
5 2001 Unnamed: 5_level_1 2001
6 2002 Unnamed: 6_level_1 2002
7 2003 Unnamed: 7_level_1 2003
8 2004 Unnamed: 8_level_1 2004
9 2005 Unnamed: 9_level_1 2005
10 2006 Unnamed: 10_level_1 2006
11 2007 Unnamed: 11_level_1 2007
12 2008 Unnamed: 12_level_1 2008
13 2009 Unnamed: 13_level_1 2009
14 2010 Unnamed: 14_level_1 2010
15 2011 Unnamed: 15_level_1 2011
16 2012 Unnamed: 16_level_1 2012
17 2013 Unnamed: 17_level_1 2013
18 2014 Unnamed: 18_level_1 2014
19 2015 Unnamed: 19_level_1 2015
20 2016 Unnamed: 20_level_1 2016
21 2017 Unnamed: 21_level_1 2017
#overwrite columns by column c
df.columns = cols['c'].tolist()
#forward filling missing values
df['Country'] = df['Country'].ffill()
df = df.drop('tmp', axis=1).set_index(['Country','Series','Unit'])
print (df)
I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494