pandas: changing timestamp in cell results in incorrect value - numpy

I have a pandas DataFrame where I am trying to change one of the cells' timestamps to a different timestamp. However, I'm not getting the result I expect. Here's what I have:
>>> df = pd.DataFrame({"d": [np.datetime64('2013-07-14T10:30:30.521Z'), np.datetime64('2013-07-21T10:30:30.521Z')]})
>>> df
d
0 2013-07-14 10:30:30.521000
1 2013-07-21 10:30:30.521000
>>> df.iloc[-1, df.columns.get_loc("d")] = np.datetime64('2013-08-29T10:30:30.521Z')
>>> df
d
0 2013-07-14 10:30:30.521000
1 1970-01-01 00:22:57.772230521
As the example illustrates, the timestamp I get for df.loc[1, "d"] is not the one I am assigning to that cell. I don't understand this behavior or where I'm going wrong. Is there some other way I need to be changing the value of a timestamp?
Edit: the above is just a simple example. My actual df has many columns, not just 1. I'm using pandas version 0.16.1 (and can't change the version).

since your df has only one column you can do it this way:
In [29]: df
Out[29]:
d
0 2013-07-14 10:30:30.521
1 2013-07-21 10:30:30.521
In [30]: df.iloc[-1] = pd.to_datetime('2013-08-29T10:30:30.521Z')
In [31]: df
Out[31]:
d
0 2013-07-14 10:30:30.521
1 2013-08-29 10:30:30.521
UPDATE: if you have multiple columns in your DF:
In [47]: df
Out[47]:
d a
0 2013-07-14 10:30:30.521 1
1 2013-07-21 10:30:30.521 1
In [48]: df.loc[df.index[-1], 'd'] = pd.to_datetime('2013-08-29T10:30:30.521Z')
In [49]: df
Out[49]:
d a
0 2013-07-14 10:30:30.521 1
1 2013-08-29 10:30:30.521 1

Related

Convert and replace a string value in a pandas df with its float type

I have a value in pandas df which is accidentally put as a string as follows:
df.iloc[5329]['values']
'72,5'
I want to convert this value to float and replace it in the df. I have tried the following ways:
df.iloc[5329]['values'] = float(72.5)
also,
df.iloc[5329]['values'] = 72.5
and,
df.iloc[5329]['values'] = df.iloc[5329]['values'].replace(',', '.')
It runs successfully with a warning but when I check the df, its still stored as '72,5'.
The entire df at that index is as follows:
df.iloc[5329]
value 36.25
values 72,5
values1 72.5
currency MYR
Receipt Kuching, Malaysia
Delivery Male, Maldives
How can I solve that?
iloc needs specific row, col positioning.
import pandas as pd
df = pd.DataFrame(
{
'A': np.random.choice(100, 3),
'B': [15.2,'72,5',3.7]
})
print(df)
df.info()
Output:
A B
0 84 15.2
1 92 72,5
2 56 3.7
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 3 non-null int64
1 B 3 non-null object
Update to value:
df.iloc[1,1] = 72.5
print(df)
Output:
A B
0 84 15.2
1 92 72.5
2 56 3.7
Make sure you don't have recurring indexing (i.e. [][]) when doing assignment, since df.iloc[5329] will make a copy of data and further assignment will happen to the copy not original df. Instead just do:
df.iloc[5329, 'values'] = 72.5

How to change datetime to numeric discarding 0s at end [duplicate]

I have a dataframe in pandas called 'munged_data' with two columns 'entry_date' and 'dob' which i have converted to Timestamps using pd.to_timestamp.I am trying to figure out how to calculate ages of people based on the time difference between 'entry_date' and 'dob' and to do this i need to get the difference in days between the two columns ( so that i can then do somehting like round(days/365.25). I do not seem to be able to find a way to do this using a vectorized operation. When I do munged_data.entry_date-munged_data.dob i get the following :
internal_quote_id
2 15685977 days, 23:54:30.457856
3 11651985 days, 23:49:15.359744
4 9491988 days, 23:39:55.621376
7 11907004 days, 0:10:30.196224
9 15282164 days, 23:30:30.196224
15 15282227 days, 23:50:40.261632
However i do not seem to be able to extract the days as an integer so that i can continue with my calculation.
Any help appreciated.
Using the Pandas type Timedelta available since v0.15.0 you also can do:
In[1]: import pandas as pd
In[2]: df = pd.DataFrame([ pd.Timestamp('20150111'),
pd.Timestamp('20150301') ], columns=['date'])
In[3]: df['today'] = pd.Timestamp('20150315')
In[4]: df
Out[4]:
date today
0 2015-01-11 2015-03-15
1 2015-03-01 2015-03-15
In[5]: (df['today'] - df['date']).dt.days
Out[5]:
0 63
1 14
dtype: int64
You need 0.11 for this (0.11rc1 is out, final prob next week)
In [9]: df = DataFrame([ Timestamp('20010101'), Timestamp('20040601') ])
In [10]: df
Out[10]:
0
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [11]: df = DataFrame([ Timestamp('20010101'),
Timestamp('20040601') ],columns=['age'])
In [12]: df
Out[12]:
age
0 2001-01-01 00:00:00
1 2004-06-01 00:00:00
In [13]: df['today'] = Timestamp('20130419')
In [14]: df['diff'] = df['today']-df['age']
In [16]: df['years'] = df['diff'].apply(lambda x: float(x.item().days)/365)
In [17]: df
Out[17]:
age today diff years
0 2001-01-01 00:00:00 2013-04-19 00:00:00 4491 days, 00:00:00 12.304110
1 2004-06-01 00:00:00 2013-04-19 00:00:00 3244 days, 00:00:00 8.887671
You need this odd apply at the end because not yet full support for timedelta64[ns] scalars (e.g. like how we use Timestamps now for datetime64[ns], coming in 0.12)
Not sure if you still need it, but in Pandas 0.14 i usually use .astype('timedelta64[X]') method
http://pandas.pydata.org/pandas-docs/stable/timeseries.html (frequency conversion)
df = pd.DataFrame([ pd.Timestamp('20010101'), pd.Timestamp('20040605') ])
df.ix[0]-df.ix[1]
Returns:
0 -1251 days
dtype: timedelta64[ns]
(df.ix[0]-df.ix[1]).astype('timedelta64[Y]')
Returns:
0 -4
dtype: float64
Hope that will help
Let's specify that you have a pandas series named time_difference which has type
numpy.timedelta64[ns]
One way of extracting just the day (or whatever desired attribute) is the following:
just_day = time_difference.apply(lambda x: pd.tslib.Timedelta(x).days)
This function is used because the numpy.timedelta64 object does not have a 'days' attribute.
To convert any type of data into days just use pd.Timedelta().days:
pd.Timedelta(1985, unit='Y').days
84494

How to expand one row to multiple rows according to its value in Pandas

This is a DataFrame I have for example. Please refer the image link.
Before:
Before
d = {1: ['2134',20, 1,1,1,0], 2: ['1010',5, 1,0,0,0], 3: ['3457',15, 0,1,1,0]}
columns=['Code', 'Price', 'Bacon','Onion','Tomato', 'Cheese']
df = pd.DataFrame.from_dict(data=d, orient='index').sort_index()
df.columns = columns
What I want to do is expanding a single row into multiple rows. Then the Dataframe should be look like the image of below link. The intention is using some columns(from 'Bacon' to 'Cheese') as categories.
After:
After
I tried to find the answer, but failed. Thanks.
You can first reshape with set_index and stack, then filter by query and get_dummies from column level_2 and last reindex columns for add missing with no 1 and reset_index:
df = df.set_index(['Code', 'Price']) \
.stack() \
.reset_index(level=2, name='val') \
.query('val == 1') \
.level_2.str.get_dummies() \
.reindex(columns=df.columns[2:], fill_value=0) \
.reset_index()
print (df)
Code Price Bacon Onion Tomato Cheese
0 2134 20 1 0 0 0
1 2134 20 0 1 0 0
2 2134 20 0 0 1 0
3 1010 5 1 0 0 0
4 3457 15 0 1 0 0
5 3457 15 0 0 1 0
You can use stack and transpose to do this operation and format accordingly.
df = df.stack().to_frame().T
df.columns = ['{}_{}'.format(*c) for c in df.columns]
Use pd.melt to put all the food in one column and then pd.get_dummies to expand the columns.
df1 = pd.melt(df, id_vars=['Code', 'Price'])
df1 = df1[df1['value'] == 1]
df1 = pd.get_dummies(df1, columns=['variable'], prefix='', prefix_sep='').sort_values(['Code', 'Price'])
df1.reindex(columns=df.columns, fill_value=0)
Edited after I saw how jezrael used reindex to both add and drop a column.

Get an index label as a string in Pandas

I have a Pandas dataframe, the index/labels are date. I just want to get out the starting date (ie the first entry) and the ending date (ie the last entry). What is the best way to do that?
Any help would be much appreciated.
You could use the index's format method. For example,
In [44]: df = pd.DataFrame({'foo':1}, index=pd.date_range('2000-1-1', periods=5, freq='D'))
In [45]: df
Out[45]:
foo
2000-01-01 1
2000-01-02 1
2000-01-03 1
2000-01-04 1
2000-01-05 1
[5 rows x 1 columns]
In [46]: df.index[[0,-1]].format()
Out[46]: ['2000-01-01', '2000-01-05']
To get the index of adataframe as a list use:
df.index.format()

How to make first level of MultiIndex as the columns?

Say I have a MultiIndex dataframe like:
In [1]: arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
In [2]: df = pa.DataFrame(randn(6,1),index=pa.MultiIndex.from_tuples(zip(*arrays)),columns=['A'])
In [3]: df
Out[3]:
A
one 1 0.229037
2 -1.640695
3 0.908127
two 1 -0.918750
2 1.170112
3 -2.620850
I would like to change this to a new dataframe, with the columns as the first level index of the MultiIndex dataframe? Is there an easy way? (below an example)
In [12]: dft = df.ix['one']
In [13]: dft = dft.rename(columns={'A':'one'})
In [14]: dft['two'] = df.ix['two']['A']
In [15]: dft
Out[15]:
one two
1 0.229037 -0.918750
2 -1.640695 1.170112
3 0.908127 -2.620850
Perhaps you are looking for pandas.unstack:
In [56]: df
Out[56]:
A
one 1 0.229037
2 -1.640695
3 0.908127
two 1 -0.918750
2 1.170112
3 -2.620850
In [57]: df.unstack(level=0)
Out[57]:
A
one two
1 0.229037 -0.918750
2 -1.640695 1.170112
3 0.908127 -2.620850
Just to add something to this, there is another option of making a multi-index into columns using the reset_index() function. The difference here being that it simply "pops" out the values as new columns. Depends on your usecase:
In [5]: df
Out[5]:
A
one 1 -1.598591
2 -0.354813
3 -0.435924
two 1 1.408328
2 0.448303
3 0.381360
In [6]: df.reset_index()
Out[6]:
level_0 level_1 A
0 one 1 -1.598591
1 one 2 -0.354813
2 one 3 -0.435924
3 two 1 1.408328
4 two 2 0.448303
5 two 3 0.381360