How do I sort column by targeting a specific number within that cell? - pandas

I would like to use Pandas Python to sort a specific column by date (more specifically the year). However, the year is buried within a bunch of other numbers. How do I just target the 2 digits that I need?
In the example below, I want to sort this column by the numbers [16,14,15...] rather than considering all the numbers in that row.
3/18/16 11:46
6/19/14 14:58
7/27/15 14:22
8/3/15 12:59
2/20/13 12:33
9/27/16 12:08
7/27/15 14:22

Given a dataframe like this,
date
0 3/18/16
1 6/19/14
2 7/27/15
3 8/3/15
4 2/20/13
5 9/27/16
6 7/27/15
You can convert the date column to datetime format and then sort.
df['date'] = pd.to_datetime(df['date'])
df = df.sort_values(by = 'date')
The resulting dataframe
date
4 2013-02-20
1 2014-06-19
2 2015-07-27
6 2015-07-27
3 2015-08-03
0 2016-03-18
5 2016-09-27

Related

Difference in weeks between first occurance and current occurance

I have a large dataset including items and dates. The simplified version looks as follows:
Item
Date
1
2018-10-01
2
2018-04-03
1
2018-10-16
2
2018-04-15
1
2018-10-20
2
2018-04-30
I want to add the column df['ItemAge'], displaying the number of weeks between the date of the occurrence of the item and the date of the first date of occurrence of the item. Here, the number of weeks is rounded to whole integers. The value of this variable is 0 on the date that the item first occurred. Hence I want to obtain:
Item
Date
ItemAge
1
2018-10-01
0
2
2018-04-03
0
1
2018-10-16
2
2
2018-04-15
2
1
2018-10-20
3
2
2018-04-30
4
I am thinking about creating the variable StartDate for each item as the date of the first occurrence of each item. Subsequently looping over the items and taking the difference in days between the new occurrence of the item and its StartDate. Then dividing this number of days by 7 and rounding to whole integers.
However, I don't know how to write this code in python. Does anyone have any suggestions?
You can convert to datetime, groupby Item and transform to take the first date per group. Then get the number of days by dividing the resulting TimeDelta by 7:
d = pd.to_datetime(df['Date'])
df['ItemAge'] = ((d-d.groupby(df['Item']).transform('first')).dt.days/7).round().astype(int)
output:
Item Date ItemAge
0 1 2018-10-01 0
1 2 2018-04-03 0
2 1 2018-10-16 2
3 2 2018-04-15 2
4 1 2018-10-20 3
5 2 2018-04-30 4

How to join columns in Julia?

I have opened a dataframe in julia where i have 3 columns like this:
day month year
1 1 2011
2 4 2015
3 12 2018
how can I make a new column called date that goes:
day month year date
1 1 2011 1/1/2011
2 4 2015 2/4/2015
3 12 2018 3/12/2018
I was trying with this:
df[!,:date]= df.day.*"/".*df.month.*"/".*df.year
but it didn't work.
in R i would do:
df$date=paste(df$day, df$month, df$year, sep="/")
is there anything similar?
thanks in advance!
Julia has an inbuilt Date type in its standard library:
julia> using Dates
julia> df[!, :date] = Date.(df.year, df.month, df.day)
3-element Vector{Date}:
2011-01-01
2015-04-02
2018-12-03

How to concatenate a dataframe to a multiindex main dataframe along columns

I have tried a few answers but was not able to get the desired result in my case.
I am working with stocks data.
I have a list ['3MINDIA.NS.csv', 'AARTIDRUGS.NS.csv', 'AARTIIND.NS.csv', 'AAVAS.NS.csv', 'ABB.NS.csv']
for every stock in the list I get an output which contains trades and related info.. it looks something like this:
BUY SELL profits rel_profits
0 2004-01-13 2004-01-27 -44.200012 -0.094606
1 2004-02-05 2004-02-16 18.000000 0.044776
2 2005-03-08 2005-03-11 25.000000 0.048077
3 2005-03-31 2005-04-01 13.000000 0.025641
4 2005-10-11 2005-10-26 -20.400024 -0.025342
5 2005-10-31 2005-11-04 67.000000 0.095578
6 2006-05-22 2006-06-05 -55.100098 -0.046693
7 2007-03-06 2007-03-14 3.000000 0.001884
8 2007-03-19 2007-03-28 41.500000 0.028222
9 2007-07-31 2007-08-14 69.949951 0.038224
10 2008-01-24 2008-02-05 25.000000 0.013055
11 2009-11-04 2009-11-05 50.000000 0.031250
12 2010-12-10 2010-12-15 63.949951 0.018612
13 2011-02-02 2011-02-15 -53.050049 -0.015543
14 2011-09-30 2011-10-07 74.799805 0.018181
15 2015-12-09 2015-12-18 -215.049805 -0.019523
16 2016-01-18 2016-02-01 -475.000000 -0.046005
17 2016-11-16 2016-11-30 -1217.500000 -0.096877
18 2018-03-26 2018-04-02 0.250000 0.000013
19 2018-05-22 2018-05-25 250.000000 0.012626
20 2018-06-05 2018-06-12 101.849609 0.005361
21 2018-09-25 2018-10-10 -2150.000000 -0.090717
22 2021-01-27 2021-02-03 500.150391 0.024638
23 2021-06-30 2021-07-07 393.000000 0.016038
24 2021-08-12 2021-08-13 840.000000 0.035279
25 NaN NaN -1693.850281 0.995277
# note: every dataframe will have a last row with NaN values in buy, sell columns
# each datafram has different number of rows
Now I tried to add an extra level of index to this dataframe like this:
symbol = name of the stock from given list for ex. for 3MINDIA.NS.csv symbol is 3MINDIA
trades.columns = pd.MultiIndex.from_product([[symbol], trades.columns])
after this I tried to concatenate each trades dataframe that is generated in the loop to a main dataframe using:
result_df = pd.concat([result_df, trades], axis=1)
# I am trying to do this so that Whenever
I call result_df[symbol] I should be able
to see the trade dates for that particular symbol.
But I get a result_df that has lot of NaN values because each trades dataframe has variable number of rows in it.
IS there any way I can combine trades dataframes along the columns with stock symbol as higher level index and not get all the NaN values in my result_df
result_df I got
So I found a way to get what I wanted.
first I added this code in loop
trades = pd.concat([trades], keys=[symbol], names=['Stocks'])
after this I used concatenate again on result_df and trades
# Desired Result
result_df = pd.concat([result_df, trades], axis=0, ignore_index=False)
And BAM!!! This is exactly what I wanted

Insert items from MultiIndexed dataframe into regular dataframe based on time

I have this regular dataframe indexed by 'Date', called ES:
Price Day Hour num_obs med abs_med Ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203
I have this other dataframe indexed by the following MultiIndex. The first index goes from 0 to 23 and the second index goes from 0 to 55. In other words we have daily 5 minute increment data.
5min_Ret
0 0 2.235875e-06
5 9.814064e-07
10 -1.453213e-06
15 4.295757e-06
20 5.884896e-07
25 -1.340122e-06
30 9.470660e-06
35 1.178204e-06
40 -1.111621e-05
45 1.159005e-05
50 6.148861e-06
55 1.070586e-05
1 0 1.485287e-05
5 3.018576e-06
10 -1.513273e-05
15 -1.105312e-05
20 3.600874e-06
...
I want to create a column in the original dataframe, ES, that has the appropriate '5min_Ret' at each appropriate hour/5minute combo.
I've tried multiple things: looping over rows, finding some apply function. But nothing has worked so far. I feel like I'm overlooking a simple and Pythonic solution here.
The expected output creates a new column called '5min_ret' to the original dataframe in which each row corresponds to the correct hour/5minute pair from the smaller dataframe containing the 5min_ret
Price Day Hour num_obs med abs_med Ret 5min_ret
Date
2006-01-03 08:30:00 1260.583333 1 8 199 1260.416667 0.166667 0.000364 xxxx
2006-01-03 08:35:00 1261.291667 1 8 199 1260.697917 0.593750 0.000562 xxxx
2006-01-03 08:40:00 1261.125000 1 8 199 1260.843750 0.281250 -0.000132 xxxx
2006-01-03 08:45:00 1260.958333 1 8 199 1260.895833 0.062500 -0.000132 xxxx
2006-01-03 08:50:00 1261.214286 1 8 199 1260.937500 0.276786 0.000203 xxxx
I think one way is to use merge on hour and minute. First create a column 'min' in ES from the datetimeindex such as:
ES['min'] = ES.index.minute
Now you can merge with your multiindex DF containing the column '5min_Ret' that I named df_multi such as:
ES = ES.merge(df_multi.reset_index(), left_on = ['hour','min'],
right_on = ['level_0','level_1'], how='left')
Here you merge on 'hour' and 'min' from ES with 'level_0' and 'level_1', which are created from your multiindex of df_multi when you do reset_index, and on the value of the left df (being ES)
You should get a new column in ES named '5min_Ret' with the value you are looking for. You can drop the colum 'min' if you don't need it anymore by ES = ES.drop('min',axis=1)

When plotting a dataframe, how to set the x-range for a 'YYYY-MM' value

I have a pandas df with the below values. I can create a nifty chart that looks like the following:
import matplotlib.pyplot as plt
ax = pdf_month.plot(x="month", y="count", kind="bar")
plt.show()
I want to truncate the date range (to ignore 1900-01-01 and other months that not import, but everytime I try I get error messages (see below). The date range would be something like '2016-01' to '2018-04'
ax.set_xlim(pdf_month['month'][17],pdf_date['count'].values.max())
where pdf_month['month'][17] gives you a value of u'2017-01'.
pdf_month.printSchema
root
|-- month: string (nullable = true)
|-- count: long (nullable = false)
How do I set the range on the month values for a x-value that isn't really an int or a date. I still have the original, pre-grouped dates. Is there a better way to group by month that would allow you to customize the x-axis?
error messages:
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
sample output of pd_month
month count
0 1900-01 353
1 2015-09 1
2 2015-10 2
3 2015-11 2
4 2015-12 1
5 2016-01 1
6 2016-02 1
7 2016-03 3
8 2016-04 2
9 2016-05 5
10 2016-06 7
11 2016-07 13
12 2016-08 12
13 2016-09 41
14 2016-10 19
15 2016-11 17
16 2016-12 20
You can try Series date indexing, Pandas Series allow for date slicing as follows:
df.month['2016-01': '2018-04']
This works with datetime indexes.