Custom index for dataframe - pandas

I have a dataframe:
1/3/2011 4/1/2021 3/31/2021 ... 1/6/2011 1/5/2011 1/4/2011
0 0.424247 0.209943 0.209942 ... 0.494431 0.478675 0.442692
1 0.428014 0.210241 0.210241 ... 0.499361 0.484653 0.446848
2 1.127805 0.356723 0.362666 ... 1.332204 1.339620 1.178887
3 2.220437 0.913629 0.949701 ... 2.366291 1.942443 2.209880
4 3.213573 1.556866 1.624042 ... 3.312162 3.151239 3.213159
5 4.025311 2.030883 2.105356 ... 4.069105 4.866949 4.019084
6 4.555883 2.315257 2.397099 ... 4.601537 4.717146 4.527070
7 4.803622 2.447491 2.529781 ... 4.847289 4.939370 4.784244
8 4.982618 2.528838 2.605640 ... 5.006495 5.110961 4.936500
9 5.049444 2.548326 2.641850 ... 5.096861 5.214397 5.045123
10 5.183262 2.588166 2.668743 ... 5.207816 5.301237 5.166312
11 5.227674 2.624253 2.703215 ... 5.276020 5.367465 5.241496
12 5.247942 2.633068 2.716046 ... 5.318237 5.399149 5.286811
13 5.274951 2.615380 2.701839 ... 5.343966 5.393002 5.310892
14 5.267717 2.588757 2.676699 ... 5.332641 5.356331 5.299973
15 5.207645 2.557645 2.644474 ... 5.271866 5.287435 5.244266
16 5.097226 2.522334 2.605892 ... 5.163959 5.188291 5.145620
17 4.987161 2.486294 2.567273 ... 5.056751 5.091500 5.046471
18 4.893541 2.450168 2.530107 ... 4.965447 5.007443 4.959943
19 4.823184 2.414322 2.494957 ... 4.896898 4.942282 4.892097
20 4.782205 2.378828 2.462084 ... 4.857204 4.901275 4.848124
21 4.769757 2.344163 2.431462 ... 4.845584 4.883970 4.827838
22 4.754436 2.310346 2.400372 ... 4.830733 4.863281 4.805725
23 4.727185 2.277665 2.368254 ... 4.803574 4.831505 4.774819
24 4.684759 2.246143 2.334896 ... 4.760727 4.785601 4.732571
25 4.627248 2.216407 2.300755 ... 4.702254 4.726026 4.679888
26 4.562172 2.190393 2.268304 ... 4.635846 4.660460 4.624084
27 4.530679 2.175844 2.248211 ... 4.603836 4.629887 4.602763
28 4.550864 2.175709 2.244352 ... 4.624965 4.652757 4.633093
29 4.632196 2.191131 2.258115 ... 4.709154 4.739082 4.724527
30 4.787292 2.223629 2.291247 ... 4.869629 4.902260 4.889937
31 4.893589 2.246091 2.314864 ... 4.979662 5.013891 5.001563
I want to change the index such as the first one is 3M USD 0Y1Y, next one is 3M USD 1Y1Y, 3M USD 2Y1Y and so on, but I am not sure how to do this.

I'm not sure if this really answers your question, but you should be able to do this with list comprehension
df.index = [f"3M USD {x}Y1Y" for x in range(len(df))]

Related

checking for duplicates in panda data frame

import pandas as pd
from io import StringIO
import requests
import matplotlib.pyplot as plt
import numpy as np
from scipy.interpolate import make_interp_spline
url = 'https://m-selig.ae.illinois.edu/ads/coord/b737a.dat'
response = requests.get(url).text
lines = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
lines.append(line)
lines = [x.split() for x in lines]
df = pd.DataFrame(lines)
df = df.dropna(axis=0)
df = df.astype(float)
df = df[~(df > 1).any(1)]
print(df)
output...
0 1
2 0.0000 0.0177
3 0.0023 0.0309
4 0.0050 0.0372
5 0.0076 0.0415
6 0.0143 0.0499
7 0.0249 0.0582
8 0.0495 0.0730
9 0.0740 0.0814
10 0.0990 0.0866
11 0.1530 0.0907
12 0.1961 0.0905
13 0.2504 0.0887
14 0.3094 0.0858
15 0.3520 0.0833
16 0.3919 0.0804
17 0.4477 0.0756
18 0.5034 0.0696
19 0.5593 0.0626
20 0.5965 0.0575
21 0.6488 0.0498
22 0.8351 0.0224
23 0.9109 0.0132
24 1.0000 0.0003
26 0.0000 0.0177
27 0.0022 0.0038
28 0.0049 -0.0018
29 0.0072 -0.0053
30 0.0119 -0.0106
31 0.0243 -0.0204
32 0.0486 -0.0342
33 0.0716 -0.0457
34 0.0979 -0.0516
35 0.1488 -0.0607
36 0.1953 -0.0632
37 0.2501 -0.0632
38 0.2945 -0.0626
39 0.3579 -0.0610
40 0.3965 -0.0595
41 0.4543 -0.0563
42 0.5050 -0.0527
43 0.5556 -0.0482
44 0.6063 -0.0427
45 0.6485 -0.0375
46 0.8317 -0.0149
47 0.9410 -0.0053
48 1.0000 -0.0003
This is my code for a website I'm scraping data from. I'm running into a problem where the x points start from zero, go up, and come back down to zero creating a line in the middle of the plot which I don't need.
Notice how there is two df[0] = 0 on rows 2 and 26, How can I write a code where it detects duplicates?
Try one of the following?
Out of the loop
df1=df.drop_duplicates(keep='first', inplace=False, ignore_index=False)
Inside your loop
lines = []
lines1 = []
for idx, line in enumerate(response.split('\n'), start=1):
if all([x.replace('.','').replace('-','').isdecimal() for x in line.split()]):
if not (line in lines1): lines.append(line)
lines1.append(line)

Statsample-glm gem IndexError: Specified vector y does not exist

I am trying to create a Poisson regression for some school performance data, and this seems like the best gem so far.
Going through the practice analysis from this post, I come up with this error:
irb(main):001:0> require 'daru'
require 'statsample-glm'
=> false
=> false
irb(main):003:0> data_set = Daru::DataFrame.from_csv "logistic_mle.csv"
=> #<Daru::DataFrame(200x4)>
a b c y
0 0.75171213 -3.2683591 1.70092606 0
1 0.55421406 -2.9565972 2.66368360 0
2 -1.8533164 -2.8293733 3.34679611 0
3 -2.8861015 -0.7389824 4.74970154 0
4 -2.6055309 0.56102031 5.48308397 0
5 -4.2735321 1.62383436 5.35813425 0
6 -4.7701259 1.22025583 6.41070111 0
7 -6.9231483 2.86547174 8.73185919 0
8 -7.5641950 4.94028695 8.94193466 0
9 -8.6309366 4.27420502 9.27002100 0
10 -8.9911114 5.10389362 11.7669513 0
11 -9.9905763 7.87484596 12.4794035 0
12 -10.381878 8.84300238 13.7498993 0
13 -11.047682 9.44613324 13.5025027 0
14 -12.434424 9.70515870 15.1221173 0
15 -13.627294 10.4190343 16.3289942 0
16 -15.620222 11.3788332 17.7367653 0
17 -16.292239 13.1516565 18.6939344 0
18 -16.715913 14.9076297 18.0246863 0
19 -17.950125 15.8533651 20.6826094 0
20 -18.989884 15.4331557 20.9101142 0
21 -19.908508 16.8542366 22.0721145 0
22 -21.146652 18.6785324 23.4977598 0
23 -21.367574 18.3208056 23.9121114 0
24 -22.131396 20.7616214 24.1683442 0
25 -23.163631 21.1293492 25.2695476 0
26 -24.136076 21.7035705 27.9161820 0
27 -25.386072 23.3588003 27.8755285 0
28 -27.254627 24.9201403 28.9810564 0
29 -28.845061 25.1681854 29.6749936 0
... ... ... ... ...
irb(main):004:0> glm = Statsample::GLM.compute data_set, :y, :logistic, {constant: 1, algorithm: :mle}
Traceback (most recent call last):
1: from (irb):4
IndexError (Specified vector y does not exist)
Further inspection of the error reveals this:
Caused by:
IndexError: Specified index :y does not exist
I've tried reformatting the header to "date" instead of "string" based on a comment in this stackoverflow post which is marginally related, with no change in the error.
Any thoughts from the SO community?
Sorry SO, I posted this too quickly. I found a solution that works:
Instead of
data_set = Daru::DataFrame.from_csv "logistic_mle.csv"
This line works:
data_set = Daru::DataFrame.from_csv("logistic_mle.csv", headers: true, header_converters: :symbol)

I am sure that the type of "items_tmp_dic2" is dict,so why report this error?

import pandas as pd
import numpy as np
path = 'F:/datasets/kaggle/predict_future_sales/'
train_raw = pd.read_csv(path + 'sales_train.csv')
items = pd.read_csv(path + 'items.csv')
item_category_id = items['item_category_id']
item_id = train_raw.item_id
train_raw.head()
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0
items.head()
item_name item_id item_category_id
0 ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D 0 40
1 !ABBYY FineReader 12 Professional Edition Full... 1 76
2 ***В ЛУЧАХ СЛАВЫ (UNV) D 2 40
3 ***ГОЛУБАЯ ВОЛНА (Univ) D 3 40
4 ***КОРОБКА (СТЕКЛО) D 4 40
Then I want to add a "item_category_id" to train_raw,you mean from the data of items,so i want to creat a dict of item_id and item_category_id
item_category_id = items['item_category_id']
item_id = train_raw.item_id
items_tmp = items.drop(['item_name'],axis=1)
items_tmp_dic = items_tmp.to_dict('split')
items_tmp_dic = items_tmp_dic.get('data')
items_tmp_dic2 = dict(items_tmp_dic)
ic_id = []
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(i))
print(len(ic_id))
wrong
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-be637620ea6d> in <module>
6 ic_id = []
7 for i in np.nditer(item_id.values[:10]):
----> 8 ic_id.append(items_tmp_dic2.get(i))
9 print(len(ic_id))
TypeError: unhashable type: 'numpy.ndarray'
but when I run
for i in np.nditer(item_id.values[:10]):
print(i)
I get
22154
2552
2552
2554
2555
2564
2565
2572
2572
2573
I have ensured that the type of "items_tmp_dic2" is dict,so why ?
I have solved it by using int()
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(int(i)))

Why does changing "Date" column to datetime ruin graph?

I have a dataframe with financial data in it (Date, Open, Close, Low, High).
I want to graph the date versus one column (eg. Open). When I convert the "Date" column to a date, the graph does not display correctly. Why could this be happening?
import pandas as pd
import matplotlib.pyplot as plt
def plot_one_data(df, column, title= "No Title", color = "black"):
df.plot(x="Date", y=column, title=title)
#show plot
filename= title+ ".png"
plt.savefig(filename)
#Load in the csv files
df = pd.read_csv('data/df.csv')
print(df.head())
plot_one_data(df, "Open", "Before Converting to Date - good but no dates on x axis")
df['Date'] = pd.to_datetime(df["Date"])
print(df.head())
plot_one_data(df, "Open", "After Converting to Date - bad!")
Text Output:
Date Open ... Adj Close Volume
0 14-08-06 1266.670044 ... 1268.209961 2118020000
1 15-08-06 1268.189941 ... 1285.579956 2334100000
2 16-08-06 1285.270020 ... 1295.430054 2554570000
3 17-08-06 1295.369995 ... 1297.479980 2458340000
4 18-08-06 1297.479980 ... 1302.300049 2033910000
[5 rows x 7 columns]
Date Open ... Adj Close Volume
0 2006-08-14 1266.670044 ... 1268.209961 2118020000
1 2006-08-15 1268.189941 ... 1285.579956 2334100000
2 2006-08-16 1285.270020 ... 1295.430054 2554570000
3 2006-08-17 1295.369995 ... 1297.479980 2458340000
4 2006-08-18 1297.479980 ... 1302.300049 2033910000
Solution - add format
Updated line
df['Date'] = pd.to_datetime(df["Date"], format='%d-%m-%y')
I found that looking at the first 30 entries (before and after changing to datetime) helped. It works until the the end of August 2006 then jumps to the 9th of January!
First 30 entries in original data
Date Open ... Adj Close Volume
0 14-08-06 1266.670044 ... 1268.209961 2118020000
1 15-08-06 1268.189941 ... 1285.579956 2334100000
2 16-08-06 1285.270020 ... 1295.430054 2554570000
3 17-08-06 1295.369995 ... 1297.479980 2458340000
4 18-08-06 1297.479980 ... 1302.300049 2033910000
5 21-08-06 1302.300049 ... 1297.520020 1759240000
6 22-08-06 1297.520020 ... 1298.819946 1908740000
7 23-08-06 1298.729980 ... 1292.989990 1893670000
8 24-08-06 1292.969971 ... 1296.060059 1930320000
9 25-08-06 1295.920044 ... 1295.089966 1667580000
10 28-08-06 1295.089966 ... 1301.780029 1834920000
11 29-08-06 1301.569946 ... 1304.280029 2093720000
12 30-08-06 1303.699951 ... 1305.369995 2060690000
13 31-08-06 1304.250000 ... 1303.819946 1974540000 #Smooth change from August to September
14 01-09-06 1303.800049 ... 1311.010010 1800520000
15 05-09-06 1310.939941 ... 1313.250000 2114480000
16 06-09-06 1313.040039 ... 1300.260010 2329870000
17 07-09-06 1300.209961 ... 1294.020020 2325850000
18 08-09-06 1294.020020 ... 1298.920044 2132890000
19 11-09-06 1298.859985 ... 1299.540039 2506430000
20 12-09-06 1299.530029 ... 1313.000000 2791580000
21 13-09-06 1312.739990 ... 1318.069946 2597220000
22 14-09-06 1318.000000 ... 1316.280029 2351220000
23 15-09-06 1316.280029 ... 1319.660034 3198030000
24 18-09-06 1319.849976 ... 1321.180054 2325080000
25 19-09-06 1321.170044 ... 1317.640015 2390850000
26 20-09-06 1318.280029 ... 1325.180054 2543070000
27 21-09-06 1324.890015 ... 1318.030029 2627440000
28 22-09-06 1318.030029 ... 1314.780029 2162880000
29 25-09-06 1314.780029 ... 1326.369995 2710240000
First 30 entries after changing to datetime
[30 rows x 7 columns]
Date Open ... Adj Close Volume
0 2006-08-14 1266.670044 ... 1268.209961 2118020000
1 2006-08-15 1268.189941 ... 1285.579956 2334100000
2 2006-08-16 1285.270020 ... 1295.430054 2554570000
3 2006-08-17 1295.369995 ... 1297.479980 2458340000
4 2006-08-18 1297.479980 ... 1302.300049 2033910000
5 2006-08-21 1302.300049 ... 1297.520020 1759240000
6 2006-08-22 1297.520020 ... 1298.819946 1908740000
7 2006-08-23 1298.729980 ... 1292.989990 1893670000
8 2006-08-24 1292.969971 ... 1296.060059 1930320000
9 2006-08-25 1295.920044 ... 1295.089966 1667580000
10 2006-08-28 1295.089966 ... 1301.780029 1834920000
11 2006-08-29 1301.569946 ... 1304.280029 2093720000
12 2006-08-30 1303.699951 ... 1305.369995 2060690000
13 2006-08-31 1304.250000 ... 1303.819946 1974540000 #Fine until here
14 2006-01-09 1303.800049 ... 1311.010010 1800520000 #Problem here
15 2006-05-09 1310.939941 ... 1313.250000 2114480000
16 2006-06-09 1313.040039 ... 1300.260010 2329870000
17 2006-07-09 1300.209961 ... 1294.020020 2325850000
18 2006-08-09 1294.020020 ... 1298.920044 2132890000
19 2006-11-09 1298.859985 ... 1299.540039 2506430000
20 2006-12-09 1299.530029 ... 1313.000000 2791580000
21 2006-09-13 1312.739990 ... 1318.069946 2597220000
22 2006-09-14 1318.000000 ... 1316.280029 2351220000
23 2006-09-15 1316.280029 ... 1319.660034 3198030000
24 2006-09-18 1319.849976 ... 1321.180054 2325080000
25 2006-09-19 1321.170044 ... 1317.640015 2390850000
26 2006-09-20 1318.280029 ... 1325.180054 2543070000
27 2006-09-21 1324.890015 ... 1318.030029 2627440000
28 2006-09-22 1318.030029 ... 1314.780029 2162880000
29 2006-09-25 1314.780029 ... 1326.369995 2710240000
Here is the fixed image:

Sub-group finding

I have this kind of records (rows):
0 1 4 8 2 3 7 9 3 4 8 9 4 7 9 1 0 0 2 5 8 2 4 5 6 1 0 2 4 8 9 0
Definitions:
group: collection of numbers which are separated by 0-s (zeros)
sub-group: collection of numbers which are separated by local minima in the groups
local minimum: the numbers before and after it are greater
In the above example there are 3 groups and 7 sub-groups, i.e.
groups: 1 4 8 2 3 7 9 3 4 8 9 4 7 9 1 , 2 5 8 2 4 5 6 1 , 2 4 8 9
sub-groups: 1 4 8 , 3 7 9 , 4 8 9 , 7 9 1 , 2 5 8 , 4 5 6 1 , 2 4 8 9 (this last is identical to the group itself)
So, in these kind of records I have to
find the minima (print out: 2, 3, 4, 2)
the size (number of character) of these sub-groups
positions of numbers of the sub-groups in the groups
I have already started to write something, but I am stuck here...
Can anyone help me to solve this?
Here is the code so far:
#!/usr/bin/awk -f
{
db = split($0,a,/( 0)+ */)
for (i=1; i<=db; i++) {
split_at_max(a[i])
for (j=1; j<=ret_count; j++) {
print ""
for (k=1; k<=maximums[j]; k++) {
print ret[j,k]
}
}
}
}
function split_at_max(x) {
m_db = split(x,values," ")
for (mx in ret) {
delete ret[mx]
}
ret_count = 1
ret_curr_db = 0
for (mi=2; mi<m_db; mi++) {
ret_curr_db++
ret[ret_count,ret_curr_db] = values[mi-1]
if ( (values[mi-1] <= values[mi]) &&
(values[mi] >= values[mi+1]) &&
(values[mi+1] <= values[mi+2]) ) {
maximums[ret_count] = ret_curr_db
ret_count++
ret_curr_db = 0
}
}
ret_curr_db++
ret[ret_count,ret_curr_db] = values[mi-1]
ret_curr_db++
ret[ret_count,ret_curr_db] = values[mi]
maximums[ret_count] = ret_curr_db
}
interesting assignment.
wrote a quick and dirty awk script. there should be a lot of room to optimize. I don't know what kind of output are you expecting...
awk -v RS="0" 'NF>1{
delete g;
print "group:";
for(i=1;i<=NF;i++){
printf $i" ";
g[i]=$i
}
print "";
t=1;
delete m;
for(i=2;i<length(g);i++){
if(g[i-1]>g[i] && g[i]<g[i+1]) {
print "found minima:"g[i]
m[t]=i;
t++;
}
}
if(length(m)>0){
s=0;
for(x=1;x<=length(m);x++){
printf "sub-group: "
for(i=s+1;i<m[x];i++){
printf g[i]" "
s=m[x];
}
print "";
if(x+1>length(m)){
printf "sub-group: ";
for(i=s+1;i<=length(g);i++)
printf g[i]" "
print "";
}
}
}else{
print "no minima found. sub-group is the same as group:"
printf "sub-group: "
for(i=1;i<=NF;i++){
printf $i" ";
g[i]=$i
}
}
print "\n-----------------------------"
} yourFile
the output on your example input:
group:
1 4 8 2 3 7 9 3 4 8 9 4 7 9 1
found minima:2
found minima:3
found minima:4
sub-group: 1 4 8
sub-group: 3 7 9
sub-group: 4 8 9
sub-group: 7 9 1
-----------------------------
group:
2 5 8 2 4 5 6 1
found minima:2
sub-group: 2 5 8
sub-group: 4 5 6 1
-----------------------------
group:
2 4 8 9
no minima found. sub-group is the same as group:
sub-group: 2 4 8 9
-----------------------------
update
fixing for those "special" elements like 20,30,40...
still quick and dirty:
change my awk script above to
sed 's/^0$//g' yourFile | awk -v RS="" [following codes are the same as above]......
then the output is:
group:
6 63 81 31 37 44 20
found minima:31
sub-group: 6 63 81
sub-group: 37 44 20
-----------------------------