Statsample-glm gem IndexError: Specified vector y does not exist - ruby-on-rails-5

I am trying to create a Poisson regression for some school performance data, and this seems like the best gem so far.
Going through the practice analysis from this post, I come up with this error:
irb(main):001:0> require 'daru'
require 'statsample-glm'
=> false
=> false
irb(main):003:0> data_set = Daru::DataFrame.from_csv "logistic_mle.csv"
=> #<Daru::DataFrame(200x4)>
a b c y
0 0.75171213 -3.2683591 1.70092606 0
1 0.55421406 -2.9565972 2.66368360 0
2 -1.8533164 -2.8293733 3.34679611 0
3 -2.8861015 -0.7389824 4.74970154 0
4 -2.6055309 0.56102031 5.48308397 0
5 -4.2735321 1.62383436 5.35813425 0
6 -4.7701259 1.22025583 6.41070111 0
7 -6.9231483 2.86547174 8.73185919 0
8 -7.5641950 4.94028695 8.94193466 0
9 -8.6309366 4.27420502 9.27002100 0
10 -8.9911114 5.10389362 11.7669513 0
11 -9.9905763 7.87484596 12.4794035 0
12 -10.381878 8.84300238 13.7498993 0
13 -11.047682 9.44613324 13.5025027 0
14 -12.434424 9.70515870 15.1221173 0
15 -13.627294 10.4190343 16.3289942 0
16 -15.620222 11.3788332 17.7367653 0
17 -16.292239 13.1516565 18.6939344 0
18 -16.715913 14.9076297 18.0246863 0
19 -17.950125 15.8533651 20.6826094 0
20 -18.989884 15.4331557 20.9101142 0
21 -19.908508 16.8542366 22.0721145 0
22 -21.146652 18.6785324 23.4977598 0
23 -21.367574 18.3208056 23.9121114 0
24 -22.131396 20.7616214 24.1683442 0
25 -23.163631 21.1293492 25.2695476 0
26 -24.136076 21.7035705 27.9161820 0
27 -25.386072 23.3588003 27.8755285 0
28 -27.254627 24.9201403 28.9810564 0
29 -28.845061 25.1681854 29.6749936 0
... ... ... ... ...
irb(main):004:0> glm = Statsample::GLM.compute data_set, :y, :logistic, {constant: 1, algorithm: :mle}
Traceback (most recent call last):
1: from (irb):4
IndexError (Specified vector y does not exist)
Further inspection of the error reveals this:
Caused by:
IndexError: Specified index :y does not exist
I've tried reformatting the header to "date" instead of "string" based on a comment in this stackoverflow post which is marginally related, with no change in the error.
Any thoughts from the SO community?

Sorry SO, I posted this too quickly. I found a solution that works:
Instead of
data_set = Daru::DataFrame.from_csv "logistic_mle.csv"
This line works:
data_set = Daru::DataFrame.from_csv("logistic_mle.csv", headers: true, header_converters: :symbol)

Related

average on dataframe segments

In the following picture, I have DataFrame that renders zero after each cycle of operation (the cycle has random length). I want to calculate the average (or perform other operations) for each patch. For example, the average of [0.762, 0.766] alone, and [0.66, 1.37, 2.11, 2.29] alone and so forth till the end of the DataFrame.
So I worked with this data :
random_value
0 0
1 0
2 1
3 2
4 3
5 0
6 4
7 4
8 0
9 1
There is probably a way better solution, but here is what I came with :
def avg_function(df):
avg_list = []
value_list = list(df["random_value"])
temp_list = []
for i in range(len(value_list)):
if value_list[i] == 0:
if temp_list:
avg_list.append(sum(temp_list) / len(temp_list))
temp_list = []
else:
temp_list.append(value_list[i])
if temp_list: # for the last values
avg_list.append(sum(temp_list) / len(temp_list))
return avg_list
test_list = avg_function(df=df)
test_list
[Out] : [2.0, 4.0, 1.0]
Edit: since requested in the comments, here is a way to add the means to the dataframe. I dont know if there is a way to do that with pandas (and there might be!), but I came up with this :
def add_mean(df, mean_list):
temp_mean_list = []
list_index = 0 # will be the index for the value of mean_list
df["random_value_shifted"] = df["random_value"].shift(1).fillna(0)
random_value = list(df["random_value"])
random_value_shifted = list(df["random_value_shifted"])
for i in range(df.shape[0]):
if random_value[i] == 0 and random_value_shifted[i] == 0:
temp_mean_list.append(0)
elif random_value[i] == 0 and random_value_shifted[i] != 0:
temp_mean_list.append(0)
list_index += 1
else:
temp_mean_list.append(mean_list[list_index])
df = df.drop(["random_value_shifted"], axis=1)
df["mean"] = temp_mean_list
return df
df = add_mean(df=df, mean_list=mean_list
Which gave me :
df
[Out] :
random_value mean
0 0 0
1 0 0
2 1 2
3 2 2
4 3 2
5 0 0
6 4 4
7 4 4
8 0 0
9 1 1

Splitting a coordinate string into X and Y columns with a pandas data frame

So I created a pandas data frame showing the coordinates for an event and number of times those coordinates appear, and the coordinates are shown in a string like this.
Coordinates Occurrences x
0 (76.0, -8.0) 1 0
1 (-41.0, -24.0) 1 1
2 (69.0, -1.0) 1 2
3 (37.0, 30.0) 1 3
4 (-60.0, 1.0) 1 4
.. ... ... ..
63 (-45.0, -11.0) 1 63
64 (80.0, -1.0) 1 64
65 (84.0, 24.0) 1 65
66 (76.0, 7.0) 1 66
67 (-81.0, -5.0) 1 67
I want to create a new data frame that shows the x and y coordinates individually and shows their occurrences as well like this--
x Occurrences y Occurrences
76 ... -8 ...
-41 ... -24 ...
69 ... -1 ...
37 ... -30 ...
60 ... 1 ...
I have tried to split the string but don't think I am doing it correctly and don't know how to add it to the table regardless--I think I'd have to do something like a for loop later on in my code--I scraped the data from an API, here is the code to set up the data frame shown.
for key in contents['liveData']['plays']['allPlays']:
# for plays in key['result']['event']:
# print(key)
if (key['result']['event'] == "Shot"):
#print(key['result']['event'])
scoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if scoordinates not in shots:
shots[scoordinates] = 1
else:
shots[scoordinates] += 1
if (key['result']['event'] == "Goal"):
#print(key['result']['event'])
gcoordinates = (key['coordinates']['x'], key['coordinates']['y'])
if gcoordinates not in goals:
goals[gcoordinates] = 1
else:
goals[gcoordinates] += 1
#create data frame using pandas
gdf = pd.DataFrame(list(goals.items()),columns = ['Coordinates','Occurences'])
print(gdf)
sdf = pd.DataFrame(list(shots.items()),columns = ['Coordinates','Occurences'])
print()
try this
import re
df[['x', 'y']] = df.Coordinates.apply(lambda c: pd.Series(dict(zip(['x', 'y'], re.findall('[-]?[0-9]+\.[0-9]+', c.strip())))))
using the in-built string methods to achieve this should be performant:
df[["x", "y"]] = df["Coordinates"].str.strip(r"[()]").str.split(",", expand=True).astype(np.float)
(this also converts x,y to float values, although not requested probably desired)

Create bins with totals and percentage

I would like to create bins to get histogram with totals and percentage, e.g. starting from 0.
If possible to set the minimum and maximum value in the bins ( in my case value min=0 and max=20 )
Input file
8 5
10 1
11 4
12 4
12 4
13 5
16 7
18 9
16 9
17 7
18 5
19 5
20 1
21 7
output desired
0 0 0.0%
0 - 2 0 0.0%
2 - 4 0 0.0%
4 - 6 0 0.0%
6 - 8 0 0.0%
8 - 10 5 6.8%
10 - 12 5 6.8%
12 - 14 13 17.8%
14 - 16 0 0.0%
16 - 18 23 31.5%
18 - 20 19 26.0%
> 20 8 11.0%
---------------------
Total: 73
I use this code from Mr Ed Morton, it works perfectly but the percentage is missed.
awk 'BEGIN { delta = (delta == "" ? 2 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}' file
Thanks in advance
Your expected output doesn't seem to correspond to your sample input data, but try this variation of that awk code in your question (Intended to be put in an executable file to run as a script, not a a one-liner due to size):
#!/usr/bin/awk -f
BEGIN { delta = (delta == "" ? 2 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
max[bucketNr] = max[bucketNr] < $2 ? $2 : max[bucketNr]
sum += $2
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%d-%d %d %.1f\n", beg, end, max[bucketNr],
(cnt[bucketNr] / NR) * 100
beg = end
}
print "-------------"
print "Total " sum
}
It adds tracking the maximum of the second column for each bin the first column falls in, and prints out a percentage instead of a count of how many rows were in each bin. Plus some tweaks to the output format to better match your desired output.

I am sure that the type of "items_tmp_dic2" is dict,so why report this error?

import pandas as pd
import numpy as np
path = 'F:/datasets/kaggle/predict_future_sales/'
train_raw = pd.read_csv(path + 'sales_train.csv')
items = pd.read_csv(path + 'items.csv')
item_category_id = items['item_category_id']
item_id = train_raw.item_id
train_raw.head()
date date_block_num shop_id item_id item_price item_cnt_day
0 02.01.2013 0 59 22154 999.00 1.0
1 03.01.2013 0 25 2552 899.00 1.0
2 05.01.2013 0 25 2552 899.00 -1.0
3 06.01.2013 0 25 2554 1709.05 1.0
4 15.01.2013 0 25 2555 1099.00 1.0
items.head()
item_name item_id item_category_id
0 ! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D 0 40
1 !ABBYY FineReader 12 Professional Edition Full... 1 76
2 ***В ЛУЧАХ СЛАВЫ (UNV) D 2 40
3 ***ГОЛУБАЯ ВОЛНА (Univ) D 3 40
4 ***КОРОБКА (СТЕКЛО) D 4 40
Then I want to add a "item_category_id" to train_raw,you mean from the data of items,so i want to creat a dict of item_id and item_category_id
item_category_id = items['item_category_id']
item_id = train_raw.item_id
items_tmp = items.drop(['item_name'],axis=1)
items_tmp_dic = items_tmp.to_dict('split')
items_tmp_dic = items_tmp_dic.get('data')
items_tmp_dic2 = dict(items_tmp_dic)
ic_id = []
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(i))
print(len(ic_id))
wrong
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-50-be637620ea6d> in <module>
6 ic_id = []
7 for i in np.nditer(item_id.values[:10]):
----> 8 ic_id.append(items_tmp_dic2.get(i))
9 print(len(ic_id))
TypeError: unhashable type: 'numpy.ndarray'
but when I run
for i in np.nditer(item_id.values[:10]):
print(i)
I get
22154
2552
2552
2554
2555
2564
2565
2572
2572
2573
I have ensured that the type of "items_tmp_dic2" is dict,so why ?
I have solved it by using int()
for i in np.nditer(item_id.values[:10]):
ic_id.append(items_tmp_dic2.get(int(i)))

Extracting table data using BeautifulSoup

Having a little trouble using BeautifulSoup to extract data (zip code and population). Any help appreciated.
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
Now I'm stuck. Don't really know how to pull the data in the columns I want and append it to the columns I made in the empty dataframe.
Any help appreciated.
You just need to loop over your cols, and dump that into your austin_pop dataframe.
So I did that by making a list of the data from the cols using list comprehension:
row_list = [ data.text for data in cols ]
List comprehension equivalent to a for loop. You can use either.:
row_list = []
for data in cols:
rows_list.append(data.text)
Created a single row, kept the 2 columns you wanted, and then dumped that in to austin_pop:
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
Full Code:
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
url = "https://www.zip-codes.com/city/tx-austin.asp"
pop_source = requests.get("https://www.zip-codes.com/city/tx-austin.asp").text
soup = BeautifulSoup(pop_source, 'html5lib')
zip_pop_table = soup.find('table',class_='statTable')
austin_pop = pd.DataFrame(columns=['Zip Code','Population'])
for row in zip_pop_table.find_all('tr'):
cols = row.find_all('td')
row_list = [ data.text for data in cols ]
temp_df = pd.DataFrame([row_list], columns = ['Zip Code','type','county','Population', 'area_codes'])
temp_df = temp_df[['Zip Code', 'Population']]
austin_pop = austin_pop.append(temp_df).reset_index(drop = True)
austin_pop = austin_pop.iloc[1:, :]
austin_pop['Zip Code'] = austin_pop['Zip Code'].apply(lambda x: x.split()[-1])
Output:
print (austin_pop)
Zip Code Population
1 73301 0
2 73344 0
3 78681 50,606
4 78701 6,841
5 78702 21,334
6 78703 19,690
7 78704 42,117
8 78705 31,340
9 78708 0
10 78709 0
11 78710 0
12 78711 0
13 78712 860
14 78713 0
15 78714 0
16 78715 0
17 78716 0
18 78717 22,538
19 78718 0
20 78719 1,764
21 78720 0
22 78721 11,425
23 78722 5,901
24 78723 28,330
25 78724 21,696
26 78725 6,083
27 78726 13,122
28 78727 26,689
29 78728 20,299
30 78729 27,108
.. ... ...
45 78746 26,928
46 78747 14,808
47 78748 40,651
48 78749 34,449
49 78750 26,814
50 78751 14,385
51 78752 18,064
52 78753 49,301
53 78754 15,036
54 78755 0
55 78756 7,194
56 78757 21,310
57 78758 44,072
58 78759 38,891
59 78760 0
60 78761 0
61 78762 0
62 78763 0
63 78764 0
64 78765 0
65 78766 0
66 78767 0
67 78768 0
68 78772 0
69 78773 0
70 78774 0
71 78778 0
72 78779 0
73 78783 0
74 78799 0
[74 rows x 2 columns]