When following through Andrew Ng's Machine learning course assignment - Exercise:1 in python,
I had to predict the prize of a house given the size of the house in sq-feet,number of bedroom using multi variable linear regression.
In one of the steps where we had to predict the cost of the house on a new example X = [1,1650,3] where 1 is the bias term,1650 is the size of the house and 3 is the number of bedrooms, I used the below code to normalize and predict the output:
X_vect = np.array([1,1650,3])
X_vect[1:3] = (X_vect[1:3] - mu)/sigma
pred_price = np.dot(X_vect,theta)
print("the predicted price for 1650 sq-ft,3 bedroom house is ${:.0f}".format(pred_price))
Here mu is the mean of the training set calculated previously as [2000.68085106 3.17021277],sigma is the standard deviation of the training data calculated previously as [7.86202619e+02 7.52842809e-01] and theta is [340412.65957447 109447.79558639 -6578.3539709 ]. The value of X_vect after the calculation was [1 0 0].Hence the prediction code :
pred_price = np.dot(X_vect,theta_vals[0])
gave the result as the predicted price for 1650 sq-ft,3 bedroom house is $340413.
But this was wrong according to the answer key.So I did it manually as below:
print((np.array([1650,3]).reshape(1,2) - np.array([2000.68085106,3.17021277]).reshape(1,2))/sigma)
This is the value of normalized form of X_vect and the output was [[-0.44604386 -0.22609337]].
The next line of code to calculate the hypothesis was:
print(340412.65957447 + 109447.79558639*-0.44604386 + 6578.3539709*-0.22609337)
Or in cleaner code:
X1_X2 = (np.array([1650,3]).reshape(1,2) - np.array([2000.68085106,3.17021277]).reshape(1,2))/sigma
xo = 1
x1 = X1_X2[:,0:1]
x2 = X1_X2[:,1:2]
hThetaOfX = (340412.65957447*xo + 109447.79558639*x1 + 6578.3539709*x2)
print("The price of a 1650 sq-feet house with 3 bedrooms is ${:.02f}".format(hThetaOfX[0][0]))
This gave the result of the predicted price to be $290106.82.This was matching the answer key.
My question is where did I go wrong in my first approach?
I am very new to TensorFlow - so please bear with me if this is a trivial question.
I'm coding in Python+TensorFlow. I have a dataframe with the following structure -
Y | X_1 | X_2 | ... | X_p | Grp
where Y is the continuous response, X_1 through X_p are features, and Grp is a categorical value indicating group. I want to fit a separate linear regression of Y on (X_1,...,X_p)for each Grp and save the weights/coefficients. I do not want to use the out of the shelf tf.estimator.LinearRegressor. Instead I want to go the loss function-optimizer-session.run() route.
The relevant tutorial pages on internet talk about linear regression but not per group. I would appreciate any suggestions. I am thinking to do this -
For each g in Grps :
1. Call the optimizer by passing the data for Group g as the placeholders.
2. Get the estimated weights (for Group g) and save them in a dataframe : Grp | weights
Another approach that sounds reasonable is to have separate graphs for each group and kick them all together using various "sessions".
Are these reasonable and feasible in TF? Which one is easier or are there better approaches?
Thank you,
Sai
Here is the Stata code that I have tried:
eststo clear
sysuse auto, clear
eststo Dom: estpost sum rep78 mpg turn trunk weight length if foreign==0
eststo For: estpost sum rep78 mpg turn trunk weight length if foreign==1
esttab Dom For, cells("mean(fmt(2))" "sd") ///
nonumber nodepvars noobs se collabels(none) mlabels(, lhs("Var") title)
Below is also the output:
--------------------------------------
Var Dom For
--------------------------------------
rep78 3.02 4.29
0.84 0.72
mpg 19.83 24.77
4.74 6.61
turn 41.44 35.41
3.97 1.50
trunk 14.75 11.41
4.31 3.22
weight 3317.12 2315.91
695.36 433.00
length 196.13 168.55
20.05 13.68
--------------------------------------
What this does is to compute the mean and standard deviation for several variables using summarize. This is done separately based on a condition (once for foreign observations and once for non-foreign observations).
The results, mean and standard deviation, are then displayed via esttab. I will ultimately want to get this in LaTeX, but this example shows what the result is in Stata for the sake of simplicity.
I have two questions:
How can I get the standard deviations to be shown in parentheses?
Is it possible to include any lines between the variables to separate the two different groups?
I have something like this in mind:
--------------------------------------
Var Dom For
--------------------------------------
Variable Group 1:
--------------------------------------
rep78 3.02 4.29
(0.84) (0.72)
mpg 19.83 24.77
(4.74) (6.61)
turn 41.44 35.41
(3.97) (1.50)
--------------------------------------
Variable Group 2:
--------------------------------------
trunk 14.75 11.41
(4.31) (3.22)
weight 3317.12 2315.91
(695.36) (433.00)
length 196.13 168.55
(20.05) (13.68)
--------------------------------------
I would like to use eststo, etc. if possible. I would prefer that it be as automated as possible, but I am open to exporting matrices from Stata into LaTeX or using fragments if that is what it takes. If this is not possible, I am also open to other solutions.
Regarding the first question you need to specify option par in sd within cells():
sysuse auto, clear
eststo clear
eststo Dom: estpost sum rep78 mpg turn trunk weight length if foreign==0
eststo For: estpost sum rep78 mpg turn trunk weight length if foreign==1
esttab Dom For, cells("mean(fmt(2))" "sd(par)") ///
nonumber nodepvars noobs se collabels(none) mlabels(, lhs("Var") title)
With regards to the second question, you could do the following:
eststo clear
eststo Dom: estpost sum rep78 mpg turn if foreign==0
eststo For: estpost sum rep78 mpg turn if foreign==1
esttab Dom For using output.txt, cells("mean(fmt(2))" "sd(par)") ///
nonumber nodepvars noobs collabels(none) mlabels(, lhs("Vars") title) ///
posthead("#hline" "Variable Group 1:" "#hline" ) postfoot(" ") replace
eststo clear
eststo Dom: estpost sum trunk weight length if foreign==0
eststo For: estpost sum trunk weight length if foreign==1
esttab Dom For using output.txt, cells("mean(fmt(2))" "sd(par)") ///
nonumber nodepvars noobs collabels(none) mlabels(none) ///
prehead("#hline" "Variable Group 2:") append
This will produce the desired output:
type output.txt
--------------------------------------
Vars Dom For
--------------------------------------
Variable Group 1:
--------------------------------------
rep78 3.02 4.29
(0.84) (0.72)
mpg 19.83 24.77
(4.74) (6.61)
turn 41.44 35.41
(3.97) (1.50)
--------------------------------------
Variable Group 2:
--------------------------------------
trunk 14.75 11.41
(4.31) (3.22)
weight 3317.12 2315.91
(695.36) (433.00)
length 196.13 168.55
(20.05) (13.68)
--------------------------------------
This is a two part problem:
PART 1:
I am using the cloudera pig editor to transform my data. The data set is derived from the US Patents Citations data set. The first column is the "Cited" patent. The remaining data is the patents that cite the first patent.
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
I need to create PIG code that will count the number of citation that the first patent has. So, I need the output to be:
3858241 5
3858242 4
3858243 7
3858244 6
3858245 7
3858246 3
3858247 6
PART 2:
I need to create a histogram of the output from problem 1 using a PIG script.
Any help would be greatly appreciated.
Thanks
this script should work.
X = LOAD 'pigpatient.txt' using PigStorage(' ') AS (pid:int,str:chararray);
X1 = FOREACH X GENERATE pid,STRSPLIT(str, ',') AS (y:tuple());
X2 = FOREACH X1 GENERATE pid,SIZE(y) as numofcitan;
dump X2;
X3 = group X2 by numofcitan;
Histograms = foreach X3 GENERATE group as numofcitan,COUNT(X2.pid);
dump Histograms;
input:
3858241 3634889,3557384,3398406,1324234,956203
3858242 3707004,3668705,3319261,1515701
3858243 3684611,3681785,3574238,3221341,3156927,3146465,2949611
3858244 2912700,2838924,2635670,2211676,17445,14040
3858245 3755824,3699969,3621837,3608095,3553737,3176316,2072303
3858246 3601877,3503079,3451067
3858247 3755824,3694819,3621837,2807431,1600859
Result:
(3858241,5)
(3858242,4)
(3858243,7)
(3858244,6)
(3858245,7)
(3858246,3)
(3858247,5)
Histogram output:
Number of citatatins,number of patients
(3,1)
(4,1)
(5,2)
(6,1)
(7,2)
#Sravan K Reddy's answer is good enough to be a solution, but it is essential to know what is histogram?
Histogram is frequency distribution of datasets and gives statistical information about data. Most commonly used histogram types are; Equi-width and equi-depth which is called equi-height or height-balanced.
In database tools, equi-depth histogram is prefered. ex: Oracle see
#Sravan K Reddy intends to create equi-width histogram of patent citations. However, in order to create histogram, data must be sorted. That is vital for histogram construction.
If you want to create histogram of your big data, read this paper and check Apache Pig Scripts.
I am playing around with learning MVC and want to create a recipe recorder application to store my recipes.
I am using .net with Sql Server 2008 R2 however I don't think that really matters with what I am trying to do.
I want to be able to record all of the measures I use. In my country we use metric however I want people to be able to use imperial with my application.
How do I structure my table to cope with the differences, I was thinking of storing all of the measurements as ints and have a foreign key to store the kind of weight.
Ideally I would like to be able to share the recipes between people and display the measurements in their preferred way.
Is this the right kind of way
IngredientID PK
Weight int
TypeOfWeight int e.g. tsp=1,tbl=2,kilogram=3,pound=4,litre=5,ounce=6 etc
UserID int
Or is this way off track? Any suggestions would be great!
I think you should store the weights (Kilo/Pound) etc as a single weight type (metric) and simply "display" them in the correct conversion using the user's preference. If the user has there weight settings set to Imperial, values entered into the system would need to be converted as well. This should simplify your data anyway.
Similar to Dates, you could store every date and what timezone it is from, or otherwise store all dates as the same (or no timezone) and then display them in the application using offsets according to the user's preference
If you are storing weights (a non-discrete value) I would strongly suggest using numeric or decimal for this data. You have the right idea with the typeofweight column. Store a reference table somewhere showing what the conversion ratio is for each (to a certain standard).
This gets quite tricky when you want to show ounces as TSP, because the conversion depends on the ingredient itself, so you need a 3rd table - ingredient: id, name, volume-to-weight ratio.
Example typeofweight table, where the standard unit is grams
type | conversion
gram | 1
ounce | 28.35
kg | 1000
tsp | 5 // assuming that 1 tsp = 5 grams of water
pound | 453.59
Example ingredient volume to weight conversion
type | vol-to-weight
water | 1
sugar | 1.4 // i.e. 1 tsp holds 5g of water, but 7g of sugar
So to display 500 ounces of sugar in tsp, you would use the formula
units x ounce.conversion x sugar.vol-to-weight
= 500 x 28.35 x 1.4
Another example with 2 weights
Ingredient is specified as 3 ounces of starch. Show in grams
= 3 x 28.35 (straightforward isn't it)
or
Ingredient is specified as 3 ounces of starch. Show in pounds
= 3 * 28.35 / 453.59