How should I impute NaN values in a categorical column? - pandas

Should I encode a categorical column and use label encoding, then impute NaN values with most frequent value, or are there other ways?
As encoding requires converting dataframe to array, then imputing would require again array to dataframe conversion (all this for a single column, and there are more columns like that).
Fore example, I have the variable BsmtQual which evaluates the height of a basement and has following number of categories:
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
Out of 2919 values in BsmtQual, 81 are NaN values.

For problems you have in the future like this that don't involve coding you should post at https://datascience.stackexchange.com/.
This depends on a few things. First of all, how important is this variable in your exercise? Assuming that you are doing classification, you could try removing all rows without with NaN values, running a few models, then removing the variable and running the same models again. If you haven't seen a dip in accuracy, then you might consider removing the variable completely.
If you do see a dip in accuracy or can't judge impact due to the problem being unsupervised, then there are several other methods you can try. If you just want a quick fix, and if there aren't too many NaNs or categories, then you can just impute with the most frequent value. This shouldn't cause too many problems if the previous conditions are satisfied.
If you want to be more exact, then you could consider using the other variables you have to predict the class of the categorical variable (obviously this will only work if the categorical variable is correlated to some of your other variables). You could use a variety of algorithms for this, including classifiers or clustering. It all depends on the distribution of your categorical variable and how much effort you want to put it in to solve your issue.
(I'm only learning as well, however I think thats most of your options)

"… or there are other ways."
Example:
Ex Excellent (100+ inches) 5 / 5 = 1.0
Gd Good (90-99 inches) 4 / 5 = 0.8
TA Typical (80-89 inches) 3 / 5 = 0.6
Fa Fair (70-79 inches) 2 / 5 = 0.4
Po Poor (<70 inches 1 / 5 = 0.2
NA No Basement 0 / 5 = 0.0
However, labels express less precision (affects accuracy if combined with actual measurements).
Could be solved by either scaling values over category range (e.g. scaling 0 - 69 inches over 0.0 - 0.2), or by approximation value for each category (more linearly accurate). For example, if highest value is 200 inch:
Ex Excellent (100+ inches) 100 / 200 = 0.5000
Gd Good (90-99 inches) ((99 - 90) / 2) + 90 / 200 = 0.4725
TA Typical (80-89 inches) ((89 - 80) / 2) + 80 / 200 = 0.4225
Fa Fair (70-79 inches) ((79 - 70) / 2) + 70 / 200 = 0.3725
Po Poor (<70 inches (69 / 2) / 200 = 0.1725
NA No Basement 0 / 200 = 0.0000
Actual measurement 120 inch 120 / 200 = 0.6000
Produces decent approximation (range mid-point value, except Ex, which is a minimum value). If calculations on such columns produce inaccuracies it is for notation imprecision (labels express ranges rather than values).

Related

Snowflake CEIL Function - round up to next 0.1 kilometer

I have a column containing measurement values in meters.
I want to round them up (ceil) them to the next 100m and return it as a km value.
Special thing is: if the original value is a "round" number (100m increment) it should be ceiled up to the next 100m increment (see line 3 in the example below).
Example:
meter_value kilometer_value
1111 1.2
111 0.2
1000 1.1
I think I can get the first two lines by doing:
ceil(meter_value/1000,1) as kilometer_value
The solution I thought of to fix the edge case in line three is to just add 1 meter always:
ceil((meter_value+1)/1000,1) as kilometer_value
It seems a bit clumsy, is there a better way/alternative function to archive this?
You can check to see if it's divisible by 100 and only add one if it is:
ceil(((meter_value + iff(meter_value % 100 = 0, 1, 0))/1000), 1)
This will prevent situations where (if decimal parts are allowed) adding 1 to a value of 999.5 would not be accurate if adding one all the time.
Greg's answer is good, simpler to read to me would be to
divide by 100
floor
add 1
ceil
divide by 10
select
column1 as meter_value
,ceil(((meter_value + iff(meter_value % 100 = 0, 1, 0))/1000), 1) as greg
,ceil(floor(meter_value/100)+1)/10 as simeon
from values
(1111)
,(111)
,(1000)
,(1)
,(0)
;
METER_VALUE
GREG
SIMEON
1,111
1.2
1.2
111
0.2
0.2
1,000
1.1
1.1
1
0.1
0.1
0
0.1
0.1
do we want to mention negative values? I mean it distance, so it's a directionless magnitude, right?
anyway with negative value, both our methods the +1 forces the boundary case to be wrong.
Actually:
Once you have floored adding the 1 or 0.1 if you divide by 1000 vs 100 first, you don't need to ceil at all
thus two short forms can be:
,ceil(floor(meter_value/100)+1)/10 as version_a
,(floor(meter_value/100)+1)/10 as version_b
,floor(meter_value/1000,1)+0.1 as version_c

snowflake Computation for dividing 2 columns giving wrong values

I am exactly doing this Sum(2322933.99/1161800199.8)*
100
I should get
1.9 something but I am getting 64. Something
can anyone guide my y this division in snowflake giving wrong results
I tried them converting into decimal values and tried with Formula div0()
Nothing worked
I guess that your database table has 33 rows. So you get 33 * 1.9 (because of SUM), which is about 64.
My guess, with the few details that you gave us:
sum(x)/sum(y) is different than sum(x/y)
1/2 + 2/4 + 4/8 = 1.5
(1+2+4)/(2+4+8) = 0.5
Try writing sum(total gross weight)/sum(total cases filled) instead of sum(total gross weight /total cases filled).

Chisquare test give wrong result. Should I reject proposed distribution?

I want to fit poission distribution on my data points and want to decide based on chisquare test that should I accept or reject this proposed distribution. I only used 10 observations. Here is my code
#Fitting function:
def Poisson_fit(x,a):
return (a*np.exp(-x))
#Code
hist, bins= np.histogram(x, bins=10, density=True)
print("hist: ",hist)
#hist: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02,2.67094169e-03,4.39345778e-04,6.59603327e-05,1.01518320e-05,
1.06301906e-06]
XX = np.arange(len(hist))
print("XX: ",XX)
#XX: [0 1 2 3 4 5 6 7 8 9]
plt.scatter(XX, hist, marker='.',color='red')
popt, pcov = optimize.curve_fit(Poisson_fit, XX, hist)
plt.plot(x_data, Poisson_fit(x_data,*popt), linestyle='--',color='red',
label='Fit')
print("hist: ",hist)
plt.xlabel('s')
plt.ylabel('P(s)')
#Chisquare test:
f_obs =hist
#f_obs: [5.62657158e-01, 5.14254073e-01, 2.03161280e-01, 5.84898068e-02,
1.35995217e-02, 2.67094169e-03, 4.39345778e-04, 6.59603327e-05,
1.01518320e-05, 1.06301906e-06]
f_exp= Poisson_fit(XX,*popt)
f_exp: [6.76613820e-01, 2.48912314e-01, 9.15697229e-02, 3.36866185e-02,
1.23926144e-02, 4.55898806e-03, 1.67715798e-03, 6.16991940e-04,
2.26978650e-04, 8.35007789e-05]
chi,p_value=chisquare(f_obs,f_exp)
print("chi: ",chi)
print("p_value: ",p_value)
chi: 0.4588956658201067
p_value: 0.9999789643475111`
I am using 10 observations so degree of freedom would be 9. For this degree of freedom I can't find my p-value and chi value on Chi-square distribution table. Is there anything wrong in my code?Or my input values are too small that test fails? if P-value >0.05 distribution is accepted. Although p-value is large 0.999 but for this I can't find chisquare value 0.4588 on table. I think there is something wrong in my code. How to fix this error?
Is this returned chi value is the critical value of tails? How to check proposed hypothesis?

Plotting data from two sets with different shapes in the same plot

I am using data collected from two different instruments which have different resolution because of the sampling rate of each instrument. For a specific time, one of the sets have >10k entries while the other has ~2.5k. They however capture data over the same time interval, and I want to plot them on top of each other even though they have different resolution in data. The minimum and maximum x of both sets are the same however one of them have more entries.
Simplified it could look like this:
1st set from instrument with higher sampling rate:
time(s) value
0.0 10
0.2 11
0.4 12
0.6 13
0.8 14
... ..
100 50
2nd set from instrument with lower sampling rate:
time(s) value
0 100
1 120
2 125
3 128
4 130
. ...
100 430
They are measuring different things, but I would like to display them in the same plot. How can I accomplish this?
I found the mistake.. I was trying to plot both datasets using the time data from the first instrument. Of course they need to be plotted with their respective time data and I put the first time data in the second plot by mistake..

Complex Formulas within Excel Using VBA

I am working on vba code where I have data (for Slope Inclinometers) at various depths like so:
Depth A0 A180 Checksum B0 B180 Checksum
4.5 (-1256) 1258 2 (-394) 378 (-16)
4.5 (-1250) 1257 7 (-396) 376 (-20)
4.5 (-1257) 1257 0 (-400) 374 (-26)
Depth A0 A180 Checksum B0 B180 Checksum
5 (-1214) 1214 0 (-472) 459 (-13)
5 (-1215) 1212 -3 (-472) 455 (-17)
5 (-1216) 1211 -5 (-473) 455 (-18)
UNKNOWN AMOUNT OF DATA WILL BE PRESENT (depends how much the user transfers to this sheet)
Now I need to be able to calculate the A Axis Displacement, the B Axis Displacement, and the resultant which have formulas as followed:
A Axis Displacement = [((A0-A180)/2)-((A0*-A180*)/2))]*(constant/constant)
Where * is the initial readings which is always the first row of data at that specified depth.
B Axis Displacement = [((A0-A180)/2)-((A0*-A180*)/2))]*(constant/constant)
Where * is the initial readings which is always the first row of data at that specified depth.
Resultant = SQRT[(A Axis Displacement)^2 + (B Axis Displacement)^2]
I'm struggling to find examples of how I can implement this using vba as there will be various depths present (unknown amount) on the same sheet where the formula will need to start over at each new depth present.
Any helps/tips would be greatly appreciated!
how I can implement this using vba as there will be various depths present...
You still can do it purely with formulas and easy auto-fill, because the formula can find the the first occurrence of the current depth and perform all the necessary calculations, leaving blank at header rows or blank rows. For instance, you can enter these formulas at row 2 and fill down all the rows.
H2 (A Axis Displacement):
=IF(ISNUMBER($A2),0.5*(B2-C2-VLOOKUP($A2,$A:$F,2,0)+VLOOKUP($A2,$A:$F,3,0)), "")
I2 (B Axis Displacement):
=IF(ISNUMBER($A2),0.5*(E2-F2-VLOOKUP($A2,$A:$F,5,0)+VLOOKUP($A2,$A:$F,6,0)), "")
J2 (Resultant):
=IF(ISNUMBER($A2),SQRT(SUMSQ(H2,I2)),"")
p.s. in the displacements formulas I omitted the (constant/constant) factor as it is irrelevant to the answer, you can easily multiply the 0.5 factor by anything you need.