Handling Nulls while calculating percentiles in Hive - sql

Am having some troubles in handling nulls while calculating percentiles. Below is the sample data.
enter image description here
Code that am using now: percentile(column_1, array(0, 0.25, 0.50, 0.75, 1)) as column_1_p
Here it considers null values too while calculating percentiles. But I need to eliminate them and only use other valid values to calculate percentiles. I couldn't find any other function which does this.
Data: Values range from zero to 1000. I cannot replace nulls with zeros, as I already have zeros in data.
Any help here is highly appreciated.
Thanks in advance.

Related

How I can round up values in Polars

I have some calculated float columns. I want to display values of one column rounded, but round(pl.col("value"), 2) not vorking properly in Polars. How I can make it?
df.with_column(
pl.col("x").round(2)
)

How to calculate the difference between row values based on another column value without filtering the values in between

How to calculate the difference between row values based on another column value without filtering the values in between.I want to calculate the difference between seconds for turn_marker == 1. but when I use the following method, it filters all the zeros but I need the zeros, because I need the entire data set.
Here you can see my data set with a column called turn_marker that has the values zero and 1, and another column with seconds. Now I want to calculte the time bwetween those rows where turn_marker is equal 1.
dataframe = main_dataframe.query("turn_marker=='1;'")
main_dataframe["seconds_diff"] = dataframe["seconds"].diff()
main_dataframe
I would be grateful if you could help me.
You can do this:
main_dataframe['indx'] = main_dataframe.index
main_dataframe['diff'] = main_dataframe.sort_values(by=['turn_marker', 'indx'], ascending=[False, True])['seconds'].diff()
main_dataframe.loc[main_dataframe.turn_marker == '0;', 'diff'] = np.nan

Decimal Round figure Condition

enter image description hereI want to round the figure of "Rate" if the decimal value becomes .5 or higher (For Example if 17.57 I want it as 18, if 20.98 I want it as 21)
On the other hand, if the decimal value becomes lower than .5 (For Example if 17.23 I want it as 17, if 20.49 I want it as 20)
I am attaching an image. Please let me know the condition.Thank you.
You just need use Round() function as below:
select round(86.54,0) //zero will fix in your case
Use round function
select round('19.6',0)
Result

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))

How to identify the indices for the nth smallest values in a multi dimensional array in VBA?

I currently have a 3 dimensional array full of different values. I would like to find the indices corresponding to the "nth" smallest values in the array. For example... If the 3 smallest values were 0.1, 0.2 and 0.3 I would like to see, in order, the indices for these values. Any help would be greatly appreciated.
A possible way to approach this would be adding an original index dimension into your 3rd array, then sorting, by the current values to find out the smallest item and returning the original index. Take a look into this: VBA array sort function?