enter image description here
I want to make a new column from "TotalPrice" with qcut function but some values returns as NaN. I don't know why?
I tried to change the data type of the column. But nothing has changed.
Edit:
you are doing a cqut on df rather than rfm dataframe. Ensure that this is what you expect to be doing
Because you did not provide some data to build a minimal reproducible example, I would guess that there's not enough data or too many repeated values. Then, the underlying quartile function may fail to find the edge of the quantile and returns NaN
(this did not make any sense because "M" buckets did not make sense with "TotalPrice")
Related
I have a pandas dataframe that I've extracted from a json object using pd.json_normalize.
It has 4 rows and over 60 columns, and with the exception of the 'ts' column there are no columns where there is more than one value.
Is it possible to merge the four rows togather to give one row which can then be written to a .csv file? I have searched the documentation and found no information on this.
To give context, the data is a one time record from a weather station, I will have records at 5 minute intervals and need to put all the records into a database for further use.
I've managed to get the desired result, it's a little convoluted, and i would expect that there is a much more succint way to do it, but I basically manipulated the dataframe, replaced all nan's with zero, replaced some strings with ints and added the columns together as shown in the code below:
with open(fname,'r') as d:
ws=json.loads(next(d))
df=pd.json_normalize(ws['sensors'], record_path='data')
df3=pd.concat([df.iloc[0],df.iloc[1], df.iloc[2],
df.iloc[3]],axis=1)
df3.rename(columns={0 :'a', 1:'b', 2 :'c' ,3 :'d'}, inplace=True)
df3=df3.fillna(0)
df3.loc['ts',['b','c','d']]=0
df3.loc[['ip_v4_gateway','ip_v4_netmask','ip_v4_address'],'c']=int(0)
df3['comb']=df3['a']+df3['b']+df3['c']+df3['d']
df3.drop(columns=['a','b','c','d'], inplace=True)
df3=df3.T
As has been said by quite a few people, the documentation on this is very patchy, so I hope this may help someone else who is struggling with this problem! (and yes, i know that one line isn't indented properly, get over it!)
Can anyone explain me excatly what is meant by Dummy Variable Trap?And why we want to remove one column to avoid that trap?Please provide me some links or explain this.I am not clear about this process.
In regression analysis there's often talk about the issue of multicolinearity, which you might be familiar with already. The dummy variable trap is simply perfect colinearity between two or more variables. This can arise if, for one binary variable, two dummies are included; Imagine that you have a variable x which is equal to 1 when something is True. If you would include x, along with another variable z, which would be the opposite of x (i.e. 1 when that same thing is False), in your regression model, you would have two perfectly negatively correlated variables.
Here's a simple demonstration. Let's say your x is one column with True/False values in a pandas dataframe. See what happens when you use pd.get_dummies(df.x) below. The two dummies that are created are mirroring each other, so one of them is redundant. In simpler terms, you only need one of them since you can always guess the value of the other based on the one that you have.
import pandas as pd
df = pd.DataFrame({'x': [True, False]})
pd.get_dummies(df.x)
False True
0 0 1
1 1 0
The same applies if you have a categorical variable that can take on more than two values. Whether binary or not, there is always a "base scenario" that can be defined by the variation in the other case(s). This "base scenario" is therefore redundant and will only introduce perfect colinearity in the model if included.
So what's the issue with multicolinearity/linear dependence? The short answer is that if there is imperfect multicolinearity among your explanatory variables, your estimated coefficients can be distorted/biased. If there is perfect multicolinearity (which is the case with the dummy variable trap) you can't estimate your model at all; think of it like this, if you have a variable that can be perfectly explained by another variable, it means that your sample data only includes valuable information about one, not two, truly unique variables. So it would be impossible to obtain two separate coefficient estimates for the same variable.
Further Reading
Multicolinearity
Dummy Variable Trap
I'm a beginner in Python and the Pandas library, and I'm rather confused by some basic functionality of DataFrame. I was dropping my data frame and has stated inplace=True so my data should be dropped. But why am I still seeing my data when I show it using head or iloc function? I've checked my data using .info() and notice that the data is dropped already by the difference of the data count before stating inplace=True.
So why can I still see my dropped data? Any explanation or pointer would be great. Thanks
Pict
if you have NaN in olny one column, just use df.dropna(inplace=True)
This should get you the result you want.
The reason why your code is not working is because when you do df['to_address'] , you are working with only that column & the output is as series (using inplace=True will not have an effect) which the contents of the column with the NaN rows removed.
You can use df = df.dropna(subset=['to_address']) as well.
For the mixError function in missForest, the documentation says
Usage:
mixError(ximp, xmis, xtrue)
Arguments
ximp : imputed data matrix with variables in the columns and observations in the rows. Note there should not be any missing values.
xmis: data matrix with missing values.
xtrue: complete data matrix. Note there should not be any missing values.
Then my question is..
If I have already xtrue, why do I need this function?
All the examples have a complete data, they impute some NA's on purpose, then they use missForest to fill out the NA's and then they calculate the error comparing the imputed data with the original data without NA's.
But.. what is the sense of that? If I already have the complete data!
So, the question is also
Could xtrue be the original data with all the rows with NA's removed??
I am running a sum function of one hive table in Hue, and get a return value of NaN.
Here is my code:
select sum(v1) from hivedb.tb1;
I don't know why it is giving me a NaN result. I checked if any of my v1 values are null:
select * from hivedb.tb1 where v1 is null;
, and it turns out no record has null value. The table has 100 million rows, so I can not do a manual check for each record.
Does anybody know why I am getting a NaN result?
And if it is because I have some abnormal value in some rows, how can I find them?
Any help is appreciated. Thank you in advance!
UPDATE 1
I manually screened the first 1000 rows, and luckily spotted some abnormal values of NaN in tb1. It is resulted from some rounding error from the previous steps. So my question 1 is probably answered. Please feel free to comment on it, if you think there could be other reasons.
I still don't know how to use an efficient way to spot the rows with NaN values. So I am still looking forward to any answers to my question #2. Please feel free to share. I appreciate your help.
UPDATE 2
The problem is solved with help in the accepted answer below, in the discussion section. There are multiple ways to deal with it.
Use a condition selection of v1+1 >v1. It will select rows with non NaN values.
Use a condition selection of cast(v1 as String) ='NaN'. It will select rows with NaN values.
Hive relies on Java (plus SQL-specific semantics for Null and friends), and Java honors the IEEE standard for number semantics. Which means that... NaN is tricky.
Quoting that post...
(Float.NaN == Float.NaN) always returns false.In fact, if you
look at the JDK implementation of Float.isNaN(), a number is
not-a-number if it is not equal to itself (which makes sense because
a number should be equal to itself).The same holds for Double.NaN
So, there is no point in showing you how to use the (undocumented) Hive function called reflect2, which allows you to invoke raw Java methods on Hive columns, i.e.
where v1 is not null and not reflect2(v1, "isNaN")
...because -- in theory -- you can simply state:
where v1 is not null and v1=v1
Disclaimer -- I have seen cases where the Hive optimizer makes aggressive "optimizations" and produces wrong results.In other words, if the simple v1=v1 clause does not filter out the NaN values as expected, then look into reflect2...
Edit -- indeed, the optimizer appears to ignore the v1=v1 clause in some versions of Hive (see comments) so a more devious formula is necessary:
v1 +1.0 > v1 should work... except when rounding errors make either abs(v1) << 1 or abs(v1) >> 1
other "numeric" tricks will fail similarly in edge cases, especially when v1 =0.0
In the end, the most robust approach appears to try cast(v1 as String) <>'NaN' (because all possible NaN values are displayed as "NaN" even if they are not strictly "equal" in the arithmetical sense).
Side note about reflect2 -- you can see that it is indeed not mentioned in the official Hive doc, while reflect is mentioned (and even has a specific Wiki entry). But it has been implemented as early as Hive V0.11 cf. Hive-4025
Edit -- Java "reflection" is now disabled by default for ODBC / JDBC / Hue connections (see comments), and cannot be re-enabled when using security plug-ins such as ranger or Sentry. So its usage is restricted to the (deprecated) hive CLI.
You can handle NaN as
SELECT SUM(CAST(IF(v1 ='NaN', 0, v1)) as Double) FROM hivedb.tb1
Not sure if this applies in many cases, but in Hive 3 I'm getting:
select float('NaN') = float('NaN')
returns True
So in theory:
select * from hivedb.tb1 where v1 <> float('NaN');
should accomplish this