The shape of my dataset is (130,20) which can be found using the df.shape command of python. I also found out the total number of missing values in the data set using df.isnull().sum().sum() command.
Now I want to know the % of missing values in the dataset.
Total value 130*20 = 2600
total missing values = 850
% of missing values (850/2600)*100 = 32.69%
I am not sure my method is alright for fining % of missing values.
Any help would be appreciated.
I usually do
I don't know why it is giving the waring message of Removed 20 rows containing missing values (geom_segment). I have checked for NA's using sum( and answer is 0 and maxium data point is 72 on y-axis.
I tried coord_cartesian(ylim = c(0, 80)) but nothing has changed.
Would highly appriciate any suggestions.
Here is my code
plot3 <- plot2+scale_y_continuous(breaks=seq(0,70,10))
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)
I am exactly doing this Sum(2322933.99/1161800199.8)*
I should get
1.9 something but I am getting 64. Something
can anyone guide my y this division in snowflake giving wrong results
I tried them converting into decimal values and tried with Formula div0()
Nothing worked
I guess that your database table has 33 rows. So you get 33 * 1.9 (because of SUM), which is about 64.
My guess, with the few details that you gave us:
sum(x)/sum(y) is different than sum(x/y)
1/2 + 2/4 + 4/8 = 1.5
(1+2+4)/(2+4+8) = 0.5
Try writing sum(total gross weight)/sum(total cases filled) instead of sum(total gross weight /total cases filled).
I have a dataframe which has appr. 100 columns and 20000 rows. Now I want to encode one categorical column so that it will have numerical code. After checking its value counts, the result shows something like this:
aaa 650
baa 350
cad 50
dae 10
ef3 1
The total unique values are about 3300. So I might have a code range from 1 to 3300. I will
normalize the numerical code before train it. As I have already many columns in the dataset, I prefer not using one hot encoding method. So how can I do it? Thank you!
You can enumerate each group using ngroup(). It would look something like:
df.assign(num_code=lambda x: x.groupby(['name']).ngroup())
I don't know what kind of information the column contains, however I am not sure it makes sense to assign an incremental numerical code to a column that seems to be categorical for training models.
There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing That returned 3 values: 1,2 and 3 (surprise) we then typed and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem