Finding the % of missing values from the entire dataset - pandas

The shape of my dataset is (130,20) which can be found using the df.shape command of python. I also found out the total number of missing values in the data set using df.isnull().sum().sum() command.
Now I want to know the % of missing values in the dataset.
Total value 130*20 = 2600
total missing values = 850
% of missing values (850/2600)*100 = 32.69%
I am not sure my method is alright for fining % of missing values.
Any help would be appreciated.

I usually do
df.isna().to_numpy().ravel().mean()
Or
df.isna().mean().mean()

Related

Removed 20 rows containing missing values (geom_segment)?

I don't know why it is giving the waring message of Removed 20 rows containing missing values (geom_segment). I have checked for NA's using sum(is.na(BS_cell_228)) and answer is 0 and maxium data point is 72 on y-axis.
I tried coord_cartesian(ylim = c(0, 80)) but nothing has changed.
Would highly appriciate any suggestions.
Here is my code
plot3 <- plot2+scale_y_continuous(breaks=seq(0,70,10))
plot3+expand_limits(y=0)

Pandas Replace_ column values

Hello,
I am analyzing the next dataset with this information .
The column ['program_number'] is an object but I want to change it to a integer colum.
I have tried to replace some values but it doesn´t work.
as you can see, some values like 6 is duplicate. like '6 ' and 6.
How can I resolve it? Many thanks
UPDATE
Didn't see 1X and 3X at first.
If you need those numbers and just want to remove the X then:
df["Program"] = df["Program"].str.strip(" X").astype(int)
If there is data in the column which aren't numbers or which shouldn't be converted, you can use pd.to_numeric with errors='corece'. If there are cells which can't be converted, you'll get NaN. Be aware that this will result in floating numbers.
df["Program"] = pd.to_numeric(df["Program"], errors="coerce")
old
You want to use str.strip() here, rather than replace.
Try this:
df1['program_number'] = df1['program_number'].str.strip().astype(int)

snowflake Computation for dividing 2 columns giving wrong values

I am exactly doing this Sum(2322933.99/1161800199.8)*
100
I should get
1.9 something but I am getting 64. Something
can anyone guide my y this division in snowflake giving wrong results
I tried them converting into decimal values and tried with Formula div0()
Nothing worked
I guess that your database table has 33 rows. So you get 33 * 1.9 (because of SUM), which is about 64.
My guess, with the few details that you gave us:
sum(x)/sum(y) is different than sum(x/y)
1/2 + 2/4 + 4/8 = 1.5
(1+2+4)/(2+4+8) = 0.5
Try writing sum(total gross weight)/sum(total cases filled) instead of sum(total gross weight /total cases filled).

Encode all data in one column and assign the same code if data has a same value

I have a dataframe which has appr. 100 columns and 20000 rows. Now I want to encode one categorical column so that it will have numerical code. After checking its value counts, the result shows something like this:
df['name'].value_counts()
aaa 650
baa 350
cad 50
dae 10
ef3 1
....
The total unique values are about 3300. So I might have a code range from 1 to 3300. I will
normalize the numerical code before train it. As I have already many columns in the dataset, I prefer not using one hot encoding method. So how can I do it? Thank you!
You can enumerate each group using ngroup(). It would look something like:
df.assign(num_code=lambda x: x.groupby(['name']).ngroup())
I don't know what kind of information the column contains, however I am not sure it makes sense to assign an incremental numerical code to a column that seems to be categorical for training models.

Dataframe non-null values differ from value_counts() values

There is an inconsistency with dataframes that I cant explain. In the following, I'm not looking for a workaround (already found one) but an explanation of what is going on under the hood and how it explains the output.
One of my colleagues which I talked into using python and pandas, has a dataframe "data" with 12,000 rows.
"data" has a column "length" that contains numbers from 0 to 20. she wants to divided the dateframe into groups by length range: 0 to 9 in group 1, 9 to 14 in group 2, 15 and more in group 3. her solution was to add another column, "group", and fill it with the appropriate values. she wrote the following code:
data['group'] = np.nan
mask = data['length'] < 10;
data['group'][mask] = 1;
mask2 = (data['length'] > 9) & (data['phraseLength'] < 15);
data['group'][mask2] = 2;
mask3 = data['length'] > 14;
data['group'][mask3] = 3;
This code is not good, of course. the reason it is not good is because you dont know in run time whether data['group'][mask3], for example, will be a view and thus actually change the dataframe, or it will be a copy and thus the dataframe would remain unchanged. It took me quit sometime to explain it to her, since she argued correctly that she is doing an assignment, not a selection, so the operation should always return a view.
But that was not the strange part. the part the even I couldn't understand is this:
After performing this set of operation, we verified that the assignment took place in two different ways:
By typing data in the console and examining the dataframe summary. It told us we had a few thousand of null values. The number of null values was the same as the size of mask3 so we assumed the last assignment was made on a copy and not on a view.
By typing data.group.value_counts(). That returned 3 values: 1,2 and 3 (surprise) we then typed data.group.value_counts.sum() and it summed up to 12,000!
So by method 2, the group column contained no null values and all the values we wanted it to have. But by method 1 - it didnt!
Can anyone explain this?
see docs here.
You dont' want to set values this way for exactly the reason you pointed; since you don't know if its a view, you don't know that you are actually changing the data. 0.13 will raise/warn that you are attempting to do this, but easiest/best to just access like:
data.loc[mask3,'group'] = 3
which will guarantee you inplace setitem