filter values in a dataframe column based on null values in a different column python dataframe - dataframe

I've been stuck on this for a bit so hopefully someone has better guidance.
I currently have a dataframe that looks something like this(only way more rows):
|"released_date"| "status" |
+-------------+--------+
| 12/12/20 |released|
+-------------+--------+
| 10/01/20 | NaN |
+-------------+--------+
| NaN | NaN |
+-------------+--------+
| NaN. |released|
+-------------+--------+
I wanted to do df['status'].fillna('released' if df.released_date.notnull())
aka, fill any Nan value in the status column of df with "released" as long as df.released_date is't a null value.
I keep getting various error messages when I do this though in different variations, first for the code above is a syntax error, which I imagine is because notnull() returns a boolean array?
I feel like there is a simple answer for this and I somehow am not seeing it. I haven't found any questions like this where I'm trying to organize something based on the null values in a dataframe, which leads me to wonder if my methodology isn't ideal in the first place? How can I filter values in a dataframe column based on null values in a different column without using isnull() or notnull() if those only return boolean arrays anyways? using == Null doesn't seem to work either...

Try:
idx = df[(df['status'].isnull()) & (~df['released_date'].isnull())].index
df.loc[idx,'status'] = 'released'
First get the index of all rows with 'status' equals null and 'released_date' notequals null. Then use df.loc to update the status column.
Prints:
released_date status
0 12/12/20 released
1 10/01/20 released
2 NaN NaN
3 NaN released

Related

Create a new column based on another column in a dataframe

I have a df with multiple columns. One of my column is extra_type. Now i want to create a new column based on the values of extra_type column. For example
extra_type
NaN
legbyes
wides
byes
Now i want to create a new column with 1 and 0 if extra_type is not equal to wide then 1 else 0
I tried like this
df1['ball_faced'] = df1[df1['extra_type'].apply(lambda x: 1 if [df1['extra_type']!= 'wides'] else 0)]
It not working this way.Any help on how to make this work is appreciated
expected output is like below
extra_type ball_faced
NaN 1
legbyes 1
wides 0
byes 1
Note that there's no need to use apply() or a lambda as in the original question, since comparison of a pandas Series and a string value can be done in a vectorized manner as follows:
df1['ball_faced'] = df1.extra_type.ne('wides').astype(int)
Output:
extra_type ball_faced
0 NaN 1
1 legbyes 1
2 wides 0
3 byes 1
Here are links to docs for ne() and astype().
For some useful insights on when to use apply (and when not to), see this SO question and its answers. TL;DR from the accepted answer: "If you're not sure whether you should be using apply, you probably shouldn't."
df['ball_faced'] = df.extra_type.apply(lambda x: x != 'wides').astype(int)
extra_type
ball_faced
0
NaN
1
1
legbyes
1
2
wides
0
3
byes
1

SQL Impala convert NaN values to NULL values

I have the following column in my table:
col 1
1
3
NULL
NaN
5
"Bad" aggregations return NaN instead of NULL and the variable is type DOUBLE in the end.
I want to have one type of missing values only, hence I need to convert NULL to NaN or the other way around.
My problem is that when I partition with a window function it does not recognize NaNs as equal to
NULLS and creates separate subgroups, which is something I do not want.
Any suggestions on how to convert them?

How to create new columns using groupby based on logical expressions

I have this CSV file
http://www.sharecsv.com/s/2503dd7fb735a773b8edfc968c6ae906/whatt2.csv
I want to create three columns, 'MT_Value','M_Value', and 'T_Data', one who has the mean of the data grouped by year and month, which I accomplished by doing this.
data.groupby(['Year','Month']).mean()
But for M_value I need to do the mean of only the values different from zero, and for T_Data I need the count of the values that are zero divided by the total of values, I guess that for the last one I need to divide the amount of values that are zero by the amount of total data grouped, but honestly I am a bit lost. I looked on google and they say something about transform but I didn't understood very well
Thank you.
You could do something like this:
(data.assign(M_Value=data.Valor.where(data.Valor!=0),
T_Data=data.Valor.eq(0))
.groupby(['Year','Month'])
[['Valor','M_Value','T_Data']]
.mean()
)
Explanation: assign will create new columns with respective names. Now
data.Valor.where(data.Valor!=0) will replace 0 values with nan, which will be ignored when we call mean().
data.Valor.eq(0) will replace 0 with 1 and other values with 0. So when you do mean(), you compute count(Valor==0)/total_count().
Output:
Valor M_Value T_Data
Year Month
1970 1 2.306452 6.500000 0.645161
2 1.507143 4.688889 0.678571
3 2.064516 7.111111 0.709677
4 11.816667 13.634615 0.133333
5 7.974194 11.236364 0.290323
... ... ... ...
1997 10 3.745161 7.740000 0.516129
11 11.626667 21.800000 0.466667
12 0.564516 4.375000 0.870968
1998 1 2.000000 15.500000 0.870968
2 1.545455 5.666667 0.727273
[331 rows x 3 columns]

Pandas Duplicated returns some not duplicate values?

I am trying to remove duplicates from dataset.
Before using df.drop_duplicates(), I run df[df.duplicated()] to check which values are treated as duplicates. Values that I don't consider to be duplicates are returned, see example below. All columns are checked.
How to get accurate duplicate results and drop real duplicates?
city price year manufacturer cylinders fuel odometer
whistler 26880 2016.0 chrysler NaN gas 49000.0
whistler 17990 2010.0 toyota NaN hybrid 117000.0
whistler 15890 2010.0 audi NaN gas 188000.0
whistler 8800 2007.0 nissan NaN gas 163000.0
Encountered the same problem.
At first, it looks like
df.duplicated(subset='my_column_of_interest')
returns results which actually have unique values in my_column_of_interest field.
This is not the case, though. The documentation shows that duplicated uses the keep parameter to opt for keeping all duplicates, just the first or just the last. Its default value is first.
Which means that if you have a value present twice in this column, running
df.duplicated(subset='my_column_of_interest') will return results that only contain this value once (since only its first occurrence is kept).

SSRS Chart with Grouping like in Excel

I wasnt able to find anything like this yet... but here is what i need to do:
I have a query result like this:
ID Data1 Data2 Data3 Data4 ... Data7
1 12 13 15 1 ... 12
2 12 13 15 1 ... 12
3 12 13 15 1 ... 12
4 12 13 15 1 ... 12
I need to make a BarChart With 2 Values, 1 is the first row (ID=1) one is the last row (ID=4). The column headers DataX is what i need the series to be paired by.
Example:
ID Insured Uninsured Rejected
1 12 3 0
4 16 9 2
In the BarChart i need to see the number of insured or ID=1 and ID=2 next to each other, the number of Uninsured and rejected the same.
I feel like i have tried all ways possible but was not able to get anything besides a BarChart where all values of ID=1 where displayed and then all values for ID=2 where displayed next to each other.
Im sure this was a very confusing way to describe it, but i hope someone can understand what i am looking for.
NOTE: I tried to do this in Excel, and it worked within 2 minutes. I set the filter: Series on the 2 rows that i wanted, and set the Categories to the dataX Columns as described, and everything looked great. When i tried to translate this into SSRS i was able to do all the same things in the Series and Categories, but then i had to put in values and that screwed everything up.
PLEASE HELP!
I bet you need to add a grouping to your values by a spanning factor.