Getting only True values and their respective indices from a Pandas series - pandas

I have a pandas series that looks like this, extracted on querying a dataframe.
t_loc=
312 False
231 True
324 True
286 False
123 False
340 True
I want only the indices that have 'True' boolean value.
I tried t_loc.index gives me all the indices. t_loc['True'] or t_loc[True] are both futile. Need help.
Also, I need to update these locations with a single number if True. How can I update a column in a dataframe given the location numbers ?
Desired O/P:
[231,324,340]
Need to update eg. df[col1] # 231.. is it df[col1].loc[231] ? How to specify multiple locations? Can we pass the entire list since I need to update it with only one value for all the locations ?

This actually works too :
t_loc.index[t_loc == True/False]

u can try this as well
t_loc.astype(int).index[t_loc == 1]

Related

How to update column A value with column B value based on column B's string length property?

I scraped a real estate website and produced a CSV output with data requiring to be cleaned and structured. So far, my code properly organized and reformatted the data for it to work with stats software.
However, every row and then, my 'Gross area' column has the wrong value in m2. The correct value appears in another column ('Furbished').
Gross_area
Furbished
170 #erroneous
190 m2
170 #erroneous
190 m2
160 #correct
Yes
155 #correct
No
I tried using the np.where function. However, I could not specify the condition based on string length, which would allow me to target all '_ _ _ m2' values in column 'Furbished' and reinsert them in 'Gross_area'. It just doesn't work.
df['Gross area']=np.where(len(df['Furbished]) == 6, df['Furbished'],df['Gross area']
As an alternative, I tried setting cumulative conditions to precisely target my '_ _ _ m2' values and insert them in my 'Gross area' column. It does not work:
df['Gross area']=np.where((df['Furbished]) != 'Yes' or 'No', df['Furbished'],df['Gross area']
The outcome I seek is:
Gross_area
Furbished
190 m2
190 m2
190 m2
190m2
160
Yes
Any suggestions? Column Furbished string length criterion would be the best option, as I have other instances that would require the same treatment :)
Thanks in advance for your help!
There is probably a better way to do this, but you could get the intended effect by a simple df.apply() function.
df['Gross area'] = df.apply(lambda row: row['Furbished'] if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
With a simple change, you can also keep the 'Gross area' column in the right type.
df['Gross area'] = df.apply(lambda row: float(row['Furbished'][:-2]) if len(row['Furbished']) == 6 else df['Gross area'], axis=1)
You can use pd.where:
df['Gross_area'] = df['Furbished'].where(df['Furbished'].str.len() == 6, df['Gross_area'])
This tells you to use the value in the Furbished column if its length is 6, otherwise use the value in the Gross_area column.
Result:
Gross_area Furbished
0 190 m2 190 m2
1 190 m2 190 m2
2 160 #correct Yes
3 155 #correct No
Thanks a lot for your help! The suggestion of Derek was the simplest to implement in my program:
df['Gross area']=df['Furbished'].where(df['Furbished'].str.len()==6,df['Gross area'])
I could create a set of rules to replace or delete all the misreferenced data :)
To update data from given column A if column B equals given string
df['Energy_Class']=np.where(df['Energy_Class']=='Usado',df['Bathrooms'],df['Energy_Class'])
To replace string segment found within column rows
net=[]
for row in net_col:
net.append(row)
net_in=[s for s in prices if 'm²' in s]
print(net_in)
net_1=[s.replace('m²','') for s in net]
net_2=[s.replace(',','.') for s in net_1]
net_3=[s.replace('Sim','') for s in net_2]
df['Net area']=np.array(net_3)
To create new column and append standard value B if value A found in existing target column rows
Terrace_list=[]
caraocl0=(df['Caracs/0']
for row in carac_0:
caracl0.append(row)
print(caracl0)
if row == 'Terraço':
yes='Yes'
Terrace_list.append(yes)
else:
null=('No')
Terrace_list.append(null)
df['Terraces']=np.array(Terrace_list)
To append pre-set value B in existing column X if value A found in existing column Y.
df.loc[df['Caracs/1']=='Terraço','Terraces']='Yes'
Hope this helps someone out.

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

Adding column value for a list of indexes

I have a list of indexes and trying to populate a column 'Type' for these rows only.
What I tried to do:
index_list={1,5,9,10,13}
df.loc[index_list,'Type']=="gain+loss"
Output:
1 False
5 False
9 False
10 False
13 False
But the output just gives the list with all False instead of populating these rows.
Thanks for any advice.
You need to put a single equal instead of double equal. In python, and in most progamming languages, == is the comparison operator. In your case you need the assignment operator =.
So the following code will do what you want :
index_list={1,5,9,10,13}
df.loc[index_list,'Type']="gain+loss"

groupby 2 columns and count into separate columns based on one columns cases

I'm trying to group by 2 columns of which the first value has 5 different values and the second 2.
My data looks like this:
and using
df_counted = df_analysis
.groupby(['TYPE', 'RESULT'])
.size()
.sort_values(ascending=False)
.reset_index(name='COUNT')
I was able to transform it into the cases I want:
However I don't want a column for result, just for counts.
It's suppoed to be like
COUNT_TRUE COUNT_FALSE
FORWARD 21 182
BACKWARD 34 170
RIGHT 24 298
LEFT 20 242
NEUTRAL 16 82
The best I could do there was this. How do I get there?
Pandas has a feature of making a pivot table with dataframe. Your task can also be done by making pivot table.
df_counted.pivot_table(index="TYPE", columns="RESULT", values="COUNT")
Result:
Solved it and went a kind of full SQL there. It's not elegant, but it works:
df_counted is the last df from the question with the NaN values.
# drop duplicates for the first counts
df_pos = df_counted.drop_duplicates(subset=['TYPE'], keep='first').drop(columns=['COUNT_POS'])
# drop duplicates for the first counts
df_neg = df_counted.drop_duplicates(subset=['TYPE'], keep='last').drop(columns=['COUNT_NEG'])
# join on TYPE
df = df_pos.set_index('TYPE').join(df_neg.set_index('TYPE'))
If someone has a more elegant way of doing this, I'd be super interested to see it.

Need explanation on how pandas.drop is working here

I have a data frame, lets say xyz. I have written code to find out the % of null values each column possess in the dataframe. my code below:
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)
let say i got following results:
abc 26.63
def 36.58
ghi 78.46
I want to drop column ghi because it has more than 70% of null values.
I achieved it using the following code:
xyz = xyz.drop(xyz.loc[:,round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70].columns, 1)
but , i did not understand how does this code works, can anyone please explain it?
the code is doing the following:
xyz.drop( [...], 1)
removes the specified elements for a given axis, either by row or by column. In this particular case, df.drop( ..., 1) means you're dropping by axis 1, i.e, column
xyz.loc[:, ... ].columns
will return a list with the column names resulting from your slicing condition
round(100*(xyz.isnull().sum()/len(xyz.index)), 2)>70
this instruction is counting the number of nulls, adding them up and normalizing by the number of rows, effectively computing the percentage of nan in each column. Then, the amount is rounded to have only 2 decimal positions and finally you return True is the number of nan is more than 70%. Hence, you get a mapping between columns and a True/False array.
Putting everything together: you're first producing a Boolean array that marks which columns have more than 70% nan, then, using .loc you use Boolean indexing to look only at the columns you want to drop ( nan % > 70%), then using .columns you recover the name of such columns, which then are used by the .drop instruction.
Hopefully this clear things up!
If you code is hard to understand , you can just check dropna with thresh, since pandas already cover this case.
df=df.dropna(axis=1,thresh=round(len(df)*0.3))