How do I clean this dataframe? - pandas

In row 2, I have a value "AVE" in the 'address' column that I would like to join with the 'address' value in row 1. The result should be row 1 'address' reads as "NEWPORT AVE / HIGHLAND AVE". How do I do this?
I also need to perform the same function with row 3 where 'action_taken' reads as "SERVICE RENDERED" with "RENDERED" taken from row 4.
|incident_no | date_reported | time_reported | address | incident_type | action_taken
------------------------------------------------------------------------------------------------------
1 | 2100030948 | 2021-05-16 | 23:21:00 | NEWPORT AVE / HIGHLAND | ERRATIC M/V | UNFOUNDED
2 | <NA> | NaT | NaT | AVE | NaN | NaN
3 | 2100030947 | 2021-05-16 | 23:16:00 | FALMOUTH ST | SECURITY CHECK| SERVICE
4 | <NA> | NaT | NaT | NaN | NaN | RENDERED
5 | 2100030946 | 2021-05-16 | 22:55:00 | PINE RD | SECURITY CHECK| SERVICE
``

First columns from list forward filling missing values, then group by them and aggregate join with remove missing values:
cols = ['incident_no','date_reported','time_reported']
df[cols] = df[cols].ffill()
df = df.groupby(cols).agg(lambda x: ' '.join(x.dropna())).reset_index()

Related

Pandas apply to a range of columns

Given the following dataframe, I would like to add a fifth column that contains a list of column headers when a certain condition is met on a row, but only for a range of dynamically selected columns (ie subset of the dataframe)
| North | South | East | West |
|-------|-------|------|------|
| 8 | 1 | 8 | 6 |
| 4 | 4 | 8 | 4 |
| 1 | 1 | 1 | 2 |
| 7 | 3 | 7 | 8 |
For instance, given that the inner two columns ('South', 'East') are selected and that column headers are to be returned when the row contains the value of one (1), the expected output would look like this:
Headers
|---------------|
| [South] |
| |
| [South, East] |
| |
The following one liner manages to return column headers for the entire dataframe.
df['Headers'] = df.apply(lambda x: df.columns[x==1].tolist(),axis=1)
I tried adding the dynamic column range condition by using iloc but to no avail. What am I missing?
For reference, these are my two failed attempts (N1 and N2 being column range variables here)
df['Headers'] = df.iloc[N1:N2].apply(lambda x: df.columns[x==1].tolist(),axis=1)
df['Headers'] = df.apply(lambda x: df.iloc[N1:N2].columns[x==1].tolist(),axis=1)
This works:
df=pd.DataFrame({'North':[8,4,1,7],'South':[1,4,1,3],'East':[8,8,1,7],\
'West':[6,4,2,8]})
df1=df.melt(ignore_index=False)
condition1=df1['variable']=='South'
condition2=df1['variable']=='East'
condition3=df1['value']==1
df1=df1.loc[(condition1|condition2)&condition3]
df1=df1.groupby(df1.index)['variable'].apply(list)
df=df.join(df1)

How to print rows of data where the difference in columns is >1

the data table i'm using is:
enter code here
|State| Sno Center| Mar-21| Apr-21|
|AP 1 | Guntur | 121 | 121.1 |
| 2 | Nellore | 118.8 | 118.3|
| 3 | Visakhapatnam| 131.6 | 131.5|
|ASM | 4 Biswanath-| 123.7 | 124.5|
| 5 | Doom-Dooma | 127.8 |128.2|
| 6 | Guwahati | 125.9 |128.2|
| 7 | Labac-Silchar| 114.2 | 115.4|
| 8 | Numaligarh- | 114.2 | 115.1|
| 9 | Sibsagar | 117.7 | 117.3|
| 10| Munger-Jamalpur|117.2 | 118.3|
I want to find the difference between the columns Mar-21 and Apr-21 and print those rows only where the difference is >1.
I tried the following
from numpy import median
import pandas as pd
from pandas.core.tools.numeric import to_numeric
df=pd.read_csv('CPIIW_421a.csv')
mydiff=(df['Apr-21']-df['Mar-21'])
print(mydiff)
df['diff']=(df['Apr-21']-df['Mar-21'])
print(df['diff'])
this code put displays only one column of difference as under, instead of details of rows.
0 | 0.1|
1 | -0.5|
2 | -0.1|
3 | 0.8|
4 | 0.4|
I need all rows display where difference is >1.
How I need to proceed further.
I also want to copy the required data in a new csv file. please advise.
I am a beginner.
thanks
I think you just need this in your last print:
print(df.loc[df['diff'] > 1])
to save it to csv:
df.loc[df['diff'] > 1].to_csv('yourcsv.csv')
You can try as follows
df_new = df[(df["Apr-21"]-df["Mar-21"]) > 1].copy()
For example
df = pd.DataFrame(data={'State':[1,2,3,4,5,6, 7, 8, 9, 10],
'Sno Center': ["Guntur", "Nellore", "Visakhapatnam", "Biswanath", "Doom-Dooma", "Guwahati", "Labac-Silchar", "Numaligarh", "Sibsagar", "Munger-Jamalpu"],
'Mar-21': [121, 118.8, 131.6, 123.7, 127.8, 125.9, 114.2, 114.2, 117.7, 117.7],
'Apr-21': [121.1, 118.3, 131.5, 124.5, 128.2, 128.2, 115.4, 115.1, 117.3, 118.3]})
df
State Sno Center Mar-21 Apr-21
0 1 Guntur 121.0 121.1
1 2 Nellore 118.8 118.3
2 3 Visakhapatnam 131.6 131.5
3 4 Biswanath 123.7 124.5
4 5 Doom-Dooma 127.8 128.2
5 6 Guwahati 125.9 128.2
6 7 Labac-Silchar 114.2 115.4
7 8 Numaligarh 114.2 115.1
8 9 Sibsagar 117.7 117.3
9 10 Munger-Jamalpu 117.7 118.3
df_new = df[(df["Apr-21"]-df["Mar-21"]) > 1].copy()
The result
df_new
State Sno Center Mar-21 Apr-21
5 6 Guwahati 125.9 128.2
6 7 Labac-Silchar 114.2 115.4

Iterate through pandas data frame and replace some strings with numbers

I have a dataframe sample_df that looks like:
bar foo
0 rejected unidentified
1 clear caution
2 caution NaN
Note this is just a random made up df, there are lot of other columns lets say with different data types than just text. bar and foo might also have lots of empty cells/values which are NaNs.
The actual df looks like this, the above is just a sample btw:
| | Unnamed: 0 | user_id | result | face_comparison_result | created_at | facial_image_integrity_result | visual_authenticity_result | properties | attempt_id |
|-----:|-------------:|:---------------------------------|:---------|:-------------------------|:--------------------|:--------------------------------|:-----------------------------|:----------------|:---------------------------------|
| 0 | 58 | ecee468d4a124a8eafeec61271cd0da1 | clear | clear | 2017-06-20 17:50:43 | clear | clear | {} | 9e4277fc1ddf4a059da3dd2db35f6c76 |
| 1 | 76 | 1895d2b1782740bb8503b9bf3edf1ead | clear | clear | 2017-06-20 13:28:00 | clear | clear | {} | ab259d3cb33b4711b0a5174e4de1d72c |
| 2 | 217 | e71b27ea145249878b10f5b3f1fb4317 | clear | clear | 2017-06-18 21:18:31 | clear | clear | {} | 2b7f1c6f3fc5416286d9f1c97b15e8f9 |
| 3 | 221 | f512dc74bd1b4c109d9bd2981518a9f8 | clear | clear | 2017-06-18 22:17:29 | clear | clear | {} | ab5989375b514968b2ff2b21095ed1ef |
| 4 | 251 | 0685c7945d1349b7a954e1a0869bae4b | clear | clear | 2017-06-18 19:54:21 | caution | clear | {} | dd1b0b2dbe234f4cb747cc054de2fdd3 |
| 5 | 253 | 1a1a994f540147ab913fcd61b7a859d9 | clear | clear | 2017-06-18 20:05:05 | clear | clear | {} | 1475037353a848318a32324539a6947e |
| 6 | 334 | 26e89e4a60f1451285e70ca8dc5bc90e | clear | clear | 2017-06-17 20:21:54 | suspected | clear | {} | 244fa3e7cfdb48afb44844f064134fec |
| 7 | 340 | 41afdea02a9c42098a15d94a05e8452b | NaN | clear | 2017-06-17 20:42:53 | clear | clear | {} | b066a4043122437bafae3ddcf6c2ab07 |
| 8 | 424 | 6cf6eb05a3cc4aabb69c19956a055eb9 | rejected | NaN | 2017-06-16 20:00:26 |
I want to replace any strings I find with numbers, per the below mapping.
def no_strings(df):
columns=list(df)
for column in columns:
df[column] = df[column].map(result_map)
#We will need a mapping of strings to numbers to be able to analyse later.
result_map = {'unidentified':0,"clear": 1, 'suspected': 2,"caution" : 3, 'rejected':4}
So the output might look like:
bar foo
0 4 0
1 1 3
2 3 NaN
For some reason, when I run no_strings(sample_df) I get errors.
What am I doing wrong?
df['bar'] = df['bar'].map(result_map)
df['foo'] = df['foo'].map(result_map)
df
bar foo
0 4 0
1 1 3
2 3 2
However, if you wish to be on the safe side (assuming a key/value is not in your result_map and you dont want to see a NaN) do this:
df['foo'] = df['foo'].map(lambda x: result_map.get(x, 'not found'))
df['bar'] = df['bar'].map(lambda x: result_map.get(x, 'not found'))
so an out put for this df
bar foo
0 rejected unidentified
1 clear caution
2 caution suspected
3 sdgdg 0000
will result in:
bar foo
0 4 0
1 1 3
2 3 2
3 not found not found
To be extra efficient:
cols = ['foo','bar','other_columns']
for c in cols:
df[c] = df[c].map(lambda x: result_map.get(x, 'not found'))
Lets try stack, map the dict and then unstack
df.stack().to_frame()[0].map(result_map).unstack()
bar foo
0 4 0
1 1 3
2 3 2

df.replace not having any effect when trying to replace dates in pandas dataframe

I've been through the various comments on here about df.replace but I'm still not able to get it working.
Here is a snippet of my code:
# Name columns
df_yearly.columns = ['symbol', 'date', ' annuual % price change']
# Change date format to D/M/Y
df_yearly['date'] = pd.to_datetime(df_yearly['date'], format='%d/%m/%Y')
The df_yearly dataframe looks like this:
| symbol | date | annuual % price change
---|--------|------------|-------------------------
0 | APX | 12/31/2017 |
1 | APX | 12/31/2018 | -0.502554278
2 | AURA | 12/31/2018 | -0.974450706
3 | BASH | 12/31/2016 | -0.998110828
4 | BASH | 12/31/2017 | 8.989361702
5 | BASH | 12/31/2018 | -0.083599574
6 | BCC | 12/31/2017 | 121718.9303
7 | BCC | 12/31/2018 | -0.998018734
I want to replace all dates of 12/31/2018 with 06/30/2018. The next section of my code is:
# Replace 31-12-2018 with 30-06-2018 as this is final date in monthly DF
df_yearly_1 = df_yearly.date.replace('31-12-2018', '30-06-2018')
print(df_yearly_1)
But the output is still coming as:
| 0 | 2017-12-31
| 1 | 2018-12-31
| 2 | 2018-12-31
| 3 | 2016-12-31
| 4 | 2017-12-31
| 5 | 2018-12-31
Is anyone able to help me with this? I thought this might be related to me having the date format incorrect in my df.replace statement but I've tried to search and replace 12-31-2018 and it's still not doing anything.
Thanks in advance!!
try '.astype(str).replace'
df.date.astype(str).replace('2016-12-31', '2018-06-31')

Pandas: need to create dataframe for weekly search per event occurrence

If I have this events dataframe df_e below:
|------|------------|-------|
| group| event date | count |
| x123 | 2016-01-06 | 1 |
| | 2016-01-08 | 10 |
| | 2016-02-15 | 9 |
| | 2016-05-22 | 6 |
| | 2016-05-29 | 2 |
| | 2016-05-31 | 6 |
| | 2016-12-29 | 1 |
| x124 | 2016-01-01 | 1 |
...
and also know the t0 which is the beginning of time (let's say for x123 it's 2016-01-01) and tN which is the end of experiment from another dataframe df_s (2017-05-25), then how can I create the dataframe df_new which should like this
|------|------------|---------------|--------|
| group| obs. weekly| lifetime, week| status |
| x123 | 2016-01-01 | 1 | 1 |
| | 2016-01-08 | 0 | 0 |
| | 2016-01-15 | 0 | 0 |
| | 2016-01-22 | 1 | 1 |
| | 2016-01-29 | 2 | 1 |
...
| | 2017-05-18 | 1 | 1 |
| | 2017-05-25 | 1 | 1 |
...
| x124 | 2017-05-18 | 1 | 1 |
| x124 | 2017-05-25 | 1 | 1 |
Explanation: take t0 and generate rows until tN per week period. For each row R, search with that group if the event date falls within R, if True, then count how long in weeks it lives there, also set status = 1 as alive, otherwise set lifetime, status columns for this R as 0, e.g. dead.
Questions:
1) How to generate dataframes per group given t0 and tN values, e.g. generate [group, obs. weekly, lifetime, status] columns for (tN - t0) / week rows?
2) How to accomplish the construction of such df_new dataframe explained above?
I can begin with this so far =)
import pandas as pd
# 1. generate dataframes per group to get the boundary within `t0` and `tN` from df_s dataframe, where each dataframe has "group, obs, lifetime, status" columns X (tN - t0 / week) rows filled with 0 values.
df_all = pd.concat([df_group1, df_group2])
def do_that(R):
found_event_row = df_e.iloc[[R.group]]
# check if found_event_row['date'] falls into R['obs'] week
# if True, then found how long it's there
df_new = df_all.apply(do_that)
I'm not really sure if I get you but group one is not related to group two, right? if that's the case I think what you want is something like this:
import pandas as pd
df_group1 = df_group1.set_index('event date')
df_group1.index = pd.to_datetime(df_group1.index) #convert the index to datetime so you can 'resample'
df_group1['lifetime, week'] = df_group1.resample('1W').apply(lamda x: yourfuncion(x))
df_group1 = df_group1.reset_index()
df_group1['status']= df_group1.apply(lambda x: 1 if x['lifetime, week']>0 else 0)
#do the same with group2 and concat to create df_all
I'm not sure how you get 'lifetime, week' but all that's left is creating the function that generates it.