Keep the elements with particular letters in a dataframe - dataframe

I have a list:
list=['Nikolas', 'Niki', 'Niko', 'George', 'Kate']
and I want to keep only the names that have the letters "Nik" ('Nikolas', 'Niki', 'Niko')
I tried this code, but I get an error "TypeError: argument of type 'int' is not iterable".
list=['Nikolas', 'Niki', 'Niko', 'George', 'Kate']
df_1 = pd.DataFrame(list, columns =['name'])
df_1_transposed = df_1.T
df_2 = [colname for colname in df_1_transposed.columns if 'Nik' in colname]
Do you know how to fix it?
thanks in advance!

Code
import pandas as pd
list=['Nikolas', 'Niki', 'Niko', 'George', 'Kate']
df_1 = pd.DataFrame(list, columns =['name'])
df_1 = df_1[df_1['name'].str.find('Nik') == 0]
returns
name
0 Nikolas
1 Niki
2 Niko

Related

Dataframe columns cleaning

I am trying to clean a number of columns in a dataset and try to iterate to different columns.
import pandas as pd
df = pd.DataFrame({
'A': [7.3\N\P,nan\T\Z,11.0\R\Z],
'B': [nan\J\N, nan\A\G, 10.8\F\U],
'C': [12.4\A\I, 13.3\H\Z, 8.200000000000001\B\W]})
for name, values in df.iloc[:, 0:3].iteritems():
def myreplace(s):
for char in ['\A','\B','\C','\D','\E','\F','\G','\H','\I',
'\J','\K','\L','\M','\\N','\O','\P','\Q','\R',
'\S','\T','\V','\W','\X','\Y','\Z','\\U']:
s = s.map(lambda x: x.replace(char, ''))
return s
df = df.apply(myreplace)
I get the error: 'float' object has no attribue 'replace'
I could run this part on one column and it works, but I need to run it on several columns so this part does not work as I get an error that 'Dataframe'objec has no attribute 'str'
df_data.str.replace('[\\\|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|R|S|T|U|V|W|X|Y|Z]', '')
I am really new to python pandas dataframe. Will appreciate the help
Given, assuming the goal is to extract numbers from the strings:
A B C
0 7.3\N\P nan\J\N 12.4\A\I
1 nan\T\Z nan\A\G 13.3\H\Z
2 11.0\R\Z 10.8\F\U 8.200000000000001\B\W
Doing:
cols = ['A', 'B', 'C']
for col in cols:
df[col] = df[col].str.extract('(\d*\.\d*)').astype(float)
Output:
A B C
0 7.3 NaN 12.4
1 NaN NaN 13.3
2 11.0 10.8 8.2

Why pandas does not want to subset given columns in a list

I'm trying to remove certain values with that code, however pandas does not give me to, instead outputs
ValueError: Unable to coerce to Series, length must be 10: given 2
Here is my code:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv")
print(df.shape)
columns_df = ['index', 'company', 'body-style', 'wheel-base', 'length', 'engine-type',
'num-of-cylinders', 'horsepower', 'average-mileage', 'price']
prohibited_symbols = ['?','Nan''n.a']
df = df[df[columns_df] != prohibited_symbols]
print(df)
Try:
df = df[~df[columns_df].str.contains('|'.join(prohibited_symbols))]
The regex operator '|' helps remove records that contain any of your prohibited symbols.
Because what you are trying is not doing what you imagine it should.
df = df[df[columns_df] != prohibited_symbols]
Above line will always return False values for everything. You can't iterate over a list of prohibited symbols like that. != will do only a simple inequality check and none of your cells will be equal to the list of prohibited symbols probably. Also using that syntax will not delete those values from your cells.
You'll have to use a for loop and clean every column like this.
for column in columns_df:
df[column] = df[column].str.replace('|'.join(prohibited_symbols), '', regex=True)
You can as well specify the values you consider as null with the na_values argument when reading the data and then use dropna from pandas.
Example:
import pandas as pd
df = pd.read_csv("/Volumes/SSD/IT/DataSets/Automobile_data.csv", na_values=['?','Nan''n.a'])
df = df.dropna()

How to find and replace second comma (,) with - in a dataframe using Pandas?

I have dataframe like :
Name
Address
Anuj
Anuj,Sinha,BB
Sinha
Sinha,Anuj BB
In column Adrress, I want to replace all comma (,) except the fist comma in all row with -.
Can anyone please suggest me the possible solution?
provided:
df.dtypes
Customer ID Int64
First_name-Last_name string
Address string
Phone string
Secondary_station string
Customer_disconnected string
If there is a maximum of 2 commas, you can use this simple regex:
df['Address'] = df['Address'].str.replace('(,.*),', r'\1-')
output:
Name Address
0 Anuj Anuj,Sinha-BB
1 Sinha Sinha,Anuj BB
If there are possibly more than 2 commas, you can do:
df['Address'] = df['Address'].str.split(',').apply(lambda x: x[0]+','+'-'.join(x[1:]))
or, more efficient:
splits = df['Address'].str.split(',', 1)
df['Address'] = splits.str[0]+','+splits.str[1].str.replace(',', '-')
You can use the replace function this way :
txt = "I like bananas"
x = txt.replace("bananas", "apples")
print(x)
It will display :
I like apples
For your dataframe, you just need to iterate thought your values this way:
import pandas as pd
df = pd.DataFrame(
{
'name': ['Anuj', 'Sinha'],
'adresse': ['Anuj,Sinha,BB', 'Sinha,Anuj BB']
}
)
for colunm in df.columns:
for index, value in enumerate(df[colunm]):
df[colunm][index] = value.replace(',', '-')
print(df)
It will display :
name adresse
0 Anuj Anuj-Sinha-BB
1 Sinha Sinha-Anuj BB

find a value from df1 in df2 and replace other values of the matching rows

I have the following code with 2 dataframes (df1 & df2)
import pandas as pd
data = {'Name': ['Name1', 'Name2', 'Name3', 'Name4', 'Name5'],
'Number': ['456', 'A977', '132a', '6783r', '868354']}
replace = {'NewName': ['NewName1', 'NewName3', 'NewName4', 'NewName5', 'NewName2'],
'ID': ['I753', '25552', '6783r', '868354', 'A977']}
df1 = pd.DataFrame(data, columns = ['Name', 'Number'])
df2 = pd.DataFrame(replace, columns = ['NewName', 'ID'])
Now I would like to compare every item in the 'Number' column of df1 with the 'ID' column of df2. If there is a match, I would like to replace the 'Name' of df1 with the 'NewName' of df2, otherwise it should keep the 'Name' of df1.
First I tried the following code, but unfortunately it mixed the name and the number in the different rows.
df1.loc[df1['Number'].isin(df2['ID']), ['Name']] = df2.loc[df2['ID'].isin(df1['Number']),['NewName']].values
The next code that I tried worked a bit better, but it replaced the 'Name' of df1 with the 'Number' of df1 if there was no matching.
df1['Name'] = df1['Number'].replace(df2.set_index('ID')['NewName'])
How can I stop this behavior in my last code or are there better ways in general to achieve what I would like to do?
You can use map instead of replace to substitute each value in the column Number in df1 with corresponding value from the NewName column in df2 then fill the NaN values (values which can't be substituted) in mapped column with the original values from the Name column in df1:
df1['Name'] = df1['Number'].map(df2.set_index('ID')['NewName']).fillna(df1['Name'])
>>> df1
Name Number
0 Name1 456
1 NewName2 A977
2 Name3 132a
3 NewName4 6783r
4 NewName5 868354

How to concatenate values from multiple rows using Pandas?

In the screenshot, 'Ctrl' column contains a key value. I have two duplicate rows for OTC-07 which I need to consolidate. I would like to concat the rest of column values for OTC-07. i.e, OTC-07 should have Type A,B and Assertion a,b,c,d after consolidation.. Can anyone help me on this? :o
First define a dataframe of given structure:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Ctrl': ['OTC-05', 'OTC-06', 'OTC-07', 'OTC-07', 'OTC-08'],
'Type': ['A', 'A', 'A', 'B', np.NaN],
'Assertion': ['a,b,c', 'c,b', 'a,c', 'b,c,d', 'a,b,c']
})
df
Output:
Then replace NaN values with empty strings:
df = df.replace(np.NaN, '', regex=True)
Then group by column 'Ctrl' and aggregate columns 'Type' and 'Assertion'. Please not that assertion aggregation is a bit tricky as you need not a simple concatenation, but sorted list of unique letters:
df.groupby(['Ctrl']).agg({
'Type': ','.join,
'Assertion': lambda x: ','.join(list(sorted(set(','.join(x).split(',')))))
})
Output: