Pandas remove numeric rows/values from column - pandas

How can i remove the rows where a column "Name" contains numeric values.
below is the input df,
data = [['tom', 10], ['nick', 15], ['juli', 14],['00012', 14],['abc123', 14]]
Expected result is,
Name Age
0 tom 10
1 nick 15
2 juli 14

df[~df.Name.str.contains(r'\d')]

Try this:
df = df[df.Name.str.isalpha()]
str.isalpha() checks if all characters in the string are alphabetic, if not return False

Related

Columns names of a dataframe from a list of names

I have a list of names:
list_names = ['Albert','Marcos','Alberta']
and I have an empty dataframe:
t=pd.DataFrame()
How can I add list_names as columns in the dataframe t with values = 10, like this:
Albert Marcos Alberta
10 10 10
Use reindex to create columns and loc to append data
t =t.reindex(list_names, axis='columns')
t.loc[0]=10
t = pd.DataFrame({'Albert': 10, 'Marcos': 10, 'Alberta': 10})

Translate my SKUs using a dictionary with Pandas

I have a table which has internal SKUs in column 0 and then synonyms along that row. The number of synonyms is not constant (ranging from 0 to 7, but will have a tendency to grow)
I need an effient function which will allow me to get SKUs from one column in a large table and translate them to synonym 0 from my other table.
This is my current function which takes an array of SKUs from one table, searches for them in another and gives me the first column value where it finds a synonym.
def new_array(dfarray1, array1, trans_dic):
missing_values = set([])
new_array = []
for value in array1:
pos = trans_dic.eq(str(value)).any(axis=1)
if len(pos[pos]) > 0 :
new_array.append(trans_dic['sku_0'][pos[pos].index[0]])
else:
missing_values.add(str(value))
if len(missing_values) > 0 :
print("The following values are missing in dictionary. They are in DF called:"+dfarray1)
print(missing_values)
sys.exit()
else:
return new_array
I'm sure that this is very badly written because it takes my laptop about 3 minutes to go through about 75K values only. Can anyone help me make this faster?
Some questions asked previously:
What types are your function parameters? (can guess pandas, but no way to know for sure)
Yes. I am working on two pandas dataframes.
What does your table even look like?
Dictionary table:
SKU0
Synonym 0
Synonym 1
Synonym 2
foo
bar
bar1
foo1
baar1
foo2
baaar0
baar2
Values table:
SKU
Value
Value1
value1
foo
3
1
7
baar1
4
5
7
baaar0
5
5
9
Desired table:
SKU
Value
Value1
value1
foo
3
1
7
foo1
4
5
7
foo2
5
5
9
What does the rest of your code that is calling this function look like?
df1.sku = new_array('df1', list(df1.sku), sku_dic)
Given the dictionary dataframe in the format
df_dict = pd.DataFrame({
"SKU0": ["foo", "foo1", "foo2"],
"Synonym 0": ["bar", "baar1", "baaar0"],
"Synonym 1": ["bar1", np.nan, np.nan],
"Synonym 2": [np.nan, np.nan, "baar2"]
})
and a values dataframe in the format
df_values = pd.DataFrame({
"SKU": ["foo", "baar1", "baaar0"],
"Value": [3, 4, 5],
"Value1": [1, 5, 5],
"value1": [7, 7, 9]
})
you can get the output you want by first using pd.melt to restructure your dictionary dataframe and then join it to your values dataframe. Then you can use some extra logic to check which column to take the final value from and to select the final columns needed.
(
df_dict
# converts dict df from wide to long format
.melt(id_vars=["SKU0"])
# filters rows where there is no synonym
.loc[lambda x: x["value"].notna()]
# join dictionary with values df
.merge(df_values, how="right", left_on="value", right_on="SKU")
# get final value by taking the value from column "SKU0" if available, else "SKU"
.assign(SKU = lambda x: np.where(x["SKU0"].isna(), x["SKU"], x["SKU0"]))
# select final columns needed in output
[["SKU", "Value", "Value1", "value1"]]
)
# output
SKU Value Value1 value1
0 foo 3 1 7
1 foo1 4 5 7
2 foo2 5 5 9

Is there a function to find the index of a float value in a column using pandas?

Hi so I have a dataframe df with a numeric index, a datetime column, and ozone concentrations, among several other columns. But here's a list of the important columns regarding my question.
index, date, ozone
0, 4-29-2018, 55.4375
1, 4-29-2018, 52.6375
2, 5-2-2018, 50.4128
3, 5-2-2018, 50.3
4, 5-3-2018, 50.3
5, 5-4-2018, 51.8845
I need to call the index value of a row based on the column value. However, multiple rows have a column value of 50.3. First, how do I find the index value based on a specific column value? I've tried:
np.isclose(df['ozone'], 50.3).argmax() from Getting the index of a float in a column using pandas
but this only gives me the first index value that the number appears. Is there a way to call the index based on two parameters (like ask what the index value for when datetime = 5-2-2018 and ozone = 50.3)?
I've also tried df.loc but it doesn't work for floating points.
here's some sample code:
df = pd.read_csv('blah.csv')
df.set_index('date', inplace = True)
df.index = pd.to_datetime(df.index)
date = pd.to_datetime(df.index)
dv = df.groupby([date.month,date.day]).mean()
dv.drop(columns=['dir_5408'], inplace=True)
df['ozone'] = df.oz_3186.rolling('8H', min_periods=2).mean().shift(-4)
ozone = df.groupby([date.month,date.day])['ozone'].max()
df['T_mda8_3186'] = df.Temp_C_3186.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_3186 = df.groupby([date.month,date.day])['T_mda8_3186'].max()
df['T_mda8_5408'] = df.Temp_C_5408.rolling('8H', min_periods=2).mean().shift(-4)
T_mda8_5408 = df.groupby([date.month,date.day])['T_mda8_5408'].max()
df['ws_mda8_5408'] = df.ws_5408.rolling('8H', min_periods=2).mean().shift(-4)
ws_mda8_5408 = df.groupby([date.month,date.day])['ws_mda8_5408'].max()
dv_MDA8 = df.drop(columns=['Temp_C_3186', 'Temp_C_5408','dir_5408','ws_5408','u_5408','v_5408','rain(mm)_5724',
'rain(mm)_5408','rh_3186','rh_5408','pm10_5408','pm10_3186','pm25_5408','oz_3186'])
dv_MDA8.reset_index(inplace=True)
I need the date as a datetime index for the beginning of my code.
Thanks in advance for your help.
This is what you might be looking for,
import pandas as pd
import datetime
data = pd.DataFrame({
'index':[0,1,2,3,4,5],
'date':['4-29-2018','4-29-2018','5-2-2018','5-2-2018','5-3-2018','5-4-2018'],
'ozone':[55.4375,52.6375,50.4128,50.3,50.3,51.8845]
}
)
data.set_index(['index'],inplace=True)
data['date'] = data['date'].apply(lambda x: datetime.datetime.strptime(x,'%m-
%d-%Y'))
data['ozone'] = data['ozone'].astype('float')
data.loc[(data['date'] == datetime.datetime.strptime('5-3-2018','%m-%d-%Y'))
& (data['ozone'] == 50.3)]
Index represents each row and you can find out indexes and then store/use it
later until, ofcourse, the index of df has not changed
Code:
import pandas as pd
import numpy as np
students = [('jack', 34, 'Sydeny', 'Engineering'),
('Sachin', 30, 'Delhi', 'Medical'),
('Aadi', 16, 'New York', 'Computer Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Delhi', 'Data Science'),
('Riti', 30, 'Mumbai', 'Information Security'),
('Aadi', 40, 'London', 'Arts'),
('Sachin', 30, 'Delhi', 'Medical')
]
df = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Subject'])
print(df)
ind_name_sub = df.where((df.Name == 'Riti') & (df.Subject == 'Data Science')).dropna().index
# similarly you can have your ind_date_ozone may have one or more values
print(ind_name_sub)
print(df.loc[ind_name_sub])
Output:
Name Age City Subject
0 jack 34 Sydeny Engineering
1 Sachin 30 Delhi Medical
2 Aadi 16 New York Computer Science
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science
5 Riti 30 Mumbai Information Security
6 Aadi 40 London Arts
7 Sachin 30 Delhi Medical
Int64Index([3, 4], dtype='int64')
Name Age City Subject
3 Riti 30 Delhi Data Science
4 Riti 30 Delhi Data Science

How to implement where clause in python

I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock

Return the row number of the string value in a numeric column?

I am trying to convert a column to numeric, If most of the values in a column are numeric but some contain string values, then the function should return the row no of the columns which contain string values.
my dataset:
received
11
12
0
-340
2
9
1
aa
nn
qbb
expected output: row no :8,9,10 contains string values
I think need filter by to_numeric with errors='coerce' for return NaNs for non numeric with isnull:
i = df.index[pd.to_numeric(df['received'], errors='coerce').isnull()]
print (i)
Int64Index([7, 8, 9], dtype='int64')
python count from 0, so if need change it from count from 1:
i = df.index[pd.to_numeric(df['received'], errors='coerce').isnull()] + 1
print (i)
Int64Index([8, 9, 10], dtype='int64')
For dictionary use:
d = df.loc[pd.to_numeric(df['received'], errors='coerce').isnull(), 'received'].to_dict()
print (d)
{8: 'nn', 9: 'qbb', 7: 'aa'}