Pandas - Check if all values from a list are in the headers of dataframe, if not add column and value 'null' - pandas

I have the below code that creates a dataframe that is coming from an API. I have a list with the headers (headers_list) and I want to check if each element is in the dataframe, if not add column to dataframe and add 'null'. Also its important its in the correct order from the list.
( I've hardcoded data as an example of a response missing 'biography' since its missing I want to add in and have the value 'null'.)
headers_list = ['followers_count', 'biography', 'media_count', 'profile_picture_url', 'username', 'website', 'id']
Below is an example of my code:
data = {'followers_count': 8192, 'follows_count': 427, 'media_count': 317, 'profile_picture_url': 'https://90860011962368_a.jpg', 'username': 'yes', 'website': 'http://GOL.COM/', 'id': '17843651'}
x = pd.DataFrame(user_fields_data.items())
x.set_index(0, inplace=True)
user_fields_df = x.transpose()
I know I can do this but then I would have to make several 'if' statements, wondering if there is a better way?
if 'biography' not in user_fields_df:
user_fields_df.insert(1, "biography", 'null')
Also, I tried this but it add the column to the end, and I need to to add to the correct location:
for col in headers_list:
if col not in user_fields_df.columns:
user_fields_df[col] = 'null'

You can reindex columns (axis = 1) with the headers_list:
user_fields_df.reindex(headers_list, axis=1)
#0 followers_count biography media_count ... username website id
#1 8192 NaN 317 ... yes http://GOL.COM/ 17843651

Related

problem with pandas drop_duplicates removing empty values

Im using drop_duplicates to remove duplicates from my dataframe based on a column, the problem is this column is empty for some entries and those ended being removed to is there a way to make the function ignore the empty value.
here is an example
Title summary
0 TITLE A summaryA
1 TITLE A summaryB
2 summaryC
3 summaryD
using this
data.drop_duplicates(subset ="TITLE",
keep = 'first', inplace = True)
i get a result like this:
Title summary
0 TITLE A summaryA
2 summaryC
but since last two rows are not duplicates i want to keep them.. is there a ways for drop_duplicates to ignore empty values?
Fill missing values with the index number? Maybe not the prettiest way but it works
df = pd.DataFrame(
{'Title':['TITLE A', 'TITLE A', None, None], 'summary':['summaryA', 'summaryB',
'summaryC', 'summaryD']}
)
df['_id'] = df.index
df['_id'] = df['_id'].apply(str)
df['Title2'] = df['Title'].fillna(df['_id'])
df.drop_duplicates(subset ="Title2", keep = 'first')

Why is this pandas df.loc() call selecting all the records that satisfy only one condition and not both?

So I have this dataframe
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
# list of int
lst2 = ["gdadsf", '23', 'gggg', '22', 'df', '66', '77']
# Calling DataFrame constructor after zipping
# both lists, with columns specified
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['Name', 'val'])
df.loc[(df['Name']=='Geeks')&('gggg' in df['val'].to_string())]
and the result is below, it selects all the rows that contain Geeks instead of just row 2
0 Geeks gdadsf
2 Geeks gggg
6 Geeks 77
Update:This is a continuation or a question that stemmed from How do I test if a string is in a cell of a pandas data frame, cell that contains a list of strings?
Update2: I if bring this close to the other questions referred above I get nothing in response to my query Notice the list of strings that are now stored in the cells
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is', 'portal', 'for', 'Geeks']
# list of int
lst2 = [["gdadsf",'jjjj'], ['23'], ['gggg','hhh'], ['22'], ['df'], ['66'], ['77','zzz'] ]
# Calling DataFrame constructor after zipping
# both lists, with columns specified
df = pd.DataFrame(list(zip(lst, lst2)),
columns =['Name', 'val'])
df.loc[(df['Name']=='Geeks')&(df['val'].str.contains('gggg'))]
to_string() concatenate everything to a single long string. Essentially your code is
df.loc[(df['Name']=='Geeks')& True]
which gives you all the rows having Name equals 'Geeks'. So you don't want that, you want:
df.loc[(df['Name']=='Geeks')&( df['val'].str.contains('gggg'))]
In your case it would be true every time because the membership function in would evaluate to True every time.
you should add similar condition, checking for equalto value. like below.
print(df.loc[(df['Name']=='Geeks')&(df['val']=='gggg')])
if you have multiple values to check in the and condition use isin like below.
print(df.loc[(df['Name']=='Geeks')&(df.val.isin(['gggg','77']))])

How to split a column into multiple columns and then count the null values in the new column in SQL or Pandas?

I have a relatively large table with thousands of rows and few tens of columns. Some columns are meta data and others are numerical values. The problem I have is, some meta data columns are incomplete or partial that is, it missed the string after a ":". I want to get a count of how many of these are with the missing part after the colon mark.
If you look at the miniature example below, what I should get is a small table telling me that in group A, MetaData is complete for 2 entries and incomplete (missing after ":") in other 2 entries. Ideally I also want to get some statistics on SomeValue (Count, max, min etc.).
How do I do it in an SQL query or in Python Pandas?
Might turn out to be simple to use some build in function however, I am not getting it right.
Data:
Group MetaData SomeValue
A AB:xxx 20
A AB: 5
A PQ:yyy 30
A PQ: 2
Expected Output result:
Group MetaDataComplete Count
A Yes 2
A No 2
No reason to use split functions (unless the value can contain a colon character.) I'm just going to assume that the "null" values (not technically the right word) end with :.
select
"Group",
case when MetaData like '%:' then 'Yes' else 'No' end as MetaDataComplete,
count(*) as "Count"
from T
group by "Group", case when MetaData like '%:' then 'Yes' else 'No' end
You could also use right(MetaData, 1) = ':'.
Or supposing that values can contain their own colons, try charindex(':', MetaData) = len(MetaData) if you just want to ask whether the first colon is in the last position.
Here is an example:
## 1- Create Dataframe
In [1]:
import pandas as pd
import numpy as np
cols = ['Group', 'MetaData', 'SomeValue']
data = [['A', 'AB:xxx', 20],
['A', 'AB:', 5],
['A', 'PQ:yyy', 30],
['A', 'PQ:', 2]
]
df = pd.DataFrame(columns=cols, data=data)
# 2- New data frame with split value columns
new = df["MetaData"].str.split(":", n = 1, expand = True)
df["MetaData_1"]= new[0]
df["MetaData_2"]= new[1]
# 3- Dropping old MetaData columns
df.drop(columns =["MetaData"], inplace = True)
## 4- Replacing empty string by nan and count them
df.replace('',np.NaN, inplace=True)
df.isnull().sum()
Out [1]:
Group 0
SomeValue 0
MetaData_1 0
MetaData_2 2
dtype: int64
From a SQL perspective, performing a split is painful, not mention using the split results in having to perform the query first then querying the results:
SELECT
Results.[Group],
Results.MetaData,
Results.MetaValue,
COUNT(Results.MetaValue)
FROM (SELECT
[Group]
MetaData,
SUBSTRING(MetaData, CHARINDEX(':', MetaData) + 1, LEN(MetaData)) AS MetaValue
FROM VeryLargeTable) AS Results
GROUP BY Results.[Group],
Results.MetaData,
Results.MetaValue
If your just after a count, you could also try the algorithmic approach. Just loop over the data and use regular expressions with negative lookahead.
import pandas as pd
import re
pattern = '.*:(?!.)' # detects the strings of the missing data form
missing = 0
not_missing = 0
for i in data['MetaData'].tolist():
match = re.findall(pattern, i)
if match:
missing += 1
else:
not_missing += 1

Replacing Specific Values in a Pandas Column [duplicate]

I'm trying to replace the values in one column of a dataframe. The column ('female') only contains the values 'female' and 'male'.
I have tried the following:
w['female']['female']='1'
w['female']['male']='0'
But receive the exact same copy of the previous results.
I would ideally like to get some output which resembles the following loop element-wise.
if w['female'] =='female':
w['female'] = '1';
else:
w['female'] = '0';
I've looked through the gotchas documentation (http://pandas.pydata.org/pandas-docs/stable/gotchas.html) but cannot figure out why nothing happens.
Any help will be appreciated.
If I understand right, you want something like this:
w['female'] = w['female'].map({'female': 1, 'male': 0})
(Here I convert the values to numbers instead of strings containing numbers. You can convert them to "1" and "0", if you really want, but I'm not sure why you'd want that.)
The reason your code doesn't work is because using ['female'] on a column (the second 'female' in your w['female']['female']) doesn't mean "select rows where the value is 'female'". It means to select rows where the index is 'female', of which there may not be any in your DataFrame.
You can edit a subset of a dataframe by using loc:
df.loc[<row selection>, <column selection>]
In this case:
w.loc[w.female != 'female', 'female'] = 0
w.loc[w.female == 'female', 'female'] = 1
w.female.replace(to_replace=dict(female=1, male=0), inplace=True)
See pandas.DataFrame.replace() docs.
Slight variation:
w.female.replace(['male', 'female'], [1, 0], inplace=True)
This should also work:
w.female[w.female == 'female'] = 1
w.female[w.female == 'male'] = 0
This is very compact:
w['female'][w['female'] == 'female']=1
w['female'][w['female'] == 'male']=0
Another good one:
w['female'] = w['female'].replace(regex='female', value=1)
w['female'] = w['female'].replace(regex='male', value=0)
You can also use apply with .get i.e.
w['female'] = w['female'].apply({'male':0, 'female':1}.get):
w = pd.DataFrame({'female':['female','male','female']})
print(w)
Dataframe w:
female
0 female
1 male
2 female
Using apply to replace values from the dictionary:
w['female'] = w['female'].apply({'male':0, 'female':1}.get)
print(w)
Result:
female
0 1
1 0
2 1
Note: apply with dictionary should be used if all the possible values of the columns in the dataframe are defined in the dictionary else, it will have empty for those not defined in dictionary.
Using Series.map with Series.fillna
If your column contains more strings than only female and male, Series.map will fail in this case since it will return NaN for other values.
That's why we have to chain it with fillna:
Example why .map fails:
df = pd.DataFrame({'female':['male', 'female', 'female', 'male', 'other', 'other']})
female
0 male
1 female
2 female
3 male
4 other
5 other
df['female'].map({'female': '1', 'male': '0'})
0 0
1 1
2 1
3 0
4 NaN
5 NaN
Name: female, dtype: object
For the correct method, we chain map with fillna, so we fill the NaN with values from the original column:
df['female'].map({'female': '1', 'male': '0'}).fillna(df['female'])
0 0
1 1
2 1
3 0
4 other
5 other
Name: female, dtype: object
Alternatively there is the built-in function pd.get_dummies for these kinds of assignments:
w['female'] = pd.get_dummies(w['female'],drop_first = True)
This gives you a data frame with two columns, one for each value that occurs in w['female'], of which you drop the first (because you can infer it from the one that is left). The new column is automatically named as the string that you replaced.
This is especially useful if you have categorical variables with more than two possible values. This function creates as many dummy variables needed to distinguish between all cases. Be careful then that you don't assign the entire data frame to a single column, but instead, if w['female'] could be 'male', 'female' or 'neutral', do something like this:
w = pd.concat([w, pd.get_dummies(w['female'], drop_first = True)], axis = 1])
w.drop('female', axis = 1, inplace = True)
Then you are left with two new columns giving you the dummy coding of 'female' and you got rid of the column with the strings.
w.replace({'female':{'female':1, 'male':0}}, inplace = True)
The above code will replace 'female' with 1 and 'male' with 0, only in the column 'female'
There is also a function in pandas called factorize which you can use to automatically do this type of work. It converts labels to numbers: ['male', 'female', 'male'] -> [0, 1, 0]. See this answer for more information.
w.female = np.where(w.female=='female', 1, 0)
if someone is looking for a numpy solution. This is useful to replace values based on a condition. Both if and else conditions are inherent in np.where(). The solutions that use df.replace() may not be feasible if the column included many unique values in addition to 'male', all of which should be replaced with 0.
Another solution is to use df.where() and df.mask() in succession. This is because neither of them implements an else condition.
w.female.where(w.female=='female', 0, inplace=True) # replace where condition is False
w.female.mask(w.female=='female', 1, inplace=True) # replace where condition is True
dic = {'female':1, 'male':0}
w['female'] = w['female'].replace(dic)
.replace has as argument a dictionary in which you may change and do whatever you want or need.
I think that in answer should be pointed which type of object do you get in all methods suggested above: is it Series or DataFrame.
When you get column by w.female. or w[[2]] (where, suppose, 2 is number of your column) you'll get back DataFrame.
So in this case you can use DataFrame methods like .replace.
When you use .loc or iloc you get back Series, and Series don't have .replace method, so you should use methods like apply, map and so on.
To answer the question more generically so it applies to more use cases than just what the OP asked, consider this solution. I used jfs's solution solution to help me. Here, we create two functions that help feed each other and can be used whether you know the exact replacements or not.
import numpy as np
import pandas as pd
class Utility:
#staticmethod
def rename_values_in_column(column: pd.Series, name_changes: dict = None) -> pd.Series:
"""
Renames the distinct names in a column. If no dictionary is provided for the exact name changes, it will default
to <column_name>_count. Ex. female_1, female_2, etc.
:param column: The column in your dataframe you would like to alter.
:param name_changes: A dictionary of the old values to the new values you would like to change.
Ex. {1234: "User A"} This would change all occurrences of 1234 to the string "User A" and leave the other values as they were.
By default, this is an empty dictionary.
:return: The same column with the replaced values
"""
name_changes = name_changes if name_changes else {}
new_column = column.replace(to_replace=name_changes)
return new_column
#staticmethod
def create_unique_values_for_column(column: pd.Series, except_values: list = None) -> dict:
"""
Creates a dictionary where the key is the existing column item and the value is the new item to replace it.
The returned dictionary can then be passed the pandas rename function to rename all the distinct values in a
column.
Ex. column ["statement"]["I", "am", "old"] would return
{"I": "statement_1", "am": "statement_2", "old": "statement_3"}
If you would like a value to remain the same, enter the values you would like to stay in the except_values.
Ex. except_values = ["I", "am"]
column ["statement"]["I", "am", "old"] would return
{"old", "statement_3"}
:param column: A pandas Series for the column with the values to replace.
:param except_values: A list of values you do not want to have changed.
:return: A dictionary that maps the old values their respective new values.
"""
except_values = except_values if except_values else []
column_name = column.name
distinct_values = np.unique(column)
name_mappings = {}
count = 1
for value in distinct_values:
if value not in except_values:
name_mappings[value] = f"{column_name}_{count}"
count += 1
return name_mappings
For the OP's use case, it is simple enough to just use
w["female"] = Utility.rename_values_in_column(w["female"], name_changes = {"female": 0, "male":1}
However, it is not always so easy to know all of the different unique values within a data frame that you may want to rename. In my case, the string values for a column are hashed values so they hurt the readability. What I do instead is replace those hashed values with more readable strings thanks to the create_unique_values_for_column function.
df["user"] = Utility.rename_values_in_column(
df["user"],
Utility.create_unique_values_for_column(df["user"])
)
This will changed my user column values from ["1a2b3c", "a12b3c","1a2b3c"] to ["user_1", "user_2", "user_1]. Much easier to compare, right?
If you have only two classes you can use equality operator. For example:
df = pd.DataFrame({'col1':['a', 'a', 'a', 'b']})
df['col1'].eq('a').astype(int)
# (df['col1'] == 'a').astype(int)
Output:
0 1
1 1
2 1
3 0
Name: col1, dtype: int64

Data slicing a pandas frame - I'm having problems with unique

I am facing issues trying to select a subset of columns and running unique on it.
Source Data:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df_raw.shape()
Produces:
(10000, 86)
Process Data:
df = df_raw[['A','B','C']]
df.shape()
Produces:
(10000, 3)
Furthermore, doing:
df_raw.head()
df.head()
produces a correct list of rows and columns.
However,
print('RAW:',sorted(df_raw['A'].unique()))
works perfectly
Whilst:
print('PROCESSED:',sorted(df['A'].unique()))
produces:
AttributeError: 'DataFrame' object has no attribute 'unique'
What am I doing wrong? If the shape and head output are exactly what I want, I'm confused why my processed dataset is throwing errors. I did read Pandas 'DataFrame' object has no attribute 'unique' on SO which correctly states that unique needs to be applied to columns which is what I am doing.
This was a case of a duplicate column. Given this is proprietary data, I abstracted it as 'A', 'B', 'C' in this question and therefore masked the problem. (The real data set had 86 columns and I had duplicated one of those columns twice in my subset, and was trying to do a unique on that)
My problem was this:
df_raw = pd.read_csv('data/master.csv', nrows=10000)
df = df_raw[['A','B','C', 'A']] # <-- I did not realize I had duplicated A later.
This was causing problems when doing a unique on 'A'
From the entire dataframe to extract a subset a data based on a column ID. This works!!
df = df.drop_duplicates(subset=['Id']) #where 'id' is the column used to filter
print (df)