Pandas, concatenate certain columns if other columns are empty - pandas

I've got a CSV file that is supposed to look like this:
ID, years_active, issues
-------------------------------
'Truck1', 8, 'In dire need of a paintjob'
'Car 5', 3, 'To small for large groups'
However, the CSV is somewhat malformed and currently looks like this.
ID, years_active, issues
------------------------
'Truck1', 8, 'In dire need'
'','', 'of a'
'','', 'paintjob'
'Car 5', 3, 'To small for'
'', '', 'large groups'
Now, I am able to identify faulty rows by the lack of an 'ID' and 'years_active' value and would like to append the value of 'issues of that row to the last preceding row that had 'ID' and 'years_active' values.
I am not very experienced with pandas, but came up with the following code:
for index, row in df.iterrows():
if row['years_active'] == None:
df.loc[index-1]['issues'] += row['issues']
Yet - the IF condition fails to trigger.
Is the thing I am trying to do possible? And if so, does anyone have an idea what I am doing wrong?

Given your sample input:
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
You can use:
new_df = df.groupby(df.ID.replace('', method='ffill')).agg({'years_active': 'first', 'issues': ' '.join})
Which'll give you:
years_active issues
ID
Car 5 3 To small for large groups
Truck1 8 In dire need of a paintjob
So what we're doing here is forward filling the non-blank IDs into subsequent blank IDs and using those to group the related rows. We then aggregate to take the first occurrence of the years_active and join together the issues columns in the order they appear to create a single result.

Following uses a for loop to find and add strings (dataframe from JonClements' answer):
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
ss = ""; ii = 0; ilist = [0]
for i in range(len(df.index)):
if i>0 and df.ID[i] != "":
df.issues[ii] = ss
ss = df.issues[i]
ii = i
ilist.append(ii)
else:
ss += ' '+df.issues[i]
df.issues[ii] = ss
df = df.iloc[ilist]
print(df)
Output:
ID issues years_active
0 Truck1 In dire need of a paintjob 8
3 Car 5 To small for large groups 3

It might be worth mentioning in the context of this question that there is an often overlooked way of processing awkward input by using the StringIO library.
The essential point is that read_csv can read from a StringIO 'file'.
In this case, I arrange to discard single quotes and multiple commas that would confuse read_csv, and I append the second and subsequent lines of input to the first line, to form complete, conventional csv lines form read_csv.
Here is what read_csv receives.
ID years_active issues
0 Truck1 8 In dire need of a paintjob
1 Car 5 3 To small for large groups
The code is ugly but easy to follow.
import pandas as pd
from io import StringIO
for_pd = StringIO()
with open('jasper.txt') as jasper:
print (jasper.readline(), file=for_pd)
line = jasper.readline()
complete_record = ''
for line in jasper:
line = ''.join(line.rstrip().replace(', ', ',').replace("'", ''))
if line.startswith(','):
complete_record += line.replace(',,', ',').replace(',', ' ')
else:
if complete_record:
print (complete_record, file=for_pd)
complete_record = line
if complete_record:
print (complete_record, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd)
print (df)

Related

Scrapy - Unable to get the right data from the table

I am trying to pull the data from a particular table on this link -
https://www.moneycontrol.com/mutual-funds/canara-robeco-blue-chip-equity-fund-direct-plan/portfolio-holdings/MCA212
enter image description here
The table ID in the HTML is - equityCompleteHoldingTable
Please refer to the screenshot above, and help in getting the stock data as a dictionary from the website table.
Thanks.
What I tried
In Scrapy Shell, I am trying the following commands -
scrapy shell 'https://www.moneycontrol.com/mutual-funds/canara-robeco-blue-chip-equity-fund-direct-plan/portfolio-holdings/MCA212'
table = response.xpath('//*[#id="equityCompleteHoldingTable"]')
rows = table.xpath('//tr')
row = rows[2]
row.xpath('td//text()')[0].extract()
--- > returns "No. of Stocks". Here the extracted data is coming from a different table on the above webpage.
I have found that the class that this table is using is used in other tables as well. And one of those tables i actually returning the data "No. of Stocks".
What I expected
I expected the data to come from the equityCompleteHoldingTable table (screenshot above)
Your primary problem is that you are not using relative xpath expressions.
For example rows = table.xpath("//tr") is an absolute xpath path. Absolute paths are parsed from the root of the page, regardless of how deeply nested the selector is.
A relative path query starts parsing from the current selector element. To use a relative xpath expression you only need to add a . as the very first character, similar to filesystem relative paths. For example: rows = table.xpath(".//tr")
With that in mind you will probably have more luck with the following:
>>> table = response.xpath('//*[#id="equityCompleteHoldingTable"]')
>>> rows = table.xpath('.//tr')
>>> row = rows[2]
>>> row.xpath('.//td/text()').extract()[3:]
['Banks', '30.99', '8247.9', '9.34%', '0.14%', '9.69% ', '7.66% ', '86.56 L', '0.00 ', 'Large Cap', '75.79']
>>>
In [1]: table = response.xpath('//*[#id="equityCompleteHoldingTable"]')
In [2]: rows = table.xpath('.//tr')
In [3]: row = rows[2]
In [4]: row.xpath('.//td//text()').getall()
Out[4]:
['\n ',
'\n ',
'ICICI Bank Ltd. ',
'\n ',
'Banks',
'30.99',
'8247.9',
'9.34%',
'0.14%',
'9.69% ',
'(Aug 2022)',
'7.66% ',
'(Dec 2021)',
'86.56 L',
'0.00 ',
'Large Cap',
'75.79']
In [5]: cells = row.xpath('.//td//text()').getall()
In [6]: [i.strip() for i in cells]
Out[6]:
['',
'',
'ICICI Bank Ltd.',
'',
'Banks',
'30.99',
'8247.9',
'9.34%',
'0.14%',
'9.69%',
'(Aug 2022)',
'7.66%',
'(Dec 2021)',
'86.56 L',
'0.00',
'Large Cap',
'75.79']

Drop columns which contains specific words (not as a substring)

I have the following data frame, df:
id text
1 'a little table'
2 'blue lights'
3 'food and drink'
4 'build an atom'
5 'fast animals'
and a list of stop words, that is:
sw = ['a', 'an', 'and']
I want to delete the lines that contain at least one of the stop words (as words themselves, not as substrings). That is, the result I would like is:
id text
2 'blue lights'
5 'fast animals'
I was trying with:
df[~df['text'].str.contains('|'.join(sw), regex=True, na=False)]
but it doesn't seem to work, as it works with substrings this way, and a is substring of all texts (except for 'blue lights'). How should I change my line of code?
here is one way to do it
# '|'.join(sw) : creates a string with a |, to form an OR condition
# \\b : adds the word boundary to the capture group
# create a pattern surrounded by the word boundary and then
# filtered out what is found using loc
df.loc[~df['text'].str.contains('\\b('+ '|'.join(sw) + ')\\b' )]
OR
df[df['text'].str.extract('\\b('+ '|'.join(sw) + ')\\b' )[0].isna()]
id text
1 2 'blue lights'
4 5 'fast animals'
li = ['a', 'an', 'and']
for i in li:
for k in df.index:
if i in df.text[k].split():
df.drop(k,inplace=True)
If you want to use str.contains, you could try as follows:
import pandas as pd
data = {'id': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5},
'text': {0: "'a little table'", 1: "'blue lights'",
2: "'food and drink'", 3: "'build an atom'",
4: "'fast animals'"}}
df = pd.DataFrame(data)
sw = ['a', 'an', 'and']
res = df[~df['text'].str.contains(fr'\b(?:{"|".join(sw)})\b',
regex=True, na=False)]
print(res)
id text
1 2 'blue lights'
4 5 'fast animals'
In the regex pattern \b asserts position at a word boundary, while ?: at start of pattern between (...) creates a non-capturing group. Strictly speaking, you could do without ?:, but it suppresses a Userwarning: "This pattern ... has match groups etc.".
`
Another possible solution, which works as follows:
Split each string by space, producing a list of words
Check whether each of those lists of words is disjoint with sw.
Use the result for boolean indexing.
df[df['text'].str.split(' ').map(lambda x: set(x).isdisjoint(sw))]
Output:
id text
1 2 blue lights
4 5 fast animals
You can also use the custom apply() method,
def string_present(List,string):
return any(ele+' ' in string for ele in List)
df['status'] = df['text'].apply(lambda row: string_present(sw,row))
df[df['status']==False].drop(columns=['status'],axis=1)
The output is,
id text
1 2 blue lights
4 5 fast animals
sw = ['a', 'an', 'and']
df1.loc[~df1.text.str.split(' ').map(lambda x:pd.Series(x).isin(sw).any())]

Pandas - Check if all values from a list are in the headers of dataframe, if not add column and value 'null'

I have the below code that creates a dataframe that is coming from an API. I have a list with the headers (headers_list) and I want to check if each element is in the dataframe, if not add column to dataframe and add 'null'. Also its important its in the correct order from the list.
( I've hardcoded data as an example of a response missing 'biography' since its missing I want to add in and have the value 'null'.)
headers_list = ['followers_count', 'biography', 'media_count', 'profile_picture_url', 'username', 'website', 'id']
Below is an example of my code:
data = {'followers_count': 8192, 'follows_count': 427, 'media_count': 317, 'profile_picture_url': 'https://90860011962368_a.jpg', 'username': 'yes', 'website': 'http://GOL.COM/', 'id': '17843651'}
x = pd.DataFrame(user_fields_data.items())
x.set_index(0, inplace=True)
user_fields_df = x.transpose()
I know I can do this but then I would have to make several 'if' statements, wondering if there is a better way?
if 'biography' not in user_fields_df:
user_fields_df.insert(1, "biography", 'null')
Also, I tried this but it add the column to the end, and I need to to add to the correct location:
for col in headers_list:
if col not in user_fields_df.columns:
user_fields_df[col] = 'null'
You can reindex columns (axis = 1) with the headers_list:
user_fields_df.reindex(headers_list, axis=1)
#0 followers_count biography media_count ... username website id
#1 8192 NaN 317 ... yes http://GOL.COM/ 17843651

How to implement where clause in python

I want to replicate what where clause does in SQL, using Python. Many times conditions in where clause can be complex and have multiple conditions. I am able to do it in the following way. But I think there should be a smarter way to achieve this. I have following data and code.
My requirement is: I want to select all columns only when first letter in the address is 'N'. This is the initial data frame.
d = {'name': ['john', 'tom', 'bob', 'rock', 'dick'], 'Age': [23, 32, 45, 42, 28], 'YrsOfEducation': [10, 15, 8, 12, 10], 'Address': ['NY', 'NJ', 'PA', 'NY', 'CA']}
import pandas as pd
df = pd.DataFrame(data = d)
df['col1'] = df['Address'].str[0:1] #creating a new column which will have only the first letter from address column
n = df['col1'] == 'N' #creating a filtering criteria where the letter will be equal to N
newdata = df[n] # filtering the dataframe
newdata1 = newdata.drop('col1', axis = 1) # finally dropping the extra column 'col1'
So after 7 lines of code I am getting this output:
My question is how can I do it more efficiently or is there any smarter way to do that ?
A new column is not necessary:
newdata = df[df['Address'].str[0] == 'N'] # filtering the dataframe
print (newdata)
Address Age YrsOfEducation name
0 NY 23 10 john
1 NJ 32 15 tom
3 NY 42 12 rock

Pandas - possible to aggregate two columns using two different aggregations?

I'm loading a csv file, which has the following columns:
date, textA, textB, numberA, numberB
I want to group by the columns: date, textA and textB - but want to apply "sum" to numberA, but "min" to numberB.
data = pd.read_table("file.csv", sep=",", thousands=',')
grouped = data.groupby(["date", "textA", "textB"], as_index=False)
...but I cannot see how to then apply two different aggregate functions, to two different columns?
I.e. sum(numberA), min(numberB)
The agg method can accept a dict, in which case the keys indicate the column to which the function is applied:
grouped.agg({'numberA':'sum', 'numberB':'min'})
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'number A': np.arange(8),
'number B': np.arange(8) * 2})
grouped = df.groupby('A')
print(grouped.agg({
'number A': 'sum',
'number B': 'min'}))
yields
number B number A
A
bar 2 9
foo 0 19
This also shows that Pandas can handle spaces in column names. I'm not sure what the origin of the problem was, but literal spaces should not have posed a problem. If you wish to investigate this further,
print(df.columns)
without reassigning the column names, will show show us the repr of the names. Maybe there was a hard-to-see character in the column name that looked like a space (or some other character) but was actually a u'\xa0' (NO-BREAK SPACE), for example.