Scrapy - Unable to get the right data from the table - scrapy

I am trying to pull the data from a particular table on this link -
https://www.moneycontrol.com/mutual-funds/canara-robeco-blue-chip-equity-fund-direct-plan/portfolio-holdings/MCA212
enter image description here
The table ID in the HTML is - equityCompleteHoldingTable
Please refer to the screenshot above, and help in getting the stock data as a dictionary from the website table.
Thanks.
What I tried
In Scrapy Shell, I am trying the following commands -
scrapy shell 'https://www.moneycontrol.com/mutual-funds/canara-robeco-blue-chip-equity-fund-direct-plan/portfolio-holdings/MCA212'
table = response.xpath('//*[#id="equityCompleteHoldingTable"]')
rows = table.xpath('//tr')
row = rows[2]
row.xpath('td//text()')[0].extract()
--- > returns "No. of Stocks". Here the extracted data is coming from a different table on the above webpage.
I have found that the class that this table is using is used in other tables as well. And one of those tables i actually returning the data "No. of Stocks".
What I expected
I expected the data to come from the equityCompleteHoldingTable table (screenshot above)

Your primary problem is that you are not using relative xpath expressions.
For example rows = table.xpath("//tr") is an absolute xpath path. Absolute paths are parsed from the root of the page, regardless of how deeply nested the selector is.
A relative path query starts parsing from the current selector element. To use a relative xpath expression you only need to add a . as the very first character, similar to filesystem relative paths. For example: rows = table.xpath(".//tr")
With that in mind you will probably have more luck with the following:
>>> table = response.xpath('//*[#id="equityCompleteHoldingTable"]')
>>> rows = table.xpath('.//tr')
>>> row = rows[2]
>>> row.xpath('.//td/text()').extract()[3:]
['Banks', '30.99', '8247.9', '9.34%', '0.14%', '9.69% ', '7.66% ', '86.56 L', '0.00 ', 'Large Cap', '75.79']
>>>
In [1]: table = response.xpath('//*[#id="equityCompleteHoldingTable"]')
In [2]: rows = table.xpath('.//tr')
In [3]: row = rows[2]
In [4]: row.xpath('.//td//text()').getall()
Out[4]:
['\n ',
'\n ',
'ICICI Bank Ltd. ',
'\n ',
'Banks',
'30.99',
'8247.9',
'9.34%',
'0.14%',
'9.69% ',
'(Aug 2022)',
'7.66% ',
'(Dec 2021)',
'86.56 L',
'0.00 ',
'Large Cap',
'75.79']
In [5]: cells = row.xpath('.//td//text()').getall()
In [6]: [i.strip() for i in cells]
Out[6]:
['',
'',
'ICICI Bank Ltd.',
'',
'Banks',
'30.99',
'8247.9',
'9.34%',
'0.14%',
'9.69%',
'(Aug 2022)',
'7.66%',
'(Dec 2021)',
'86.56 L',
'0.00',
'Large Cap',
'75.79']

Related

Pyspark Dataframe - How to create new column with only first 2 words

dataframe --> df having a column for Full Name (First, middle & last). The column name is full_name and words are seperated by a space (delimiter)
I'd like to create a new column having only 1st and middle name.
I have tried the following
df = df.withColumn('new_name', split(df['full_name'], ' '))
But this returns all the words in a list.
I also tried
df = df.withColumn('new_name', split(df['full_name'], ' ')).getItem(1)
But this returns only the 2nd name in the list (middle name)
Please advise how to proceed with this.
Try this
import pyspark.sql.functions as F
split_col = F.split(df['FullName'], ' ')
df = df.withColumn('FirstMiddle', F.concat_ws(' ',split_col.getItem(0),split_col.getItem(1)))
df.show()
Took my some time thinking but I came up with this
df1 = df.withColumn('first_name', f.split(df['full_name'], ' ').getItem(0))\
.withColumn('middle_name', f.split(df['full_name'], ' ').getItem(1))\
.withColumn('New_Name', f.concat(f.col('first_name'), f.lit(' '), f.col('middle_name')))\
.drop('first_name')\
.drop('middle_name')
It is a working code and the output is as expected but I am not sure how efficient this is considered. If someone has any better ideas please reply

How to replace element in pandas DataFrame column [duplicate]

I have a column in my dataframe like this:
range
"(2,30)"
"(50,290)"
"(400,1000)"
...
and I want to replace the , comma with - dash. I'm currently using this method but nothing is changed.
org_info_exc['range'].replace(',', '-', inplace=True)
Can anybody help?
Use the vectorised str method replace:
df['range'] = df['range'].str.replace(',','-')
df
range
0 (2-30)
1 (50-290)
EDIT: so if we look at what you tried and why it didn't work:
df['range'].replace(',','-',inplace=True)
from the docs we see this description:
str or regex: str: string exactly matching to_replace will be replaced
with value
So because the str values do not match, no replacement occurs, compare with the following:
df = pd.DataFrame({'range':['(2,30)',',']})
df['range'].replace(',','-', inplace=True)
df['range']
0 (2,30)
1 -
Name: range, dtype: object
here we get an exact match on the second row and the replacement occurs.
For anyone else arriving here from Google search on how to do a string replacement on all columns (for example, if one has multiple columns like the OP's 'range' column):
Pandas has a built in replace method available on a dataframe object.
df.replace(',', '-', regex=True)
Source: Docs
If you only need to replace characters in one specific column, somehow regex=True and in place=True all failed, I think this way will work:
data["column_name"] = data["column_name"].apply(lambda x: x.replace("characters_need_to_replace", "new_characters"))
lambda is more like a function that works like a for loop in this scenario.
x here represents every one of the entries in the current column.
The only thing you need to do is to change the "column_name", "characters_need_to_replace" and "new_characters".
Replace all commas with underscore in the column names
data.columns= data.columns.str.replace(' ','_',regex=True)
In addition, for those looking to replace more than one character in a column, you can do it using regular expressions:
import re
chars_to_remove = ['.', '-', '(', ')', '']
regular_expression = '[' + re.escape (''. join (chars_to_remove)) + ']'
df['string_col'].str.replace(regular_expression, '', regex=True)
Almost similar to the answer by Nancy K, this works for me:
data["column_name"] = data["column_name"].apply(lambda x: x.str.replace("characters_need_to_replace", "new_characters"))
If you want to remove two or more elements from a string, example the characters '$' and ',' :
Column_Name
===========
$100,000
$1,100,000
... then use:
data.Column_Name.str.replace("[$,]", "", regex=True)
=> [ 100000, 1100000 ]

Pandas - Check if all values from a list are in the headers of dataframe, if not add column and value 'null'

I have the below code that creates a dataframe that is coming from an API. I have a list with the headers (headers_list) and I want to check if each element is in the dataframe, if not add column to dataframe and add 'null'. Also its important its in the correct order from the list.
( I've hardcoded data as an example of a response missing 'biography' since its missing I want to add in and have the value 'null'.)
headers_list = ['followers_count', 'biography', 'media_count', 'profile_picture_url', 'username', 'website', 'id']
Below is an example of my code:
data = {'followers_count': 8192, 'follows_count': 427, 'media_count': 317, 'profile_picture_url': 'https://90860011962368_a.jpg', 'username': 'yes', 'website': 'http://GOL.COM/', 'id': '17843651'}
x = pd.DataFrame(user_fields_data.items())
x.set_index(0, inplace=True)
user_fields_df = x.transpose()
I know I can do this but then I would have to make several 'if' statements, wondering if there is a better way?
if 'biography' not in user_fields_df:
user_fields_df.insert(1, "biography", 'null')
Also, I tried this but it add the column to the end, and I need to to add to the correct location:
for col in headers_list:
if col not in user_fields_df.columns:
user_fields_df[col] = 'null'
You can reindex columns (axis = 1) with the headers_list:
user_fields_df.reindex(headers_list, axis=1)
#0 followers_count biography media_count ... username website id
#1 8192 NaN 317 ... yes http://GOL.COM/ 17843651

Adding a row to a FITS table with astropy

I have a problem which ought to be trivial but seems to have been massively over-complicated by the column-based nature of FITS BinTableHDU.
The script I'm writing should be trivial: iterate through a FITS file and write a subset of rows to an identically formatted FITS file, reducing the row count from c700k/3.6GB to about 350 rows. I have processed the input file and have each row that I want to save in a python array of FITS records:
outarray = []
self.indata=Table.read(self.infile, hdu=1)
for r in self._indata:
RecPassesFilter = FilterProc(r, self)
#
# Add to output array only if passes all filters...
#
if RecPassesFilter:
outarray.append(r)
Now, I've created an empty BintableHDU with exactly the same columns and formats and I want to add the filtered data:
[...much omitted code later...}
mycols = []
for inputcol in self._coldefs:
mycols.append(fits.Column(name=inputcol.name, format=inputcol.format))
# Next line should produce an empty BinTableHDU in the identical format to the output data
SaveData = fits.BinTableHDU.from_columns(mycols)
for s in self._outdata:
SaveData.data.append(s)
Now that last line not only fails, but every variant of it (SaveData.append() or .add_row() or whatever) also fails with a "no such method" error. There seems to be a singular lack of documentation on how to do the trivial task of adding a record. Clearly I am missing something, but two days later I'm still drawing a blank.
Can anyone point me in the right direction here?
OK, I managed to resolve this with some brute force and nested iterations essentially to create column data arrays on the fly. It's not much in terms of code and I don't care that it's inefficient as I won't need to run it too often. Example code here:
with fits.open(self._infile) as HDUSet:
tableHDU=HDUSet[1]
self._coldefs = tableHDU.columns
FITScols = []
for inputcol in self._coldefs:
NewColData = []
for r in self._outdata:
NewColData.append(r[inputcol.name])
FITScols.append(fits.Column(name=inputcol.name, format=inputcol.format, array=NewColData))
SaveData = fits.BinTableHDU.from_columns(FITScols)
SaveData.writeto(fname)
This solves my problem for a 350 row subset. I haven't yet dared try it for the 250K row subset that I need for the next part of my project!
I just recalled that BinTableHDU.from_columns takes an nrows argument. If you pass that along with the columns of an existing table HDU, it will copy the column structure but initialize subsequent rows with empty data:
>>> hdul = fits.open('astropy/io/fits/tests/data/table.fits')
>>> table = hdul[1]
>>> table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '>f4')]))
>>> new_table = fits.BinTableHDU.from_columns(table.columns, nrows=5)
>>> new_table.columns
ColDefs(
name = 'target'; format = '20A'
name = 'V_mag'; format = 'E'
)
>>> new_table.data
FITS_rec([('NGC1001', 11.1), ('NGC1002', 12.3), ('NGC1003', 15.2),
('', 0. ), ('', 0. )],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))
As you can see, this still copies the data from the original columns. I think the idea behind this originally was for adding new rows to an existing table. However, you can also initialize a completely empty new table by passing fill=True:
>>> new_table_zeroed = fits.BinTableHDU.from_columns(table.columns, nrows=5, fill=True)
>>> new_table_zeroed.data
FITS_rec([('', 0.), ('', 0.), ('', 0.), ('', 0.), ('', 0.)],
dtype=(numpy.record, [('target', 'S20'), ('V_mag', '<f4')]))

Pandas, concatenate certain columns if other columns are empty

I've got a CSV file that is supposed to look like this:
ID, years_active, issues
-------------------------------
'Truck1', 8, 'In dire need of a paintjob'
'Car 5', 3, 'To small for large groups'
However, the CSV is somewhat malformed and currently looks like this.
ID, years_active, issues
------------------------
'Truck1', 8, 'In dire need'
'','', 'of a'
'','', 'paintjob'
'Car 5', 3, 'To small for'
'', '', 'large groups'
Now, I am able to identify faulty rows by the lack of an 'ID' and 'years_active' value and would like to append the value of 'issues of that row to the last preceding row that had 'ID' and 'years_active' values.
I am not very experienced with pandas, but came up with the following code:
for index, row in df.iterrows():
if row['years_active'] == None:
df.loc[index-1]['issues'] += row['issues']
Yet - the IF condition fails to trigger.
Is the thing I am trying to do possible? And if so, does anyone have an idea what I am doing wrong?
Given your sample input:
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
You can use:
new_df = df.groupby(df.ID.replace('', method='ffill')).agg({'years_active': 'first', 'issues': ' '.join})
Which'll give you:
years_active issues
ID
Car 5 3 To small for large groups
Truck1 8 In dire need of a paintjob
So what we're doing here is forward filling the non-blank IDs into subsequent blank IDs and using those to group the related rows. We then aggregate to take the first occurrence of the years_active and join together the issues columns in the order they appear to create a single result.
Following uses a for loop to find and add strings (dataframe from JonClements' answer):
df = pd.DataFrame({
'ID': ['Truck1', '', '', 'Car 5', ''],
'years_active': [8, '', '', 3, ''],
'issues': ['In dire need', 'of a', 'paintjob', 'To small for', 'large groups']
})
ss = ""; ii = 0; ilist = [0]
for i in range(len(df.index)):
if i>0 and df.ID[i] != "":
df.issues[ii] = ss
ss = df.issues[i]
ii = i
ilist.append(ii)
else:
ss += ' '+df.issues[i]
df.issues[ii] = ss
df = df.iloc[ilist]
print(df)
Output:
ID issues years_active
0 Truck1 In dire need of a paintjob 8
3 Car 5 To small for large groups 3
It might be worth mentioning in the context of this question that there is an often overlooked way of processing awkward input by using the StringIO library.
The essential point is that read_csv can read from a StringIO 'file'.
In this case, I arrange to discard single quotes and multiple commas that would confuse read_csv, and I append the second and subsequent lines of input to the first line, to form complete, conventional csv lines form read_csv.
Here is what read_csv receives.
ID years_active issues
0 Truck1 8 In dire need of a paintjob
1 Car 5 3 To small for large groups
The code is ugly but easy to follow.
import pandas as pd
from io import StringIO
for_pd = StringIO()
with open('jasper.txt') as jasper:
print (jasper.readline(), file=for_pd)
line = jasper.readline()
complete_record = ''
for line in jasper:
line = ''.join(line.rstrip().replace(', ', ',').replace("'", ''))
if line.startswith(','):
complete_record += line.replace(',,', ',').replace(',', ' ')
else:
if complete_record:
print (complete_record, file=for_pd)
complete_record = line
if complete_record:
print (complete_record, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd)
print (df)