Scrape table from JSP website using Python

Scrape table from JSP website using Python - selenium

I would like to scrape the table that appears when you go to this website: https://www.eprocure.gov.bd/resources/common/SearcheCMS.jsp
I used the following code based on the example shown here.
options = Options()
options.add_argument('--headless')
driver = webdriver.Firefox(executable_path="C:/Users/DefaultUser/AppData/geckodriver.exe")
driver.get("https://www.eprocure.gov.bd/resources/common/SearcheCMS.jsp")
time.sleep(5)
res = driver.execute_script("return document.documentElement.outerHTML")
driver.quit()
soup = BeautifulSoup(res, 'html.parser')
table_rows =soup.find_all('table')\[1\].find_all('tr')
rows=\[\]
for tr in table_rows:
td = tr.find_all('td')
rows.append(\[i.text for i in td\])
delaydata = rows\[3:\]
import pandas as pd
df = pd.DataFrame(delaydata, columns = \['S. No.', 'Ministry, Division, Organization PE', 'Procurement Nature, Type & Method', 'Tender/Proposal ID, Ref No., Title & Publishing Date', 'Contract Awarded To', 'Company Unique ID', 'Experience Certificate No', 'Contract Amount', 'Contract Start & End Date', 'Work Status'\])
df

Finding the URL
Well, actually, there's no need to use Selenium. The data is available via sending a POST request to:
https://www.eprocure.gov.bd/AdvSearcheCMSServlet
How did I find this URL?
Well, if you inspect your browsers Network calls (Click on F12), you'll see the following:
And take note of the "Payload" tab:
this will later be used as data in the below example.
Great, but how do I get the data including paginating the page?
To get the data, including page pagination, you can see this example, where we get the HTML table and increase pageNo for pagination (this is for the "eTenders" table/tab):
import requests
import pandas as pd
from bs4 import BeautifulSoup
data = {
"action": "geteCMSList",
"keyword": "",
"officeId": "0",
"contractAwardTo": "",
"contractStartDtFrom": "",
"contractStartDtTo": "",
"contractEndDtFrom": "",
"contractEndDtTo": "",
"departmentId": "",
"tenderId": "",
"procurementMethod": "",
"procurementNature": "",
"contAwrdSearchOpt": "Contains",
"exCertSearchOpt": "Contains",
"exCertificateNo": "",
"tendererId": "",
"procType": "",
"statusTab": "eTenders",
"pageNo": "1",
"size": "10",
"workStatus": "All",
}
_columns = [
"S. No",
"Ministry, Division, Organization, PE",
"Procurement Nature, Type & Method",
"Tender/Proposal ID, Ref No., Title..",
"Contract Awarded To",
"Company Unique ID",
"Experience Certificate No ",
"Contract Amount",
"Contract Start & End Date",
"Work Status",
]
for page in range(1, 11): # <--- Increase number of pages here
print(f"Page: {page}")
data["pageNo"] = page
response = requests.post(
"https://www.eprocure.gov.bd/AdvSearcheCMSServlet", data=data
)
# The HTML is missing a `table` tag, so we need to fix it
soup = BeautifulSoup("<table>" + "".join(response.text) + "</table>", "html.parser")
df = pd.read_html(
str(soup),
)[0]
df.columns = _columns
print(df.to_string())
Going further
How do I select the different tabs/tables on the page?
To select the different tabs on the page, you can change the "statusTab" in the data. Inspect the payload tab again, and you'll see what I mean.
Output
The above code outputs:
S. No Ministry, Division, Organization, PE Procurement Nature, Type & Method Tender/Proposal ID, Ref No., Title.. Contract Awarded To Company Unique ID Experience Certificate No\t Contract Amount Contract Start & End Date Work Status
0 1 Ministry of Education, Education Engineering Department, Office of the Executive Engineer, EED,Kishoreganj Zone. Works, NCT, LTM 300580, 932/EE/EED/KZ/Rev-5974/2018-19/23, Dt: 28/03/2019 Repair and Renovation Works at Chowganga Shahid Smrity High School Itna Kishoreganj. 01-Apr-2019 M/S KAZI RASEL NIRMAN SONGSTA 1051854 WD-5974- 25/e-GP/20221228/300580/0060000 475000.000 10-Jun-2019 03-Sep-2019 Completed
1 2 Ministry Of Water Resourses, Bangladesh Water Development Board (BWDB), Chattogram Mechanical Division Works, NCT, LTM 558656, CMD/T-19/100 Dated: 14-03-2021 Manufacturing supplying & installation of 01 No MS Flap gate size - 1.65 m 1.95m and 01 no. Padestal type lifting device for sluice no S-15 6-vent 02 nos MS Vertical gate size - 1.65 m 1.95m for sluice no S-15 6-vent and sluice no S-14 new 1-vent at Coxs Bazar Sadar Upazilla of CEP Polder No 66/1 under Coxsbazar O&M Division implemented by Chattogram Mechanical Division BWDB Madunaghat Chattogram during the financial year 2020-21. 15-Mar-2021 M/S. AN Corporation 1063426 CMD/COX/LTM-16/2020-21/e-GP/20221228/558656/0059991 503470.662 12-Apr-2021 05-May-2021 Completed
2 3 Ministry Of Water Resourses, Bangladesh Water Development Board (BWDB), Chattogram Mechanical Division Works, NCT, LTM 633496, CMD/T-19/263 Dated: 30-11-2021 Manufacturing, supplying & installation of 07 No M.S Flap gate for sluice no.- 6 (1-vent), sluice no.- 7 (2-vent), sluice no.-8 (2-vent), sluice no.-35 (2-vent) size :- (1.00 m Ã?1.00m), 01 No Padestal type lifting device for sluice no- 13(1-vent) for CEP Polder No 64/2B, at pekua Upazilla under Chattogram Mechanical Division, BWDB, Madunaghat, Chattogram, during the financial year 2021-22. 30-Nov-2021 M/S. AN Corporation 1063426 CMD/LTM-08/2021-22/e-GP/20221228/633496/0059989 648808.272 26-Dec-2021 31-Jan-2022 Completed
...
...

Related

How to get a product's multiple locations in odoo?

I want to get a product's location and display it on a custom report table:
and on the "Warehouse" cell it should be all the product's location, so if that product has multiple it should be displayed there. Just for clarification this is the location I'm talking about:
In order to put that there I tried this code:
class StockInventoryValuationReport(models.TransientModel):
_name = 'report.stock.inventory.valuation.report'
_description = 'Stock Inventory Valuation Report'
location_id = fields.Many2one('stock.location')
# filter domain wizard
#api.multi
def _compute_results(self):
self.ensure_one()
stockquant_obj = self.env['stock.quant'].search([("location_id", "=", self.location_id.id)])
print(stockquant_obj.location_id)
line = {
'name': product.name,
'reference': product.default_code,
'barcode': product.barcode,
'qty_at_date': product.qty_at_date,
'uom_id': product.uom_id,
'currency_id': product.currency_id,
'cost_currency_id': product.cost_currency_id,
'standard_price': standard_price,
'stock_value': product.qty_at_date * standard_price,
'cost_method': product.cost_method,
'taxes_id': product.taxes_id,
'location_id': stockquant_obj.location_id,
}
if product.qty_at_date != 0:
self.results += ReportLine.new(line)
but when I'm printing stockquant_obj.location_id it is an empty recordset basically its not finding any locations. Can someone please hint me on anything?

I actually managed to get the product's locations using this code:
class StockInventoryValuationReport(models.TransientModel):
_name = 'report.stock.inventory.valuation.report'
_description = 'Stock Inventory Valuation Report'
location_id = fields.Many2one('stock.location')
# filter domain wizard
#api.multi
def _compute_results(self):
self.ensure_one()
stockquant_obj = self.env['stock.quant'].search([("location_id", "=", self.location_id.id)])
for xyz in stockquant_obj:
line = {
'name': product.name,
'reference': product.default_code,
'barcode': product.barcode,
'qty_at_date': product.qty_at_date,
'uom_id': product.uom_id,
'currency_id': product.currency_id,
'cost_currency_id': product.cost_currency_id,
'standard_price': standard_price,
'stock_value': product.qty_at_date * standard_price,
'cost_method': product.cost_method,
'taxes_id': product.taxes_id,
'location_id': xyz.location_id,
}
if product.qty_at_date != 0:
self.results += ReportLine.new(line)
I debugged further discovering that now stock.quant() could get some record-set but odoo was expecting a singleton when on my old code was stockquant_obj.location_id so since I have seen from other people that the solution to singleton is a for loop and for that reason I added it.
The problem with this is that now not only the warehouse would be added but the same product would repeat as times as long the recordset is. How can I dodge this? How to tell python that I only need to loop through stockquant_obj and xyz should be inside the line variable?

How to extract text of specific tags with multiple occurrences

HTML:
"<span class="font-weight-bold color-primary small text-right text-nowrap">29,95 €</span>
url = https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire?sellerCountry=13&sellerReputation=2&language=1&minCondition=4#articleFilterSellerLocation
I wish to extract the text of 29,95 €.
Currently using BeautifulSoup. However, the page has a table with many other texts like this which I also wish to extract. How do I find all of these tags and extract only the text at the end to a list?
The current code I have tried is:
for price in new_page:
new_page.find("div", class_="table-body")
price = new_page.find_all("span", attrs="font-weight-bold color-primary small text-right text-nowrap")
output_price = [x["font-weight-bold color-primary small text-right text-nowrap"] for x in price]

import requests
from bs4 import BeautifulSoup
def main(url):
params = {
"sellerCountry": "13",
"sellerReputation": "2",
"language": "1",
"minCondition": "4"
}
r = requests.get(url, params=params)
soup = BeautifulSoup(r.text, 'lxml')
print(soup.select_one('dl.labeled dd:nth-child(6)').text)
main('https://www.cardmarket.com/en/Magic/Cards/Bloodstained-Mire')
Output:
29,95 €

Google Public Patent Data SQL (BigQuery)

I am trying to retrieve specific cpc codes AND assignees via SQL in the Google public patent data. I am trying to search for the term "VOLKSWAGEN" and cpc.code "H01M8".
But I got the error:
No matching signature for operator = for argument types: ARRAY
<STRUCT<name STRING, country_code STRING>>, STRING. Supported
signature: ANY = ANY at [15:3]
code:
SELECT
publication_number application_number,
family_id,
publication_date,
filing_date,
priority_date,
priority_claim,
ipc,
cpc.code,
inventor,
assignee_harmonized,
FROM
`patents-public-data.patents.publications`
WHERE
assignee_harmonized = "VOLKSWAGEN" AND cpc.code = "H01M8"
LIMIT
1000
I'm also interested in searching multiple assignees such as:
in ("VOLKSWAGEN", "PORSCHE", "AUDI", "SCANIA", "SKODA", "MAZDA", "TOYOTA", "HONDA", "BOSCH", "KYOCERA", "PANASONIC", "TOTO", "NISSAN", "LG FUEL CELL SYSTEMS", "SONY", "HYUNDAI", "SUZUKI", "PLUG POWER", "SFC ENERGY", "BALLARD", "KIA MOTORS", "SIEMENS", "KAWASAKI", "BAYERISCHE MOTORENWERKE", "HYDROGENICS", "POWERCELL SWEDEN", "ELRINGKLINGER", "PROTON MOTOR")
I have recently started to work with SQL and do not see the mistake :/
Many thanks for your help!

Many thanks, now I created this code to screen multiple companies.
Is it possible to get the query of requests out of "cpc__u.code" in one row cell each? with a ", " to seperate the codes between the output string?.
Same I like to consider for the assignee_harmonized__u.name here as well !
Do you think the companies will be screened with this precedure and the "IN" operator?
SELECT
publication_number application_number,
family_id,
publication_date,
filing_date,
priority_date,
priority_claim,
cpc__u.code,
inventor,
assignee_harmonized,
assignee
FROM
`patents-public-data.patents.publications`,
UNNEST(assignee_harmonized) AS assignee_harmonized__u,
UNNEST(cpc) AS cpc__u
WHERE
assignee_harmonized__u.name in ("VOLKSWAGEN", "PORSCHE", "AUDI", "SCANIA", "SKODA", "MAZDA", "TOYOTA", "HONDA", "BOSCH", "KYOCERA", "PANASONIC", "TOTO", "NISSAN", "LG FUEL CELL SYSTEMS", "SONY", "HYUNDAI", "SUZUKI", "PLUG POWER", "SFC ENERGY", "BALLARD", "KIA MOTORS", "SIEMENS", "KAWASAKI", "BAYERISCHE MOTORENWERKE", "HYDROGENICS", "POWERCELL SWEDEN", "ELRINGKLINGER", "PROTON MOTOR")
AND cpc__u.code LIKE "H01M8%"
LIMIT
100000

In Google BigQuery UNNEST is needed to access ARRAY elements. This is described here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays
The following query works for me.
SELECT
publication_number application_number,
family_id,
publication_date,
filing_date,
priority_date,
priority_claim,
ipc,
cpc__u.code,
inventor,
assignee_harmonized,
FROM
`patents-public-data.patents.publications`,
UNNEST(assignee_harmonized) AS assignee_harmonized__u,
UNNEST(cpc) AS cpc__u
WHERE
assignee_harmonized__u.name = "VOLKSWAGEN AG"
AND cpc__u.code LIKE "H01M8%"
LIMIT
1000
The following are changes I made to generate results:
UNNEST(assignee_harmonized) as assignee_harmonized__u to access assignee_harmonized__u.name.
UNNEST(cpc) as cpc__u to access cpc__u.code.
assignee_harmonized__u.name = "VOLKSWAGEN AG" as "VOLKSWAGEN" returns no results.
cpc__u.code LIKE "H01M8%" as "H01M8" returns no results. An example value is H01M8/10.
This returns the following:
Query complete (2.3 sec elapsed, 29.2 GB processed)
If you want to screen multiple assignee names, IN will work like the following, however, you need to have an exact match like VOLKSWAGEN AG or AUDI AG.
assignee_harmonized__u.name IN ("VOLKSWAGEN", "PORSCHE", "AUDI", "SCANIA", "SKODA", "MAZDA", "TOYOTA", "HONDA", "BOSCH", "KYOCERA", "PANASONIC", "TOTO", "NISSAN", "LG FUEL CELL SYSTEMS", "SONY", "HYUNDAI", "SUZUKI", "PLUG POWER", "SFC ENERGY", "BALLARD", "KIA MOTORS", "SIEMENS", "KAWASAKI", "BAYERISCHE MOTORENWERKE", "HYDROGENICS", "POWERCELL SWEDEN", "ELRINGKLINGER", "PROTON MOTOR")
If you want to do a LIKE style match with multiple strings, you can try REGEXP_CONTAINS:
https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_contains

How to get section heading of tables in wikipedia through API

How do I get section headings for individual tables: Xia dynasty (夏朝) (2070–1600 BC), Shang dynasty (商朝) (1600–1046 BC), Zhou dynasty (周朝) (1046–256 BC) etc. for the Chinese Monarchs list on Wikipedia via API? I use the code below to connect:
from pprint import pprint
import requests, wikitextparser
r = requests.get(
'https://en.wikipedia.org/w/api.php',
params={
'action': 'query',
'titles': 'List_of_Chinese_monarchs',
'prop': 'revisions',
'rvprop': 'content',
'format': 'json',
}
)
r.raise_for_status()
pages = r.json()['query']['pages']
body = next(iter(pages.values()))['revisions'][0]['*']
doc = wikitextparser.parse(body)
print(f'{len(doc.tables)} tables retrieved')
han = doc.tables[5].data()
doc.tables[6].data()
doc.tables[i].data() only return the table values, without its <h2> section headings. I would like the API to return me a list of title strings that correspond to each of the 83 tables returned.
Original website:
https://en.wikipedia.org/wiki/List_of_Chinese_monarchs

I'm not sure why you are using doc.tables when it is the sections you are interested in. This works for me:
for i in range(1,94,1):
print(doc.sections[i].title.replace('[[','').replace(']]',''))
I get 94 sections though rather than 83 and while you can use len(doc.sections) this will include See also etc. There must be a more elegant way of removing the wikilinks.

How partial payment hit accounting statements in odoo 10 Using Point Of sale?

As I have purchased pos_partial_payment module but found incomplete for my requirement so need to modify/add new functionality to it. As I perform partial payment (remaining credit amount) there is not changes to accounting.
For example, such partial paid amount is not showing in accounting and also not changing account receivable to Cash journal.
#api.one
def pay_partial_payment(self, amount):
# Comment this method because we don't want generate payment accounting entry, if you want then just Uncomment it...
print "partnerrrrrrrrrrrrrrrrrrrrrrr id",self
print "****************codeeeeeeeeeeeeeeeeeeeeeeeee",amount
#finding recivble
print "id of account rec account",self.property_account_receivable_id.id
lines = []
partner_line = {'account_id':self.property_account_receivable_id.id,
'name':'/',
'date':date.today(),
'partner_id': self.id,
'debit':float(amount),
'credit':0.0,
}
lines.append(partner_line)
#find sale account
recivable_ids = self.env['account.account'].search([('user_type_id.name','=','Receivable')])
print "*********************************************************recivable_ids",recivable_ids
rec_id = False
if recivable_ids:
rec_id = recivable_ids[0]
pos_line = {'account_id':rec_id.id,
'name':'pos partial payment',
'date':date.today(),
'partner_id': self.id,
'credit':float(amount),
'debit':0.0,
}
lines.append(pos_line)
line_list = [(0, 0, x) for x in lines]
move_id = self.env['account.move'].create({'partner_id':self.id,
'date':date.today(),
# 'journal_id':7,
'journal_id':6,
'line_ids':line_list
})
return True

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Scrape table from JSP website using Python - selenium

Related

How to get a product's multiple locations in odoo?

How to extract text of specific tags with multiple occurrences

Google Public Patent Data SQL (BigQuery)

How to get section heading of tables in wikipedia through API

How partial payment hit accounting statements in odoo 10 Using Point Of sale?

Categories

Resources