python read excel and search data

python read excel and search data - python-3.8

It is currently possible to read excel forms, but the function I want is to allow users to directly print out the corresponding content after entering the row and column item. Is there a sample grammar for reference?
import pandas as pd
import numpy as np
import xlrd
sheets=pd.ExcelFile('D:\data.xlsx')
df1=pd.read_excel(sheets,'a')
df1.columns=[1,2,3,4,5,6,7]
list_t=range(10,50,5) # columns table list
list_d=np.arange(1,3.5,0.5) # row table list
#user input
d=float(input("d=")) #input column item
t=input(input("t=")) #input row item

Related

How to export Pandas styled dataframe as an image to Databricks dbfs?

Context: I am writing a bot on Databricks using Python that will send to a Slack channel the image of a pandas dataframe table. That table was formatted using .style to make it faster for people to see the most important numbers.
I have two problems:
how can I save as an image a pandas dataframe that went through the .style method?
how can I open that image in another Databricks notebook?
Step 1 - OK: generating a sample dataframe.
import pandas as pd
my_df = pd.DataFrame({'fruits':['apple','banana'], 'count': [1,2]})
Step 2 - OK: then, I save a new variable in the following way to add to the table several formatting modifications that I need:
my_df_styled = (my_df.style
.set_properties(**{'text-align': 'center', 'padding': '15px'})
.hide_index()
.set_caption('My Table')
.set_table_styles([{'selector': 'caption',
'props': [('text-align', 'bottom'),
('padding', '10px')
]}])
)
Step 3 - Problem: trying to save the new variable as an image. But here, I am not being able to correctly do it. I tried to follow what was mentioned here, but they are using matplotlib to save it and it is something that I don't want to do, because I don't want to lose the formatting on my table.
my_df_styled.savefig('/dbfs/path/figure.png')
But I get the following error:
AttributeError: 'Styler' object has no attribute 'savefig'
Step 4 - Problem: opening the image in a different notebook. Not sure how to do this. I tried the following using another image:
opening_image = open('/dbfs/path/otherimage.png')
opening_image
But instead of getting the image, I get:
Out[#]: <_io.TextIOWrapper name='/dbfs/path/otherimage.png' mode='r'
encoding='UTF-8'>

For first question, savefig() is the method of Matplotlib so it is certainly not working if you try to do sth like df.savefig()
Rather, you should use another wrapper (matplotlib or other library in below link) to input the dataframe so that the dataframe can be converted into image with desired style.
https://stackoverflow.com/a/69250659/4407905
For the second question, I do not try Databrick myself, but I guessed it would be better if you do use to_csv(), to_excel(), to_json(), etc. method to export data into text-based format.

Is it expected behaviour for scrapy nested loaders to create duplicate values from base item?

I'm using Scrapy and want to collate values from multiple related pages into a single item (e.g. a user profile spread across a number of pages -> user item). So I'm creating an itemloader and then passing the item after scraping to the parser for the next request for it to add the values from the next page. The problem I'm having is that as soon as I nest a loader, all the values in the base item are duplicated.
from scrapy.loader import ItemLoader
from scrapy.selector import Selector
l = ItemLoader(item={'k':'v'},response=(''), selector=Selector(text=''))
nl = l.nested_css('.test')
print(l.load_item())
>>> {'k': ['v', 'v']}
So the workaround is to not use nested loaders, but am I doing something wrong, or is this a defect?

Scrape wikipedia table using BeautifulSoup

I would like to scrape the table titled "List of chemical elements" from the wikipedia link below and display it using pandas
https://en.wikipedia.org/wiki/List_of_chemical_elements
I am new to beautifulsoup and this is currently what i have.
from bs4 import BeautifulSoup
import requests as r
import pandas as pd
response = r.get('https://en.wikipedia.org/wiki/List_of_chemical_elements')
wiki_text = response.text
soup = BeautifulSoup(wiki_text, 'html.parser')
table_soup = soup.find_all('table')

You can select the table with beautifulsoup in different ways:
By its "title":
soup.select_one('table:-soup-contains("List of chemical elements")')
By order in tree (it is the first one):
soup.select_one('table')
soup.select('table')[0]
By its class (there is no id in your case):
soup.select_one('table.wikitable')
Or simply with pandas
pd.read_html('https://en.wikipedia.org/wiki/List_of_chemical_elements')[0]
*To get the expected result, try it yourself and if you have difficulties, ask a new question.

I am not sure between which two elements I should be looking to scrape and formatting error (jupyter + selenium)

I finally got around to displaying the page that I need in text/HTML and did conclude that the data I need is also included. For now I just have it printing the entire page because I remain conflicted between the two elements that I potentially need to get what I want. Between these three highlighted elements 1, 2, and 3, I am having trouble with first identifying which one I should reference (I would go with the 'table' element but it doesn't highlight the left most column with ticker names which is literally half the point of getting this data, though the name is referenced like so as shown in the highlighted yellow part). Also, the class descriptions seem really long and and sometimes appears to have two within the same elements so I was wondering how I would address that? And though this problem is not as immediate, if you did take that code and just printed it and scrolled a bit down, the table data is in straight columns so I was wondering if that would be addressed after I reference the proper element or have to write something additional to fix it? Would the fact that I have multiple pages to scan also change anything in the code? Thank you in advance!
Code:
!pip install selenium
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome("D:/chromedriver/chromedriver.exe")
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
# get text
text = soup.get_text()
print(text)

edit
read_html without bs4
You wont need beautifulsoup to get your goal, pandas is selecting all html tables from the page source and push them into a list of data frames.
In your case there is only one table in the page source, so you get your df by selecting the first element in list by slicing with [0]:
df = pd.read_html(driver.page_source)[0]
Example
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
df = pd.read_html(driver.page_source)[0]
driver.close()
Initial answer based on bs4
Your close to a solution, let pandas take control and read the html prettified and bs4 flavored to pandas and modify it there to your needs:
pd.read_html(soupt_one('table').prettify(), flavor='bs4')
Example
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome('D:/chromedriver/chromedriver.exe')
driver.get('https://www.barchart.com/options/unusual-activity/stocks')
soup = BeautifulSoup(driver.page_source, 'html.parser')
df = pd.read_html(soup.select_one('table').prettify(), flavor='bs4')[0]
df

Scrape table data using pandas/beautiful soup (instead of selenium which is slow?), BS implementation not working

I am trying to scrape web data on this website, and the only way I was able to access the data was by iterating through the rows of the table, adding them to a list (then adding them to a pandas data frame/writing to a csv), and then clicking to the next page and repeating the process [there are about 50 pages per search and my program does 100+ searches]. It's super slow/inefficient, and I was wondering if there was a way to efficiently add all the data using pandas or beautiful soup instead of iterating through each line/column.
url = "https://claimittexas.org/app/claim-search"
rows = driver.find_elements_by_xpath("//tbody/tr")
try:
for row in rows[1:]:
row_array = []
#print(row.text) # prints the whole row
for col in row.find_elements_by_xpath('td')[1:]:
row_array.append(col.text.strip())
table_array.append(row_array)
df = pd.DataFrame(table_array)
df.to_csv('my_csv.csv', mode='a', header=False)
except:
print(letters + "no table exists")
EDIT: I tried to scrape using beautiful soup, something I tried earlier in the week and posted about, but I can't seem to access the table without using selenium
With the bs version, I put a bunch of print statements in to see what was wrong, and it has that the rows value is just an empty list
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
rows = soup.find('table').find('tbody').find_all(('tr')[1:])
for row in rows[1:]:
cells = row.find_all('td')
for cell in cells[1:]:
print(cell.get_text())

use this line in BS4 code implimentation
rows = soup.find('table').find('tbody').find_all('tr')[1:]
instead of
rows = soup.find('table').find('tbody').find_all(('tr')[1:])

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

python read excel and search data - python-3.8

Related

How to export Pandas styled dataframe as an image to Databricks dbfs?

Is it expected behaviour for scrapy nested loaders to create duplicate values from base item?

Scrape wikipedia table using BeautifulSoup

I am not sure between which two elements I should be looking to scrape and formatting error (jupyter + selenium)

Scrape table data using pandas/beautiful soup (instead of selenium which is slow?), BS implementation not working

Categories

Resources