How to scrape Data from the chart in this Page? - selenium

I am looking for a way to scrape datas from this website using Selenium. I am talking about the Floor price history chart in the page shared. I unfortunately have no idea on how to scrape charts so I am here asking you. Thank you.

You can use pandas to read the data from the API URL:
import pandas as pd
url = "https://api-bff.nftpricefloor.com/nft/bored-ape-yacht-club/chart/pricefloor?interval=all"
df = pd.read_json(url)
# print sample data:
print(df.head().to_markdown(index=False))
Prints:
dataPriceFloorETH
dataPriceFloorUSD
dataVolumeETH
dataVolumeUSD
dates
sales
5.5853
12845
11.55
26561.4
2021-07-30T00:00:00.000Z
2
5.4953
12637
102.726
236237
2021-07-30T08:00:00.000Z
10
5.4547
12544
121.42
279228
2021-07-30T16:00:00.000Z
19
5.5301
12717
234.009
538148
2021-07-31T00:00:00.000Z
29
6.3304
15588
418.771
1.03118e+06
2021-07-31T08:00:00.000Z
58

Related

Appending data into gsheet using google colab without giving a range

I have a simple example to test on how to append data to a tab in gsheet using colab. For example here is the first code snippet to update the data first time;
from google.colab import auth
auth.authenticate_user()
import gspread
from google.auth import default
creds, _ = default()
gc = gspread.authorize(creds)
df = pd.DataFrame({'a': ['apple','airplane','alligator'], 'b': ['banana', 'ball', 'butterfly'], 'c': ['cantaloupe', 'crane', 'cat']})
df2 = [df.columns.to_list()] + df.values.tolist()
wb = gc.open('test_wbr_feb13')
wsresults2 = wb.worksheet('Sheet2')
wsresults2.update(None,df2)
This works for me as show in the screenshot;
First screenshot
Since it is my work account, I am not able to give a link of the gsheet, apologies for that. Next I need to check if we can append the data to existing data. To this end, I use following code;
from gspread_dataframe import get_as_dataframe, set_with_dataframe
wb = gc.open('test_wbr_feb13')
wsresults2 = wb.worksheet('Sheet2')
set_with_dataframe(wsresults2, df)
Please note that we don't know the rows from which we need to insert the data, it can be variable depending on the data size. But the output is still the same, plz see screenshot. Can I please get help on how to append data into gsheet using this approach? thanks
Second screenshot

Using BS4 and requests to scrape links of a specific class?

I was trying to use requests and BeautifulSoup4 to scrape the top page of r/askreddit, but when I tried to pull links using the class of that link, I would sometimes receive an empty list. Using this code:
import requests
from bs4 import BeautifulSoup
base_url = 'https://www.reddit.com/r/AskReddit/top/?t=day'
r = requests.get(base_url)
soup = BeautifulSoup(r.text, 'html.parser')
links = []
for link in soup.find_all('a', 'SQnoC3ObvgnGjWt90zD9Z _2INHSNB8V5eaWp4P0rY_mE'):
print(link.get('href'))
links.append(link.get('href'))
print(links)
Sometimes the code would return a printed version of each link as well as a list of the links as intended:
/r/AskReddit/comments/yaqpzk/what_is_a_cult_that_pretends_its_not_cult/
/r/AskReddit/comments/yaugmy/whats_a_name_you_would_never_give_to_your_child/
/r/AskReddit/comments/yavldx/what_is_the_single_greatest_animated_series_of/
/r/AskReddit/comments/yb64tg/what_have_you_survived_that_wouldve_killed_you/
/r/AskReddit/comments/yat0xj/what_is_your_favorite_movie_that_most_people_have/
/r/AskReddit/comments/yasntt/what_is_the_craziest_cult_of_all_time/
/r/AskReddit/comments/yas0s7/54_of_americans_between_the_ages_of_16_and_74/
['/r/AskReddit/comments/yaqpzk/what_is_a_cult_that_pretends_its_not_cult/', '/r/AskReddit/comments/yaugmy/whats_a_name_you_would_never_give_to_your_child/', '/r/AskReddit/comments/yavldx/what_is_the_single_greatest_animated_series_of/', '/r/AskReddit/comments/yb64tg/what_have_you_survived_that_wouldve_killed_you/', '/r/AskReddit/comments/yat0xj/what_is_your_favorite_movie_that_most_people_have/', '/r/AskReddit/comments/yasntt/what_is_the_craziest_cult_of_all_time/', '/r/AskReddit/comments/yas0s7/54_of_americans_between_the_ages_of_16_and_74/']
>>>
but most of the time I would simply receive:
[]
>>>
I am confused as to why the same code would be providing two different outputs, and I don't understand why I only sometimes receive the data I actually want to scrape. I have looked at some of the other posts about these libraries on this site, but I haven't found anything that looks like the problem I am having. I have looked over the BS4 documentation, albeit a bit ineffectively because I am a beginner, but I am still unsure of where the program is going wrong.
I recommend to parse old.reddit.com (note the old. at the beginning of URL - simpler HTML syntax) or use their JSON API (add .json at the end of URL). For example:
import requests
base_url = "https://www.reddit.com/r/AskReddit/top/.json?t=day"
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:105.0) Gecko/20100101 Firefox/105.0"
}
data = requests.get(base_url, headers=headers).json()["data"]["children"]
for i, d in enumerate(data, 1):
print("{:>3} {:<90} {}".format(i, d["data"]["title"], d["data"]["url"]))
Prints:
1 What is a cult that pretends it’s not cult? https://www.reddit.com/r/AskReddit/comments/yaqpzk/what_is_a_cult_that_pretends_its_not_cult/
2 Men of Reddit, what was something you didn't know about women till you got with one? https://www.reddit.com/r/AskReddit/comments/yb54lt/men_of_reddit_what_was_something_you_didnt_know/
3 What's a name you would NEVER give to your child? https://www.reddit.com/r/AskReddit/comments/yaugmy/whats_a_name_you_would_never_give_to_your_child/
4 What is the single greatest animated series of all time? https://www.reddit.com/r/AskReddit/comments/yavldx/what_is_the_single_greatest_animated_series_of/
5 What have you survived that would’ve killed you 150+ years ago? https://www.reddit.com/r/AskReddit/comments/yb64tg/what_have_you_survived_that_wouldve_killed_you/
6 What is your “favorite movie” that most people have never seen? https://www.reddit.com/r/AskReddit/comments/yat0xj/what_is_your_favorite_movie_that_most_people_have/
7 What is the craziest cult of all time? https://www.reddit.com/r/AskReddit/comments/yasntt/what_is_the_craziest_cult_of_all_time/
8 [Serious]: What are some early warning signs of an abusive relationship? https://www.reddit.com/r/AskReddit/comments/yar1os/serious_what_are_some_early_warning_signs_of_an/
9 54% of Americans between the ages of 16 and 74 read below a 6th grade reading level. Why do you think that is? https://www.reddit.com/r/AskReddit/comments/yas0s7/54_of_americans_between_the_ages_of_16_and_74/
10 What is something positive going on in your life? https://www.reddit.com/r/AskReddit/comments/yb28sm/what_is_something_positive_going_on_in_your_life/
11 The alien overlords demand that one American major metropolitan city be sacrificed and turned into a no human zone. Which one goes? https://www.reddit.com/r/AskReddit/comments/yb0l7y/the_alien_overlords_demand_that_one_american/
12 What movie has a great soundtrack in your opinion? https://www.reddit.com/r/AskReddit/comments/yapb09/what_movie_has_a_great_soundtrack_in_your_opinion/
13 What is an obscure reference to something that only true fans will understand? https://www.reddit.com/r/AskReddit/comments/yb7og2/what_is_an_obscure_reference_to_something_that/
14 What's socially acceptable within your own gender, but not with the opposite? https://www.reddit.com/r/AskReddit/comments/yb7pei/whats_socially_acceptable_within_your_own_gender/
15 What stages of drunk do you have? https://www.reddit.com/r/AskReddit/comments/yb39tn/what_stages_of_drunk_do_you_have/
16 What is the worst chocolate? https://www.reddit.com/r/AskReddit/comments/yawn49/what_is_the_worst_chocolate/
17 What would the US state mottos be if they were brutally honest? https://www.reddit.com/r/AskReddit/comments/yb5eaw/what_would_the_us_state_mottos_be_if_they_were/
18 What’s your opinion on circumcision? https://www.reddit.com/r/AskReddit/comments/yat9pm/whats_your_opinion_on_circumcision/
19 What examples of 'Internet etiquette' do you feel deserve more awareness? https://www.reddit.com/r/AskReddit/comments/yb25n3/what_examples_of_internet_etiquette_do_you_feel/
20 What 90s song will always be a banger? https://www.reddit.com/r/AskReddit/comments/ybeys0/what_90s_song_will_always_be_a_banger/
21 What show never had a 'meh' season? https://www.reddit.com/r/AskReddit/comments/yb4ww4/what_show_never_had_a_meh_season/
22 What was the scariest thing you have witnessed? https://www.reddit.com/r/AskReddit/comments/yb0ajj/what_was_the_scariest_thing_you_have_witnessed/
23 You start a new job, what's an instant red flag in the workplace social atmosphere? https://www.reddit.com/r/AskReddit/comments/yb2siq/you_start_a_new_job_whats_an_instant_red_flag_in/
24 which fictional world would you like to live in the most? https://www.reddit.com/r/AskReddit/comments/yawtqi/which_fictional_world_would_you_like_to_live_in/
25 How did you come up with your username? https://www.reddit.com/r/AskReddit/comments/yb7zm8/how_did_you_come_up_with_your_username/

Python KeyError when using pandas

I'm following a tutorial on NLP but have encountered a key error error when trying to group my raw data into good and bad reviews. Here is the tutorial link: https://towardsdatascience.com/detecting-bad-customer-reviews-with-nlp-d8b36134dc7e
#reviews.csv
I am so angry about the service
Nothing was wrong, all good
The bedroom was dirty
The food was great
#nlp.py
import pandas as pd
#read data
reviews_df = pd.read_csv("reviews.csv")
# append the positive and negative text reviews
reviews_df["review"] = reviews_df["Negative_Review"] +
reviews_df["Positive_Review"]
reviews_df.columns
I'm seeing the following error:
File "pandas\_libs\hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Negative_Review'
Why is this happening?
You're getting this error because you did not understand how to structure your data.
When you do df['reviews']=df['Positive_reviews']+df['Negative_reviews'] you're actually summing the values of Positive reviews to Negative reviews(which does not exist currently) into the 'reviews' column (chich also does not exist).
Your csv is nothing more than a plaintext file with one text in each row. Also, since you're working with text, remember to enclose every string in quotation marks("), otherwise your commas will create fakecolumns.
With your approach, it seems that you'll still tag all your reviews manually (usually, if you're working with machine learning, you'll do this outside code and load it to your machine learning file).
In order for your code to work, you want to do the following:
import pandas as pd
df = pd.read_csv('TestFileFolder/57886076.csv', names=['text'])
## Fill with placeholder values
df['Positive_review']=0
df['Negative_review']=1
df.head()
Result:
text Positive_review Negative_review
0 I am so angry about the service 0 1
1 Nothing was wrong, all good 0 1
2 The bedroom was dirty 0 1
3 The food was great 0 1
However, I would recommend you to have a single column (is_review_positive) and have it to true or false. You can easily encode it later on.

Pandas, importing JSON-like file using read_csv

I would like to import data from .txt to dataframe. I can not import it using classical pd.read_csv, while using different types of sep it throws me errors. Data I want to import Cell_Phones_&_Accessories.txt.gz is in a format.
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
review/userId: A1RXYH9ROBAKEZ
review/profileName: A. Igoe
review/helpfulness: 0/0
review/score: 1.0
review/time: 1233360000
review/summary: Don't buy!
review/text: First of all, the company took my money and sent me an email telling me the product was shipped. A week and a half later I received another email telling me that they are sorry, but they don't actually have any of these items, and if I received an email telling me it has shipped, it was a mistake.When I finally got my money back, I went through another company to buy the product and it won't work with my phone, even though it depicts that it will. I have sent numerous emails to the company - I can't actually find a phone number on their website - and I still have not gotten any kind of response. What kind of customer service is that? No one will help me with this problem. My advice - don't waste your money!
product/productId: B000JVER7W
product/title: Mobile Action MA730 Handset Manager - Bluetooth Data Suite
product/price: unknown
....
You can use jen for separator, then split by first : and pivot:
df = pd.read_csv('Cell_Phones_&_Accessories.txt', sep='¥', names=['data'], engine='python')
df1 = df.pop('data').str.split(':', n=1, expand=True)
df1.columns = ['a','b']
df1 = df1.assign(c=(df1['a'] == 'product/productId').cumsum())
df1 = df1.pivot('c','a','b')
Python solution with defaultdict and DataFrame constructor for improve performance:
from collections import defaultdict
data = defaultdict(list)
with open("Cell_Phones_&_Accessories.txt") as f:
for line in f.readlines():
if len(line) > 1:
key, value = line.strip().split(':', 1)
data[key].append(value)
df = pd.DataFrame(data)

Is there a way to speed up this webscraping iteration? Pandas

So I'm collecting data on a list of stocks and putting all that info into a dataframe. The list has about 700 stocks.
import pandas as pd
stock =['adma','aapl','fb'] # list has about 700 stocks which I extracted from a pickled dataframe that was storing the info.
#The site I'm visiting is below with the name of the stock added to the end of the end of the link
##http://finviz.com/quote.ashx?t=adma
##http://finviz.com/quote.ashx?t=aapl
I'm just extracting one portion of that site, evident by [-2] in the code below
df2 = pd.DataFrame()
for i in stock:
df = pd.read_html('http://finviz.com/quote.ashx?t={}'.format(i), header =0)[-2].set_index('SEC Form 4')
df['Stock'] = i.upper() # creating a column which has the name of the stock, so I can differentiate between stocks
df2 = df2.append(df)
It feels like I'm doing a few seconds per iteration and I have around 700 to go through at the moment. It's not terribly slow, but I was just curious if there is a more efficient method. Thanks.
Your current code is blocking, you don't proceed with retrieving the information from the next url until you are done with the current. Instead, you can switch to, for example, Scrapy which is based on twisted and working asynchronously processing multiple pages at the same time.