Getting url for all zip files from webpage - beautifulsoup

I am looking at the webpage with bunch of zip files in them.
Each of zip file has url as https://www.ercot.com/misdownload/servlets/mirDownload?mimic_duns=000000000&doclookupId=814778337
I want to extract the urls for only _csv.zip files and extract the files into a csv file and discard urls for _xml.zip files. Both the xml.zip and csv.zip has the same data but I prefer to use the csv.zip.
I am not sure how to approach this or where to start.
Edit:
If you are getting "Access Denied", please note that the webpage might be only accessible to USA IP addresses.
When you click on the urls, it downloads a zip file to PC. I basically want to:
download the zip file to the PC
load the contents of the csv file inside the zip to a pandas dataframe

All the zip files and a merged csv file (21 MB) are here, so no need for scraping.
But if you feel like it, here's my take on it.
import os.path
from shutil import copyfileobj
import pandas as pd
import requests
from bs4 import BeautifulSoup
base_url = "https://www.ercot.com"
entry_url = f"{base_url}/misapp/GetReports.do?reportTypeId=12331&reportTitle=DAM%20Settlement%20Point%20Prices&showHTMLView=&mimicKey"
download_dir = "ercot"
def scrape_zips():
with requests.Session() as connection:
print("Finding all zip files...")
zip_urls = [
f"{base_url}{source_url['href']}" for source_url in
BeautifulSoup(
connection.get(entry_url).text,
"lxml"
).find_all("a")[::2]
]
os.makedirs(download_dir, exist_ok=True)
total_urls = len(zip_urls)
for idx, url in enumerate(zip_urls, start=1):
file_name = url.split("=", -1)[-1]
zip_object = connection.get(url, stream=True)
print(f"Fetching file {file_name} -> {idx} out of {total_urls}")
with open(os.path.join(download_dir, f"{file_name}.zip"), "wb") as output:
copyfileobj(zip_object.raw, output)
zip_object.close()
def list_files(dir_name: str):
yield from (
next(os.walk(dir_name), (None, None, []))[2]
)
def merge_zips_to_df():
print("Merging csv files...")
df = pd.concat(
pd.read_csv(os.path.join(download_dir, csv_file)) for csv_file
in list_files(download_dir)
)
print(df.head(20))
df.to_csv(os.path.join(download_dir, "merdged_csv_files.csv"), index=False)
if __name__ == "__main__":
scrape_zips()
merge_zips_to_df()
This should give you the following output:
Finding all zip files...
Fetching file 816055622 -> 1 out of 31
Fetching file 815870449 -> 2 out of 31
Fetching file 815686938 -> 3 out of 31
Fetching file 815503551 -> 4 out of 31
Fetching file 815315296 -> 5 out of 31
Fetching file 815127892 -> 6 out of 31
Fetching file 814952388 -> 7 out of 31
Fetching file 814778337 -> 8 out of 31
Fetching file 814599101 -> 9 out of 31
Fetching file 814416972 -> 10 out of 31
Fetching file 814224618 -> 11 out of 31
Fetching file 814040277 -> 12 out of 31
Fetching file 813865857 -> 13 out of 31
Fetching file 813688802 -> 14 out of 31
Fetching file 813516414 -> 15 out of 31
Fetching file 813341752 -> 16 out of 31
Fetching file 813159478 -> 17 out of 31
Fetching file 812976112 -> 18 out of 31
Fetching file 812784659 -> 19 out of 31
Fetching file 812599985 -> 20 out of 31
Fetching file 812424952 -> 21 out of 31
Fetching file 812241625 -> 22 out of 31
Fetching file 812053445 -> 23 out of 31
Fetching file 811874015 -> 24 out of 31
Fetching file 811685701 -> 25 out of 31
Fetching file 811501577 -> 26 out of 31
Fetching file 811319918 -> 27 out of 31
Fetching file 811147926 -> 28 out of 31
Fetching file 810973966 -> 29 out of 31
Fetching file 810793357 -> 30 out of 31
Fetching file 810615891 -> 31 out of 31
Merging csv files...
DeliveryDate HourEnding SettlementPoint SettlementPointPrice DSTFlag
0 12/22/2021 01:00 AEEC 25.07 N
1 12/22/2021 01:00 AJAXWIND_RN 25.07 N
2 12/22/2021 01:00 ALGOD_ALL_RN 25.01 N
3 12/22/2021 01:00 ALVIN_RN 24.11 N
4 12/22/2021 01:00 AMADEUS_ALL 25.07 N
5 12/22/2021 01:00 AMISTAD_ALL 25.06 N
6 12/22/2021 01:00 AMOCOOIL_CC1 25.98 N
7 12/22/2021 01:00 AMOCOOIL_CC2 25.98 N
8 12/22/2021 01:00 AMOCO_PUN1 25.98 N
9 12/22/2021 01:00 AMOCO_PUN2 25.98 N
10 12/22/2021 01:00 AMO_AMOCO_1 25.98 N
11 12/22/2021 01:00 AMO_AMOCO_2 25.98 N
12 12/22/2021 01:00 AMO_AMOCO_5 25.98 N
13 12/22/2021 01:00 AMO_AMOCO_G1 25.98 N
14 12/22/2021 01:00 AMO_AMOCO_G2 25.98 N
15 12/22/2021 01:00 AMO_AMOCO_G3 25.98 N
16 12/22/2021 01:00 AMO_AMOCO_S1 25.98 N
17 12/22/2021 01:00 AMO_AMOCO_S2 25.98 N
18 12/22/2021 01:00 ANACACHO_ANA 25.05 N
19 12/22/2021 01:00 ANCHOR_ALL 25.08 N

Related

How to convert timelogs (timestamps) in a dataframe into seconds (integer)?

I have a dataframe with columns: Text, Start time, and end time.
All three of them are strings.
The dataframe
What I am focused on currently is that, I need to convert the elements of Start & End columns into number of seconds. That is, converting 00:00:26 into 26. Or 00:01:27 into 87. etc.
The output elements need to be in int format.
I have already figured out the way to convert the string of timelog into proper timestamps
datetime_str = '00:00:26'
datetime_object = datetime.strptime(datetime_str, '%H:%M:%S').time()
print(datetime_object)
print(type(datetime_object))
Output:
00:00:26
<class 'datetime.time'>
But how do I convert this 00:00:26 into an integer 26.
Since you're manipulating a df, you can simply use Timedelta & total_seconds from pandas :
df["Start_2"] = pd.to_timedelta(df["Start"]).dt.total_seconds().astype(int)
df["End_2"] = pd.to_timedelta(df["End"]).dt.total_seconds().astype(int)
​
Output :
print(df)
Start End Start_2 End_2
0 00:00:05 00:00:13 5 13
1 00:00:13 00:00:21 13 21
2 00:00:21 00:00:27 21 27
3 00:00:27 00:00:36 27 36
4 00:00:36 00:00:42 36 42
5 00:00:42 00:00:47 42 47
6 00:00:47 00:00:54 47 54
7 00:00:54 00:00:59 54 59
8 00:00:59 00:01:07 59 67
9 00:01:07 00:01:13 67 73

How to calculate slope of a dataframe, upto a specific row number?

I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858

Pandas doesn't split EIA API Data into two different columsn for easy access

I am importing EIA data which contains weekly storage data. The first column in the reported week and second is storage.
When I import the data it shows two columns. First column has no title and second one as following title "Weekly Lower 48 States Natural Gas Working Underground Storage, Weekly (Billion Cubic Feet)".
I would like to plot the data using matplotlib but I need to separate the columns first. I used df.iloc[100:,:0] and this gives the first column which is the week but I somehow cannot separate the second column.
import eia
import pandas as pd
import os
api_key = "mykey"
api = eia.API(api_key)
series_search = api.data_by_series(series='NG.NW2_EPG0_SWO_R48_BCF.W')
df = pd.DataFrame(series_search)
df1 = df.iloc[100:,:0]
Code Output
This output is sample of all 486 rows. When I use df.shape command it shows as (486, 1) when it should show (486, 2 )
2010 0101 01 3117
2010 0108 08 2850
2010 0115 15 2607
2010 0122 22 2521
2019 0322 22 1107
2019 0329 29 1130
2019 0405 05 1155
2019 0412 12 1247
2019 0419 19 1339
You can first cut the last 3 characters of the string and then convert it to datetime:
df['Date'] = pd.to_datetime(df['Date'].str[:-3], format='%Y %m%d')
print(df)
Date Value
0 2010-01-01 3117
1 2010-01-08 2850
2 2010-01-15 2607
3 2010-01-22 2521
4 2019-03-22 1107
5 2019-03-29 1130
6 2019-04-05 1155
7 2019-04-12 1247
8 2019-04-19 1339

How to sum data group by week but sum from beginning?

I have a database like this :
Entry_No Week Registering_Date Bin_Code Item_No Quantity
=====================================================================
1 26 6/26/2015 BIN 1 A 10
2 26 6/26/2015 BIN 1 B 20
3 26 6/26/2015 BIN 1 C 30
4 26 6/26/2015 BIN 1 D 40
5 27 6/29/2015 BIN 1 A -3
6 27 6/29/2015 BIN 2 A 3
7 27 6/29/2015 BIN 1 A -2
8 27 6/29/2015 BIN 3 A 2
9 28 7/5/2015 BIN 1 B -15
10 28 7/5/2015 BIN 3 B 15
11 28 7/5/2015 BIN 1 C -25
12 28 7/5/2015 BIN 2 C 25
13 28 7/5/2015 BIN 1 B 50
And I would like to sum group by BIN_CODE and ITEM_NO by WEEK but from beginning of the data.. I know how to sum group by week but it only shows the summary on that week (not from beginning)
And the result i expect is like this :
WEEK BIN CODE ITEM NO QUANTITY
====================================
26 BIN 1 A 10
26 BIN 1 B 20
26 BIN 1 C 30
26 BIN 1 D 40
26 BIN 2 A -
26 BIN 2 B -
26 BIN 2 C -
26 BIN 2 D -
26 BIN 3 A -
26 BIN 3 B -
26 BIN 3 C -
26 BIN 3 D -
27 BIN 1 A 5
27 BIN 1 B 20
27 BIN 1 C 30
27 BIN 1 D 40
27 BIN 2 A 3
27 BIN 2 B -
27 BIN 2 C -
27 BIN 2 D -
27 BIN 3 A 2
27 BIN 3 B -
27 BIN 3 C -
27 BIN 3 D -
28 BIN 1 A 5
28 BIN 1 B 55
28 BIN 1 C 5
28 BIN 1 D 40
28 BIN 2 A 3
28 BIN 2 B -
28 BIN 2 C 25
28 BIN 2 D -
28 BIN 3 A 2
28 BIN 3 B 15
28 BIN 3 C -
28 BIN 3 D -
Sorry im newbie here to write a good question.. Here is the image :
Could you please help me?
Thanks before :)
Hope this works :
SELECT
t.Week AS WEEK,
t.Bin_Code AS BIN_CODE,
t.Item_No AS ITEM_NO,
SUM(CASE WHEN items.Item_No = t.Item_No THEN items.Quantity ELSE 0 END) AS QUANTITY
FROM your_table as t
CROSS JOIN (SELECT DISTINCT Item_No FROM your_table) AS items
GROUP BY
t.Week,
t.Bin_Code
ORDER BY
1,2,3
Try this query
select Week,Registering_Date,Bin_Code,Item_No, case when asum>0 then asum else '-' end as QUANTITY
from(
select Week,Registering_Date,Bin_Code,Item_No,sum(case when Quantityfrom>0 then quentity else o end ) as asum
from tablename
group by Week,Bin_Code,Item_No)a

Moving sum over date range

I have this table that has wide range of dates and a corresponding value for each one of those dates, an example shown below.
Date Value
6/01/2013 8
6/02/2013 4
6/03/2013 1
6/04/2013 7
6/05/2013 1
6/06/2013 1
6/07/2013 3
6/08/2013 8
6/09/2013 4
6/10/2013 2
6/11/2013 10
6/12/2013 4
6/13/2013 7
6/14/2013 3
6/15/2013 2
6/16/2013 1
6/17/2013 7
6/18/2013 5
6/19/2013 1
6/20/2013 4
What I am trying to do is create a query that will create a new column that will display the sum of the Value’s column for a specified date range. For example down below, the sum column contains the sum of its corresponding date going back one full week. So the Sum of the date 6/9/2013 would be the sum of the values from 6/03/2013 to 6/09/2013.
Date Sum
6/01/2013 8
6/02/2013 12
6/03/2013 13
6/04/2013 20
6/05/2013 21
6/06/2013 22
6/07/2013 25
6/08/2013 25
6/09/2013 25
6/10/2013 26
6/11/2013 29
6/12/2013 32
6/13/2013 38
6/14/2013 38
6/15/2013 32
6/16/2013 29
6/17/2013 34
6/18/2013 29
6/19/2013 26
6/20/2013 23
I’ve tried to using the LIMIT clause but I could not get it to work, any help would be greatly appreciated.
zoo has a function rollapply which can do what you need:
z <- zoo(x$Value, order.by=x$Date)
rollapply(z, width = 7, FUN = sum, partial = TRUE, align = "right")
## 2013-06-01 8
## 2013-06-02 12
## 2013-06-03 13
## 2013-06-04 20
## 2013-06-05 21
## 2013-06-06 22
## 2013-06-07 25
## 2013-06-08 25
## 2013-06-09 25
## 2013-06-10 26
## 2013-06-11 29
## 2013-06-12 32
## 2013-06-13 38
## 2013-06-14 38
## 2013-06-15 32
## 2013-06-16 29
## 2013-06-17 34
## 2013-06-18 29
## 2013-06-19 26
## 2013-06-20 23
Using data.table
require(data.table)
#Build some sample data
data <- data.table(Date=1:20,Value=rpois(20,10))
#Build reference table
Ref <- data[,list(Compare_Value=list(I(Value)),Compare_Date=list(I(Date)))]
#Use lapply to get last seven days of value by id
data[,Roll.Val := lapply(Date, function(x) {
d <- as.numeric(Ref$Compare_Date[[1]] - x)
sum((d <= 0 & d >= -7)*Ref$Compare_Value[[1]])})]
head(data,10)
Date Value Roll.Val
1: 1 14 14
2: 2 7 21
3: 3 9 30
4: 4 5 35
5: 5 10 45
6: 6 10 55
7: 7 15 70
8: 8 14 84
9: 9 8 78
10: 10 12 83
Here is another solution if anyone is interested:
library("devtools")
install_github("boRingTrees","mgahan")
require(boRingTrees)
rollingByCalcs(data,dates="Date",target="Value",stat=sum,lower=0,upper=7)
Here is one way of doing it
> input <- read.table(text = "Date Value
+ 6/01/2013 8
+ 6/02/2013 4
+ 6/03/2013 1
+ 6/04/2013 7
+ 6/05/2013 1
+ 6/06/2013 1
+ 6/07/2013 3
+ 6/08/2013 8
+ 6/09/2013 4
+ 6/10/2013 2
+ 6/11/2013 10
+ 6/12/2013 4
+ 6/13/2013 7
+ 6/14/2013 3
+ 6/15/2013 2
+ 6/16/2013 1
+ 6/17/2013 7
+ 6/18/2013 5
+ 6/19/2013 1
+ 6/20/2013 4 ", as.is = TRUE, header = TRUE)
> input$Date <- as.Date(input$Date, format = "%m/%d/%Y") # convert Date
>
> # create a sequence that goes a week back from the current data
> x <- data.frame(Date = seq(min(input$Date) - 6, max(input$Date), by = '1 day'))
>
> # merge
> merged <- merge(input, x, all = TRUE)
>
> # replace NAs with zero
> merged$Value[is.na(merged$Value)] <- 0L
>
> # use 'filter' for the running sum and delete first 6
> input$Sum <- filter(merged$Value, rep(1, 7), sides = 1)[-(1:6)]
> input
Date Value Sum
1 2013-06-01 8 8
2 2013-06-02 4 12
3 2013-06-03 1 13
4 2013-06-04 7 20
5 2013-06-05 1 21
6 2013-06-06 1 22
7 2013-06-07 3 25
8 2013-06-08 8 25
9 2013-06-09 4 25
10 2013-06-10 2 26
11 2013-06-11 10 29
12 2013-06-12 4 32
13 2013-06-13 7 38
14 2013-06-14 3 38
15 2013-06-15 2 32
16 2013-06-16 1 29
17 2013-06-17 7 34
18 2013-06-18 5 29
19 2013-06-19 1 26
20 2013-06-20 4 23
>