How can pvlib-python retreive a year-long archived weather forecasts from the global model (GFS)? - unidata

I have seen how easy pvlib-python can obtain weather forecasts, as it is presented in this link: https://pvlib-python.readthedocs.io/en/latest/forecasts.html
In this link, the example is just for illustration, the retrieved weather data seem to be limited in length (not more than a month from the past). So, I wonder whether the archived weather forecasts retrieved by pvlib for a practical implementation can be longer.
Can pvlib-python retrieve archived GFS weather forecasts for a year?
For example, I am looking for the temperature and solar irradiance (GHI) for the entire 2018. Can pvlib-python do that, and if so how?

This is not possible with pvlib-python. I think it's out-of-scope and I don't anticipate adding this feature in the future.
However, I wrote a python script to download some archived point forecast data from the NOAA NOMADS server: https://github.com/wholmgren/get_nomads/ It's efficient in that in only downloads the data that you need, but it's still fairly slow and error prone.

I wrote a small client for the CAMS radiation service: https://github.com/GiorgioBalestrieri/cams_radiation_python.
It contains a notebook showing how to combine this with pvlib.
From the website:
Copernicus Atmosphere Monitoring Service (CAMS) radiation service provides time series of Global, Direct, and Diffuse Irradiations on horizontal surface, and Direct Irradiation on normal plane (DNI) for the actual weather conditions as well as for clear-sky conditions. The geographical coverage is the field-of-view of the Meteosat satellite, roughly speaking Europe, Africa, Atlantic Ocean, Middle East (-66° to 66° in both latitudes and longitudes). Time coverage is 2004-02-01 up to 2 days ago. Data are available with a time step ranging from 1 min to 1 month. The number of automatic or manual requests is limited to 40 per day.
See the repo readme file for more information.

Related

Crux dataset Bigquery - Query for Min/Avg/Max LCP, FID and CLS

I have been exploring the Crux dataset in big query for last 10 days to extract data for data studio report. Though I consider myself good at SQL, as I have mostly worked with oracle and SQL server, I am finding it very hard to write queries against this dataset. I started from this article by Rick Viscomi, explored the queries on his github repo but still unable to figure it out.
I am trying to use the materialized table chrome-ux-report.materialized.metrics_summary to get some of the metrics but I am not sure if the Min/Avg/Max lcp (in milliseconds) for a time period (month for example) could be extracted from this table. What other queries could I possibly try which requires less data processing. (Some of the queries that I tried expired my free TB of data processing on big query).
Any suggestion, advise solution, queries are more than welcome since the documentation about the structure of the dataset and queries against it is not very clear.
For details about the fields used on the report you can check on the main documentation for the chrome ux report specially on the last part with data format which shows the dimensions and how its interpreted as show below:
Dimension
origin "https://example.com"
effective_connection_type.name 4G
form_factor.name "phone"
first_paint.histogram.start 1000
first_paint.histogram.end 1200
first_paint.histogram.density 0.123
For example, the above shows a sample record from the Chrome User Experience Report, which indicates that 12.3% of page loads had a “first paint time” measurement in the range of 1000-1200 milliseconds when loading “http://example.com” on a “phone” device over a ”4G”-like connection. To obtain a cumulative value of users experiencing a first paint time below 1200 milliseconds, you can add up all records whose histogram’s “end” value is less than or equal to 1200.
For the metrics, in the initial link there is a section called methodology where you can get information about the metrics and dimensions of the report. I recommend going to the actual origin source table per country and per site and not the summary as the data you are looking for can be obtained there. In the Bigquery part of the documentation you will find samples of how to query those tables. I find this relatable:
SELECT
SUM(bin.density) AS density
FROM
`chrome-ux-report.chrome_ux_report.201710`,
UNNEST(first_contentful_paint.histogram.bin) AS bin
WHERE
bin.start < 1000 AND
origin = 'http://example.com'
In the example above we’re adding all of the density values in the FCP histogram for “http://example.com” where the FCP bin’s start value is less than 1000 ms. The result is 0.7537, which indicates that ~75.4% of page loads experience the FCP in under a second.
About query estimation cost, you can see estimating query cost guide on google official bigquery documentation. But using this tables due to its nature consumes a lot of processing so filter it as much as possible.

Octaplanner example for Capicated Vehicle Routing with Time Window?

I am new to OctaPlanner.
I want to build a solution where I will nave number of locations to deliver items from one single location and also I want to use openmap distance data for calculating the distance.
Initially I used jsprit, but for more than 300 deliveries, it takes more than 8 minutes with 20 threads. Thats why I am trying to use Octa planner.
I want to map 1000 deliveries within 1 minute.
Does any one know any reference code or reference material which I can start using?
Thanks in advance :)
CVRPTW is a standard example, just open the examples app, vehicle routing and then import one of the belgium datasets with timewindows. The code is in the zip too.
To scale to 1k deliveries and especially beyond, you'll want to use "Nearby selection" (see reference manual), which isn't on by default but which makes a huge difference.

Bloomberg - get real world prices from the API

For a number of financial instruments, Bloomberg scales the prices that are shown in the Terminal - for example:
FX Futures at CME: e.g. ADZ3 Curncy (Dec-2013 AUD Futures at CME) show as 93.88 (close on 04-Oct-2013), whereas the actual (CME) market/settlement price was 0.9388
FX Rates: sometimes FX rates are scaled - this may vary by which way round the FX rate is asked for, so EURJPY Curncy (i.e. JPY per EUR) has a BGN close of 132.14 on 04-Oct-2013. The inverse (EUR per JPY) would be 0.007567. However, for JPYEUR Curncy (i.e. EUR per JPY), BGN has a close of 0.75672 for 04-Oct-2013.
FX Forwards: Depending on whether you are asking for rates or forward points (which can be set by overrides)... if you ask for rates, you might get these in terms of the original rate, so for EURJPY1M Curncy, BGN has a close of 132.1174 on 04-Oct-2013. But if you ask for forward points, you would get these scaled by some factor - i.e. -1.28 for EURJPY1M Curncy.
Now, I am not trying to criticise Bloomberg for the way that they represent this data in the Terminal. Goodness only knows when they first wrote these systems, and they have to maintain the functionality that market practitioners have come to know and perhaps love... In that context, scaling to the significant figures might make sense.
However, when I am using the API, I want to get real-world, actual prices. Like... the actual price at the exchange or the actual price that you can trade EUR for JPY.
So... how can I do that?
Well... the approach that I have come to use is to find the FLDS that communicate this scaling information, and then I fetch that value to reverse the scale that they have applied to the values. For futures, that's PX_SCALING_FACTOR. For FX, I've found PX_POS_MULT_FACTOR most reliable. For FX forward points, it's FWD_SCALE.
(It's also worth mentioning that how these are applied vaires - so PX_SCALING_FACTOR is what futures prices should be divided by, PX_POS_MULT_FACTOR is what FX rates should be multipled by, and FWD_SCALE is how many decimal places to divide the forward points by to get to a value that can be added to the actual FX rate.)
The problem with that is that it doubles the number of fetchs I have to make, which adds a significant overhead to my use of the API (reference data fetches also seem to take longer than historical data fetches.) (FWIW, I'm using the API in Java, but the question should be equally applicable to using the API in Excel or any of the other supported languages.)
I've thought about finding out this information and storing it somewhere... but I'd really like to not have to hard code that. Also, that would require to spend a very long time finding out the right scaling factors for all the different instruments I'm interested in. Even then, I would have no guarantee that they wouldn't change their scale on me at some point!
What I would really like to be able to do is apply an override in my fetch that would allow me specify what scale should be used. (And no, the fields above do not seem to be override-able.) I've asked the "helpdesk" about this on lots and lots of occasions - I've been badgering them about it for about 12 months, but as ever with Bloomberg, nothing seems to have happened.
So...
has anyone else in the SO community faced this problem?
has anyone else found a way of setting this as an override?
has anyone else worked out a better solution?
Short answer: you seem to have all the available information at hand and there is not much more you can do. But these conventions are stable over time so it is fine to store the scales/factors instead of fetching the data everytime (the scale of EURGBP points will always be 4).
For FX, I have a file with:
number of decimal (for spot, points and all-in forward rate)
points scale
spot date
To answer you specific questions:
FX Futures at CME: on ADZ3 Curncy > DES > 3:
For this specific contract, the price is quoted in cents/AUD instead of exchange convention USD/AUD in order to show greater precision for both the futures and options. Calendar spreads are also adjusted accordingly. Please note that the tick size has been adjusted by 0.01 to ensure the tick value and contract value are consistent with the exchange.
Not sure there is much you can do about this, apart from manually checking the factor...
FX Rates: PX_POS_MULT_FACTOR is your best bet indeed - note that the value of that field for a given pair is extremely unlikely to change. Alternatively, you could follow market conventions for pairs and AFAIK the rates will always be the actual rate. So use EURJPY instead of JPYEUR. The major currencies, in order, are: EUR, GBP, AUD, NZD, USD, CAD, CHF, JPY. For pairs that don't involve any of those you will have to fetch the info.
FX Forwards: the points follow the market conventions, but the scale can vary (it is 4 most of the time, but it is 3 for GBPCZK for example). However it should not change over time for a given pair.

Creating a testing strategy to check data consistency between two systems

With a quick search over stackoverflow was not able to find anything so here is my question.
I am trying to write down the testing strategy for a application where two applications sync with each other every day to keep a huge amount of data in sync.
As its a huge amount of data I don't really want to cross check everything. But just want to do a random check every time a data sync happens. What should be the strategy here for such system?
I am thinking of this 2 approach.
1) Get a count of all data and cross check both are same
2) Choose a random 5 data entry and verify that their proprty are in sync.
Any suggestion would be great.
What you need is known as Risk Management, in Software Testing it is called Software Risk Management.
It seems your question is not about "how to test" what you are about to test but how to describe what you do and why you do that (based on the question I assume you need this explanation for yourself too...).
Adding SRM to your Test Strategy should describe:
The risks of not fully testing all and every data in the mirrored system
A table scaling down SRM vs amount of data tested (ie probability of error if only n% of data tested versus -e.g.- 2n% tested), in other words saying -e.g.!- 5% of lost data/invalid data/data corrupption/etc if x% of data was tested with a k minute/hour execution time
Based on previous point, a break down of resources used for the different options (e.g. HW load% for n hours, manhours used is y, costs of HW/SW/HR use are z USD)
Probability -and cost- of errors/issues with automation code (ie data comparison goes wrong and results in false positive or false negative, giving an overhead to DBA, dev and/or testing)
What happens if SRM option taken (!!e.g.!! 10% of data tested giving 3% of data corruption/loss risk and 0.75% overhead risk -false positive/negative results-) results in actual failure, ie reference to Business Continuity and effects of data, integrity, etc loss
Everything else comes to your mind and you feel it applies to your *current issue* in your *current system* with your *actual preferences*.

How to group nearby latitude and longitude locations stored in SQL

Im trying to analyse data from cycle accidents in the UK to find statistical black spots. Here is the example of the data from another website. http://www.cycleinjury.co.uk/map
I am currently using SQLite to ~100k store lat / lon locations. I want to group nearby locations together. This task is called cluster analysis.
I would like simplify the dataset by ignoring isolated incidents and instead only showing the origin of clusters where more than one accident have taken place in a small area.
There are 3 problems I need to overcome.
Performance - How do I ensure finding nearby points is quick. Should I use SQLite's implementation of an R-Tree for example?
Chains - How do I avoid picking up chains of nearby points?
Density - How to take cycle population density into account? There is a far greater population density of cyclist in london then say Bristol, therefore there appears to be a greater number of backstops in London.
I would like to avoid 'chain' scenarios like this:
Instead I would like to find clusters:
London screenshot (I hand drew some clusters)...
Bristol screenshot - Much lower density - the same program ran over this area might not find any blackspots if relative density was not taken into account.
Any pointers would be great!
Well, your problem description reads exactly like the DBSCAN clustering algorithm (Wikipedia). It avoids chain effects in the sense that it requires them to be at least minPts objects.
As for the differences in densities across, that is what OPTICS (Wikipedia) is supposed do solve. You may need to use a different way of extracting clusters though.
Well, ok, maybe not 100% - you maybe want to have single hotspots, not areas that are "density connected". When thinking of an OPTICS plot, I figure you are only interested in small but deep valleys, not in large valleys. You could probably use the OPTICS plot an scan for local minima of "at least 10 accidents".
Update: Thanks for the pointer to the data set. It's really interesting. So I did not filter it down to cyclists, but right now I'm using all 1.2 million records with coordinates. I've fed them into ELKI for analysis, because it's really fast, and it actually can use the geodetic distance (i.e. on latitude and longitude) instead of Euclidean distance, to avoid bias. I've enabled the R*-tree index with STR bulk loading, because that is supposed to help to get the runtime down a lot. I'm running OPTICS with Xi=.1, epsilon=1 (km) and minPts=100 (looking for large clusters only). Runtime was around 11 Minutes, not too bad. The OPTICS plot of course would be 1.2 million pixels wide, so it's not really good for full visualization anymore. Given the huge threshold, it identified 18 clusters with 100-200 instances each. I'll try to visualize these clusters next. But definitely try a lower minPts for your experiments.
So here are the major clusters found:
51.690713 -0.045545 a crossing on A10 north of London just past M25
51.477804 -0.404462 "Waggoners Roundabout"
51.690713 -0.045545 "Halton Cross Roundabout" or the crossing south of it
51.436707 -0.499702 Fork of A30 and A308 Staines By-Pass
53.556186 -2.489059 M61 exit to A58, North-West of Manchester
55.170139 -1.532917 A189, North Seaton Roundabout
55.067229 -1.577334 A189 and A19, just south of this, a four lane roundabout.
51.570594 -0.096159 Manour House, Picadilly Line
53.477601 -1.152863 M18 and A1(M)
53.091369 -0.789684 A1, A17 and A46, a complex construct with roundabouts on both sides of A1.
52.949281 -0.97896 A52 and A46
50.659544 -1.15251 Isle of Wight, Sandown.
...
Note, these are just random points taken from the clusters. It may be sensible to compute e.g. cluster center and radius instead, but I didn't do that. I just wanted to get a glimpse of that data set, and it looks interesting.
Here are some screenshots, with minPts=50, epsilon=0.1, xi=0.02:
Notice that with OPTICS, clusters can be hierarchical. Here is a detail:
First, your example is quite misleading. You have two different sets of data, and you don't control the data. If it appears in a chain, then you will get a chain out.
This problem is not exactly suitable for a database. You'll have to write code or find a package that implements this algorithm on your platform.
There are many different clustering algorithms. One, k-means, is an iterative algorithm where you look for a fixed number of clusters. k-means requires a few complete scans of the data, and voila, you have your clusters. Indexes are not particularly helpful.
Another, which is usually appropriate on slightly smaller data sets, is hierarchical clustering -- you put the two closest things together, and then build the clusters. An index might be helpful here.
I recommend though that you peruse a site such as kdnuggets in order to see what software -- free and otherwise -- is available.