RVest html_nodes() not returning second table on webpage - html-table

I am trying to scrape data from football reference using this code:
season<-2020
team<-"nwe"
html<-paste("https://www.pro-football-reference.com/teams/",team,"/",season,"_roster.htm", sep="")
x<-read_csv(html)
table <- "table#roster"
html %>%
read_html() %>%
html_nodes(table) %>%
html_table()
Running this code gives me an empty list. My intention is to scrape the second table on the page. Scraping the first table works regardless of which method I use, but getting the second one is impossible. I am getting really annoyed because there is no logical reason for this to be happening.
IF I replaced
table <- "table#roster"
with
table <-"table#starters"
then the code effectively reads in the first table from the website.

Related

How to update a whole column based on a HTTP call done on each cell?

I am fumbling around with Pandas (I want to avoid using Excel, I have very basic knowledge of Pandas and a reasonable of Python), trying to add a column based on another column.
Specifically, I have a column with IDs, and I want to enrich my data by making a HTTP query to an API and using a field in the JSON response:
d['m0'] = pd.read_json(f"http://localhost:3000/{d['id']}")['H']['M0']
What I wanted to say in the above was
take the data from a cell in the column id, run the API query, and put the ['H']['M0'] of the JSON response (a string) into the column m0
What I get is
InvalidURL Traceback (most recent call last)
Cell In[5], line 1
----> 1 d['m0'] = pd.read_json(f"http://localhost:3000/{d['id']}")['H']['M0']
I feel that the way th eURI was built is not correct, i.e. the content of the cell for column id was not used, but rather the whole column:
InvalidURL: URL can't contain control characters. '/0 AA13\n1 BB10\n2
AA13, BB10, ...are the ids in the column
If I understand it correctly, you have various ID values and you want to automate it by fetching the JSON corresponding to that ID and use values from JSON to further populate the dataframe.
I have not seen the structure of your JSON or return type of accessed fields; but I feel that you are looking for following:
d['m0'] = d['id'].apply(lambda id: pd.read_json(f"http://localhost:3000/{id}")['H']['M0'])

Understanding the "Not found: Dataset ### was not found in location US" error

I know this topic has come up many times but still here I am. Data processing location seems consistent (dataset, US; query: US) and I am using backticks & long format in the FROM clause
Below are two sequences of code. The first one works perfectly:
SELECT station_id
FROM `bigquery-public-data.austin_bikeshare.bikeshare_stations`
Whereas the following returns an error message:
SELECT bikeshare_stations.station_id
FROM `bigquery-public-data.austin_bikeshare`
Not found: Dataset glassy-droplet-347618:bigquery-public-data was not found in location US
My question, thus, is why do the first lines of text work while the second doesn't?
You need to understand the different parts of the backticks:
bigquery-public-data is the name of the project;
austin_bikeshare is the name of the schema (aka dataset in BQ); and
bikeshare_stations is the name of the table/view.
Therefore, the shorter format you are looking for is: austin_bikeshare.bikeshare_stations (instead of bigquery-public-data.austin_bikeshare).
Using bigquery-public-data.austin_bikeshare means that you have a schema called bigquery-public-data that contains a table called austin_bikeshare , when this is not true.

How to create a view against a table that has record fields?

We have a weekly backup process which exports our production Google Appengine Datastore onto Google Cloud Storage, and then into Google BigQuery. Each week, we create a new dataset named like YYYY_MM_DD that contains a copy of the production tables on that day. Over time, we have collected many datasets, like 2014_05_10, 2014_05_17, etc. I want to create a data set Latest_Production_Data that contains a view for each of the tables in the most recent YYYY_MM_DD dataset. This will make it easier for downstream reports to write their query once and always retrieve the most recent data.
To do this, I have code that gets the most recent dataset and the names of all the tables that dataset contains from the BigQuery API. Then, for each of these tables, I fire a tables.insert call to create a view that is a SELECT * from the table I am looking to create a reference to.
This fails for tables that contain a RECORD field, from what looks to be a pretty benign column-naming rule.
For example, I have this table:
For which I issue this API call:
{
'tableReference': {
'projectId': 'redacted',
'tableId': u'AccountDeletionRequest',
'datasetId': 'Latest_Production_Data'
}
'view': {
'query': u'SELECT * FROM [2014_05_17.AccountDeletionRequest]'
},
}
This results in the following error:
HttpError: https://www.googleapis.com/bigquery/v2/projects//datasets/Latest_Production_Data/tables?alt=json returned "Invalid field name "__key__.namespace". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.">
When I execute this query in the BigQuery web console, the columns are renamed to translate the . to an _. I kind of expected the same thing to happen when I issued the create view API call.
Is there an easy way I can programmatically create a view for each of the tables in my dataset, regardless of their underlying schema? The problem I'm encountering now is for record columns, but another problem I anticipate is for tables that have repeated fields. Is there some magic alternative to SELECT * that will take care of all these intricacies for me?
Another idea I had was doing a table copy, but I would prefer not to duplicate the data if I can at all avoid it.
Here is the workaround code I wrote to dynamically generate a SELECT statement for each of the tables:
def get_leaf_column_selectors(dataset, table):
schema = table_service.get(
projectId=BQ_PROJECT_ID,
datasetId=dataset,
tableId=table
).execute()['schema']
return ",\n".join([
_get_leaf_selectors("", top_field)
for top_field in schema["fields"]
])
def _get_leaf_selectors(prefix, field):
if prefix:
format = prefix + ".%s"
else:
format = "%s"
if 'fields' not in field:
# Base case
actual_name = format % field["name"]
safe_name = actual_name.replace(".", "_")
return "%s as %s" % (actual_name, safe_name)
else:
# Recursive case
return ",\n".join([
_get_leaf_selectors(format % field["name"], sub_field)
for sub_field in field["fields"]
])
We had a bug where you needed to need to select out the individual fields in the view and use an 'as' to rename the fields to something legal (i.e they don't have '.' in the name).
The bug is now fixed, so you shouldn't see this issue any more. Please ping this thread or start a new question if you see it again.

xts objects and split

I have many "problems" but i will try to split them up as good as i can.
First i will present my code:
# Laster pakker
library(RODBC)
library(plyr)
library(lattice)
library(colorRamps)
library(Perfor)
# Picking up data
query <- "select convert(varchar(10),fk_dim_date,103) fk_dim_date,fk_dim_portfolio, dtd_portfolio_return_pct, dtd_benchmark_return_pct, * from nbim_dm..v_fact_performance
where fk_dim_date > '20130103' and fk_dim_portfolio in ('6906', '1812964')
"
# Formatting SQLen
query <- strwrap(query,width=nchar(query),simplify=TRUE)
# quering
ch <- odbcDriverConnect("driver={SQL Server};server=XXXX;Database=XXXX;", rows_at_time = 1024)
result <- sqlQuery(ch, query, as.is=c(TRUE, TRUE, TRUE))
close(ch)
# Do some cleanup
`enter code here`resultt$v_d <- as.Date(as.POSIXct(t$v_d))
#split
y <- split(qt,qt$fk_dim_portfolio)
#making names
new_names <- c("one","two")
for (i in 1:length(y){assign(new_names[i],y[[i]])})
So far so good:
The table that my SQL is running on has approx 178 diff. port_ids, some of which are useless and others that are highly useful. However i want this code to pull all fk_dim_ports (pulling: '6906', '1812964 was just for example purposes). After pulling the data i want to seperate it into n (now 178 sets) and make them xts objects which i have run into some trouble using:
qt <- xts(t[,-1],order.by=t[,1])
But works perfectly well if i don`t split the data using:
y <- split(qt,qt$fk_dim_portfolio)
Assuming this will work, my intention is to create charts.PerformanceSummary(mydata) for every table of my previous created data frames.
If you have any tips on how to split, make timeseries objects and loop the generation of the charts i would highly appreciate this.
I am aware that this post probably don`t comply to your rules/customs etc, but thanks for helping.
Lars

jqGrid/NHibernate/SQL: navigate to selected record

I use jqGrid to display data which is retrieved using NHibernate. jqGrid does paging for me, I just tell NHibernate to get "count" rows starting from "n".
Also, I would like to highlight specific record. For example, in list of employees I'd like a specific employee (id) to be shown and pre-selected in table.
The problem is that this employee may be on non-current page. E.g. I display 20 rows from 0, but "highlighted" employee is #25 and is on second page.
It is possible to pass initial page to jqGrid, so, if I somehow use NHibernate to find what page the "highlighted" employee is on, it will just navigate to that page and then I'll use .setSelection(id) method of jqGrid.
So, the problem is narrowed down to this one: given specific search query like the one below, how do I tell NHibernate to calculate the page where the "highlighted" employee is?
A sample query (simplified):
var query = Session.CreateCriteria<T>();
foreach (var sr in request.SearchFields)
query = query.Add(Expression.Like(sr.Key, "%" + sr.Value + "%"));
query.SetFirstResult((request.Page - 1) * request.Rows)
query.SetMaxResults(request.Rows)
Here, I need to alter (calculate) request.Page so that it points to the page where request.SelectedId is.
Also, one interesting thing is, if sort order is not defined, will I get the same results when I run the search query twice? I'd say that SQL Server may optimize query because order is not defined... in which case I'll only get predictable result if I pull ALL query data once, and then will programmatically in C# slice the specified portion of query results - so that no second query occur. But it will be much slower, of course.
Or, is there another way?
Pretty sure you'd have to figure out the page with another query. This would surely require you to define the column to order by. You'll need to get the order by and restriction working together to count the rows before that particular id. Once you have the number of rows before your id, you can figure what page you need to select and perform the usual paging query.
OK, so currently I do this:
var iquery = GetPagedCriteria<T>(request, true)
.SetProjection(Projections.Property("Id"));
var ids = iquery.List<Guid>();
var index = ids.IndexOf(new Guid(request.SelectedId));
if (index >= 0)
request.Page = index / request.Rows + 1;
and in jqGrid setup options
url: "${Url.Href<MyController>(c => c.JsonIndex(null))}?_SelectedId=${Id}",
// remove _SelectedId from url once loaded because we only need to find its page once
gridComplete: function() {
$("#grid").setGridParam({url: "${Url.Href<MyController>(c => c.JsonIndex(null))}"});
},
loadComplete: function() {
$("#grid").setSelection("${Id}");
}
That is, in request I lookup for index of id and set page if found (jqGrid even understands to display the appropriate page number in the pager because I return the page number to in in json data). In grid setup, I setup url to include the lookup id first, but after grid is loaded I remove it from url so that prev/next buttons work. However I always try to highlight the selected id in the grid.
And of course I always use sorting or the method won't work.
One problem still exists is that I pull all ids from db which is a bit of performance hit. If someone can tell how to find index of the id in the filtered/sorted query I'd accept the answer (since that's the real problem); if no then I'll accept my own answer ;-)
UPDATE: hm, if I sort by id initially I'll be able to use the technique like "SELECT COUNT(*) ... WHERE id < selectedid". This will eliminate the "pull ids" problem... but I'd like to sort by name initially, anyway.
UPDATE: after implemented, I've found a neat side-effect of this technique... when sorting, the active/selected item is preserved ;-) This works if _SelectedId is reset only when page is changed, not when grid is loaded.
UPDATE: here's sources that include the above technique: http://sprokhorenko.blogspot.com/2010/01/jqgrid-mvc-new-version-sources.html