xml_nodeset to tibble, one row per xml_nodeset (item) - purrr

I have a complicated xml file with items as 1st child nodes. The items can have different structure and some of the attributes are missing in some of them. I need to store one item (nodeset) in tibble row, so that I keep track on missing attributes and write a function handling all variants.
I found a solution of the first step by Felix Ebert:
https://stackoverflow.com/questions/49253021/how-to-extract-xml-attr-and-xml-text-on-different-levels-with-xml2-and-purrr
I copy part of the code here:
xml <- xml2::read_xml("input/example.xml")
rows <- xml %>% xml_find_all("//xmlsubsubnode")
rows_df <- data_frame(node = rows)
Function data_frame was depreciated and I got error messages if I replace it with
tibble()
as_tibble()
data.frame()
With "tibble" I get following ERROR:
df_articles <- tibble(item = xml_articles)
Error:
! All columns in a tibble must be vectors.
✖ Column `item` is a `xml_nodeset` object.
Backtrace:
1. tibble::tibble(item = xml_articles)
2. tibble:::tibble_quos(xs, .rows, .name_repair)
3. tibble:::check_valid_col(res, col_names[[j]], j)
4. tibble:::check_valid_cols(set_names(list(x), name))
I would be grateful if anybody can update the original post.

Related

RStudio Error: Unused argument ( by = ...) when fitting gam model, and smoothing seperately for a factor

I am still a beginnner in R. For a project I am trying to fit a gam model on a simple dataset with a timeset and year. I am doing it in R and I keep getting an error message that claims an argument is unused, even though I specify it in the code.
It concerns a dataset which includes a categorical variable of "Year", with only two levels. 2020 and 2022. I want to investigate if there is a peak in the hourly rate of visitors ("H1") in a nature reserve. For each observation period the average time was taken, which is the predictor variable used here ("T"). I want to use a Gam model for this, and have the smoothing applied differently for the two years.
The following is the line of code that I tried to use
`gam1 <- gam(H1~Year+s(T,by=Year),data = d)`
When I try to run this code, I get the following error message
`Error in s(T, by = Year) : unused argument (by = Year)`
I also tried simply getting rid of the "by" argument
`gam1 <- gam(H1~Year+s(T,Year),data = d)`
This allows me to run the code, but when trying to summon the output using summary(gam1), I get
Error in [<-(tmp, snames, 2, value = round(nldf, 1)) : subscript out of bounds
Since I feel like both errors are probably related to the same thing that I'm doing wrong, I decided to combine the question.
Did you load the {mgcv} package or the {gam} package? The latter doesn't have factor by smooths and as such the first error message is what I would expect if you did library("gam") and then tried to fit the model you showed.
To fit the model you showed, you should restart R and try in a clean session:
library("mgcv")
# load you data
# fit model
gam1 <- gam(H1 ~ Year + s(T, by = Year), data = d)
It could well be that you have both {gam} and {mgcv} loaded, in which case whichever you loaded last will be earlier on the function search path. As both packages have functions gam() and s(), R might just be finding the wrong versions (masking), so you might also try
gam1 <- mgcv::gam(H1 ~ Year + mgcv::s(T, by = Year), data = d)
But you would be better off only loading {mgcv} if you wan factor by smooths.
#Gavin Simpson
I did have both loaded, and I tried just using mgcv as you suggested. However, then I get the following error.
Error in names(dat) <- object$term :
'names' attribute [1] must be the same length as the vector [0]
I am assuming this is simply because it's not actually trying to use the "gam" function, but rather it attempts to name something gam1. So I would assume I actually need the package of 'gam' before I could do this.
The second line of code also doesn't work. I get the following error
Error in model.frame.default(formula = H1 ~ Year + mgcv::s(T, by = Year), :
invalid type (list) for variable 'mgcv::s(T, by = Year)'
This happens no matter the order I download the two packages in. And if I don't download 'gam', I get the error as described above.

Issue when importing dataset: Rows that contain more elements/columns than the previous row are divided between two rows

For a project I receive datasets in the form of text files. These text files are generated by the measuring software from a machine. The data in the files is seperated by spaces and has no header, example of a row:
Mo 27.06.2022 12:01:11 MP2 mv:(mean. 5s): 4,824 mg/mü org.C
When loading this data using
my_data <- read.table("File.txt", header = FALSE, sep = "", dec = ",", fill=TRUE, na.strings=c("","NA"))
I obtain 9 columns in the following format (example), as intended.
|Mo|27.06.2022|12:01:11|MP2|mv:(mean.| 5s):| 4,824| mg/mü| org.C|
However, sometimes the data set starts with a notification from the machine (example):
Mo 27.06.2022 11:42:04 {SE14} service requestend
When this happens, the 'regular' 9 column rows are seperated between two rows (example):
Row 1: Mo|27.06.2022|11:58:26|MP1|mv:(mean.| 5s):|
Row 2: 7,858| mg/mü |org.C
How do I tell R to not perform this seperation between two rows? As I understand, it does this because earlier in the text file, an input of only 6 columns is recognized.
This is a script that we will use for years to come, so help is greatly appreciated!
I've tried removing the fill function from the read.table function, I have tried removing the na.strings, and ofcourse looking for answers on stack overflow, but was not able to encounter this specific problem.

Python append entry KeyError problem because of missing data from the API

So, i'm trying to collect data from an API to make a dataframe. The problems is that that when i get the response in JSON some of the values are missing for some rows. That means that one row has all 10 out of 10 values and some only have 8 out of 10.
For e.g. I have such code to fill in the data from the API to then form a DataFrame:
response = r.json()
cols = ['A', 'B', 'C', 'D']
l = []
for entry in response:
l.append([
entry['realizationreport_id'],
entry['suppliercontract_code'],
entry['rid'],
entry['ppvz_inn'],
I get this error because in one of the rows the API didn't give a value in response:
KeyError: 'ppvz_inn'
So i'm tryng to fix it so that the cell of the DataFrame is filled with 0 or Nan if the API doesn't have a value for this specific row
l = []
for entry in response:
l.append([
entry['realizationreport_id'],
entry['suppliercontract_code'],
entry['rid'],
entry['ppvz_inn'],
try:
entry['ppvz_supplier_name'],
except KeyError:
'0',
And now i get this error:
try:
^
SyntaxError: invalid syntax
How to actually make it work and fill those cells with no data?
You cannot have a try-except statement in the middle of your append statement.
You could either work with if statements or first try to fix your JSON data by filling in empty values. You could also maybe use setdefault, see here some info about it.
Use collections.defaultdict. It's a subclass of dict which does not return KeyError, creating a called key instead.
You can cast your existing dict to defaultdict using unpacking.
for entry in response:
entry_defaultdict = defaultdict(list, **entry)
In this case, every call to non-existing object will create an empty list as a value of the key.

How to use arcpullr::get_spatial_layer() and arcpullr::get_layer_by_poly()

I couldn't figure this out through the package documentation https://cran.r-project.org/web/packages/arcpullr/vignettes/intro_to_arcpullr.html.
My codes return the errors described below.
library(arcpullr)
url <- "https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/WBD/MapServer/1"
huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_poly(url,geometry = "esriGeometryPolygon")
huc8_1:
Error in if (layer_info$type == "Group Layer") { :
argument is of length zero
huc8_2:
Error in get_sf_crs(geometry) : "sf" %in% class(sf_obj) is not TRUE
It would be very appreciated if you could provide any help to explain the errors and suggest any solutions. Thanks!
I didn't use the arcpullr package. Using leaflet.esri::addEsriFeatureLayer with a where clause works.
See the relevant codes below, as an example:
leaflet.esri::addEsriFeatureLayer(
url="https://arcgis.deq.state.or.us/arcgis/rest/services/WQ/IR_201820_byParameter/MapServer/2",
options = leaflet.esri::featureLayerOptions(where = IR_where_huc12)
)
You have to pass an sf object as the second argument to any of the get_layer_by_* functions. I alter your example a bit using a point instead of a polygon for spatial querying (since it's easier to create), but get_layer_by_poly would work the same way using an sf polygon instead of a point. Also, the service you use requires a token. I changed the url to USGS HU 6-digit basins instead
library(arcpullr)
url <- "https://hydro.nationalmap.gov/arcgis/rest/services/wbd/MapServer/3"
query_pt <- sf_point(c(-90, 45))
# this would query everything in the feature layer, which may or may not be huge
# huc8_1 <- get_spatial_layer(url)
huc8_2 <- get_layer_by_point(url, query_pt)
huc_map <- plot_layer(huc8_2)
huc_map
huc_map + ggplot2::geom_sf(data = query_pt)

sqlfetch table not found in channel regular expression

I am trying to find fetch multiple Access files in which the table I need has a different name each time.
Example :
in Access file 1, table name is "base1"
in Access file 2, table name is "base2"
etc.
I tried the following function which will be later used within a map function to fetch all Access files from my directory:
fetch <- function (x) { y <- odbcConnectAccess2007(x) sqlFetch(y,"^base.$") odbcCloseAll() }
R does not seem to like regular expressions on sqlfetch since I get the following message :
Error in odbcTableExists(channel, sqtable) : ‘^base.$’: table not found on channel
Please note that this works perfectly when I use "base1" as sqltable instead of "^base.$"
Can you help me please ?
I have found the solution to this problem :
fetch <- function (x) {
y <- odbcConnectAccess2007(x)
find_table_name <-
str_extract(sqlTables(y)$TABLE_NAME, "^(base.*)$") %>%
na.omit
table_result <- sqlFetch(y, find_table_name[1])
return(table_result)
odbcCloseAll()
}