How to run SPARQL queries in R (WBG Topical Taxonomy) without parallelization - sparql

I am an R user and I am interested to use the World Bank Group (WBG) Topical Taxonomy through SPARQL queries.
This can be done directly on the API https://vocabulary.worldbank.org/PoolParty/sparql/taxonomy but it can be done also through R by using the functions load.rdf (to load the taxonomy.rdf rdfxml file downloaded from https://vocabulary.worldbank.org/ ) and after using the sparql.rdf to perform the query. These functions are available on the "rrdf" package
These are the three lines of codes:
taxonomy_file <- load.rdf("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_1 <- sparql.rdf(taxonomy_file,query)
What I obtain from result_query_1 is exactly the same I get through the API.
However, the load.rdf function uses all the cores available on my computer and not only one. This function is somehow parallelizing the load task over all the core available on my machine and I do not want that. I haven't found any option on that function to specify its serialized usege.
Therefore, I am trying to find other solutions. For instance, I have tried "rdf_parse" and "rdf_query" of the package "rdflib" but without any encouraging result. These are the code lines I have used.
taxonomy_file <- rdf_parse("taxonomy.rdf")
query <- "SELECT DISTINCT ?nodeOnPath WHERE {http://vocabulary.worldbank.org/taxonomy/435 http://www.w3.org/2004/02/skos/core#narrower* ?nodeOnPath}"
result_query_2 <- rdf_query(taxonomy_file , query = query)
Is there any other function that perform this task? The objective of my work is to run several queries simultaneously using foreach.
Thank you very much for any suggestion you could provide me.

Related

Is there a way to execute text gremlin query with PartitionStrategy

I'm looking for an implementation to run text query ex: "g.V().limit(1).toList()" while using the PatitionStrategy in Apache TinkerPop.
I'm attempting to build a REST interface to run queries on selected graph paritions only. I know how to run a raw query using Client, but I'm looking for an implementation where I can create a multi-tenant graph (https://tinkerpop.apache.org/docs/current/reference/#partitionstrategy) and query only selected tenants using raw text query instead of a GLV. Im able to query only selected partitions using pythongremlin, but there is no reference implementation I could find to run a text query on a tenant.
Here is tenant query implementation
connection = DriverRemoteConnection('ws://megamind-ws:8182/gremlin', 'g')
g = traversal().withRemote(connection)
partition = PartitionStrategy(partition_key="partition_key",
write_partition="tenant_a",
read_partitions=["tenant_a"])
partitioned_g = g.withStrategies(partition)
x = partitioned_g.V.limit(1).next() <---- query on partition only
Here is how I execute raw query on entire graph, but Im looking for implementation to run text based queries on only selected partitions.
from gremlin_python.driver import client
client = client.Client('ws://megamind-ws:8182/gremlin', 'g')
results = client.submitAsync("g.V().limit(1).toList()").result().one() <-- runs on entire graph.
print(results)
client.close()
Any suggestions appreciated? TIA
It depends on how the backend store handles text mode queries, but for the query itself, essentially you just need to use the Groovy/Java style formulation. This will work with GremlinServer and Amazon Neptune. For other backends you will need to make sure that this syntax is supported. So from Python you would use something like:
client.submit('
g.withStrategies(new PartitionStrategy(partitionKey: "_partition",
writePartition: "b",
readPartitions: ["b"])).V().count()')

PyPika control order of with clauses

I am using PyPika (version 0.37.6) to create queries to be used in BigQuery. I am building up a query that has two WITH clauses, and one clause is dependent on the other. Due to the dynamic nature of my application, I do not have control over the order in which those WITH clauses are added to the query.
Here is example working code:
a_alias = AliasedQuery("a")
b_alias = AliasedQuery("b")
a_subq = Query.select(Term.wrap_constant("1").as_("z")).select(Term.wrap_constant("2").as_("y"))
b_subq = Query.from_(a_alias).select("z")
q = Query.with_(a_subq, "a").from_(a_alias).select(a_alias.y)
q = q.with_(b_subq, "b").from_(b_alias).select(b_alias.z)
sql = q.get_sql(quote_char=None)
That generates a working query:
WITH a AS (SELECT '1' z,'2' y) ,b AS (SELECT a.z FROM a) SELECT a.y,b.z FROM a,b
However, if I add the b WITH clause first, then since a is not yet defined, the resulting query:
WITH b AS (SELECT a.z FROM a), a AS (SELECT '1' z,'2' y) SELECT a.y,b.z FROM a,b
does not work. Since BigQuery does not support WITH RECURSIVE, that is not an option for me.
Is there any way to control the order of the WITH clauses? I see the _with list in the QueryBuilder (the type of variable q), but since that's a private variable, I don't want to rely on that, especially as new versions of PyPika may not operate the same way.
One way I tried to do this is to always insert the first WITH clause at the beginning of the _with list, like this:
q._with.insert(0, q._with.pop())
Although this works, I'd like to use a PyPika supported way to do that.
In a related question, is there a supported way within PyPika to see what has already been added to the select list or other parts of the query? I noticed the q.selects member variable, but selects is not part of the public documentation. Using q.selects did not actually work for me when using our project's Python version (3.6) even though it did work in Python 3.7. The code I was trying to use is:
if any(field.name == "date" for field in q.selects if isinstance(field, Field))
The error I got was as follows:
def __getitem__(self, item: slice) -> "BetweenCriterion":
if not isinstance(item, slice):
> raise TypeError("Field' object is not subscriptable")
Thank you in advance for your help.
I could not figure out how to control the order of the WITH clauses after calling query.with_() (except for the hack already noted). As a result, I restructured my application to get around this problem. I am now calling query.with_() before building up the rest of the query.
This also made my related question moot, because I no longer need to see what I've already added to the query.

Best design patterns for refactoring code without breaking other parts

I have some PHP code from an application that was written using Laravel. One of the modules was written quite poorly. The controller of this module has a whole bunch of reporting functions which uses the functions defined inside the model object to fetch data. And the functions inside the model object are super messy.
Following is a list of some of the functions from the controller (ReportController.php) and the model (Report.php)(I'm only giving names of functions and no implementations as my question is design related)
Functions from ReportController.php
questionAnswersReportPdf()
fetchStudentAnswerReportDetail()
studentAnswersReport()
wholeClassScoreReportGradebookCSV()
wholeClassScoreReportCSV()
wholeClassScoreReportGradebook()
formativeWholeClassScoreReportGradebookPdf()
wholeClassScoreReport()
fetchWholeClassScoreReport()
fetchCurriculumAnalysisReportData()
curriculumAnalysisReportCSV()
curriculumAnalysisReportPdf()
studentAnswersReportCSV()
fetchStudentScoreReportStudents()
Functions from Report.php
getWholeClassScoreReportData
getReportsByFilters
reportMeta
fetchCurriculumAnalysisReportData
fetchCurriculumAnalysisReportGraphData
fetchCurriculumAnalysisReportUsersData
fetchTestHistoryClassAveragesData
fetchAllTestHistoryClassAveragesData
fetchAllTestHistoryClassAveragesDataCsv
fetchHistoryClassAveragesDataCsv
fetchHistoryClassAveragesGraphData
The business logic has been written in quite a messy way also. Some parts of it are in the controller while other parts are in the model object.
I have 2 specific questions :
a) I have an ongoing goal of reducing code complexity and optimizing code structure. How can I leverage common OOP design patterns to ensure altering the code in any given report does not negatively affect the other reports? I specifically want to clean up the code for some critical reports first but want to ensure that by doing this none of the other reports will break.
b) The reporting module is relatively static in definition and unlikely to change over time. The majority of reports generated by the application involve nested sub-queries as well as standard grouping & filtering options. Most of these SQL queries have been hosed within the functions of the model object and contain some really complex joins. Without spending time evaluating the database structure or table indices, which solution architecture techniques would you recommend for scaling the report functionality to ensure optimized performance? Below is a snippet of one of the SQL queries
$sql = 'SELECT "Parent"."Id",
"Parent"."ParentId",
"Parent"."Name" as systemStandardName,
string_agg(DISTINCT((("SubsectionQuestions"."QuestionSerial"))::text) , \', \') AS "quesions",
count(DISTINCT("SubsectionQuestions"."QuestionId")) AS "totalQuestions",
case when sum("SQUA"."attemptedUsers")::float > 0 then
(COALESCE(round((
(
sum(("SQUA"."totalCorrectAnswers"))::float
/
sum("SQUA"."attemptedUsers")::float
)
*100
)::numeric),0))
else 0 end as classacuracy,
case when sum("SQUA"."attemptedUsers")::float > 0 then
(COALESCE((round(((1 -
(
(
sum(("SQUA"."totalCorrectAnswers"))::float
/
sum("SQUA"."attemptedUsers")::float
)
)
)::float * count(DISTINCT("SubsectionQuestions"."QuestionId")))::numeric,1)),0))
else 0 end as pgain
FROM "'.$gainCategoryTable.'" as "Parent"
'.$resourceTableJoin.'
INNER JOIN "SubsectionQuestions"
ON "SubsectionQuestions"."QuestionId" = "resourceTable"."ResourceId"
INNER JOIN "Subsections"
ON "Subsections"."Id" = "SubsectionQuestions"."SubsectionId"
LEFT Join (
Select "SubsectionQuestionId",
count(distinct case when "IsCorrect" = \'Yes\' then CONCAT ("UserId", \' \', "SubsectionQuestionId") else null end) AS "totalCorrectAnswers"
, count(distinct CONCAT ("UserId", \' \', "SubsectionQuestionId")) AS "attemptedUsers"
From "SubsectionQuestionUserAnswers"';
if(!empty($selectedUserIdsArr)){
$sql .= ' where "UserId" IN (' .implode (",", $selectedUserIdsArr).')' ;
}else {
$sql .= ' where "UserId" IN (' .implode (",", $assignmentUsers).')' ;
}
$sql .= ' AND "AssessmentAssignmentId" = '.$assignmentId.' AND "SubsectionQuestionId" IN ('.implode(",", $subsectionQuestions).') Group by "SubsectionQuestionId"
) as "SQUA" on "SQUA"."SubsectionQuestionId" = "SubsectionQuestions"."Id"
INNER JOIN "AssessmentAssignment"
ON "AssessmentAssignment"."assessmentId" = "Subsections"."AssessmentId"
INNER JOIN "AssessmentAssignmentUsers"
ON "AssessmentAssignmentUsers"."AssignmentId" = "AssessmentAssignment"."Id"
AND "AssessmentAssignmentUsers"."Type" = \'User\'
'.$conditaionlJoin.'
WHERE "Parent"."Id" IN ('.implode(',', $ssLeaf).')
'.$conditionalWhere.'
GROUP BY "Parent"."Id",
"Parent"."ParentId",
"Parent"."Name"
'.$sorter;
$results = DB::select(DB::raw($sql));
My take on a). In my experience when I want to reduce code complexity/sheer messiness I slowly refactor out code that violates the single responsibility principle while I'm working in that area already for either a bug fix or a feature update. I try not to spend hours upon hours of updating code that is "working" that I'm not actively updating for a business process reason. Follow the "Leave it better than you found it" approach as you work in this code base, and it will get better over time. Doing this will allow you to improve the code base while also getting features and bug fixes out the door, while also keeping business owners/project managers happy because you're keeping things moving.
about a) : The first thing I do, to ensure none of my refactoring is breaking anything, is covering the code with, at least, unitary tests (doing TDD ensures the most optimal coverage). It's easier when, like #DavidY says, you respect principles like SRP (does my class try to answer too many problems ?). With test, you'll feel safer when you'll need to refactor, and the tests will tell you exactly where it broke.
about b) : Do not optimize until you need it. And optimize only when you know what cost you the most. It's the best way to know what pattern you need, otherwise you may try to force the wrong solution into the wrong problem.

How to set an SQL parameters in Apps Scripts and BigQuery

I am trying to avoid a sql injection. This topic has been dealt with in Java (How to prevent query injection on Google Big Query) and Php.
How is this accomplished in App Scripts? I did not find how to add a parameter to a SQL statement. Here is what I had hoped to do:
var sql = 'SELECT [row],etext,ftext FROM [hcd.hdctext] WHERE (REGEXP_MATCH(etext, esearch = ?) AND REGEXP_MATCH(ftext, fsearch = ?));';
var queryResults;
var resource = {
query: sql,
timeoutMs: 1000,
esearch='r"[^a-zA-z]comfortable"',
fsearch='r"[a-z,A-z]confortable"'
};
queryResults = BigQuery.Jobs.query(resource,projectNumber);
And then have esearch and fsearch filled in with the values (which could be set elsewhere).
That does not work, according to the doc.
Any suggestions on how to get a parameter in an SQL query? (I could not find a setString function...)
Thanks!
Unfortunately, BigQuery doesn't support this type of parameter substitution. It is on our list of features to consider, and I'll bump the priority since it seems like this is a common request.
The only suggestion that I can make in the mean time is that if you are building query strings by hand, you will need to make sure you escape them carefully (which is a non-trivial operation).

From within a grails HQL, how would I use a (non-aggregate) Oracle function?

If I were retrieving the data I wanted from a plain sql query, the following would suffice:
select * from stvterm where stvterm_code > TT_STUDENT.STU_GENERAL.F_Get_Current_term()
I have a grails domain set up correctly for this table, and I can run the following code successfully:
def a = SaturnStvterm.findAll("from SaturnStvterm as s where id > 201797") as JSON
a.render(response)
return false
In other words, I can hardcode in the results from the Oracle function and have the HQL run correctly, but it chokes any way that I can figure to try it with the function. I have read through some of the documentation on Hibernate about using procs and functions, but I'm having trouble making much sense of it. Can anyone give me a hint as to the proper way to handle this?
Also, since I think it is probably relevant, there aren't any synonyms in place that would allow the function to be called without qualifying it as schema.package.function(). I'm sure that'll make things more difficult. This is all for Grails 1.3.7, though I could use a later version if needed.
To call a function in HQL, the SQL dialect must be aware of it. You can add your function at runtime in BootStrap.groovy like this:
import org.hibernate.dialect.function.SQLFunctionTemplate
import org.hibernate.Hibernate
def dialect = applicationContext.sessionFactory.dialect
def getCurrentTerm = new SQLFunctionTemplate(Hibernate.INTEGER, "TT_STUDENT.STU_GENERAL.F_Get_Current_term()")
dialect.registerFunction('F_Get_Current_term', getCurrentTerm)
Once registered, you should be able to call the function in your queries:
def a = SaturnStvterm.findAll("from SaturnStvterm as s where id > TT_STUDENT.STU_GENERAL.F_Get_Current_term()")