I have an entity predicate eg. "Person" with related functional predicates storing attributes about the entity.
Eg.
Person(x), Person:id(x:s) -> string(s).
Person:dateOfBirth[a] = b -> Person(a), datetime(b).
Person:height[a] = b -> Person(a), decimal(b).
Person:eyeColor[a] = b -> Person(a), string(b).
Person:occupation[a] = b -> Person(a), string(b).
What I would like to do is in the terminal, do the equivalent of the SQL query:
SELECT id, dateOfBirth, eyeColor FROM Person
I am aware of the print command to get the details of a single functional predicate, but I would like to get a combination of them.
lb print /workspace 'Person:dateOfBirth'
You can use the "lb query" command to execute arbitrary logiql queries against your database. Effectively you create a temporary, anonymous, predicate with the results you want to see, and then a rule for populating that predicate using the logiql language. So in your case it would be something like:
lb query <workspace> '_(id, dob, eye) <-
Person(p),
Person:id(p:id),
Person:dateOfBirth[p] = dob,
Person:eyeColor[p] = eye.'
Try the query command with joins:
lb query /workspace '_(x, y, z) <- Person(p), Person:id(p:x), Person:dateOfBirth[p] = y, Person:eyeColor[p] = z.'
Related
I'm very well-versed in SQL, and an absolute novice in R. Unfortunately, due to an update in company policy, we must use Athena to run our SQL queries. Athena is weak, so despite having a complete/correct SQL query, I cannot run it to manipulate my large, insurance-based dataset.
I have seen similar posts, but haven't managed to crack my own problem trying to utilize the methodologies provided. Here are the details:
After running the SQL block in R (using a connection string), I have a countrywide data block denoted CW_Data in R
Each record contains a policy with a multitude of characteristics (columns) such as the Policy_Number, Policy_Effective_Date, Policy_Earned_Premium
Athena breaks down when I try add two columns based on the already-existing ones
Namely, I want to left join such that I can obtain a new columns for Policy_Prior_Year_Earned_Premium and Policy_Second_Prior_Year_Earned_Premium
Per the above, I know I need to add columns such that, for a given policy, I can find the record where the Policy_Number=Policy_Number and Policy_Effective_Date = Policy_Effective_Date-1 or Policy_Effective_Date-2 years. This is quite simple in SQL, but I cannot get it in R for the life of me.
Here is the (watered-down) left join I attempted in SQL using CTEs that breaks Athena (even if the SQL is run via R):
All_Info as (
Select
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP
from Policy_Characteristics as PC
left join Almost_All_Info as AAI
on AAI.Policy_Number = PC.Policy_Number
and AAI.Policy_Effective_Date = date_add('year', -1, PC.Policy_Effective_Date)
left join All_Segments as AST
on AST.Policy_Number = PC.Policy_Number
and AST.Policy_Effective_Date = date_add('year', -2, PC.Policy_Effective_Date)
Group by
PC.Policy_Number
,PC.Policy_Effective_Date
,PC.Policy_EP
As #zephryl pointed out, examples of data and expected result would be very helpful.
From your description, the R equivalent might look like this:
library(dplyr)
library(lubridate) ## datetime helpers
All_Info <-
Policy_Characteristics |>
select(Policy_Number,
Policy_Effective_Date, ## make sure this has class "Date"
Policy_EP
) |>
mutate(one_year_earlier = Policy_Effective_Date + duration(years = -1),
two_years_earlier = Policy_Effective_Date + duration(years = -2)
) |>
left_join(Almost_All_Info,
by = c('Policy_Number' = 'Policy_Number',
'one_year_earlier' = 'Policy_Effective_Date'
)
) |>
left_join(All_Segments,
by = c('Policy_Number' = 'Policy_Number',
'two_years_earlier' = 'Policy_Effective_Date'
)
) |>
group_by(Policy_Number,
Policy_Effective_Date,
Policy_EP
)
Given a SQL query:
SELECT *
FROM Database..Pizza pizza
JOIN Database..Toppings toppings ON pizza.ToppingId = toppings.Id
WHERE toppings.Name LIKE '%Mushroom%' AND
toppings.GlutenFree = 0 AND
toppings.ExtraFee = 1.25 AND
pizza.Location = 'Minneapolis, MN'
How do you determine what index to write to improve the performance of the query? (Assuming every value to the right of the equal is calculated at runtime)
Is there a built in command SQL command to suggest the proper index?
To me, it gets confusing when there's multiple JOINS that use fields from both tables.
For this query:
SELECT *
FROM Database..Pizza p JOIN
Database..Toppings t
ON p.ToppingId = t.Id
WHERE t.Name LIKE '%Mushroom%' AND
t.GlutenFree = 0 AND
t.ExtraFee = 1.25 AND
p.Location = 'Minneapolis, MN';
You basically have two options for indexes:
Pizza(location, ToppingId) and Toppings(id)
or:
Toppings(GlutenFree, ExtraFee, Name, id) and Pizza(ToppingId, location)
Which works better depends on how selective the different conditions are in the WHERE clause.
I have a table of employee similar to this:
Department Data
A [{"name":"John", "age":10, "job":"Manager"},{"name":"Eli", "age":40, "job":"Worker"},{"name":"Sam", "age":32, "job":"Manager"}]
B [{"name":"Jack", "age":50, "job":"CEO"},{"name":"Mike", "age":334 "job":"CTO"},{"name":"Filip", "age":63, "job":"Worker"}]
I want to get the department, name, and age of all employees, something similar to this:
Department Data
A [{"name":"John", "age":10},{"name":"Eli", "age":40},{"name":"Sam", "age":32}]
B [{"name":"Jack", "age":50},{"name":"Mike", "age":334},{"name":"Filip", "age":63}]
How can I achieve this using SQL query?
I assume you are using Hive/Spark and the datatype of the column is an array of maps.
Using explode and collect_list and map functions.
select dept,collect_list(map("name",t.map_elem['name'],"age",t.map_elem['age'])) as res
from tbl
lateral view explode(data) t as map_elem
group by dept
Note that this would be not be as performant as a Spark solution or a UDF with which you can access the required keys in an array of maps, without a function like explode.
One more way to do this with Spark SQL functions transform and map_filter (only available starting Spark version 3.0.0).
spark.sql("select dept,transform(data, map_elem -> map_filter(map_elem, (k, v) -> k != \"job\")) as res from tbl")
Another option with Spark versions > 2.4 is using function element_at with transform and selecting the required keys.
spark.sql("select dept," +
"transform(data, map_elem -> map(\"name\",element_at(map_elem,\"name\"),\"age\",element_at(map_elem,\"age\"))) as res " +
"from tbl")
I'd get your table into tabular format:
Department | Age | Job
Then:
SELECT Name, Age
FROM EMPLOYEE
GROUP BY Job
As an example, lets say I load two different files into a pig script
A = LOAD 'file1' USING PigStorage('\t') AS (
day:chararray,
month:chararray,
year:chararray,
message:chararray);
B = LOAD 'file2' USING PigStorage('\t) AS (
month:chararray,
day:chararray,
year:chararry,
message:chararray);
Now, notice the order of the fields is different, so if I combine them into one file C = UNION A, B; I get...
(2,OCT,2013,INFO INVALID USERNAME)
(OCT,3,2013,WARN STACK OVERFLOW)
If for no other reason than to make the data easier to read, I'd like to reorder the fields, so that both of them follow a common format and have the same positional notation for each field.
(2,OCT,2013,INFO INVALID USERNAME)
(3,OCT,2013,WARN STACK OVERFLOW)
This also crops up in a few other places with messages, levels, hosts, etc. It's not just date fields, I'd like to make everything "prettier" all around.
In some weird pseudo-code, I'd be looking for something like:
D = FOREACH B
REORDER (month,day,year) TO (day,month,year);
I haven't been able to find an example of anyone trying to do this and don't see a function that would do it. So maybe it's not possible and I'm alone here, but if anyone has any ideas I'd appreciate some hints.
In general, this is not necessary in Pig because you can just refer to fields by name and not worry about their position in the record. If your goal is to do a UNION of the two relations, you can achieve this by using the ONSCHEMA keyword:
C = UNION ONSCHEMA A, B;
That said, if you do really need to reorder a relation, a simple FOREACH...GENERATE is all you need:
D = FOREACH B GENERATE day, month, year, message;
Note that in your example, you are not actually working with tuples, you are working with entire records. If you did have a tuple, though, you can use the TOTUPLE built-in UDF to get where you need to go:
DESCRIBE E;
E: {t: (month: chararray,day: chararray,year: chararray,message: chararray)}
F = FOREACH E GENERATE TOTUPLE(t.day, t.month, t.year, t.message) AS t;
DESCRIBE F;
F: {t: (day: chararray,month: chararray,year: chararray,message: chararray)}
I have several sql queries that I simply want to fire at the database.
I am using hibernate throughout the whole application, so i would prefer to use hibernate to call this sql queries.
In the example below i want to get count + name, but cant figure out how to get that info when i use createSQLQuery().
I have seen workarounds where people only need to get out a single "count()" from the result, but in this case I am using count() + a column as ouput
SELECT count(*), a.name as count FROM user a
WHERE a.user_id IN (SELECT b.user_id FROM user b)
GROUP BY a.name
HAVING COUNT(*) BETWEEN 2 AND 5;
fyi, the above query would deliver a result like this if i call it directly on the database:
1, John
2, Donald
1, Ralph
...
Alternatively, you can use
SQLQuery query = session.createSQLQuery("SELECT count(*) as num, a.name as name FROM user a WHERE a.user_id IN (SELECT b.user_id FROM user b) GROUP BY a.name HAVING COUNT(*) BETWEEN 2 AND 5;";
query.addScalar("num", Hibernate.INTEGER).addScalar("name", Hibernate.STRING);
// you might need to use org.hibernate.type.StandardBasicTypes.INTEGER / STRING
// for Hibernate v3.6+,
// see https://hibernate.onjira.com/browse/HHH-5138
List<Object> result = query.list();
// result.get(2*i + 0) -> i-th row num
// result.get(2*i + 1) -> i-th row name
I'm using this in case of time-pressure, imo much faster to code then creating your own beans & transformers.
Cheers!
Jakub
cheers for the info Thomas, worked wonderful for generating objects
the problem i had with my initial query was that "count" was a reserved word :P
when i changed the name to something else it worked.
If your SQL statement looks like this SELECT count(*) as count, a.name as name... you could use setResultTransformer(new AliasToBeanResultTransformer(YourSimpleBean.class)) on your Query.
Where YourSimpleBean has the fields Integer count and String name respectively the setters setCount and setName.
On execution of the query with query.list() hibernate will return a List of YourSimpleBeans.