Is there a way to represent temporal data in RDFs? - semantics

I have bunch of temporal data which I want to convert to RDF format. Is there any accepted way of doing so?
Example of tabulated data which should be somehow converted into RDF format:
| Name | Date | Salary |
-----------------------------------
| John | Jan 2012 | 3,244 |
| John | Feb 2012 | 4,012 |
| John | Mar 2012 | 3,112 |
Found one way to do it, however it is rather cubersome and introduce very large vocabularies. Assuming syntax (Subject, Predicate, Object)
(JohnJan2012, date, Jan 2012)
(JohnJan2012, name, John)
(JohnJan2012, salary, 3244)
Does anyone know of a better way to do it?

You can use bNodes here and do, in Turtle syntax, something like:
#prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
:john :salary [
:amount "3244.0"^^xsd:decimal;
:date "2012-01-01T00:00:00"^^xsd:datetime;
] .
:john :salary [
:amount "4012.0"^^xsd:decimal;
:date "2012-02-01T00:00:00"^^xsd:datetime;
] .
Here I am creating two records of those examples you gave. the [] syntax creates a blank node that is basically a node without a name (URI). From each of these blank nodes in the example we have two pieces of information date and amount.
Also, make sure you use valid xsd:datetime dates if you want to use SPARQL later on to query your data.

For a fuller discussion of the different ways you can represent time in RDF see Ian Davis'
s excellent blog post series on this topic

Related

Filtering out holidays in Splunk

I am attempting to use a search lookup table csv to filter out holidays for some Splunk queries.
To do this, I created a holidays.csv in the following style:
dateof,dateafter,description
01/17/2022,01/18/2022,MLK Day 22
02/21/2022,02/22/2022,Presidents Day 22
05/30/2022,05/31/2022,Memorial Day 22
[...]
Some of the queries run the day after the holiday, which is why I created a dateof and datefater field. I am trying to pipe the condition onto the end of the existing queries.
environment=staging "message=\"This line would contain the original search query"
| eval date=strftime(_time,"%m/%d/%Y")
| search NOT [ inputlookup holidays.csv | fields dateof ]
Note that an Event from the original query will look something like this:
time=2022-08-31T12:01:39,495Z [...] message="This line would contain the original search query"
Despite giving read access to the csv, the above condition does not filter out anything, regardless of whether it is a holiday listed in the csv file or not.
I suspect something is missing. However, I have a limited knowledge of the Spunk querying language. Would anyone be able to give guidance on this? Thanks in advance!
Subsearches can be tricky. If the result of the subsearch isn't just right, the query will fail. It helps to run the subsearch by itself to confirm the string produced makes sense as part of a query. In this case, check that
| inputlookup holidays.csv | fields dateof | rename dateof as date | format
produces something that works with search NOT.
An alternative to try is to explicitly look for the date field in the lookup.
environment=staging "message=\"This line would contain the original search query"
| eval date=strftime(_time,"%m/%d/%Y")
| where NOT [ | inputlookup holidays.csv | fields dateof | rename dateof as date | format ]
Here is another way to do it without a subsearch. A null description field tells us the date was not found in the lookup file and so is not a holiday.
environment=staging "message=\"This line would contain the original search query"
| eval date=strftime(_time,"%m/%d/%Y")
| lookup holidays.csv dateof as date output description
| where isnull(description)

How to compare value with multiple modified values from another table in BigQuery?

I am using Google BigQuery and I got the following issue:
I have a table (A) like this:
| time | request |
|------------------------|-----------------|
|2019-09-24 11:10:00 UTC | fakewebsite.com |
|2019-09-24 11:10:00 UTC | realwebsite.com |
|........................|.................|
|2019-09-24 11:10:00 UTC | foobwebsite.com |
|2019-09-24 11:10:00 UTC | barrwebsite.com |
And another table (B) like this:
| blacklist |
|---------------|
| foo.com |
| ... |
| bar.com |
I want to make a query that will grab a modified version of the values inside the blacklist field of table B as follows:
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude --this will return only "foo" from "foo.com"
and then return all values from the request field of table A where none of the to_exclude was found.
I know how to do this for one value but I don't know how to do this for multiple. I am looking for something like the following:
#standardSQL
WITH tmp_blacklist AS
(SELECT
SPLIT(NET.REG_DOMAIN(blacklist), CONCAT('.',NET.PUBLIC_SUFFIX(blacklist)))[OFFSET(0)] AS to_exclude
FROM
mydataset.B)
SELECT
request
FROM
mydataset.A
WHERE
request NOT LIKE ("%value1%", "%value2%", ..., "%valuen%") -- I can't use OR along with the NOT LIKE since the values are too many and they will change.
The n values are the values of the tmp_blacklist table.
Also if I don't define the table with the WITH and I define it after the NOT LIKE I am going to get the following error: Scalar subquery produced more than one element which makes sense if LIKE expects only one element. But then again that's half of the job done if it get's fixed since I want the "%value%" and not just the value of the table.
Now I searched online for a way to do this and I found people saying that it can't be done and then some workarounds with combinations of LIKE and IN where people said it will be very slow if one of the tables grows to have tons of data(my case).
What is the best way to do this?
One method uses not exists:
SELECT a.request
FROM mydataset.A a
WHERE NOT EXISTS (SELECT 1
FROM tmp_blacklist bl
WHERE a.request LIKE CONCAT('%', bl.to_exclude, '%'
);
Note that this can be expensive. You might want to test constructing the exclusion string as:
'value1|value2|value3'
and then using regular expressions.

Querying array of text in postgres

I have an array type I want to store in Postgres. One of the major use cases I have is to see if any of the records has an array which has a string in it.
eg.
| A | ["NY", "Paris", "Milan"] |
| B | ["Paris", "NY"] |
| C | [] |
| D | ["Milan"] |
Does there exist a row with Paris in the array? Which rows have Milan in the array? and so on.
I have 2 options on how to store the column. I can either make it a type text[] or convert it into a json as {"cities": ["NY", "Paris", "Milan"]} and then store as a JSONB field
However, I am not sure what would allow the fastest querying for the use case I have. Is there any one obviously better way of doing this? Am I tying myself down in any way by choosing one over the other? If I choose one over the other then how can I query the DB?
As you seem to be storing simple lists of values, I would recommend to use datataype Array over JSON, which better fits more complex cases (nested datastructures, associative arrays, ...).
To check for the value of an element at any position in the array, you can use array function ANY().
Here is a query that will return all records where the array stored in column cities contains 'Paris' :
SELECT t.* FROM mytable t WHERE 'Paris' = ANY(t.cities);
Yields :
id cities
---------------------------
A ["NY","Paris","Milan"]
B ["Paris","NY"]
Demo on DB Fiddle
For more information :
Postgres Arrays Documentation
Postgres Arrays Tutorial
I've noticed it is better to query JSONB, if it is a simple key-value store.
As in for instance you want to store arbitrary info on a row that your not sure what the columns(keys) would be.
info = {"a":"apple", "b":"ball"}
For use cases like yours, it would be better if you could design the db with simple tables so you could use JOINS and Indexes to your advantage.
You could restructure the tables like :
Location
id | name
----------
1 | Paris
2 | NY
3 | Milan
Other Table (with foreign key on location table)
user | location_id
--------------------
A | 1
A | 3
B | 2
Using these set of tables it would be easy to query all users with location paris using JOINS.

sqlite variable and unknown number of entries in column

I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need

How to flatten a one-to-many relationship

While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/