How to generate UUIDs for updates via SPARQL - sparql

Using an RDF database, accessed via a SPARQL endpoint, what's the best way of generating new UUID IRIs and using them for new resources?
Here is an overview of some approaches I've tried. I am sharing this because I would have liked to find this question answered. My favourite is the last approach, but I'd say it's still up for debate.
Generate a UUID in the client and using it in the Update request
pro: fast
con: you cannot be sure the UUID is unique in the database. The chance of a collision is small, though.
Generate a UUID in the client and check if the RDF store contains triples with that id. Iterate until the UUID is new.
pro: you can be reasonably sure that the UUID is unique in the db (except for ones added in concurrent updates)
con: quite slow
Ask the RDF store for a UUID and use it
Query: SELECT (UUID() as ?id) WHERE{}
pro: UUID is guaranteed to be unique (i.e. not yet used) in the store
con: it's an additional request (but it's a quick one)
Ask the RDF store for N >> 1 UUIDs before doing a bigger amount of updates
Query (returns 1000 result rows):
SELECT (UUID() as ?id) WHERE {
VALUES ?index1 { 0 1 2 3 4 5 6 7 8 9 }
VALUES ?index2 { 0 1 2 3 4 5 6 7 8 9 }
VALUES ?index3 { 0 1 2 3 4 5 6 7 8 9 }
}
pro: Probably fastest per UUID if >> 1 are needed
pro: UUIDs are guaranteed to be unused in the store
con: this approach requires more client-side programming
and: this is an unusual query, is there a better way to achieve this?
Related:
How to get UUID() from INSERT Sparql Request?
How to use Jena SPARQL UUID?
SPARQL CONSTRUCT+INSERT

Following AndyS's comment, I decided to look that up again and adjust my expectations: you have to generate 1 billion UUIDs per second for about 85 years to reach a 50% probability of one or more collisions. Therefore, Variant 1 is best:
Generate a UUID in the client and use it in the update request.

Related

Redis - is there a way to to divide one Redis server instance into 2 subredis servers?

I was wondering if there is a way to divide one Redis server into 2 different sub-servers? Specifically is there a way to have:
Sub-server1:
a 1 (a - key, 1 - value)
b 2
c 3
Sub-server2:
a 4
b 5
c 6
so that I can search for keys in one sub-server (for example use something like "GET a in Sub-server2", and get value 4)?
I am including jedis tag because the final goal is to make that work in java.
Redis already has an idea of database. A single Redis instance has 16 (0 to 15) separate databases.
You can use this idea of database to your idea of sub-server. For example, use database-1 as sub-server1, database-2 as sub-server2.
Jedis also supports database. Choose a constructor which has database parameter, pass the proper database value and you're done!

Doing multiple queries in Postgresql - conditional loop

Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1
Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .

Creating a workable Redis store with several filters

I am working on a system to display information about real estate. It runs in angular with the data stored as a json file on the server, which is updated once a day.
I have filters on number of bedrooms, bathrooms, price and a free text field for the address. It's all very snappy, but the problem is the load time of the app. This is why I am looking at Redis. Trouble is, I just can't get my head round how to get data with several different filters running.
Let's say I have some data like this: (missing off lots of fields for simplicity)
id beds price
0 3 270000
1 2 130000
2 4 420000
etc...
I am thinking I could set up three sets, one to hold the whole dataset, one to create an index on bedrooms and another for price:
beds id
2 1
3 0
4 2
and the same for price:
price id
130000 1
270000 0
420000 2
Then I was thinking I could use SINTER to return the overlapping sets.
Let's say I looking for a house with more than 2 bedrooms that is less than 300000.
From the bedrooms set I get IDs 0,2 for beds > 2.
From the prices set I get IDs 0,1 for price < 300000
So the common id is 0, which I would then lookup in the main dataset.
It all sounds good in theory, but being a Redis newbie, I have no clue how to go about achieving it!
Any advice would be gratefully received!
You're on the right track; sets + sorted sets is the right answer.
Two sources for all of the information that you could ever want:
Chapter 7 of my book, Redis in Action - http://bitly.com/redis-in-action
My Python/Redis object mapper - https://github.com/josiahcarlson/rom (it uses ideas directly from chapter 7 of my book to implement sql-like indices)
Both of those resources use Python as the programming language, though chapter 7 has been translated into Java: https://github.com/josiahcarlson/redis-in-action/ (go to the java path to see the code).
... That said, a normal relational database (especially one with built-in Geo handling like Postgres) should handle this data with ease. Have you considered a relational database?

Use Cases for Redis' "Score" and "Ranking" Features for Sets

What are some use cases for Redis' "score" and "ranking" features for sets (outside of the typical "leaderboard" examples for games? I'm trying to figure out how to make use of these dynamic new features as I anticipate moving from using a traditional relational database to Redis as a persistent data store.
ZSETs are great for selections or ranges based on scores, but scores can be any numerical value, like a timestamp.
We store daily stock prices for all US stocks in redis. Here's an example for ebay...
ZADD key score member [score member ...]
...
ZADD stocks:ebay 1 30.39 2 32.70 3 31.25 4 31.75 5 29.12 6 29.87 7 29.93
The score values in this case would normally be long timestamps, with that aside, if we want daily prices for the last 3 days, we simply convert two dates to timestamps and pull from redis using the timestamp range 1 3...
zrangebyscore stocks:ebay 1 3
1) "30.39"
2) "32.70"
3) "31.25"
The query is very fast and works well for our needs.
Hope it helps!
zset is the only type of key who can be sorted
by example you can imagine puts all comments key id of a specific article in a zset,
users will vote up/down each comments and this will change the score value
after that when you need to draw comments you can get them ordered, better comments in first place (like here)
using ZREMRANGEBYSCORE you can imagine delete all pretty bad comments each days
but as each redis type, they still basic, give you a dedicated use case is hard there can be some :- )

How can I store and index a list in a relational database?

I'm working on constructing a database (SQLite) to store information about each run of a Mathematica script I've written. The script takes several input parameters, so my DB has a table with a column for each parameter (among other columns).
Some of the input parameters are lists of numbers. My first thought for storing these is to use a junction table as described in the accepted answer to this question. But I typically use the same list for several different runs. How can I look up whether any given list is already in the database, so I can reuse its ID rather than storing it again?
Constraints as mentioned in comments:
There is no explicit upper bound on the length of a list but in practice it ranges from 1 to about 50.
The number of distinct lists will be small, on the order of 10.
I actually have 3 list parameters. For two of them, the values in the list are non-negative, double precision floating point numbers; for the third, the values are pairs of such numbers.
There are no duplicate entries. (These are more precisely sets, so no duplicates and order is irrelevant)
I can easily arrange for the list elements to be in sorted order.
For example: suppose my table is set up like this
CREATE TABLE jobs (id INTEGER PRIMARY KEY, param1 REAL, param2_id INTEGER);
CREATE TABLE param2 (param2_id INTEGER PRIMARY KEY, value REAL);
When I run the script, it sets the parameters and then calls a function to run the calculation, like so:
param1 = 4;
param2 = {.1, .3, .5};
runTheCalculation[param1, param2]
Assuming this is the very first run of the script, it will insert the following contents into the DB:
jobs: id param1 param2_id
1 4.0 1
param2: param2_id value
1 0.1
1 0.3
1 0.5
So far, so good. Now let's say I run the script again with one different parameter,
param1 = 2;
param2 = {.1, .3, .5};
runTheCalculation[]
In a naive implementation, this will result in the database containing this:
jobs: id param1 param2_id
1 4.0 1
2 2.0 2
param2: param2_id value
1 0.1
1 0.3
1 0.5
2 0.1
2 0.3
2 0.5
But I would like it to be able to look up the fact that the list {.1, .3, .5} is already in the database, so that after the second run the DB contains this instead:
jobs: id param1 param2_id
1 4.0 1
2 2.0 1
param2: param2_id value
1 0.1
1 0.3
1 0.5
What sort of a query can I use to find that the list {.1, .3, .5} already exists in the table param2?
I'm not opposed to creating additional tables if necessary. Or if there is some model other than using a junction table that makes more sense, that's fine too.
If the list is short, and the quantity of lists is relatively small, then you can simply sequence the lists in the TBL_Lists and see if yours matches. This is pretty inefficient as it will enumerate all stored lists to compare to your one stored list.
Another way, and the better way in my opinion, would be to hash the list and store its hash in a TBL_List_Hashes
Hashing the list will require enumerating it one time.
An example hashing algorithm might be to build a string of all the sorted numerical values, uniformally padded, then run any hashing method on the concatenated string.
It should be relatively easy to obtain a hash of a given list and then retrieve the matching hash from the DB.
Even with a relatively simple hash algorithm with collisions you will be able to significantly reduce the number of lists you need to validate in order to make the comparison.
So if your hash algorithm has collisions then you're adding an enumeration (and query) expense for each erroneous match.
EDIT:
Here is a relevant answer for .net
.net 3.5 List<T> Equality and GetHashCode
EDIT2:
And if you are order-agnostic in your matching then simply standardize the list order before hashing
GetHashCode for a Class with a List Object
You ask: How can I look up whether any given list is already in the database?
The normal way is to use an index, and indexes are always row-oriented. So standard database design suggests you somehow need to get the whole list (normalized) into a row.
Since you're on SQLLite, you don't have too many options:
http://www.sqlite.org/datatype3.html
I recommend TEXT! You can index BLOB as well, and BLOB will save some space, but probably TEXT will work just fine, and TEXT is usually a lot more convenient to debug and work with. Try to invent some kind of canonical String format for your lists that you can parse/generate, and always INSERT/SELECT that from the database in a consistent way (e.g., consistent rounding, pre-sorted, duplicates removed, trailing and leading zeroes always consistent), and you should be fine.
Warning: it's a low-engineering approach, and perhaps even "not-the-right-way (TM)," but if it gets the job done....
In general, don't use lists, unless you have a very unusual set of requirements, and enough hands-on experience to anticipate the consequences.
A many-to-many relationship contained in a junction table, with appropriate indexes, will perform just as well and be much easier to use. It's also more flexible.