Doing multiple queries in Postgresql - conditional loop - sql

Let me first start by stating that in the last two weeks I have received ENORMOUS help from just about all of you (ok ok not all... but I think perhaps two dozen people commented, and almost all of these comments were helpful). This is really amazing and I think it shows that the stackoverflow team really did something GREAT altogether. So thanks to all!
Now as some of you know, I am working at a campus right now and I have to use a windows machine. (I am the only one who has to use windows here... :( )
Now I manage to setup (ok, IT department did that for me) and populate a Postgres database (this I did on my own), with about 400 mb of data. Which perhaps is not so much for most of you heavy Ppostgre users, but I was more used to sqlite database for personal use which rarely exceeded 2mb ever.
Anyway, sorry for being so chatty - now the queries from that database work
nicely. I use ruby to do queries actually.
The entries in the Postgres database are interconnected, in as far as they are like
"pointers" - they have one field that points to another field.
Example:
entry 3667 points to entry 35785 which points to entry 15566. So it is quite simple.
The main entry is 1, so the end of all these queries is 1. So, from any other number, we can reach 1 in the end as the last result.
I am using ruby to make as many individual queries to the database until the last result returned is 1. This can take up to 10 individual queries. I do this by logging into psql with my password and data, and then performing the SQL query via -c. This probably is not ideal, it takes a little time to do these logins and queries, and ideally I would have to login only once, perform ALL queries in Postgres, then exit with a result (all these entries as result).
Now here comes my question:
- Is there a way to make conditional queries all inside of Postgres?
I know how to do it in a shell script and in ruby but I do not know if this is available in postgresql at all.
I would need to make the query, in literal english, like so:
"Please give me all the entries that point to the parent entry, until the last found entry is eventually 1, then return all of these entries."
I already "solved" it by using ruby to make several queries until 1 is eventually returned, but this strikes me as fairly inelegant and possibly not effective.
Any information is very much appreciated - thanks!
Edit (argh, I fail at pasting...):
Example dataset, the table would be like this:
id | parent
----+---------------+
1 | 1 |
2 | 131567 |
6 | 335928 |
7 | 6 |
9 | 1 |
10 | 135621 |
11 | 9 |
I hope that works, I tried to narrow it down solely on example.
For instance, id 11 points to id 9, and id 9 points to id 1.
It would be great if one could use SQL to return:
11 -> 9 -> 1

Unless you give some example table definitions, what you're asking for vaguely reminds of a tree structure which could be manipulated with recursive queries: http://www.postgresql.org/docs/8.4/static/queries-with.html .

Related

What JOIN would be equivalent to this query?

I have three tables, the relevant structure looking like the following:
Routes
| ID |
Runs
| ID | RouteID |
Stops
| ID | RunID | Code | Timestamp |
I’m working on a portion of an application that needs to find the next run given a first run. I’ve got a SQL query that’s doing the job, but it’s turning out to be very slow, even though all of the fields being searched are indexed. It looks like this:
SELECT "RunID"
FROM "Stops"
WHERE "Code" = 'ABC'
AND "RunID" IN ('101', '202', '303')
AND "Timestamp" > '2017-02-07 12:34:56'
ORDER BY "Timestamp" ASC
FETCH FIRST 1 ROWS ONLY
Note that this is the just form the query is generally taking. The primary keys are actually UUIDs and obviously the tables are more complicated than shown above. But the idea is that I want to find the Stops that have a given code, one of a subset of RunIDs, and a timestamp after a given timestamp.
I’m wondering if the IN clause is causing the speed issue. All the above fields within the Stops table are indexed, so I would expect this to be a rather quick search, but it’s taking a few seconds each time, and this is within a loop, so this query is making the entire routine very slow.
So, is perhaps a JOIN is the answer? The last piece that leads me to this question is all the runs in the IN clause’s list have the same parent route. So I’m really searching for all the stops that have a given code and are after a given timestamp and have a parent run whose parent route is a given ID.
But, I’m honestly weak with SQL joins. I keep studying them, but I’ve never really gotten them to click for me. Is a join possibly the answer? And if so, how would I write it?

SQL Most effective way to store every word in a document separately

Here's my situation (or see TLDR at bottom): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.
The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.
E.g: Document foo's text is
abc abc def
and document bar's text is
abc def ghi
Documents table will have
id | name
1 'foo'
2 'bar'
Words table:
id | word
1 'abc'
2 'def'
3 'ghi'
Word Document table:
word id | doc id | occurrences
1 1 2
1 2 1
2 1 1
2 2 1
3 2 1
As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.
TL;DR My question is this:
How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?
Sorry for the long read and thanks for any help!
Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net
Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)
Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.
Ofcourse, piggy backing on an existing solution has some draw backs:
You need to have a license for the solution.
When your needs can no longer be served, you will have to program it all yourself.
But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.
Your question strikes me as being naive. In the first place... you are begging the question. You are giving a flawed solution to your own problem... and then explaining why it can't work. Your question would be much better if you simply described what your objective is... and then got out of the way so that people smarter than you could tell you HOW to accomplish that objective.
Just off hand... the database sounds like a really dumb idea to me. People have been grepping text with command line tools in UNIX-like environments for a long time. Either something already exists that will solve your problem or else a decent perl script will "fake" it for you-- depending on your real world constraints, of course.
Depending on what your problem actually is, I suspect that this could get into some really interesting computer science questions-- indexing, Bayesian filtering, and who knows what else. I suspect, however, that you're making a very basic task more complicated than it needs to be.
TL;DR My answer is this:
** Why wouldn't you just write a script to go through a directory... and then use regexes to count the occurences of the word in each file that is found there?

Storing ip adresses efficiently, for faster lookups and insertions (proxy checking)

I'm writing a small Python 3 program that is supposed to be testing the validity of a large number of proxies and I want to reorganize data so I can quickly lookup IP, test it via curl, write into database whether it works and timestamp.
With about 50 000 rows, 'simple way' takes too long, thus I need some clever way of searching through IPs.
I'm new to SQL, but if I was do it in some programming language, I would make something like this:
| IP_BYTE1 | IP_BYTE2 | IP_BYTE3 | IP_BYTE4 | TIMESTAMP | WORKS |
and then search 'left to right'.
Can anyone help me with creation of such a table and algorithm for fast lookup/insertion?
The simple way is to store them in a table using your favorite data type (varchar or int) and then build an index on them.
If you are looking for different types of IP addresses, then you might want to break them into separate pieces. Are you generally looking at type D addresses? Or do you need to also look at types A, B, and C?

mysqldumpslow: What does these fields indicate..?

Recently we have started on optimizing live slow queries. As part of that, we thought to use mysqldumpslow to prioritize slow queries. I am new to this tool. I am able to understand some basic info, but I would like to know what exactly the below fields in the out put will tell us.
OUTPUT: Count: 6 Time=22.64s (135s) Lock=0.00s (0s) Rows=1.0 (6)
What about the below fields ?
Time : Is it the average time taken of all these 6 times of occurance...?
135s : What is this 135 seconds....?
Rows=1.0 (6): again what does this mean...?
I didn't find a better explanation. Really thanks in advance.
Regards,
UDAY
I made a research for that coz i wanted to know that too.
I have a log from a pretty highly used DB server.
The command mysqldumpslow has several optional parameters (https://dev.mysql.com/doc/refman/5.7/en/mysqldumpslow.html), including sort by (-s)
thanks to many queries I can work with, I can tell, that:
value before brackets represents an average value from all the same queries within to group ('count' in total) and the value within brackets is the maximum value of one of the queries. Meaning, in your case:
you have a query that was called 6 times, it is executed within 22.64 seconds (average), but once it took about 135 seconds to execute it. The same applies for locks (if provided) and rows. So most of the time it returns about one row, however it returned 6 rows at least once

What the best way to handle categories, sub-categories - hierachical data?

Duplicate:
SQL - how to store and navigate hierarchies
If I have a database where the client requires categories, sub-categories, sub-sub-categories and so on, what's the best way to do that? If they only needed three, and always knew they'd need three I could just create three tables cat, subcat, subsubcat, or the like. But what if they want further depth? I don't like the three tables but it's the only way I know how to do it.
I have seen the "sql adjacency list" but didn't know if that was the only way possible. I was hoping for input so that the client can have any level of categories and subcategories. I believe this means hierarchical data.
EDIT: Was hoping for the sql to get the list back out if possible
Thank you.
table categories: id, title, parent_category_id
id | title | parent_category_id
----+-------+-------------------
1 | food | NULL
2 | pizza | 1
3 | wines | NULL
4 | red | 3
5 | white | 3
6 | bread | 1
I usually do a select * and assemble the tree algorithmically in the application layer.
You might have a look at Joe Celko's book, or this previous question.
creating a table with a relation to itself is the best way for doing the same. its easy and flexible to the extent you want it to be without any limitation. I dont think i need to repeat the structure that you should put since that has already been suggested in the 1st answer.
I have worked with a number of methods, but still stick to the plain "id, parent_id" intra-table relationship, where root items have parent_id=0. If you need to query the items in a tree a lot, especially when you only need 'branches', or all underlying elements of one node, you could use a second table: "id, path_id, level" holding a reference to each node in the upward path of each node. This might look like a lot of data, but it drastically improves the branch-lookups when used, and is quite manageable to render in triggers.
Not a recommended method, but I have seen people use dot-notation on the data.
Food.Pizza or Wines.Red.Cabernet
You end up doing lots of Like or midstring queries which don't use indices terribly well. And you end up parsing things alot.