SQL Most effective way to store every word in a document separately - sql

Here's my situation (or see TLDR at bottom): I'm trying to make a system that will search for user entered words through several documents and return the documents that contain those words. The user(s) will be searching through thousands of documents, each of which will be 10 - 100+ pages long, and stored on a webserver.
The solution I have right now is to store each unique word in a table with an ID (only maybe 120 000 relevant words in the English language), and then in a separate table store the word id, the document it is in, and the number of times it appears in that document.
E.g: Document foo's text is
abc abc def
and document bar's text is
abc def ghi
Documents table will have
id | name
1 'foo'
2 'bar'
Words table:
id | word
1 'abc'
2 'def'
3 'ghi'
Word Document table:
word id | doc id | occurrences
1 1 2
1 2 1
2 1 1
2 2 1
3 2 1
As you can see when you have thousands of documents and each has thousands of unique words, the Word Document tables blows up very quickly and takes way too long to search through.
TL;DR My question is this:
How can I store searchable data from large documents in an SQL database, while retaining the ability to use my own search algorithm (I am aware SQL has one built in for .docs and pdfs) based on custom factors (like occurrence, as well as others) without having an outright massive table for all the entries linking each word to a document and its properties in that document?
Sorry for the long read and thanks for any help!

Rather than building your own search engine using SQL Server, have you considered using a C# .net implementation of the lucene search api's? Have a look at https://github.com/apache/lucene.net

Good question. I would piggy back on the existing solution of SQL Server (full text indexing). They have integrated a nice indexing engine which optimises considerably better than your own code probably could do (or the developers at Microsoft are lazy or they just got a dime to build it :-)
Please see SQL server text indexing background. You could query views such as sys.fulltext_index_fragments or use stored procedures.
Ofcourse, piggy backing on an existing solution has some draw backs:
You need to have a license for the solution.
When your needs can no longer be served, you will have to program it all yourself.
But if you allow SQL Server to do the indexing, you could more easily and with less time build your own solution.

Your question strikes me as being naive. In the first place... you are begging the question. You are giving a flawed solution to your own problem... and then explaining why it can't work. Your question would be much better if you simply described what your objective is... and then got out of the way so that people smarter than you could tell you HOW to accomplish that objective.
Just off hand... the database sounds like a really dumb idea to me. People have been grepping text with command line tools in UNIX-like environments for a long time. Either something already exists that will solve your problem or else a decent perl script will "fake" it for you-- depending on your real world constraints, of course.
Depending on what your problem actually is, I suspect that this could get into some really interesting computer science questions-- indexing, Bayesian filtering, and who knows what else. I suspect, however, that you're making a very basic task more complicated than it needs to be.
TL;DR My answer is this:
** Why wouldn't you just write a script to go through a directory... and then use regexes to count the occurences of the word in each file that is found there?

Related

How to implement version history for notes in a note taking database application?

So I've built myself a super simple notetaking application using a relational database (if you're curious, I've used Excel VBA + MySQL).
The app works fantastically for me as a replacement for Evernote, but I had this other feature idea:
Could I implement version control/history for each individual note?
To be clear I'm not talking about version control for the database's records or schema. I'm trying to make a user-facing (not developer) interface to take notes "back in time".
So yes, this could be done quite easily by simply assigning a unique ID to each note “thread” in a sense where the thread contains running history of that note, but if possible I’d also like to compress this data as much as possible and only store the differences of what changed.
So for example, if I have a note with body:
“This is the note body. It’s a super long text”
And I change it to:
“This is the note body. It’s a very long text”
I would like to not store all those character bytes all over again in the database, and instead somehow store only what changed (“super” -> “very”).
This is similar to how GIT works probably except I don’t need branching capabilities.
Would anybody have any suggestions for algorithms on how to do this sort of thing?
Thanks!
As a first choice, I would stick to store and version entire note as a whole, even if that's just one letter changed. It makes it simple - doesn't require to compute diffs on write and recontruct note on read. Storage is cheap and MySQL performance will surely suffice with small to medium amount of data.
[notes]
note_id version text
1 1 This is the note body. It’s a super long text
1 2 This is the note body. It’s a very long text
1 3 This is the note body. It’s a really a very long text
I would only consider following options if you really expect huge number of users and notes, or maybe just doing this for educational purposes.
Instead of versioning notes as a whole you can split it into chunks - it might be paragraphs, sections or any other entity you can distinguish.
[sections]
section_id text
1 This is the note body.text
2 It’s a super long text
3 It’s a very long text
4 It’s really a very long text
[notes]
note_id version position section_id
1 1 1 1
1 1 2 2
1 2 1 1
1 2 2 3
1 3 1 1
1 3 2 4
Here notes and their versions reference to specific sections at specific postitions. See how section_id = 1 gets reused in subsequent versions. It also allows a section to be reused across different notes.
Or, as you suggested, you could try to store diffs. For example, using unified diff:
[notes]
note_id version text_or_diff
1 1 This is the note body.
It’s a super long text
1 2 ## -1,2 +1,2 ##
This is the note body.
-It’s a super long text
+It’s a very long text
1 3 ## -1,2 +1,2 ##
This is the note body.
-It’s a very long text
+It’s really a very long text
Here of course the diff is longer than actual text of the note, but with bigger notes it will be more efficient. As mentioned, this comes at a cost - when reading such note you need to load all version records and apply the diffs.
From here you can explore various options and optimizations:
Use another diff format
Only store diff if it's shorter, otherwise just store full note
Split notes into chunks/sections and maintain chunk history as diffs

Apache Lucene: Creating an index between strings and doing intelligent searching

My problem is as follows: Let's say I have three files. A, B, and C. Each of these files contains 100-150M strings (one per line). Each string is in the format of a hierarchical path like /e/d/f. For example:
File A (RTL):
/arbiter/par0/unit1/sigA
/arbiter/par0/unit1/sigB
...
/arbiter/par0/unit2/sigA
File B (SCH)
/arbiter_sch/par0/unit1/sigA
/arbiter_sch/par0/unit1/sigB
...
/arbiter_sch/par0/unit2/sigA
File C (Layout)
/top/arbiter/par0/unit1/sigA
/top/arbiter/par0/unit1/sigB
...
/top/arbiter/par0/unit2/sigA
We can think of file A corresponding to circuit signals in a hardware modeling language. File B corresponding to circuit signals in a schematic netlist. File C corresponding to circuit signals in a layout (for manufacturing).
Now a signal will have a mapping between File A <-> File B <-> File C. For example in this case, /arbiter/par0/unit1/sigA == /arbiter_sch/par0/unit1/sigA == /top/arbiter/par0/unit1/sigA. Of course, this association (equivalence) is established by me, and I don't expect the matcher to figure this out for me.
Now say, I give '/arbiter/par0/unit1/sigA'. In this case, the matcher should return a direct match from file A since it is found. For file B/C a direct match is not possible. So it should return the best possible matches (i.e., edit distance?) So in this example, it can give /arbiter_sch/par0/unit1/sigA from file B and /top/arbiter/par0/unit1/sigA from file C.
Instead of giving a full string search, I could also give something like *par0*unit1*sigA and it should give me all the possible matches from fileA/B/C.
I am looking for solutions, and came across Apache Lucene. However, I am not totally sure if this would work. I am going through the docs to get some idea.
My main requirements are the following:
There will be 3 text files with full path to signals. (I can adjust the format to make it more compact if it helps building the indexer more quickly).
Building the index should be fairly fast (take a couple of hours). The files above are static (no modifications).
Searching should be comprehensive. It is OK if it takes ~1s / search but the matching should support direct match, regex match, and edit distance matching. The main challenge is each file can have 100-150 million signals.
Can someone tell me if such a use case can be easily addressed by Lucene? What would be the correct way to go about building a index and doing quick/fast searching? I would like to write some proof-of-concept code and test the performance. Thanks.
i think based on your requirements the best solution would be a PoC with a given test set of entries. Based on this it should be possible to evaluate the target indexing time you like to achieve. Because you only use static informations it's easier, because do don't have to care about topics like NRT (near-real-time searches).
Personally i never used lucene for such a big information set but i think lucene is able to handle this.
How i would do it:
Read tutorials and best practices about lucene, indexing, searching and understand how it works
Define an data set for indexing lets say 1000 lines for each file
Define your lucene document structure
this is really important because based on this you will apply your
searches. take care about analyzer tasks like tokanization if needed
and how. If you need fulltext search care about a TextField.
Write code for simple indexing
Run small tests with indexing and inspect your index with Luke
Write code for simple searching
Define queries and your expected results. execute searches and check
results.
Try to structure your code. separate indexing and searching -> it will be easier to refactor.

UniData - record count of all files / tables

Looking for a shortcut here. I am pretty adept with SQL database engines and ERPs. I should clarify... I mean databases like MS SQL, MySQL, postresql, etc.
One of the things that I like to do when I am working on a new project is to get a feel for what is being utilized and what isn't. In T-SQL this is pretty easy. I just query the information schema and get a row count of all the tables and filter out the ones having rowcount = 0. I know this isn't truly a precise row count, but it does give me an idea of what is in use.
So I recently started at a new company and one of their systems is running on UniData. This is a pretty radical shift from mainstream databases and there isn't a lot of help out there. I was wondering if anybody knew of a command to do the same thing listed above in UniBasic/UniQuery/whatever else.
Which tables, files, are heavily populated and which ones are not?
You can start with a special "table" (or file in Unidata terminology) named VOC - it will have a list of all the other files that are in your current "database" (aka account), as well as a bunch of other things.
To get a list of files in (or pointed to) the current account:
:SORT VOC WITH F1 = "F]" "L]" "DIR" F1 F2
Try HELP CREATE.FILE if you're curious about the difference between F and LF and DIR.
Once you have a list of files, weed out the ones named *TEMP* or *WORK* and start digging into the ones that seem important. There are other ways to get at what's important (e.g using triggers or timestamps), but browsing isn't a bad idea to see what conventions are used.
Once you have a file that looks interesting (let's say CUSTOMERS), you can look at the dictionary of that file to see
:SORT DICT CUSTOMERS F1 F2 BY F1 BY F2 USING DICT VOC
It can help to create something like F2.LONG in DICT VOC to increase the display size up from 15 characters.
Now you have a list of "columns" (aka fields or attributes), you're looking for D-type attributes that will tell you what columns are in the file. V or I-type's are calculations
https://github.com/ianmcgowan/SCI.BP/blob/master/PIVOT is helpful with profiling when you see an attribute that looks interesting and you want to see what the data looks like.
http://docs.rocketsoftware.com/nxt/gateway.dll/RKBnew20/unidata/previous%20versions/v8.1.0/unidata_userguide_v810.pdf has some generally good information on the concepts and there are many other online manuals available there. It can take a lot of reading to get to the right thing if you don't know the terminology.

Display ALL countries in an autocomplete form

First time posting on here because Google is yielding no results!
So, I have a website that is based around travelling and locations. Everytime someone enters content into the site, they select a location and that content then has lat and long, country, etc.
The issue I face is that I have a DB of all the "cities and areas" of the world and there are a good 3.5 million records in the database I believe.
My question to you is how would you guys recommend doing a 1 field autocomplete form for all the cities? I don't need advice on the autocomplete form itself, I need advice on HOW and WHERE I should be storing the data... text files? SQL? Up until now, I have been using SQL but I don't know how it should be done. Would an AJAX autoloader be able to handle it if I only returned 100 records or so? Should all the results be preloaded?
Thanks for your help guys!
EDIT: I have actually found another way to do it. I found this awesome little plugin to integrate Google Maps with it
http://xilinus.com/jquery-addresspicker/demos/index.html
Fantastic.
Benny
I have a few thoughts here:
since you don't know whether a user will enter the english or local (native) name, each city record in your database should have both. Make sure to index these fields.
Do not do auto-complete until you have a minimum number of characters. Otherwise, you will match way too many rows in your table. For example, assuming an even distribution of english characters (26), then at 3.5 million records you would statistically get thar = he following matches per character:
1 char = 135k
2 char = 5.2k
3 char = 200
4 char = 8
If you are using MySQL you will want to use the LIKE specifier.
There are much more advance methods for predictive matching, but this should be a good start.

Looking up a word's sentences in a corpus of 15 million words

I have a corpus of 15 million words, which I'd like to store in a database. I'd then like to be able to find for a given word, its context within the corpus. For example, for the word "friends" I might select the following, where I am also selection five words before and after each "friends":
... night i went to my FRIENDS house for a cup of tea ...
... what did you say my FRIENDS cat is sick and ...
... if you like my FRIENDS dad can pick you up ...
How best might I organise my database to efficiently select for a given word in such a manner? I usually use sqlite when I need a database but maybe something else is better in this case.
If you want to find a word in a corpus, then you need full text search capabilities. SQLite does actually offer such capabilities as an extension, which are explained here.
Full text search is going to return the document that matches a given query. You will first need to break up the corpus into separate documents. Usually, this is a very easy task -- the documents might be emails, or customer service records, or doctor's notes, or reports, or whatever. However, you do not describe what the documents are in your case.
I am not at all familiar with the full-text extensions to SQLite. You might consider other database solutions like MySQL which also offer full text support.