What's the best way to store the bible in SQL? - sql

What I'm looking for is a breakdown of table names w/ corresponding fields/types.
The bible I want to store will be in English and needs to support the following:
Books
Chapters
Section Titles (can show up within verses and in-between verses)
Smallcaps Text
Red Letter Text
Verse Numbers
Footnotes (can show up within verses and within section titles) (may optionally reference another verse)
Cross-references (essentially a footnote that only references another verse and doesn't add any commentary)
Anything else I'm forgetting

Here is another collection / example for you:
https://github.com/scrollmapper/bible_databases
Here you will see SQL, XML, CSV, and JSON. Of special note is the cross-reference table (quite extensive and amazing) and a simple verse-id system for fast queries.
EDIT: Notice the ids of the tables are book-chapter-verse combinations, always unique.

SQL is the BEST way to store this. Considering your requirement we can split them into two major parts
Information that's dependent on individual version
Small caps
Red letter print
Information that isn't dependent on individual version
Book, Chapter, Verse numbers
Section title
Foot notes (??????)
Cross Reference
Commentary
For various reasons I prefer to store the whole bible project into one SINGLE table, Yes call it as bible
For your visual here is my screen I have stored nearly 15 versions of bible in single table. Luckily the different version names are just kept as column wide. When you add more version in future your table grows horizontally which is okay thus the # of rows remain constant(31102). Also, I will request you to realize the convenience of keeping the combination of ('Book, Chapter, Verse') as the PRIMARY key because in most situations that's the look-up way.
That said here is my recommended table structure.
CREATE TABLE IF NOT EXISTS `bible` (
`id` int(11) NOT NULL AUTO_INCREMENT, --Global unique number or verse
`book` varchar(25) NOT NULL, --Book, chapter, verse is the combined primary key
`chapter` int(11) NOT NULL,
`verse` int(11) NOT NULL,
`section_title` varchar(250) NOT NULL, -- Section title, A section starts from this verse and spans across following verses until it finds a non-empty next section_title
`foot_note` varchar(1000) NOT NULL, -- Store foot notes here
`cross_reference` int(11) NOT NULL, -- Integer/Array of integers, Just store `id`s of related verses
`commentary` text NOT NULL, -- Commentary, Keep adding more columns based on commentaries by difference authors
`AMP` text NOT NULL, -- Keep, keep, keep adding columns and good luck with future expansion
`ASV` text NOT NULL,
`BENG` text NOT NULL,
`CEV` text NOT NULL,
PRIMARY KEY (`book`,`chapter`,`verse`),
KEY `id` (`id`)
)
Oh, What about the Small caps and Red letters?
Well, Small caps & Red letters you can store in version columns using HTML or appropriate formats. In the interface you can strip them off based on user's choice whether he requires red letter or small caps.
For reference, you can download the SQLs from below and customize in your way
Bibles in SQL format

Rather than reinventing the wheel, you might consider using a "Bible SDK" such as AV Bible, which stores text, formatting, verse numbers, etc. in an open, custom binary format.
I think they have everything you've listed except cross-references.

I also found http://www.lyricue.org/downloads/ that includes several bible translations in mysql format.

This repository contains entire Bible given in sql.
https://github.com/godlytalias/Bible-Database

Everything WernerCD's answer, but store the verseText as xml so you can add formatting tags like <red>e.g. Red Text</red> and use the tags to format it in your application

Mark Rushakoff's answer is probably the best for you specific need. However more generally if need to store content that either has data within the content or if you need to store data about the content a Content Management System is typically used. You can build your own (which WernerCD's answer had a table structure for) or use a CMS product. The list here shows the wide variety of technologies used (around 30 in this list use MySQL)

expanding the DB horizontally isn't very efficient with the potential of having very large tables and complex updates. so id, book, chapter, verse, V1, V2, V3, V4... Vn just seems to be looking at the problem like a spreadsheet rather than taking advantage of what a DB has to offer.
the references are static (book, chapter, verse) so they can be populated in one table with an id and with that you have the framework of the entire bible. the verse content can potentially have hundreds of versions so it would be better stored in its own table and linked with a foreign key to identifying the references. the structure would be primary_id, foreign_id, version, content.
now the content just fills in as needed and there is no need to have thousands of empty fields that in the future you have to go back and fill in or needing to expand the table and backfill all the existing data every time you add a new version. just fill in the verses as you get them which works better I think if you building it yourself.
This also makes sense as some versions only have the NT or some verses that they think were added later aren't available so there is no need to have empty fields you just have the data and it links to the verse reference. "version" can also be a foreign key to identify more information in the version like a publishing date or long/short name (ie. "NIV", "New International Version") This also works well when using more than one revision of a translation like the 1984 NIV vs 2011 NIV. Both can be identified as "NIV" but differ in content so the version_id can link another table with expanded information about the version it's using. Once that data is in and linked properly you can display it however you wish for example combining the publishing date/short version name making a name like "NIV1984" or have a separate column unique for a display name.
I'm not sure how red letter or footnotes could be displayed and I know sites like biblegateway have this as a toggle switch so it's nice to have the option to sort it like this. with red letters, this could be a special static identifier directly in the verse content that is parsed out later as a CSS identifier. It could be its own foreign table too but since it is so little a delimiter would be really easy. It really depends what you're using the data for and if you wanted to do queries for the red letters then it would be best as a foreign table (fast) rather then search the db for the delimiter (slow)
with footnotes, since it depends on unique content it would be best stored in its own table. how it is identified and placed in the content could use static reference points within the content like x number of characters in or x number of words in and then linked with the content using a foreign key again. So the footnote table could be something like primary_id, foreign_id, reference, footnote and an example of the data could be 2304, 452, 64, "some manuscripts don't include this". Primary key would be 2304, the foreign key that links to the content table is 452, the footnote is placed 64 characters in, and the footnote is "some manuscripts don't include this" As for the incrementing footnote like A, B, C or 1, 2, 3 all of this can be dynamically generated. If it's important to be a static letter/number then you might want to include it in the data but I would rather have good data that allows this automatically then list it as static data.
here's the hint, Don't add hundreds of columns... that would just a headache and it's spreadsheet thinking. it's better to work through the perfect queries to link tables with the right content.

Related

How do I perform a query in Postgres using a URL slug?

Let's say I have a URL as Follows:
www.somewebsite.com/dining/caseys+grille
I have a business_listings table in Postgres that contains a column business_name. I have a record in the table with 'Casey's Grille'
How can I query 'caseys+grill' against 'Casey's Grille'?
Would I need to use full text search? How would I go about doing this?
Since you are not searching for regular words, but for proper names, and you probably also want to find results that are similar in spelling, you should use trigram GIN indexes and similarity search.
This problem looks simple at first, but it is a can of worms.
The solution should consider all the use cases: is it only a matter of removing/rewriting special characters? Do you need to consider typos (is casey grill the same)? Do you need to consider distinctive marks (is Casey's Grill #2 the same)? Do you need to consider abbreviations (is NY Grill the same as New-York Grill?) Do you need to consider numbers (is 1st av. Grill the same as first avenue grill)?
If it is your database + website, the simplest is to record/compare the URL slug directly.
Else, or if you don't control the URL (like if it is the result of a search box), you may want to store/compare a parsed name. Using both the DB title and the URL slug, you transform the name to common elements. For example you change common abbreviations to their full text, you remove all special characters, you remove/add space, if your language has accents you can remove them, standardize the casing etc. Only you can find and apply the suitable transformations.
Then you can compare the two parsed named, using any suitable comparison method (trigram, plain equality, like queries etc)
I assume you actually want a single slug of the text value in business_name and you want this to be a unique identifier for this particular business.
You can create an additional column business_name_slug and create a unique index on this column.
Then you can create a before insert or update trigger that writes the slug created from business_name into this column.
The tricky part is to create a logic that
generates an url friendly version of the the business name (there should be some example in Blog Posts,Githuhib Gists etc., for example)
avoids naming collisions so your unique constraint will not raise an error when inserting/updating

FTS doesn't work as expected with emails with dots

We're developing a search as a part of a bigger system.
We have Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit) with this setup:
CREATE TABLE NewCompanies(
[Id] [uniqueidentifier] NOT NULL,
[Name] [nvarchar](400) NOT NULL,
[Phone] [nvarchar](max) NULL,
[Email] [nvarchar](max) NULL,
[Contacts1] [nvarchar](max) NULL,
[Contacts2] [nvarchar](max) NULL,
[Contacts3] [nvarchar](max) NULL,
[Contacts4] [nvarchar](max) NULL,
[Address] [nvarchar](max) NULL,
CONSTRAINT PK_Id PRIMARY KEY (Id)
);
Phone is a structured comma separated digits string like
"77777777777, 88888888888"
Email is structured emails string with commas like
"email1#gmail.com, email2#gmail.com" (or without commas at all like
"email1#gmail.com")
Contacts1, Contacts2, Contacts3, Contacts4 are text fields where users can specify contact details in free form. Like "John Smith +1 202 555 0156" or "Bob, +1-999-888-0156, bob#company.com". These fields can contain emails and phones we want to search further.
Here we create full-text stuff
-- FULL TEXT SEARCH
CREATE FULLTEXT CATALOG NewCompanySearch AS DEFAULT;
CREATE FULLTEXT INDEX ON NewCompanies(Name, Phone, Email, Contacts1, Contacts2, Contacts3, Contacts4, Address)
KEY INDEX PK_Id
Here is a data sample
INSERT INTO NewCompanies(Id, Name, Phone, Email, Contacts1, Contacts2, Contacts3, Contacts4)
VALUES ('7BA05F18-1337-4AFB-80D9-00001A777E4F', 'PJSC Azimuth', '79001002030, 78005005044', 'regular#hotmail.com, s.m.s#gmail.com', 'John Smith', 'Call only at weekends +7-999-666-22-11', NULL, NULL)
Actually we have about 100 thousands of such records.
We expect users can specify a part of email like "#gmail.com" and this should return all the rows with Gmail email addresses in any of Email, Contacts1, Contacts2, Contacts3, Contacts4 fields.
The same for phone numbers. Users can search for a pattern like "70283" and a query should return phones with these digits in them. It's even for free form Contacts1, Contacts2, Contacts3, Contacts4 fields where we probably should remove all but digits and space characters firstly before searching.
We used to use LIKE for the search when we had about 1500 records and it worked fine but now we have a lot of records and the LIKE search takes infinite to get results.
This is how we try to get data from there:
SELECT * FROM NewCompanies WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), '"s.m.s#gmail.com*"') -- this doesn't get the row
SELECT * FROM NewCompanies WHERE CONTAINS((Phone, Contacts1, Contacts2, Contacts3, Contacts4), '"6662211*"') -- doesn't get anything
SELECT * FROM NewCompanies WHERE CONTAINS(Name, '"zimuth*"') -- doesn't get anything
Actually requests
SELECT [...] CONTAINS([...], '"6662211*"') -- doesn't get anything
against 'Call only at weekends +7-999-666-22-11'
and
SELECT [...] CONTAINS(Name, '"zimuth*"') -- doesn't get anything
against 'PJSC Azimuth'
do work as expected.
See Prefix Term. Because 6662211* is not a prefix of +7-999-666-22-11 as well as zimuth* is not a prefix of Azimuth
As for
SELECT [...] CONTAINS([...], '"s.m.s#gmail.com*"') -- this doesn't get the row
This is probably due to word breakers as alwayslearning pointed out in comments. See word-breakers
I don't think that Full-Text Search is applicable for your task.
Why use for FTS in the exact same tasks that LIKE operator is used for? If there were a better index type for LIKE queries... then there would be the better index type, not the totally different technology and syntax.
And in no way it will help you to match "6662211*" against "666some arbitrary char22some arbitrary char11".
Full Text search is not about regex-es (and "6662211*" is not even a correct expression for the job - there is nothing about "some arbitrary char" part) it's about synonyms, word forms, etc.
But is it at all possible to search for substrings effectively?
Yes it is. Leaving aside such prospects as writing your own search engine, what can we do within SQL?
First of all - it is an imperative to cleanup your data!
If you want to return to the users the exact strings they have entered
users can specify contact details in free form
...you can save them as is... and leave them along.
Then you need to extract data from the free form text (it is not so hard for emails and phone numbers) and save the data in some canonical form.
For email, the only thing you really need to do - make them all lowercase or uppercase (doesn't matter), and maybe split then on the # sing. But in phone numbers you need to leave only digits
(...And then you can even store them as numbers. That can save you some space and time. But the search will be different... For now let's dive into a more simple and universal solution using strings.)
As MatthewBaker mentioned you can create a table of suffixes.
Then you can search like so
SELECT DISTINCT * FROM NewCompanies JOIN Sufficies ON NewCompanies.Id = Sufficies.Id WHERE Sufficies.sufficies LIKE 'some text%'
You should place the wildcard % only at the end. Or there would be no benefits from the Suffixes table.
Let take for example a phone number
+7-999-666-22-11
After we get rid of waste chars in it, it will have 11 digits. That means we'll need 11 suffixes for one phone number
1
11
211
2211
62211
662211
6662211
96662211
996662211
9996662211
79996662211
So the space complexity for this solution is linear... not so bad, I'd say... But wait it's complexity in the number of records. But in symbols... we need N(N+1)/2 symbols to store all the suffixes - that is quadratic complexity... not good... but if you have now 100 000 records and do not have plans for millions in the near future - you can go with this solution.
Can we reduce space complexity?
I will only describe the idea, implementing it will take some effort. And probably we'll need to cross the boundaries of SQL
Let's say you have 2 rows in NewCompanies and 2 strings of free form text in it:
aaaaa
11111
How big should the Suffixes table be? Obviously, we need only 2 records.
Let's take another example. Also 2 rows, 2 free text strings to search for. But now it's:
aa11aa
cc11cc
Let's see how many suffixes do we need now:
a // no need, LIKE `a%` will match against 'aa' and 'a11aa' and 'aa11aa'
aa // no need, LIKE `aa%` will match against 'aa11aa'
1aa
11aa
a11aa
aa11aa
c // no need, LIKE `c%` will match against 'cc' and 'c11cc' and 'cc11cc'
cc // no need, LIKE `cc%` will match against 'cc11cc'
1cc
11cc
c11cc
cc11cc
No so bad, but not so good either.
What else can we do?
Let's say, user enters "c11" in the search field. Then LIKE 'c11%' needs 'c11cc' suffix to succeed. But if instead of searching for "c11" we first search for "c%", then for "c1%" and so on? The first search will give as only one row from NewCompanies. And there would be no need for subsequent searches. And we can
1aa // drop this as well, because LIKE '1%' matches '11aa'
11aa
a11aa // drop this as well, because LIKE 'a%' matches 'aa11aa'
aa11aa
1cc // same here
11cc
c11cc // same here
cc11cc
and we end up with only 4 suffixes
11aa
aa11aa
11cc
cc11cc
I can't say what the space complexity would be in this case, but it feels like it would be acceptable.
In cases like this full text searching is less than ideal. I was in the same boat as you are. Like searches are too slow, and full text searches search for words that start with a term rather than contains a term.
We tried several solutions, one pure SQL option is to build your own version of full text search, in particular an inverted index search. We tried this, and it was successful, but took a lot of space. We created a secondary holding table for partial search terms, and used full text indexing on that. However this mean we repeatedly stored multiple copies of the same thing. For example we stored "longword" as Longword, ongword, ngword, gword.... etc. So any contained phrase would always be at the start of the indexed term. A horrendous solution, full of flaws, but it worked.
We then looked at hosting a separate server for lookups. Googling Lucene and elastisearch will give you good information on these off the shelf packages.
Eventually, we developed our own in house search engine, which runs along side SQL. This has allowed us to implement phonetic searches (double metaphone) and then using levenshtein calculations along side soundex to establish relevance. Overkill for a lot of solutions, but worth the effort in our use case. We even now have an option of leveraging Nvidia GPUs for cuda searches, but this represented a whole new set of headaches and sleepless nights. Relevance of all these will depend on how often you see your searches being performed, and how reactive you need them to be.
Full-Text Indexes have a number of limitations. You can use wildcards on words that the index finds are whole "parts" but even then you are constrained to the ending part of the word. That is why you can use CONTAINS(Name, '"Azimut*"') but not CONTAINS(Name, '"zimuth*"')
From the Microsoft documentation:
When the prefix term is a phrase, each token making up the phrase is
considered a separate prefix term. All rows that have words beginning
with the prefix terms will be returned. For example, the prefix term
"light bread*" will find rows with text of "light breaded," "lightly
breaded," or "light bread," but it will not return "lightly toasted
bread."
The dots in the email, as indicated by the title, are not the main issue. This, for example, works:
SELECT * FROM NewCompanies
WHERE CONTAINS((Email, Contacts1, Contacts2, Contacts3, Contacts4), 's.m.s#gmail.com')
In this case, the index identifies the whole email string as valid, as well as "gmail" and "gmail.com." Just "s.m.s" though is not valid.
The last example is similar. The parts of the phone number are indexed (666-22-11 and 999-666-22-11 for example), but removing the hyphens is not a string that the index is going to know about. Otherwise, this does work:
SELECT * FROM NewCompanies
WHERE CONTAINS((Phone, Contacts1, Contacts2, Contacts3, Contacts4), '"666-22-11*"')

Advice on sql naming conventions

I'm looking for some advice on SQL naming conventions. I know this topic has been discussed before but my question is a little more specific and I cannot find an answer elsewhere.
I have some integer variables - generally they would have a name like 'Timeout'. Is there an adopted standard prefixing/suffixing the value so that I know what it contains when I come back to it in 6 months time?
For instance is it 'TimeoutMilliseconds'.
I'm not talking about labelling every variable this way, just those with generic values.
Lookup ISO-11179 for the international database naming standard. for this you can grab this online for free download (though sorry I forget where). There is a lot in it, so here are some some basic summary form it:
Take your field description, remove joining words and write it backwards.
Always end with a class name. There are standard abbreviations like ID for identifier and such.
eg:
Date of Entry:
Entry_Date
Seconds_For_Delivery:
Delivery_Seconds
Name of Widget:
Widget_Name
Location of Widget:
Widget_Location
Size of Widget:
Widget_Size
Also a field should have the same name if it is a primary key or a referenced foreign key. This will pay off in readability for people that come after you, and also most DB tools will assume they are matching keys so you will also save time in using reporting tools and the like (less manual stuffing around putting links in by hand).
In the above examples, the class names are date, seconds, name, location, size. It surprises me that this ISO is not more well known.

Internationalization in .NET. Database driven? Resources interactively?

I am working on the internationalization of a CMS in .NET (VB.NET). In the administration part we used resources, but for the clients we want a UI so someone can see the labels and translate them from one language to another.
So, the first thought was to do it database driven with three tables:
Label Translation Language
----- ----------- --------
id id id
name keyname_id name
filename language_id
value
And then create an UI so you can allow the client to first select the filename of the page you want to translate, the label, and then select the language he wants and translate it, and it would be stored in the translations table.
I see here a problem: How would I take from the page all the labels?
I also spotted an example of a resources manager that can translate in an interactive way. This is the example.
The benefits of this solution is that you are working with resources, so everything seems easier because some work is done. On the other hand, this structure can be more difficult to implement, I don't know as I'm not experienced on this.
So, what do you think about these two approaches? What is better for you? Maybe there is a third approach that is easier?
EDIT: I also found this link about Resource-provider model. What do you think about it? Maybe it can be useful,but I don't know, maybe it's too much for my purposes. I am thinking where to start
In LedgerSMB, we went with the following approach:
Application strings (and strings in code) get translated by a standard i18n framework (GNU gettext basically).
Business data gets manual translation in the database. So you can add translations to department names, project names,descriptions of goods and services etc.
Our approach to the problem you say is to join the other tables, so we might have:
CREATE TABLE parts (
id int primary key.... -- autoincrements but not relevant to this example
description text,
...
);
CREATE TABLE language (
code varchar(5) primary key, -- like en_US
name text unique,
);
CREATE TABLE parts_translation (
parts_id int not null references parts(id),
language_code varchar(5) not null references language(code),
translation text
);
Then we can query based on desired language at run time.

Tag suggestion (not tag autocomplete)

AJAX autocomplete is fairly simple to implement. However, I wonder how to handle smart tag suggestion like this on SO.
To clarify the difference between autocomplete and suggestion:
autocomplete: foo [foobar, foobaz]
suggestion: foo [barfoo, foobar, foobaz], or even better, with 'did you mean' feature: [barfoo, foobar, foobaz, fobar, fobaz]
I suppose I need some full text search in tags (all letters indexed, not just words). There would be no problem to do it witch regex or other patterns for limited number of tags (even client side).
But how to implement this feature for big number of tags?
Is there any particular reason (besides URL) the tags on SO are dash separated? What about Unicode characters in tags?
I store the tags in the table with the following columns: id, tagname.
My SQL query returns objects with following fields: id, tagname, count
(I use Doctrine ORM and pgsql as default db driver.)
I would go with SELECTING them from database by REGEXP at every keypress. I did this on my sites and the was no prefrormance problem (I do not have heavy loaded server thought). If you do not like this idea, I would cash all 1-5 letters combinations which will users enter and refresh them on daily basis in separate table. If this table is indexed than you have very fast implementation.
To elaborate more on the second appreach:
Briefly: 1. Make a table SEARCHTABLE representing 1-n relationship betwean keywords (limit it to 3-4 letters) and primary IDs of tags. 2. INDEX on both fields. 3. Everytime the user makes a search do look at the SEARCHTABLE and if the combination is there, use that - very fast, as everything is indexed. If not do the regexp search and put all results to SEARCHTABLE.
Notes:
You should invalidate the table if
you add tags, but this should much
less often than a search. When
invalidating table you do not
necesarilly TRUNCATE it, you can
easily rebuild it taking all
keywords into account.
If you want to speed it up, you can "pregenerate" all two or even three
letters searches.
If you care enough, you should be using information from n-1 letter kewords to generate
the n letter keyword. It speeds the things tremendously. Imagine that user has typed "mo"
and you have shown them appropriate result from SEARCHTABLE. Than when she types "n"
giving it "mon" you need only serach trough already selected items to generate new
response.
Hope it is more comprehensive now.