Finding Interlanguage Related Articles from Wiki Dump - wikipedia-api

Finding the full list of Wikipedia's English articles with their related articles in languages other than English like French and Spanish is a problem that their is no answer about that. You can find some similar questions but most of them are related to previous structure of Wikipedia and the others have left without correct answer.
We can download the dump file of Wikipedia's English and Spanish articles from here:
English Wiki and Spanish Wiki.
There is some data named langlinks aka sitelinks in enwiki and also eswiki with the aim to find interlanguage related articles. But it's not clear how to use them to find interlingual related articles(the Spanish article related to each English one). The langlinks schemas are like:
CREATE TABLE `langlinks` (
`ll_from` int(10) unsigned NOT NULL DEFAULT '0',
`ll_lang` varbinary(20) NOT NULL DEFAULT '',
`ll_title` varbinary(255) NOT NULL DEFAULT '',
UNIQUE KEY `ll_from` (`ll_from`,`ll_lang`),
KEY `ll_lang` (`ll_lang`,`ll_title`)
) ENGINE=InnoDB DEFAULT CHARSET=binary;
Are the record with an special 'll_from' field in English related to record with similar 'll_from' field in Spanish? if yes, Why I can't find records with similar ll_from field in these two langlinks files?
Again, How to use these langlinks files to find interlanguage related articles? I dont want to use other tools like the Wikidata toolkit.

This page is helpful: Manual:langlinks table
Fields
ll_from
page_id of the referring page.
ll_lang
Language code of the target, in the ISO 639-1 standard.
ll_title
Title of the target, including namespace (FULLPAGENAMEE style).
As it showed in the schema, the combination of ll_lang and ll_title is unique.

Related

SQL - Referencing a table at a specific position in content?

I am not sure if this is possible, but I'm also not sure if I am using the correct terminology here, so forgive and correct me if I don't. Also, this question is more about database design more generally.
Say I have something like Article:
Title: Stem cells are soon being used for stuff
Text: "Here is the content for an article about stuff. Here is some more info on stemm cells and stuff. [To the uninitiated, here comes an Info Box on Stem Cells in general, you can expand it!] Now some more text about stem cells and stuff"
In my app I would like to display the article, and then at an exact position (here after sentence no. 2, but this will vary from article to article) insert an info-box on stem cells, which is in its own SQL table.
I know that the idea of SQL in general is that I reference InfoBox in my Article and simply point to it. That would be the relation between article and infoBox.
But how do I specify that infoBox should come exactly after Sentence No. 2? (as in the example). And this will not always be the case. Sometimes there might be no infoBox for an article or multiple, sometimes it will come after sentence no. 25 or 100 -etc.
I don't want to mix relations/fields and content, but I don't understand how I would realise something like this in SQL.
A table (base table or query result) holds rows of values that participate in a relation(ship)/association. Those are the rows that make the table's associated (characteristic) predicate (sentence template parameterized by column names) into a true proposition (statement).
You seem to want a base table for triples where
infoBox [i] should come exactly after Sentence No. [n] of article [a]
PS Time to read a published academic textbook on information modeling & database design. (Manuals for languages & tools to record & use designs are not textbooks on doing information modeling & database design.)

How to localize sql server data?

We have a requirement to develope an application that support multiple languages (English, German, French, Russian) and we know, we can use ASP.NET localization to localize static text of a web form but what would be the approach for data localization of a database in SQL server.
for example my database schema is something like this:
Table-Questions
QID-PK
Question
CreatedBy
Table- Answer
AID-PK
QID- FK
Answer
AddedBy
In the above schema,I want the column "question" from Question table and column "Answer" from Answer table should keep localization value.
How do I achive this.
Add a Language table:
LanguageID-PK
LanguageIdentifier (name as accepted by CultureInfo's constructor, e.g. "de" for German)
Add a TranslatedQuestion table:
TQID-PK
QID-FK
LanguageID
Translation
Likewise, add a TranslatedAnswer table:
TAID-PK
AID-FK
LanguageID
Translation
This way, of course there is nothing in the data model to guarantee that every question/answer has a transation for a given language. But you can always fall back to the untranslated question/answer.
Add a culture column to the table, then repeat the questions and answers in the culture specific format.

Internationalization in .NET. Database driven? Resources interactively?

I am working on the internationalization of a CMS in .NET (VB.NET). In the administration part we used resources, but for the clients we want a UI so someone can see the labels and translate them from one language to another.
So, the first thought was to do it database driven with three tables:
Label Translation Language
----- ----------- --------
id id id
name keyname_id name
filename language_id
value
And then create an UI so you can allow the client to first select the filename of the page you want to translate, the label, and then select the language he wants and translate it, and it would be stored in the translations table.
I see here a problem: How would I take from the page all the labels?
I also spotted an example of a resources manager that can translate in an interactive way. This is the example.
The benefits of this solution is that you are working with resources, so everything seems easier because some work is done. On the other hand, this structure can be more difficult to implement, I don't know as I'm not experienced on this.
So, what do you think about these two approaches? What is better for you? Maybe there is a third approach that is easier?
EDIT: I also found this link about Resource-provider model. What do you think about it? Maybe it can be useful,but I don't know, maybe it's too much for my purposes. I am thinking where to start
In LedgerSMB, we went with the following approach:
Application strings (and strings in code) get translated by a standard i18n framework (GNU gettext basically).
Business data gets manual translation in the database. So you can add translations to department names, project names,descriptions of goods and services etc.
Our approach to the problem you say is to join the other tables, so we might have:
CREATE TABLE parts (
id int primary key.... -- autoincrements but not relevant to this example
description text,
...
);
CREATE TABLE language (
code varchar(5) primary key, -- like en_US
name text unique,
);
CREATE TABLE parts_translation (
parts_id int not null references parts(id),
language_code varchar(5) not null references language(code),
translation text
);
Then we can query based on desired language at run time.

What's the best way to store the bible in SQL?

What I'm looking for is a breakdown of table names w/ corresponding fields/types.
The bible I want to store will be in English and needs to support the following:
Books
Chapters
Section Titles (can show up within verses and in-between verses)
Smallcaps Text
Red Letter Text
Verse Numbers
Footnotes (can show up within verses and within section titles) (may optionally reference another verse)
Cross-references (essentially a footnote that only references another verse and doesn't add any commentary)
Anything else I'm forgetting
Here is another collection / example for you:
https://github.com/scrollmapper/bible_databases
Here you will see SQL, XML, CSV, and JSON. Of special note is the cross-reference table (quite extensive and amazing) and a simple verse-id system for fast queries.
EDIT: Notice the ids of the tables are book-chapter-verse combinations, always unique.
SQL is the BEST way to store this. Considering your requirement we can split them into two major parts
Information that's dependent on individual version
Small caps
Red letter print
Information that isn't dependent on individual version
Book, Chapter, Verse numbers
Section title
Foot notes (??????)
Cross Reference
Commentary
For various reasons I prefer to store the whole bible project into one SINGLE table, Yes call it as bible
For your visual here is my screen I have stored nearly 15 versions of bible in single table. Luckily the different version names are just kept as column wide. When you add more version in future your table grows horizontally which is okay thus the # of rows remain constant(31102). Also, I will request you to realize the convenience of keeping the combination of ('Book, Chapter, Verse') as the PRIMARY key because in most situations that's the look-up way.
That said here is my recommended table structure.
CREATE TABLE IF NOT EXISTS `bible` (
`id` int(11) NOT NULL AUTO_INCREMENT, --Global unique number or verse
`book` varchar(25) NOT NULL, --Book, chapter, verse is the combined primary key
`chapter` int(11) NOT NULL,
`verse` int(11) NOT NULL,
`section_title` varchar(250) NOT NULL, -- Section title, A section starts from this verse and spans across following verses until it finds a non-empty next section_title
`foot_note` varchar(1000) NOT NULL, -- Store foot notes here
`cross_reference` int(11) NOT NULL, -- Integer/Array of integers, Just store `id`s of related verses
`commentary` text NOT NULL, -- Commentary, Keep adding more columns based on commentaries by difference authors
`AMP` text NOT NULL, -- Keep, keep, keep adding columns and good luck with future expansion
`ASV` text NOT NULL,
`BENG` text NOT NULL,
`CEV` text NOT NULL,
PRIMARY KEY (`book`,`chapter`,`verse`),
KEY `id` (`id`)
)
Oh, What about the Small caps and Red letters?
Well, Small caps & Red letters you can store in version columns using HTML or appropriate formats. In the interface you can strip them off based on user's choice whether he requires red letter or small caps.
For reference, you can download the SQLs from below and customize in your way
Bibles in SQL format
Rather than reinventing the wheel, you might consider using a "Bible SDK" such as AV Bible, which stores text, formatting, verse numbers, etc. in an open, custom binary format.
I think they have everything you've listed except cross-references.
I also found http://www.lyricue.org/downloads/ that includes several bible translations in mysql format.
This repository contains entire Bible given in sql.
https://github.com/godlytalias/Bible-Database
Everything WernerCD's answer, but store the verseText as xml so you can add formatting tags like <red>e.g. Red Text</red> and use the tags to format it in your application
Mark Rushakoff's answer is probably the best for you specific need. However more generally if need to store content that either has data within the content or if you need to store data about the content a Content Management System is typically used. You can build your own (which WernerCD's answer had a table structure for) or use a CMS product. The list here shows the wide variety of technologies used (around 30 in this list use MySQL)
expanding the DB horizontally isn't very efficient with the potential of having very large tables and complex updates. so id, book, chapter, verse, V1, V2, V3, V4... Vn just seems to be looking at the problem like a spreadsheet rather than taking advantage of what a DB has to offer.
the references are static (book, chapter, verse) so they can be populated in one table with an id and with that you have the framework of the entire bible. the verse content can potentially have hundreds of versions so it would be better stored in its own table and linked with a foreign key to identifying the references. the structure would be primary_id, foreign_id, version, content.
now the content just fills in as needed and there is no need to have thousands of empty fields that in the future you have to go back and fill in or needing to expand the table and backfill all the existing data every time you add a new version. just fill in the verses as you get them which works better I think if you building it yourself.
This also makes sense as some versions only have the NT or some verses that they think were added later aren't available so there is no need to have empty fields you just have the data and it links to the verse reference. "version" can also be a foreign key to identify more information in the version like a publishing date or long/short name (ie. "NIV", "New International Version") This also works well when using more than one revision of a translation like the 1984 NIV vs 2011 NIV. Both can be identified as "NIV" but differ in content so the version_id can link another table with expanded information about the version it's using. Once that data is in and linked properly you can display it however you wish for example combining the publishing date/short version name making a name like "NIV1984" or have a separate column unique for a display name.
I'm not sure how red letter or footnotes could be displayed and I know sites like biblegateway have this as a toggle switch so it's nice to have the option to sort it like this. with red letters, this could be a special static identifier directly in the verse content that is parsed out later as a CSS identifier. It could be its own foreign table too but since it is so little a delimiter would be really easy. It really depends what you're using the data for and if you wanted to do queries for the red letters then it would be best as a foreign table (fast) rather then search the db for the delimiter (slow)
with footnotes, since it depends on unique content it would be best stored in its own table. how it is identified and placed in the content could use static reference points within the content like x number of characters in or x number of words in and then linked with the content using a foreign key again. So the footnote table could be something like primary_id, foreign_id, reference, footnote and an example of the data could be 2304, 452, 64, "some manuscripts don't include this". Primary key would be 2304, the foreign key that links to the content table is 452, the footnote is placed 64 characters in, and the footnote is "some manuscripts don't include this" As for the incrementing footnote like A, B, C or 1, 2, 3 all of this can be dynamically generated. If it's important to be a static letter/number then you might want to include it in the data but I would rather have good data that allows this automatically then list it as static data.
here's the hint, Don't add hundreds of columns... that would just a headache and it's spreadsheet thinking. it's better to work through the perfect queries to link tables with the right content.

Searching and sorting by language

I am testing Lucene.NET for our searching requirements, and I've got couple of questions.
We have documents in XML format. Every document contains multi-language text. The number of languages and the languages itself vary from document to document. See example below:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>, a <word lang="en">table</word> and a <word lang="en">desk</word>.</document>
The keywords of a document are tagged with a special element and language attribute.
When I am creating lucene index I extract the text content from the XML and pairs of language and keyword (I am not sure if I have to), like this:
This is a sample document, which is describing a tisch, a table and a desk.
de - tisch
en - table
en - desk
I don't know exactly how to create an index that I will be able to search for example:
- all the documents that contains word tisch in German (and not the document which contains word tisch in other languages).
And also I want to specifiy sorting at runtime:
I want to sort by user specified language order (depending on a user interface). For example, if we have two documents:
<document>This is a sample document, which is describing a <word lang="de">tisch</word>.</document>
<document>This is a another sample document, which is describing a <word lang="en">table</word>.</document>
and a user on an English interface searches by "tisch OR table" I want to get the second result first.
Any information or advice is appreciated.
Many thanks!
You have a design decision to make, where the options are:
Use a single index, where each document has a field per each language it uses, or
Use M indexes, M being the number of languages in the corpus.
If you use the multi-index approach, it will be easier to restrict search to a specific language or set of languages - just search the indexes for these languages, not using the other languages. Also, sorting by language becomes easier. Therefore, if you do not have
an "AND" search that requires keywords from different languages appear in the same document, I would suggest the M-index approach.
Based on your example, I assume that the part of the documents not specially tagged is in English. If this is so, you can add the document text to the English index as a separate field; The other indexes need only store a document id, which will make them lighter.