How to find non english language and its data in a table of sql server - sql

I have a Sql Server table loaded with data from multiple countries. Say Japanese, Thai, Urdu, Portuguese, Spanish and many more which i didn't identify.
How to identify the language and its relevant data from that table ?
sample:
colid | colname
1 | stackoverflow
2 | 龍梅子, 老貓
i need a query to produce :
stackoverflow, english
龍梅子, 老貓 , chinese
is this possible to get ?

You might try this:
DECLARE #str NVARCHAR(100)=N'龍梅子, 老貓';
SELECT UNICODE(Left(#str,1))
The result is 40845
A google search on unicode 40845 brought me here.
It seems to be the dragon or emperor from chinese character set
Some more information upon unicode character ranges
The given code points to CJK Unified Ideographs
UPDATE
In order to use this:
You will either have to
get a local copy of a full list into your database and do a look-up. Something like from here: http://www.tamasoft.co.jp/en/general-info/unicode.html
or create a range table (much smaller!) like in the link given (http://jrgraphix.net/research/unicode_blocks.php)
or do some kind of web lookup...

You can't, what if the languages use the same alphabet, or even worse share the same words? You obviously wouldn't want to ask an internet ai-based program to decide.
You should add a column language_id, have a table for the languages and hope that it's not too many entries to update.

Related

Postgres: Is there a way to target specific tables based on your data?

I'm new to SQL and I'm currently thinking about an effective way to build out my database. It's a language learning application and I'm torn between two approaches:
Keeping all of my words, regardless of their language, in one giant words table
Splitting my words into separate tables based on their language, ie: words_french, words_italian, etc.
In the second scenario, are there approaches that I can use (perhaps within Postgres) that would allow me target the words_french table in the event that I'm currently working through french lessons / content and need to lookup associated french words?
I feel like there would be some sort of concat process like so: words_${language} and as of this moment I'd figure i'd have to resolve this within JS or something else on the frontend.
-- also, is breaking words and other content into their respective table_language even a valid approach?
Any ideas?
Use Option 1. Option 2 would be horribly difficult to work with.
Word table:
WordId
Word
Language
1
a
English
2
un
French
As Dimitar Spasovski suggests, if you have a need for additional attributes associated with the language, you should also have a Language table. Then replace the Language column in the Word with LanguageId to make the relationship.
Watching or reading some data modeling or data architecture classes online will help.

SQL Server - Creating a "Search library" of terms to use in a query

Firstly I apologise in advance if this question is a bit bare bones or has misleading/confusing terminology but I'm not sure how else to phrase it.
I have a few tables which capture the language of interactions based on a few different factors. What I would like to do is set up a sort of temporary library of language based terms that I can reference in a query so that I can search the various tables and find matches against the terms stored in the library.
I'll try and give an example:
The library might consist of the following terms:
English, German, French, Italian, Spanish
I then want to search these tables:
teacherSpokenLanguages, courseLanguages, studentLanguages
And find all the rows that contain the search terms in any particular field (and specify which field that term is being found).
I hope there's enough information to piece together my request. Is this even remotely possible? Could I create a temporary table to contain these values perhaps? I can't do anything permanent on the database, it all has to be housed within this one query and has to be non-destructive.

SQL - searching database with the LIKE operator

Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;

FREETEXT queries in SQL Server 2008 not phrase matching

I have a full text indexed table in SQL Server 2008 that I am trying to query for an exact phrase match using FULLTEXT. I don't believe using CONTAINS or LIKE is appropriate for this, because in other cases the query might not be exact (user doesn't surround phrase in double quotes) and in general I want to flexibility of FREETEXT.
According to the documentation[MSDN] for FREETEXT:
If freetext_string is enclosed in double quotation marks, a phrase match is instead performed; stemming and thesaurus are not performed.
which would lead me to believe a query like this:
SELECT Description
FROM Projects
WHERE FREETEXT(Description, '"City Hall"')
would only return results where the term "City Hall" appears in the Description field, but instead I get results like this:
1 Design of handicap ramp at Manning Hall.
2 Antenna investigation. Client: City of Cranston Engineering Dept.
3 Structural investigation regarding fire damage to International Tennis Hall of Fame.
4 Investigation Roof investigation for proposed satellite design on Herald Hall.
... etc
Obviously those results include at least one of the words in my phrase, but not the phrase itself. What's worse, I had thought the results would be ranked but the two results I actually wanted (because they include the actual phrase) are buried.
SELECT Description
FROM Projects
WHERE Description LIKE '%City Hall%'
1 Major exterior and interior renovation of the existing city hall for Quincy Massachusetts
2 Cursory structural investigation of Pawtucket City Hall tower plagued by leaks.
I'm sure this is a case of me not understanding the documentation, but is there a way to achieve what I'm looking for? Namely, to be able to pass in a search string without quotes and get exactly what I'm getting now or with quotes and get only that exact phrase?
As you said, FREETEXT looks up every word in your phrase, not the phrase as an all. For that you need to use the CONTAINS statement. Like this:
SELECT Description
FROM Projects
WHERE CONTAINS(Description, '"City Hall"')
If you want to get the rank of the results, you have to use CONTAINSTABLE. It works roughly the same, but it returns a table with two columns: [Key] wich contains the primary key of the search table and [Rank], which gives you the rank of the result.

How can I create an ordered list of the most common substrings inside of my MySQL varchar column?

I have a MySQL database table with a couple thousand rows. The table is setup like so:
id | text
The id column is an auto-incrementing integer, and the text column is a 200-character varchar.
Say I have the following rows:
3 | I think I'll have duck tonight
4 | Maybe the chicken will be alright
5 | I have a pet duck now, awesome!
6 | I love duck
Then the list I'm wanting to generate might be something like:
3 occurrences of 'duck'
3 occurrences of 'I'
2 occurrences of 'have'
1 occurrences of 'chicken'
.etc .etc
Plus, I'll probably want to maintain a list of substrings to ignore from the list, like 'I', 'will' and 'have. It's important to note that I do not know what people will post.
I do not have a list of words that I want to monitor, I just want to find the most common substrings. I'll then filter out any erroneous substrings that are not interesting from the list manually by editing the query.
Can anyone suggest the best way to do this? Thanks everyone!
MySQL already does this for you.
First make sure your table is a MyISAM table
Define a FULLTEXT index on your column
On a shell command line navigate to the folder where your MySQL data is stored, then type:
myisam_ftdump -c yourtablename 1 >wordfreq.dump
You can then process wordfreq.dump to eliminate the unwanted column and sort by frequency decending.
You could do all the above with a single command line and some sed/awk wizardry no doubt.
And you could incorporate it into your program without needing a dump file.
More info on myisam_ftdump here:
http://dev.mysql.com/doc/refman/5.0/en/myisam-ftdump.html
Oh... one more thing, the stopwords for MySQL are precompiled into the engine.
And words with 3 or less characters are not indexed.
The full list is here:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-stopwords.html
If this list isn't adequate for your needs, or you need words with less than 3 characters to count, the only way is to recompile MySQL with different rules for FULLTEXT. I don't recommend that!
Extract to flat file and then use your favorite quick language, perl, python, ruby, etc to process the flat file.
If you don't have one these languages as part of your skillset, this is a perfect little task to start using one, and it won't take you long.
Some database tasks are just so much easier to do OUTSIDE of the database.
You might want to look into the MySQL Full-Text Parser Plugins