SQLite Select first row per category - sql

I'm trying to write a mutli-language blog software using Python and sqlite and I'm struggeling with making an sql query elegant.
I've got all articles of the blog in two tables:
articleindex (contains most of the metadata of the articles like the URL, etc)
articlecontent (contains well, the content of the article, and a flag for the language, and when this specific translation was written (aka, the date))
I now try to select all articles ordered by date and by language. This if for the main view of the blog. It should list all articles in chronological order, regardless of the language they are in, but only once (I don't want to have the english version of an article below or above the german version) - if there are multiple translations the main view should contain the default language (english) if it exists. If there is no english version it should show the german version (if it exits) if there is no german version it shall show the esperanto version, etc.
Of course I can do this in python, select all articles, and skip a record if the another version of the article has already been written. However this seemed inelegant. I'd rather liked SQLite to return the data as need.
So far I could manage to get the data in the order I want, I just don't seem to be able to eliminate the unneeded records.
Here is the table structure:
CREATE TABLE articleindex (id INTEGER PRIMARY KEY,
category text,
translationid text,
webid text) `
CREATE TABLE articlecontent (id INTEGER PRIMARY KEY,
articleindexid INTEGER,
lang text,
content text,
date text) `
I came up with this query, which gives me the right order, but has duplicates in it:
SELECT * FROM articlecontent AS ac LEFT JOIN articleindex AS ai
ON ac.articleindexid = ai.id ORDER BY ac.date DESC, CASE ac.lang
WHEN "en" THEN 0
WHEN "de" THEN 1
WHEN "eo" THEN 2
END
This results in the (shortend) output:
articleindexid. lang
21, en
21, de
12, en
12, de
8, en
8, de
2, en
2, de
2, eo
How do I skip for example the second record with the articleindexid 21 or 12?
Using search engines I came across suggestions about using Partitions, but it seems Sqlite doesn't support those. I also have difficulties for what to search for, so any suggestion is appreciated.
Thanks
Ben

I think you should create table for language priorities to use it instead of CASE in SQL statemens. For example
LANG_PRIORITY(lang text,Ord INTEGER) = (("en",0),("de",1),("eo",2))
Anyway in your current environment try to use the following query. The subquery with LIMIT 1 will select one row per DATE with higher priority:
SELECT * FROM articlecontent AS ac
LEFT JOIN articleindex AS ai
ON ac.articleindexid = ai.id
WHERE ac.id =
(
SELECT ID FROM articlecontent as ac2
WHERE ac.date=ac2.date
ORDER BY CASE ac2.lang
WHEN "en" THEN 0
WHEN "de" THEN 1
WHEN "eo" THEN 2
END
LIMIT 1
)
ORDER BY ac.date DESC

Related

Cannot figure out writing a compound SQLquery to the Oracle database

Please help me sort out the request:
Develop a query to calculate the number of news, written by each author and the most popular tag, referred to author news. All these information must be output in one single result set.
I wrote the first part of the request, it displays the amount of news for each author:
SELECT news_author.author_id AS "Author ID", COUNT(*) AS "Amount of news"
FROM news
JOIN news_author ON id = news_author.news_id
JOIN news_tag ON id = news_tag.news_id
GROUP BY news_author.author_id
ORDER BY news_author.author_id;
Please tell me how to make a request for the most popular tag, referred to author news and combine these two samples into one result set.
You can use standard ANSI SQL features or Oracle SQL bonuses.
The table scheme is attached.
The most popular tag has a name in statistics, called the mode. And, Oracle has an aggregation function stats_mode() to calculate it. So, you can use:
SELECT na.author_id AS "Author ID",
COUNT(DISTINCT n.id) AS num_news,
STATS_MODE(nt.tag_id)
FROM news n JOIN
news_author na
ON n.id = na.news_id JOIN
news_tag nt
ON n.id = nt.news_id
GROUP BY na.author_id
ORDER BY na.author_id;

Ways to Clean-up messy records in sql

I have the following sql data:
ID Company Name Customer Address 1 City State Zip Date
0108500 AAA Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
0108500 AAA.Test Mish~Sara Newa Claims Chtiana CO 123 06FE0046
1802600 AAA Test Company Ban, Adj.~Gorge PO Box 83 MouLaurel CA 153 09JS0025
1210600 AAA Test Company Biwel~Brce 97kehst ve Jacn CA 153 04JS0190
AAA Test, AAA.Test and AAA Test Company are considered as one company.
Since their data is messy I'm thinking either to do this:
Is there a way to search all the records in the DB wherein it will search the company name with almost the same name then re-name it to the longest name?
In this case, the AAA Test and AAA.Test will be AAA Test Company.
OR Is there a way to filter only record with company name that are almost the same then they can have option to change it?
If there's no way to do it via sql query, what are your suggestions so that we can clean-up the records? There are almost 1 million records in the database and it's hard to clean it up manually.
Thank you in advance.
You could use String matching algorithm like Jaro-Winkler. I've written an SQL version that is used daily to deduplicate People's names that have been typed in differently. It can take awhile but it does work well for the fuzzy match you're looking for.
Something like a self join? || is ANSI SQL concat, some products have a concat function instead.
select *
from tablename t1
join tablename t2 on t1.companyname like '%' || t2.companyname || '%'
Depending on datatype you may have to remove blanks from the t2.companyname, use TRIM(t2.companyname) in that case.
And, as Miguel suggests, use REPLACE to remove commas and dots etc.
Use case-insensitive collation. SOUNDEX can be used etc etc.
I think most Database Servers support Full-Text search ability, and if so there are some functions related to Full-Text search that support Proximity.
for example there is a Near function in SqlServer and here is its documentation https://msdn.microsoft.com/en-us/library/ms142568.aspx
You can do the clean-up in several stages.
Create new columns
Convert everything to upper case, remove punctuation & whitespace, then match on the first 6 to 10 characters (using self join). Assuming your table is called "vendor": add two columns, "status", "dupstr", then update as follows
/** Populate dupstr column for fuzzy match **/
update vendor v
set v.dupstr = left(upper(regex_replace(regex_replace(v.companyname,'.',''),' ','')),6)
;
Identify duplicate records
Add an index on the dupstr column, then do an update like this to identify "good" records:
/** Mark the good duplicates **/
update vendor v
set v.status = 'keep' --indicate keeper record
where
--dupes to clean up
exists ( select 1 from vendor v1 where v.dupstr = v1.dupstr
and v.id != v1.id )
and
( --keeper has longest name
length(v.companyname) =
( select max(length(v2.companyname)) from vendor v2
where v.dupstr = v2.dupstr
)
or
--keeper has latest record (assuming ID is sequential)
v.id =
( select max(v3.id) from vendor v3
where v.dupstr = v3.dupstr
)
)
group by v.dupstr
;
The above SQL can be refined to add "dupe" status to other records , or you can do a separate update.
Clean Up Stragglers
Report any remaining partial matches to be reviewed by a human (i.e. dupe records without a keeper record)
You can use SQL query with SOUDEX of DIFFRENCE
For example:
SELECT DIFFERENCE ('AAA Test','AAA Test Company')
DIFFERENCE returns 0 - 4 ( 4 = almost the same, 0 - totally diffrent)
See also: https://learn.microsoft.com/en-us/sql/t-sql/functions/difference-transact-sql?view=sql-server-2017

Inconsistent geolocation query results in SQL

I wrote a simple web application that lets you mark flea market stands on a google map.
Each stand is stored in a sqlite3 database with its geolocation and other information.
This is the CREATE statement for the stands table:
CREATE TABLE stands (
id INTEGER PRIMARY_KEY,
name TEXT,
address TEXT,
u REAL,
v REAL,
);
u and v are respectively Latitude and Longitude.
Additionally I have a cities table that stores the name and geographic bounds of each city which host a stand. This is used to let users quickly navigate between cities.
CREATE TABLE cities
(name TEXT PRIMARY_KEY,
u_min REAL,
u_max REAL,
v_min REAL,
v_max REAL);
When a new stand is added, a new row is added to the cities table or if the stand is in a known city, only the bounds of the city are updated if needed.
Here are some sample stands:
592077673|Kierrätystori Rovaniemellä|Urheilukatu 1, 96100 Rovaniemi, Suomi|66.4978306921681|25.7220569153442
1321495145|Kruununhaka|Liisankatu, 00170 Helsinki, Suomi|60.1742596|24.9555782
571688977|Viikki asukastalo LAVAn edusta|Biologinkatu 5, 00790 Helsinki, Suomi|60.2342312|25.04058
563089951|Hämeentie 156|Hämeentie 156, 00560 Helsinki, Suomi|60.2130467082539|24.9785856067459
518892420|Joensuu - Ilosaari|Siltakatu 1, 80100 Joensuu, Finland|62.5990455742272|29.7706540507875
and cities:
Rovaniemi|66.4978306921681|66.4978306921681|25.7220569153442|25.7220569153442
Helsinki|60.1577049447137|60.2556221042622|24.9216988767212|25.0662129772156
Järvenpää|60.4513724|60.4513724|25.0819323000001|25.0819323000001
Joensuu|62.5990455742272|62.5990653244875|29.7706540507874|29.7706540507875
Vantaa|60.2731724937748|60.2731724937748|24.9571491285278|24.9571491285278
The issue I'm having is retrieving the the number of stands per cities.
So far I've been using the following query:
SELECT cities.name AS city,
cities.u_min,
cities.u_max,
cities.v_min,
cities.v_max,
count(stands.id) AS count
FROM cities
LEFT OUTER JOIN stands
ON ((stands.u BETWEEN cities.u_min AND cities.u_max)
AND(stands.v BETWEEN cities.v_min AND cities.v_max))
GROUP BY cities.name;
This returns:
Helsinki|60.1577049447137|60.2556221042622|24.9216988767212|25.0662129772156|9
**Joensuu|62.5990455742272|62.5990653244875|29.7706540507874|29.7706540507875|0**
Järvenpää|60.4513724|60.4513724|25.0819323000001|25.0819323000001|1
Rovaniemi|66.4978306921681|66.4978306921681|25.7220569153442|25.7220569153442|1
Vantaa|60.2731724937748|60.2731724937748|24.9571491285278|24.9571491285278|1
Which is not correct as the city named Joensuu does have 1 stand in its boundaries:
518892420|Joensuu - Ilosaari|Siltakatu 1, 80100 Joensuu, Finland|62.5990455742272|29.7706540507875
But the following query returns the expected stand:
SELECT * FROM stands where u between 62.5990455742272 and 62.5990653244875 and v between 29.7706540507874 and 29.7706540507875;
I really can't understand what is going wrong here.
Any help would be greatly appreciated.
By the way, I imported this database to Mysql and the same thing happens so I doubt this is a sqlite3 bug.
I think this has to do with floating point precision error. One of the ways to deal with such problems is to introduce a small number and add it to your boundaries to make them a little wider - that eliminates precision errors.
One approach is to widen boundaries in every query directly:
SET #e = 0.0000000000001;
SELECT cities.name AS city,
cities.u_min,
cities.u_max,
cities.v_min,
cities.v_max,
count(stands.id) AS count
FROM cities
LEFT OUTER JOIN stands
ON ((stands.u BETWEEN cities.u_min - #e AND cities.u_max + #e)
AND(stands.v BETWEEN cities.v_min - #e AND cities.v_max + #e))
GROUP BY cities.name;
Another approach is to store the widened boundaries in the cities table:
SET #e = 0.0000000000001;
UPDATE cities
SET cities.u_min = cities.u_min - #e,
cities.u_max = cities.u_max + #e,
cities.v_min = cities.v_min - #e,
cities.v_max = cities.v_max + #e;
P.S. I am not sure if the variable syntax works in SQLite, but if doesn't, just substitute all #e with 0.0000000000001.
The reason is that you use LEFT JOIN, so it includes all cities, regardless of whether there is a match in the stands table. Notice how COUNT returns 0 for that row.
Just use INNER JOIN instead of LEFT JOIN. That will tell MySQL to only return rows for which the condition in ON is met.
UPD: I misunderstood the question, so this answer is wrong. I'll post another answer.

SQL query on a condition

I'm writing a query to retrieve translated content. I want it so that if there isn't a translation for the given language id, it automatically returns the translation for the default language, with Id 1.
select Translation.Title
,Translation.Summary
from Translation
where Translation.FkLanguageId = 3
-- If there is no LanguageId of 3, select the record with LanguageId of 1.
I'm working in MS SQL but I think the theory is not DBMS-specific.
Thanks in advance.
This assumes one row per Translation only, based on how you phrased the question. If you have multiple rows per FkLanguageId and I've misunderstood, please let us know and the query becomes more complex of course
select TOP 1
Translation.Title
,Translation.Summary
from
Translation
where
Translation.FkLanguageId IN (1, 3)
ORDER BY
FkLanguageId DESC
You'd use LIMIT in another RDBMS
Assuming the table contains different phrases grouped by PhraseId
WITH Trans As
(
select Translation.Title
,Translation.Summary
,ROW_NUMBER() OVER (PARTITION BY PhraseId ORDER BY FkLanguageId DESC) RN
from Translation
where Translation.FkLanguageId IN (1,3)
)
SELECT *
FROM Trans WHERE RN=1
This assumes the existance of a TranslationKey that associates one "topic" with several different translation languages:
SELECT
isnull(tX.Title, t1.Title) Title
,isnull(tX.Summary, t1.Summary) Summary
from Translation t1
left outer join Translation tX
on tx.TranslationKey = t1.Translationkey
and tx.FkLanguageId = #TargetLanguageId
where t1.FkLanguageId = 1 -- "Default
Maybe this is a dirty solution, but it can help you
if not exists(select t.Title ,t.Summary from Translation t where t.FkLanguageId = 3)
select t.Title ,t.Summary from Translation t where t.FkLanguageId = 1
else
select t.Title ,t.Summary from Translation t where t.FkLanguageId = 3
Since your reference to pastie.org shows that you're looking up phrases or specific menu item names in a table I'm going to assume that there is a phrase ID to identify the phrases in question.
SELECT ISNULL(forn_lang.Title, default_lang.Title) Title,
ISNULL(forn_lang.Summary, default_lang.Summary) Summary
FROM Translation default_lang
LEFT OUTER JOIN Translation forn_lang ON default_lang.PhraseID = forn_lang.PhraseID AND forn_lang.FkLanguageId = 3
WHERE default_lang.FkLanguageId = 1

How to write a query returning non-chosen records

I have written a psychological testing application, in which the user is presented with a list of words, and s/he has to choose ten words which very much describe himself, then choose words which partially describe himself, and words which do not describe himself. The application itself works fine, but I was interested in exploring the meta-data possibilities: which words have been most frequently chosen in the first category, and which words have never been chosen in the first category. The first query was not a problem, but the second (which words have never been chosen) leaves me stumped.
The table structure is as follows:
table words: id, name
table choices: pid (person id), wid (word id), class (value between 1-6)
Presumably the answer involves a left join between words and choices, but there has to be a modifying statement - where choices.class = 1 - and this is causing me problems. Writing something like
select words.name
from words left join choices
on words.id = choices.wid
where choices.class = 1
and choices.pid = null
causes the database manager to go on a long trip to nowhere. I am using Delphi 7 and Firebird 1.5.
TIA,
No'am
Maybe this is a bit faster:
SELECT w.name
FROM words w
WHERE NOT EXISTS
(SELECT 1
FROM choices c
WHERE c.class = 1 and c.wid = w.id)
Something like that should do the trick:
SELECT name
FROM words
WHERE id NOT IN
(SELECT DISTINCT wid -- DISTINCT is actually redundant
FROM choices
WHERE class == 1)
SELECT words.name
FROM
words
LEFT JOIN choices ON words.id = choices.wid AND choices.class = 1
WHERE choices.pid IS NULL
Make sure you have an index on choices (class, wid).