How to calculate TF-IDF in OracleSQL? - sql

This is a text mining project. The purpose of this project is to see how every word weighs differently in a different document.
Now I am having two tables, one table with TF information (WORD | WordFrequency_in_EachFile), another table with IDF (WORD | HowManyFile_have_EachWord). I am not sure what query use for this calculation.
The math I am trying to do here is:
WordFrequency_in_EachFile*(log(N/HowManyFile_have_EachWord)+1)
N is the total number of document.
Below is my code:
create table TF_IDF (WORD, TF*IDF) as
select A.frequency*((log(10,132366/B.totalcount)+1))
from term_frequency A, document_frequency B
where A.WORD=B.WORD;
Here 1323266 is the total number of my documents, and totalcount is how many documents a word shows.
Since I am new to SQL, I would appreciate a little explanation to your code. Thanks a lot!

Calculation looks good, but there are some invalid syntax.
Right variant may look like below:
create table TF_IDF as
select
A.Word as Word,
A.frequency*( log(10, 132366/B.totalcount) + 1) as TFIDF
from
term_frequency A,
document_frequency B
where
A.WORD=B.WORD
;
In CREATE ... AS SELECT ... statement you don't need column specifications. Column names and types derived from field aliases.
Also, you must provide values for Word column in new table.
And one more point: there are one excess pair of brackets in expression.

Related

How can I assign pre-determined codes (1,2,3, etc,) to a JSON-type column in PostgreSQL?

I'm extracting a table of 2000+ rows which are park details. One of the columns is JSON type. Image of the table
We have about 15 attributes like this and we also have a documentation of pre-determined codes assigned to each attribute.
Each row in the extracted table has a different set of attributes that you can see in the image. Right now, I have cast(parks.services AS text) AS "details" to get all the attributes for a particular park or extract just one of them using the code below:
CASE
WHEN cast(parks.services AS text) LIKE '%uncovered%' THEN '2'
WHEN cast(parks.services AS text) LIKE '%{covered%' THEN '1' END AS "details"
This time around, I need to extract these attributes by assigning them the codes. As an example, let's just say
Park 1 - {covered, handicap_access, elevator} to be {1,3,7}
Park 2 - {uncovered, always_open, handicap_access} to be {2,5,3}
I have thought of using subquery to pre-assign the codes, but I cannot wrap my head around JSON operators - in fact, I don't know how to extract them on 2000+ rows.
It would be helpful if someone could guide me in this topic. Thanks a lot!
You should really think about normalizing your tables. Don't store arrays. You should add a mapping table to map the parks and the attribute codes. This makes everything much easier and more performant.
step-by-step demo:db<>fiddle
SELECT
t.name,
array_agg(c.code ORDER BY elems.index) as codes -- 3
FROM mytable t,
unnest(attributes) WITH ORDINALITY as elems(value, index) -- 1
JOIN codes c ON c.name = elems.value -- 2
GROUP BY t.name
Extract the array elements into one record per element. Add the WITH ORDINALITY to save the original order.
Join your codes on the elements
Create code arrays. To ensure the correct order, you can use the index values created by the WITH ORDINALITY clause.

How can I replace a column with another column in a different table in sql?

I have two different tables. One of them has text data and the other one has words and their stem. I want to look at all words in the text data and compare these with the second table of words and stems. If there is a connection between table TEXT_DATA and second table DICTIONARY, I want to change it to stem version.
I simply write a code but didn't work.
text data
TEXT
I have chocolates
DICTIONARY
WORD_FORM STEM
chocalates chocolate
SELECT
REPLACE(TEXT,(SELECT WORD_FORM FROM DICTIONARY),(SELECT STEM FROM DICTIONARY))
FROM TEXT_DATA
I want to see my new text like : I have chocolate
Thanks in advance
Consider joining tables with INSTR():
SELECT
REPLACE(t.TEXT, d.WORD_FORM, d.STEM) AS NEW_TEXT
FROM TEXT_DATA t
INNER JOIN DICTIONARY d
ON INSTR(t.TEXT, d.WORD_FORM, 1, 1) > 0

How to use more than 200 nested if conditions in excel?

I have the following data in excel sheet A.
Category Name
Fruit Apple
Vegetable Brinjal
XYZ Abc
I want to create a formula which takes a value for name column, outputs the corresponding category column.
If I use VLookUp, I have to copy this reference table in each and every excel sheet wherever I need to have this operation.
Hence I am looking for something similar to
IF(input="Apple","Fruit",IF(input="Brinjal","Vegetable",IF(input="Abc",XYZ,"")))
But There is limit on nested ifs in excel and no of cases that we can have in a switch case are also limited.
I have around 200 rows of this table.
use INDEX and MATCH functions. INDEX on "category" by matching "name"
You certainly don't need so many IF statements (though I note your Q Title), for example:
=CHOOSE(MATCH(D13,{"Apple","Brinjal","Abc"},0),"Fruit","Vegetable","XYZ")
which should not grow at quite the rate your version would - but with 200 'pairs' would be getting close to the limit for CHOOSE.
(D13 as example in spreadsheet.)

Excel randomly select name from list with multiple entries

I have an excel 2007 worksheet with employee names in column A and total number of entries in column B. I need to be able to randomly select x number of employee names from the total number of entries, allowing for the fact that some will have multiple entries.
For example:
Amy............30
Brian..........12
Charlene.......15
Michael.........1
Nathan..........7
What is the best way to do this?
My initial thoughts are:
1) find the max() of column B occurances of a random number in another column, like C. Then find the top values for all of that new column.
2) create a VBA array of all of the potiential entries and randomly pick one from there.
3) loop through all of the names in column A and create a temp worksheet with column B instances of each, then assign a random num generator and choose the top n.
Having said that, there may be something a lot easier. I am not sure where to begin. Normally I can find code that is similar to what I need, but I am not having any luck. Any help that you can offer would be appreciated.
Thank you in advance.
I would probably do something like this if I understand your question correctly(I just read your question title):

Related rows based on text columns

Given that I have a table with a column of TEXT in it (MySQL or SQlite) is it possible to use the value of that column in a way that I could find similar rows with somewhat related text values?
For example, I if I wanted to find related rows to row_3 - both 1 & 2 would match:
row_1 = this is about sports
row_2 = this is about study
row_3 = this is about study and sports
I know that I could use FULLTEXT or FTS3 if I had a key word I wanted to MATCH against the column values - but I'm just trying to find text that is somewhat related among the rows.
MySQL supports a fulltext search option called QUERY EXPANSION. The idea is that you search for a keyword, it finds a row, and then it uses the words in that row as keywords, to search for more matching rows.
SELECT ... FROM StudiesTable WHERE MATCH(description_text)
AGAINST ('sports' IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION);
Read about it here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-query-expansion.html
You're using the wrong hammer to pound that screw in. A single string in a database column isn't the way to store that data. You can't easily get at the part you care about, which is the individual words.
There is a lot of research into the problem of comparison of text. If you're serious about this need, you'll want to start reading about the variety of techniques in that problem domain.
The first clue is that you want to access / index the data not by complete text string, but by word or sentence fragment (unless you're interested in words that are spelled similarly being matched together, which is harder).
As an example of one technique, generate a chain out of your sentences by grabbing overlapping sets of three words, and store the chain. Then you can search for entries that have a large number of chain segments in common. A set of chain segments for your statements above would be:
row_1 = this is about sports
row_2 =
this is about study
row_3 = this is
about study and sports
this is about (3 matches)
is about sports
is about study (2 matches)
about study and
study and sports
Maybe it would be enough to take each relevant word (more than 4 letters? or comparing against a list of commom words?) in the base row using them as keywords for the fulltext search and building a tmp table (id, row_matched_id, count) to record the matches for each row adding 1 to count when it matches. At the end you'll get in the tmp table all the lines that matched and how many times they matched (how many relevant words were the same).If you want to run it once against the whole database and keep the results, use a persisted table, add a column for the id of the base row and do the search for each new row inserted (or updated) to update the results table.
Using this results table you can find quickly the rows matching more words of the base row without doing the search again.
Edit: with this you can "score" the results, for example, if you count x relevant words in the base row, you can calculate a score in % as (matches/x * 100) and filter all results with for example less than 50% matches. In your example, each row_1 and row_2 would give 50% if considering relevants only words with more than 4 letters or 67% if you consider all the words.