How to organize an SQL database to store book text analysis data - sql

I have analysed 3 books using the Stanford NLP library. I run my analysis on a page basis, and for every book this is the output I get:
// An array of length P, where P is the total number of pages in the book
// so that pageSentiment[0] represents the sentiment of the page 1.
float[] pageSentiment
// An array of length P, where P is the total number of pages in the book
// so that pageWords[0] represents the number of words in the page 1.
int[] pageWords
// An array of length W, where W is the number of unique words in the book
// where, for example, bookWords[0] has the following values
// word = "then"
// data[0] = {1, 1, 2} => the word "then" occurs 2 times in page 1 (associated to chapter 1)
// data[1] = {1, 2, 1} => the word "then" occurs 1 times in page 2 (associated to chapter 1)
// data[2] = {1, 3, 0} => the word "then" occurs 0 times in page 3 (associated to chapter 1)
// data[3] = {1, 4, 0} => the word "then" occurs 0 times in page 4 (associated to chapter 1)
// data[4] = {2, 5, 3} => the word "then" occurs 3 times in page 5 (associated to chapter 2)
// data[5] = ...
struct WordData { string word; int[,,] data; }
WordData[] bookWords
Now... I have to store all those results into an SQL database so that I can access it to plot graphs and statistical tables within a web page. What I'm trying to figure out, now, is the proper way to store all those values in a flexible way so that I can easily send different queries to the database in order to obtain different outputs that follow my current needs. For example... I need to be able to:
plot an histogram concerning the words count (pageWords) in which
each column can be either a page or a chapter (in this case I need to
aggregate page values);
see the frequency of a word by page or by chapter;
print global book values for every book;
ect...
Any suggestion about the structure of my SQL tables, please?

Just 3 tables
book
---
book_id
title
...
word
---
word_id
text
...
and many-to-many table with results
word_2_book
---
word_id
book_id
page_no
chapter_no
word_count
Then just
select *
from word_2_book wb
where wb.book_id=? and wb.word_id=?
and you can apply any aggregate functions

Related

Understanding Pandas Series Data Structure

I am trying to get my head around the Pandas module and started learning about the Series data structure.
I have created the following Series in Spyder :-
songs = pd.Series(data = [145,142,38,13], name = "Count")
I can obtain information about the Series index using the code:-
songs.index
The output of the above code is as follows:-
My question is where it states Start = 0 and Stop = 4, what are these referring to?
I have interpreted start = 0 as the first element in the Series is in row 0.
But i am not sure what Stop value refers to as there are no elements in row 4 of the Series?
Can some one explain?
Thank you.
This concept as already explained adequately in the comments (indexing is at minus one the count of items) is prevalent in many places.
For instance, take the list data structure-
z = songs.to_list()
[145, 142, 38, 13]
len(z)
4 # length is four
# however indexing stops at i-1 position 'i' being the length/count of items in the list.
z[4] # this will raise an IndexError
# you will have to start at index 0 going till only index 3 (i.e. 4 items)
z[0], z[1], z[2], z[-1] # notice how -1 can be used to directly access the last element

How to get same rank for same scores in Redis' ZRANK?

If I have 5 members with scores as follows
a - 1
b - 2
c - 3
d - 3
e - 5
ZRANK of c returns 2, ZRANK of d returns 3
Is there a way to get same rank for same scores?
Example: ZRANK c = 2, d = 2, e = 3
If yes, then how to implement that in spring-data-redis?
Any real solution needs to fit the requirements, which are kind of missing in the original question. My 1st answer had assumed a small dataset, but this approach does not scale as dense ranking is done (e.g. via Lua) in O(N) at least.
So, assuming that there are a lot of users with scores, the direction that for_stack suggested is better, in which multiple data structures are combined. I believe this is the gist of his last remark.
To store users' scores you can use a Hash. While conceptually you can use a single key to store a Hash of all users scores, in practice you'd want to hash the Hash so it will scale. To keep this example simple, I'll ignore Hash scaling.
This is how you'd add (update) a user's score in Lua:
local hscores_key = KEYS[1]
local user = ARGV[1]
local increment = ARGV[2]
local new_score = redis.call('HINCRBY', hscores_key, user, increment)
Next, we want to track the current count of users per discrete score value so we keep another hash for that:
local old_score = new_score - increment
local hcounts_key = KEYS[2]
local old_count = redis.call('HINCRBY', hcounts_key, old_score, -1)
local new_count = redis.call('HINCRBY', hcounts_key, new_score, 1)
Now, the last thing we need to maintain is the per score rank, with a sorted set. Every new score is added as a member in the zset, and scores that have no more users are removed:
local zdranks_key = KEYS[3]
if new_count == 1 then
redis.call('ZADD', zdranks_key, new_score, new_score)
end
if old_count == 0 then
redis.call('ZREM', zdranks_key, old_score)
end
This 3-piece-script's complexity is O(logN) due to the use of the Sorted Set, but note that N is the number of discrete score values, not the users in the system. Getting a user's dense ranking is done via another, shorter and simpler script:
local hscores_key = KEYS[1]
local zdranks_key = KEYS[2]
local user = ARGV[1]
local score = redis.call('HGET', hscores_key, user)
return redis.call('ZRANK', zdranks_key, score)
You can achieve the goal with two Sorted Set: one for member to score mapping, and one for score to rank mapping.
Add
Add items to member to score mapping: ZADD mem_2_score 1 a 2 b 3 c 3 d 5 e
Add the scores to score to rank mapping: ZADD score_2_rank 1 1 2 2 3 3 5 5
Search
Get score first: ZSCORE mem_2_score c, this should return the score, i.e. 3.
Get the rank for the score: ZRANK score_2_rank 3, this should return the dense ranking, i.e. 2.
In order to run it atomically, wrap the Add, and Search operations into 2 Lua scripts.
Then there's this Pull Request - https://github.com/antirez/redis/pull/2011 - which is dead, but appears to make dense rankings on the fly. The original issue/feature request (https://github.com/antirez/redis/issues/943) got some interest so perhaps it is worth reviving it /cc #antirez :)
The rank is unique in a sorted set, and elements with the same score are ordered (ranked) lexically.
There is no Redis command that does this "dense ranking"
You could, however, use a Lua script that fetches a range from a sorted set, and reduces it to your requested form. This could work on small data sets, but you'd have to devise something more complex for to scale.
unsigned long zslGetRank(zskiplist *zsl, double score, sds ele) {
zskiplistNode *x;
unsigned long rank = 0;
int i;
x = zsl->header;
for (i = zsl->level-1; i >= 0; i--) {
while (x->level[i].forward &&
(x->level[i].forward->score < score ||
(x->level[i].forward->score == score &&
sdscmp(x->level[i].forward->ele,ele) <= 0))) {
rank += x->level[i].span;
x = x->level[i].forward;
}
/* x might be equal to zsl->header, so test if obj is non-NULL */
if (x->ele && x->score == score && sdscmp(x->ele,ele) == 0) {
return rank;
}
}
return 0;
}
https://github.com/redis/redis/blob/b375f5919ea7458ecf453cbe58f05a6085a954f0/src/t_zset.c#L475
This is the piece of code redis uses to compute the rank in sorted sets. Right now ,it just gives rank based on the position in the Skiplist (which is sorted based on scores).
What does the skiplistnode variable "span" mean in redis.h? (what is span ?)

How to find all offset postions of term using Apache Lucene

I am trying to find all offset postions of given term. For instance I have input: "dog cat orange dog green dog" And i would like to find offsets for term: "dog". The result would be: 0,15,25.
Terms terms = indexReader.getTermVector(0,"text");
TermsEnum iterator = terms.iterator();
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null)
String term = byteRef.utf8ToString(); //here I find term name
/* Here I only know about term frequency and first offset(0) for given term not all of them */
Lets say I have a term, that occured 3 times while indexing like above. I would like to get array containing all offsets for term occurences.
Now i am getting only one offset for each term. How i can gather more information. I would be grateful for any help.
EDIT:
FieldType fieldType = new FieldType();
fieldType.setTokenized(true);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

Convert cell text in progressive number

I have written this SQL in PostgreSQL environment:
SELECT
ST_X("Position4326") AS lon,
ST_Y("Position4326") AS lat,
"Values"[4] AS ppe,
"Values"[5] AS speed,
"Date" AS "timestamp",
"SourceId" AS smartphone,
"Track" as session
FROM
"SingleData"
WHERE
"OsmLineId" = 44792088
AND
array_length("Values", 1) > 4
AND
"Values"[5] > 0
ORDER BY smartphone, session;
Now I have imported the result in Matlab and I have six vectors and one cell (because the text from the UUIDs was converted in cell) all of 5710x1 size.
Now I would like convert the text in the cell, in a progressive number, like 1, 2, 3... for each different session code.
In Excel it is easy with FIND.VERT(obj, matrix, col), but I do not know how do it in Matlab.
Now I have a big cell with a lot of codes like:
ff95465f-0593-43cb-b400-7d32942023e1
I would like convert this cell in an array of numbers where at the first occurrence of
ff95465f-0593-43cb-b400-7d32942023e1 -> 1
and so on. And you put 2 when a different code appear, and so on.
OK, I have solve.
I put the single session code in a second cell C.
At this point, with a for loop, I obtain:
%% Converting of the UUIDs into integer
C = unique(session);
N = length(session);
session2 = zeros(N, 1);
for i = 1:N
session2(i) = find(strcmp(C, session(i)));
end
Thanks to all!

How to make training and testing data files for LIBSVM and/or TinySVM

when I open the sample files for training data of LIBSVM, I can't understand the file structure. Can someone please show me how to make it ?
Below is my training data to predict song writers of a song(as an example):
Feature 1: Number of "love" word in the lyric
Feature 2: Number of "friend" word in the lyric
Feature 3: Number of "zone" word in the lyrics
Training data:
Song A (3, 0, 0), song writer is David
Song B (0, 3, 1), song writer is Peter
Song C (1, 3, 1), song writer is Tom
Testing data:
Song D (3, 0, 1)
Thank you very much.
Libsvm ReadMe file can help you
The training data must be something like this
label feature1:value1 feature2:value2 ..... -1:? (? can be any number)
but in the Libsvm there is something called svm_node that do the same thing:
sample code in java:
for (int k = 0; k < dataCount; k++) {
prob.x[k] = new svm_node[features.length];
for (int j = 0; j < features.length; j++) {
svm_node node = new svm_node();
node.index = featuresIndex[j];
node.value = features[j];
prob.x[k][j] = node;
}
prob.y[k] = lable;
}
In this problem of classification we have three classes for our whole dataset David, Peter, Tom and we assign them values 0, 1 and 2 receptively.
The format of data set will be.
[label] [feature number] : [the no of times that feature occurs] .... ....
Our training data file will look like this.
0 1:3 2:0 3:0
1 1:0 3:1 3:1
2 1:1 2:3 3:1
This file can be used to train our model. In this file there are 3 rows and four columns, the first column represents the actual result and the other columns represent the feature number : the number of times that feature occurs.
The testing data will be treated as.
1:3 2:0 3:1
this will be passed to svm model and then prediction can be drawn.