Cassandra return inconsistent result with secondary index - indexing

Consider I have a columnFamily named PEOPLE. Suppose the columns are:
1. id text (PRIMARY KEY)
2. first_name text
3. last_name text
4. countries_visited Set<text>
I have created one secondary index on "countries_visited".
Now if I fire up a query like this:
select eid, first_name, countries_visisted from PEOPLE where countries_visited CONTAINS 'FRANCE';
So for a large number of records, it returns many records where countries_visited contains some other Strings (not "FRANCE").
I did use nodetool rebuild_index utility also, but I still get such result.
Is it the expected response ?
I am using:
CQL 3.2.1
Cassandra 2.1.11.908

Related

Postgres data base need suggestion creating an index for table

I a have a table structure as below. For fetching the data from table I am having search criteria as mentioned below. I am writing a singe sql query as per requirement(sample query I mentioned below). I need to create an index for the table to cover all the search criteria. It will be helpful somebody advice me.
Table structure(columns):
applicationid varchar(15),
trans_tms timestamp,
SSN varchar,
firstname varchar,
lastname varchar,
DOB date,
Zipcode smallint,
adddetais json
Search criteria will be from API will be fall under 4 categories. All 4 categories are mandatory. At any cost I will receive 4 categories of values for against single applicant.
Search criteria:
ssn&last name (last name need to use function I.e. soundex(lastname)=soundex('inputvalue').
ssn & DOB
ssn&zipcode
firstname&lastname&DOB.
Query:
I am trying to write.
Sample query is:
Select *
from table
where ((ssn='aaa' and soundex(lastname)=soundex('xxx')
or ((ssn='aaa' and dob=xxx)
or (ssn='aaa' and zipcode = 'xxx')
or (firstname='xxx' and lastname='xxx' and dob= xxxx));
For considering performance I need to create an index for the table. Might be composite. Any suggestion will be helpful.
Some Approaches I would follow:
Yes, you are correct composite index/multicolumn index will give benefit in AND conditions of two columns, however, indexes would overlap on columns for given conditions.
Documentation : https://www.postgresql.org/docs/10/indexes-multicolumn.html
You can use a UNION instead of OR.
Reference : https://www.cybertec-postgresql.com/en/avoid-or-for-better-performance/
If multiple conditions could be combined for e.g: ssn should be 'aaa' with any combination, then modifying the where clause with lesser OR is preferable.

in sql in a table, in a given column with data type text, how can we show the rest of the entries in that column after a particular entry

in sql, in any given table, in a column named "name", wih data type as text
if there are ten entries, suppose an entry in the column is "rohit". i want to show all the entries in the name column after rohit. and i do not know the row id or id. can it be done??
select * from your_table where name > 'rohit'
but in general you should not treat text columns like that.
a database is more than a collection of tables.
think about how to organize your data, what defines a datarow.
maybe, beside their name, there is another thing how you would classify such a row? some things like "shall be displayed?" "is modified" "is active"?
so if you had a second column, say display of type int and your table looked like
CREATE TABLE MYDATA
NAME TEXT,
DISPLAY INT NOT NULL DEFAULT(1);
you could flag every row with 1 or 0 whether it should be displayed or not and then your query could look like
SELECT * FROM MYDATA WHERE DISPLAY=1 ORDER BY NAME
to get your list of values.
it's not much of a difference with ten rows, you don't even need indexes here, but if you build something bigger, say 10,000+ rows, you'd be surprised how slow that would become!
in general, TEXT columns are good to select and display, but should be avoided as a WHERE condition as much as you can. Use describing columns, preferrably int fields which can be indexed with extreme high efficiency and an application doesn't get slower even if the record size goes over 100k.
You can use "default" keyword for it.
CREATE TABLE Persons (
ID int NOT NULL,
name varchar(255) DEFAULT 'rohit'
);

Avoiding duplicate rows being inserted in where unizue rows are obtained from two tables

I have a two tables such as customer_name and customer_phone, but the unique customer is identified from the combination of all the four columns from two of the tables.
Since we have multiple souce systems inserting into the below tables at the same time, in all those jobs we validate before insert using a function to check if the customer already exist using (f_name,l_name,area_code,phone_num) this combination. However we still see duplicates getting inserted, because the validation happens while other job has already inserted but not yet commited. Is there any solution to avoid duplicates ?
customer_name
Col: ID, First_name, last_name
cutomer_phone
col: ID,area_code, Phone_number
Yes. Don't do the checking in the application. Let the database do the checking by using unique indexes/constraints. If I had to guess on the constraints you want:
create unique index idx_customer_name_2 on customer_name(first_name, last_name);
create unique index idx_customer_phone_2 on customer_phone(customer_id, phone_number);
The database will then do the checking and you don't have to worry about duplicates -- although you should check for errors on the insert.

How to construct an sqlite table that assign and returns IDs to any name?

I would like to have an sqlite table that maps names into unique IDs. I can create this table in the following way:
CREATE TABLE name_to_id (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT)
With a select statement I can get the row containing a needed name and get from this row the corresponding ID.
The problem appears if I try to get ID for a name that is not yet in the table. The expected behavior in this case is that the new name will be added and its newly generated ID will be returned. I have two possible solutions/implementations of that.
The first solution is trivial:
We check if name is in the table.
If not we insert a row with the name.
We select the row with the name and read the needed ID from that row.
I do not like this solution because it can happen that the first process checks if the name in the table, it sees that the name is not there, meanwhile another process adds the name to the table and then the first process tries to add the same name.
The second solution seems to be better:
For any name we use insert if not exist.
We select from the table the row containing the name and get its ID.
Is the second solution optimal or there are better solutions?
The normal way to avoid duplicate entries in a table is to create an unique constraint. The database will then check for you if the record is already there and fail if so. That should be the best in terms of reliability and performance.
Next, the SQLite FAQ suggests to use the function last_insert_rowid() to fetch the ID instead of running a second query. This is actually the first question of the FAQ at all ;)
In pseudocode, the first solution looks like this:
cursor = db.execute("SELECT id FROM name_to_id WHERE name = ?", name)
if cursor.has_some_row:
id = cursor["id"]
else:
db.execute("INSERT INTO name_to_id(name) VALUES(?)", name)
id = db.last_insert_rowid
and the second like this:
db.execute("INSERT OR IGNORE INTO name_to_id(name) VALUES(?)", name)
cursor = db.execute("SELECT id FROM name_to_id WHERE name = ?", name)
id = cursor["id"]
The first solution requires a transaction around both commands, but this would be a good idea for the second solution, too, to avoid the overhead of multiple implicit transactions.
The second solution requires a unique constaint on name, but this would be a good idea for the first solution, too, for correctness and to speed up the name lookups.
Both solution use two SQL statements, and have similar speed.
(The second searches the row two times, but that data is cached.)
So there isn't anything obvious that makes one better that the other.

How does Oracle perform read operation?

Suppose we have a table which holds information about person. Columns like NAME or SURNAME are small (I mean their size isn't very large), but columns that hold a photo or maybe a person's video (blob columns) may be very large. So when we perform a select operation:
select * from person
it will retrieve all this information. But in most cases we need only retrieve name or surname of person, so we perform this query:
select name, surname from person
Question: will Oracle read the whole record (including the blob columns) and then simply filter out name and surname columns, or will it only read name and surname columns?
Also, even if we create a separate table for such large data(person's photo and video) and have a foreign key to that table in person's table and want to retrieve only photo, so we perform this query:
select photo
from person p
join largePesonData d on p.largeDataID = d.largeDataID
where p.id = 1
Will Oracle read a whole record in person table and whole record in largePesonData or will it simply read the column with photo in largePesonData?
Oracle reads the data in blocks.
Let's assume that your block size is 8192 bytes and your average row size is 100 bytes - that would mean each block would populate 8192/100 = 81 rows (It's not accurate since there is some overhead coming from the block header - but I'm trying to keep things simple).
So when you
select name, surname from person;
You actually retrieve at least on block with all of it's data (81 rows), and later after it is being screened returning you only the data you requested.
Two exceptions to this are:
BLOB Column - "select name, surename from person" will not retrieve the BLOB contents itself because BLOB columns contain a reference to the actual BLOB (which sits somewhere else on the tablespace or even in anoter TS)
Indexed columns - In case you created an index on the table using the columns name and surname it is possible that Oracle will only scan this specific index and retrieve only those two columns.