I have been trying to use the ALTER TABLE command with HSQLDB to add 2 columns to a table but with no luck. I know MySQL and other systems support it but why doesn't HSQLDB support it? Perhaps I'm using the wrong syntax, I don't know. I also know that I could add it one-by-one but my application requires the addition of 1000 columns and it is too slow to do it one-by-one.
The reason that I'm using HSQLDB is that I need to use it in file-mode. I have also tried SmallSQL, but it is much more slow than HSQLDB.
You don't need thousands of columns for this. This is a standard one-to-many relationship between three tables: questionaire, question and answer:
create table questionaire
(
id integer not null primary key,
customer_name varchar(100) not null
);
create table question
(
id integer not null primary key,
questionaire_id integer not null references questionaire,
question_text varchar(20000),
sort_order integer
);
create table answer
(
question_id integer not null references question,
answer_text varchar(20000),
user_name varchar(50) not null,
primary key (question_id, user_name)
);
In reality you wouldn't actually store the users's name in the answer table. If you have named users that log in, you probably also need a user_account table and the question table would reference the user_account table.
You can easily query this model without the need to revert to a key/value store or JSON
To get all questionaires from a customer
select *
from questionaire qu
where customer_name = 'Some company';
Get all questionaires and the number of questions per customer
select qu.customer_name,
count(distinct qu.id) as num_questionaires,
count(q.id) as total_questions
from questionaire qu
join question q on qu.id = q.questionaire_id
group by qu.customer_name;
Get all answers for a questionaire from a specific user
select q.question_text, a.answer_text
from answer a
join question q on q.id = a.question_id
join questionaire qu on qu.id = q.questionaire_id
where qu.id = 1
and a.user_name = 'Marvin'
order by q.sort_order;
A bit more complicated, but probably still fast enough even with thousands of questions and answers: find users that haven't answered all questions
select aq.user_name, aq.questionaire_id, aq.answered_questions, tq.num_questions
from (
select a.user_name, q.questionaire_id, count(*) as answered_questions
from answer a
join question q on q.id = a.question_id
group by a.user_name, q.questionaire_id
) aq join (
select questionaire_id, count(*) as num_questions
from question
group by questionaire_id
) tq on tq.questionaire_id = aq.questionaire_id
where aq.answered_questions < tq.num_questions;
SQLFiddle example: http://sqlfiddle.com/#!15/0a4e5/1
You also shouldn't try to transpose the rows for each question (or answer) into column in SQL - you will eventually hit some limits of the number of columns the database can manage. Relational databases were designed to handle rows, lots of rows - not "thousands of column". Transposing rows to columns is typically done in the presentation layer of your application (or e.g. using a Pivot function in a spreadsheet)
Related
I am trying to create a simple chat application database schema, and query the conversations. My current table setup is the following:
CREATE TABLE chat_user (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
display_name VARCHAR(140),
... other user stuff ...
);
CREATE TABLE conversation (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
title VARCHAR(140),
created timestamp with time zone NOT NULL
);
CREATE TABLE conversation_message (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
conversation_id bigint NOT NULL,
sender_id bigint NOT NULL,
body TEXT NOT NULL,
created timestamp with time zone NOT NULL
);
CREATE TABLE conversation_participant (
id bigint GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
conversation_id bigint NOT NULL,
user_id bigint NOT NULL
);
So basically each conversation has its own title, and multiple participants. What I would like to fetch my conversations sorted by the date of the latest message in the conversation (so the conversations with the newest messages are shown first). The result set should contain the id, title of the conversation and list of participants + the id, sender_id and body of the latest message.
It would also be required to fetch the conversations paginated based on the creation date of the conversation (20 per page)
Is my table setup efficient enough to satisfy the above constraint? Seems to me that this could result in a rather large query with multiple subqueries?.
This answers the original version of the question.
You seem to want a join and agregation:
select cm.conversation_id, max(created)
from conversation_message cm join
conversation_participant cp
on cm.conversation_id = cp.conversation_id
where cp.user_id = ?
group by cm.conversation_id
order by max(created) desc;
You can try using lateral join.
So your query would look something like this.
You can fetch all of your needed data, apply limits and offsets, and retrieve the last message for each conversation. Hope it would help.
select * from conversation c
left join lateral (
select * from conversation_message cm
where cm.conversation_id=c.id
order by created desc
limit 1
) cm on true
left join conversation_participant cp on cp.id = cm.sender_id;
Left joins here are for the chat rooms without any messages.
In short: I think you have a reasonable design for a normalized (3NF) OLTP database. That's what you should aim for, not the number of JOINs for a specific use case. The design you have will satisfy the use case you defined and many other use cases, which I'm sure involved in this application of yours.
In details:
You're designing for an OLTP system, where the data is kept normalized to ensure the data consistency and improve the efficiency of OLTP transactions.
This however means you will have to do much more JOINs than a de-normalized database( which is suitable more for OLAP, reporting, analytics systems). This is just the nature of OLTP relational databases.
Trying to reduce the number of JOINs in a normalized database (i.e 3NF - Third normal form) means you will be combining data from different granularity into the same table and cause duplication, thus making updates harder and slower, and eventually data inconsistency.
So, you really shouldn't design aiming to reduce the number of JOINs. Instead make sure, you have a normalized design and avoid over-normalizing. In cases, where you may want to avoid writing long queries, you can add VIEWS and use view to write queries to simplify your queries (but that can cause sub-optimal query performance sometimes by bringing unnecessary joins).
To get latest message for your conversations there are ways to achieve it like self joins or window functions (row_number(), rank() etc). Using window function you can write your query as
with cm as (
select *,
rank() over (partition by conversation_id order by created desc) as r
from conversation_message
)
select c.id,
c.title,
cm.body,
cm.created,
cm.r,
cu.display_name
from conversation as c
left join cm on c.id = cm.conversation_id and cm.r <= 1
left join chat_user cu on cu.id = cm.sender_id
DEMO
In above query I have used left joins to include conversions with no messages, If you need only conversations which has messages then use inner joins. If you need more than 1 latest messages for each conversation change cm.r <= #no
To get participants list for each conversation you can add new CTE like
with cm as (
select *,
rank() over (partition by conversation_id order by created desc) as r
from conversation_message
),
message_participants as (
select
m.conversation_id,
array_agg(u.display_name order by m.created desc) as participants
from chat_user as u
join conversation_message as m on u.id = m.sender_id
group by m.conversation_id
)
select c.id,
c.title,
cm.body,
cm.created,
cm.r,
cu.display_name,
cmp.participants
from conversation c
left join cm on c.id = cm.conversation_id and cm.r <= 1
left join chat_user cu on cu.id = cm.sender_id
left join message_participants cmp on c.id = cmp.conversation_id
DEMO
Improvements
Add user_id in conversation table to identify who has created
this conversation.
Table conversation_participant is redundant while you can extract
list of participants from conversation_message
I am interested in designing a SQL-based (SQLite, actually) storage for an application processing a large number of similar data entries. For this example, let it be a chat messages storage.
The application has to provide the capabilities of filtering and analyzing the data by message participants, tags, etc., all of those implying N-to-N relationships.
So, the schema (kind of star) will look something like:
create table messages (
message_id INTEGER PRIMARY KEY,
time_stamp INTEGER NOT NULL
-- other fact fields
);
create table users (
user_id INTEGER PRIMARY KEY,
-- user dimension data
);
create table message_participants (
user_id INTEGER references users(user_id),
message_id INTEGER references messages(message_id)
);
create table tags (
tag_id INTEGER PRIMARY KEY,
tag_name TEXT NOT NULL,
-- tag dimension data
);
create table message_tags (
tag_id INTEGER references tags(tag_id),
message_id INTEGER references messages(message_id)
);
-- etc.
So, all good and well, until I have to perform analytic operations and filtering based on the N-to-N dimensions. Given millions of rows in the messages table and thousands in the dimensions (there are more than shown in the example), all the joins are simply too much a performance hit.
For example, I would like to analyze the number of messages each user participated in, given the data is filtered based on selected tags, selected users and other aspects:
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
and message_id in
(select message_id
from message_tags
where tag_id in ( /* some tag ID's set */ ))
and
-- more N-to-N filtering
group by U.user_id
I am constrained to SQL and, specifically, SQLite. And I do use indices on the tables.
I there some way I don't see to improve the schema, maybe a clever way to de-normalize it?
Or maybe there is a way to somehow index the dimension keys inside the message row (I thought about using FTS capabilities but not sure if searching the textual index and joining on the results will provide any performance leverage)?
Too long to put in a comment, and might help with performance but isn't exactly a direct answer to your question (your schema seems fine): have you tried messing with your query itself?
I often see that kind of subselect filter for many-to-many, and I have found that on large queries like this I frequently see improvements in performance from running a CTE/join rather than a where blag in (subselect):
;with tagMesages as (
select distinct message_id
from message_tags
where tag_id in ( /* some tag ID's set */ )
) -- more N-to-N filtering
select U.user_id, U.user_name, count(1)
from messages as M
join message_participants as MP on M.message_id=MP.message_id
join user as U on MP.user_id=U.user_id
join tagMesages on M.message_id = tagMesages.message_id
where
MP.user_id not in ( /* some user ID's set */ )
and M.time_stamp between #StartTime and #EndTime
and
-- more fact table fields filtering
group by U.user_id
We can tell they're the same, but the query planner can sometimes find this more helpful
Disclaimer: I don't do SQLite, I do SQL Server, so sorry if I've made some obvious (or otherwise) error.
Above is my schema. What you can't see in tblPatientVisits is the foreign key from tblPatient, which is patientid.
tblPatient contains a distinct copies of each patient in the dataset as well as their gender. tblPatientVists contains their demographic information, where they lived at time of admission and which hospital they went to. I chose to put that information into a separate table because it changes throughout the data (a person can move from one visit to the next and go to a different hospital).
I don't get any strange numbers with my queries until I add tblPatientVisits. There are just under one millions claims in tblClaims, but when I add tblPatientVisits so I can check out where that person was from, it returns over million. I thinkthis is due to the fact that in tblPatientVisits the same patientID shows up more than once (due to the fact that they had different admission/dischargedates).
For the life of me I can't see where this is incorrect design, nor do I know how to rectify it beyond doing one query with count(tblPatientVisits.PatientID=1 and then union with count(tblPatientVisits.patientid)>1.
Any insight into this type of design, or how I might more elegantly find a way to get the claimType from tblClaims to give me the correct number of rows with I associate a claim ID with a patientID?
EDIT: The biggest problem I'm having is the fact that if I include the admissionDate,dischargeDate or the patientStatein the tblPatient table I can't use the patientID as a primary key.
It should be noted that tblClaims are NOT necessarily related to tblPatientVisits.admissionDate, tblPatientVisits.dischargeDate.
EDIT: sample queries to show that when tblPatientVisits is added, more rows are returned than claims
SELECT tblclaims.id, tblClaims.claimType
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID INNER JOIN
tblPatientVisits ON tblPatient.patientID = tblPatientVisits.patientID
more than one million query rows returned
SELECT tblClaims.id, tblPatient.patientID
FROM tblClaims INNER JOIN
tblPatientClaims ON tblClaims.id = tblPatientClaims.id INNER JOIN
tblPatient ON tblPatientClaims.patientid = tblPatient.patientID
less than one million query rows returned
I think this is crying for a better design. I really think that a visit should be associated with a claim, and that a claim can only be associated with a single patient, so I think the design should be (and eliminating the needless tbl prefix, which is just clutter):
CREATE TABLE dbo.Patients
(
PatientID INT PRIMARY KEY
-- , ... other columns ...
);
CREATE TABLE dbo.Claims
(
ClaimID INT PRIMARY KEY,
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID)
-- , ... other columns ...
);
CREATE TABLE dbo.PatientVisits
(
PatientID INT NOT NULL FOREIGN KEY
REFERENCES dbo.Patients(PatientID),
ClaimID INT NULL FOREIGN KEY
REFERENCES dbo.Claims(ClaimID),
VisitDate DATE
, -- ... other columns ...
, PRIMARY KEY (PatientID, ClaimID, VisitDate) -- not convinced on this one
);
There is some redundant information here, but it's not clear from your model whether a patient can have a visit that is not associated with a specific claim, or even whether you know that a visit belongs to a specific claim (this seems like crucial information given the type of query you're after).
In any case, given your current model, one query you might try is:
SELECT c.id, c.claimType
FROM dbo.tblClaims AS c
INNER JOIN dbo.tblPatientClaims AS pc
ON c.id = pc.id
INNER JOIN dbo.tblPatient AS p
ON pc.patientid = p.patientID
-- where exists tells SQL server you don't care how many
-- visits took place, as long as there was at least one:
WHERE EXISTS (SELECT 1 FROM dbo.tblPatientVisits AS pv
WHERE pv.patientID = p.patientID);
This will still return one row for every patient / claim combination, but it should only return one row per patient / visit combination. Again, it really feels like the design isn't right here. You should also get in the habit of using table aliases - they make your query much easier to read, especially if you insist on the messy tbl prefix. You should also always use the dbo (or whatever schema you use) prefix when creating and referencing objects.
I'm not sure I understand the concept of a claim but I suspect you want to remove the link table between claims and patient and instead make the association between patient visit and a claim.
Would that work out better for you?
my table contains category_name and parent_category_Id column
My parent_category data contains, the same table primary key id.
i need to select the all rows and insted my parent_category_Id i need to select the catogry_name
I think this is what you're after, though it's hard to discern from the question:
Select c.*
From category c
Join parent_category pc ON c.parent_category_id = pc.id
Where pc.category_name = 'Some Name'
Try something like:
SELECT c.category_name, p.category_name
FROM categories c LEFT JOIN parent_categories p
ON c.parent_id = p.id
PS: you may think about restructuring your database, it would make more sense to store all the categories in the same table. See for instance: http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
You should restructure your database to have one table with:
id, name, and parent columns, with the parent column referencing the same table's id column. Your current database is not normalized and will likely cause issues in the future.
At a minimum you should have an auto_increment id column in the categories table.
The other answers here are correct (depending on the SQL server you are using).
Requirements: I have a table of several thousand questions. Users can view these multiple choice questions and then answer them. Once a question is answered, it should not be shown to the same user again even if he logs in after a while.
Question
How would I go about doing this efficiently? Would Bloom Filters work?
Create a QuestionsAnswered table and join on it in your select. When the user answers a question, insert the question ID and the user ID into the table.
CREATE TABLE QuestionsAnswered (UserID INT, QuestionID INT)
SELECT *
FROM Question
WHERE ID NOT IN (SELECT QuestionID
FROM QuestionsAnswered
WHERE UserID = #UserID)
INSERT INTO QuestionsAnswered
(UserID, QuestionID)
VALUES
(#UserID, #QuestionID)
Could you add something to the users info in the database which contains a list of answered questions?
So when that user comes back you can only show them questions which are NOT answered?
Create a many-to-many table between users and questions (userQuestions) to store the questions that have been answered already. Then you'd only display questions that don't exist in that userQuestions table for that user.
You insert each question shown into a log table with question_id/user_id, then show him the ones that don't match:
SELECT [TOP 1] ...
FROM questions
WHERE question_id NOT IN (
SELECT question_id
FROM question_user_log
WHERE userd_id = <current_user>)
[ORDER BY ...]
or
SELECT [TOP 1] ...
FROM questions AS q
LEFT OUTER JOIN question_user_log AS l ON q.question_id = l.question_id
AND l.user_id = <current_user>
WHERE l.question_id IS NULL
[ORDER BY...]
after you show the question, you
INSERT INTO question_user_log (question_id, user_id)
VALUES (<question shown>, <current_user>);
BTW, if you cannot create a table to track questions shown then you can query the questions in a deterministic order (ie. by Id or by Title) and select each time the one with the rank higher than the last rank shown (using ROW_NUMBER() in SQL Server/Oracle/DB2, or LIMIT in MySQL). You'd track the last rank shown somewhere in your user state (you do have a user state, otherwise the whole question is pointless).