How to check data integrity within an SQL table?

How to check data integrity within an SQL table? - sql

I have a table for logging access data of a lab. The table struct like this:
create table accesslog
(
userid int not null,
direction int not null,
accesstime datetime not null
);
This lab have only one gate that is under access control. So the users must first "enter" the lab before they can "leave". In my original design, I set the "direction" field as a flag that is either 1 (for entering the lab) or -1 (for leaving the lab). So that I can use queries like:
SELECT SUM(direction) FROM accesslog;
to get the total user count within the lab. Theoretically, it worked; since the "direction" will always be in the patterns of 1 => -1 => 1 => -1 for any given userid.
But soon I found that the log message would lost in the transmission path from lab gate to server, being dropped either by busy network or by hardware glitches. Of course I can enforce the transmission path with sequence number, ACK, retransmission, hardware redundancy, etc., but in the end I might still get something like this:
userid direction accesstime
-------------------------------------
1 1 2013/01/03 08:30
1 -1 2013/01/03 09:20
1 1 2013/01/03 10:10
1 -1 2013/01/03 10:50
1 -1 2013/01/03 13:40
1 1 2013/01/03 18:00
It's a recent log for user "1". It's clear that I've lost one log message for that user entering the lab between 10:50 to 13:40. While I query this data, he is still in the lab, so there is no exiting logs after 2013/01/03 18:00 yet; that's affirmative.
My question is: is there any way to "find" this data inconsistence with SQL command ? There are total 5000 users within my system and the lab is operating 24 hour, there is no such "magic time" that the lab would be cleared. I'd be horrible if I've to write codes checking the continuity of "direction" field line-by-line, user-by-user.
I know it's not possible to "fix" the log with correct data. I just want to know "Oh, I have a data inconsistency issue for userid=1" so that I can add an marked amending data to the correct the final statistic.
Any advice would be appreciated, even changing the table structure would be OK.
Thanks.
Edit: Sorry I didn't mentioned the details.
Currently I'm using mixed SQL solution. The table showed above is MySQL, and it contains only logs within 24 hrs as the "real time" status for fast browsing.
Everyday at 03:00 AM a pre-scheduled process written in C++ on POSIX will be launched. This process will calculated the statistic data, and add the daily statistic to an Oracle DB, via a proprietary-protocol TCP socket, then it will remove the old data from MySQL.
The Oracle part is not handled by me and I can do nothing about it. I just want to make sure that the final statistics of each day is correct.
The data size is about 200,000 records per day -- I know it's sound crazy but it's true.

You didn't state your DBMS, so this is ANSI SQL (which works on most modern DBMS).
select userid,
direction,
accesstime,
case
when lag(direction) over (partition by userid order by accesstime) = direction then 'wrong'
else 'correct'
end as status
from accesslog
where userid = 1
for each row in accesslog you'll get a column "status" which indicates if the row "breaks" the rule or not.
You can filter out those that are invalid using:
select *
from (
select userid,
direction,
accesstime,
case
when lag(direction) over (partition by userid order by accesstime) = direction then 'wrong'
else 'correct'
end as status
from accesslog
where userid = 1
) t
where status = 'wrong'
I don't think there is a way to enforce this kind of rule using constraints in the database (although I have the feeling that PostgreSQL's exclusion constraints could help here)

Why not use SUM() with a WHERE field to filter by USER.
If you get anything other than 0 or 1 then you surely have a problem.

Ok I figured it out. Thanks for the idea provided by a_horse_with_no_name.
My final solution is this query:
SELECT userid, COUNT(*), SUM(direction * rule) FROM (
SELECT userid, direction, #inout := #inout * -1 AS rule
FROM accesslog l, (SELECT #inout := -1) r
ORDER by userid, accesstime
) g GROUP by userid;
First I created a pattern with #inout that will yield 1 => -1 => 1 => -1 for each row in the "rule" column. Than I compare the direction field with rule column by calculating multiplication product.
It's OK even if there are odd records for certain users; since each user is supposed to follow identical or reversed pattern as "rule". So the total sum of multiplication product should be equal to either COUNT() or -1 * COUNT().
By checking SUM() and COUNT(), I can know exactly which userid had go wrong.

Related

How to improve the efficiency of below query in SQL Server?

I have a ten million level database. The client needs to read data and perform calculation.
Due to the large amount of data, if it is saved in the application cache, memory will be overflow and crash will occur.
If I use select statement to query data from the database in real time, the time may be too long and the number of operations on the database may be too frequent.
Is there a better way to read the database data? I use C++ and C# to access SQL Server database.
My database statement is similar to the following:
SELECT TOP 10 y.SourceName, MAX(y.EndTimeStamp - y.StartTimeStamp) AS ProcessTimeStamp
FROM
(
SELECT x.SourceName, x.StartTimeStamp, IIF(x.EndTimeStamp IS NOT NULL, x.EndTimeStamp, 134165256277210658) AS EndTimeStamp
FROM
(
SELECT
SourceName,
Active,
LEAD(Active) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) NextActive,
TicksTimeStamp AS StartTimeStamp,
LEAD(TicksTimeStamp) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) EndTimeStamp
FROM Table1
WHERE Path = N'App1' and TicksTimeStamp >= 132165256277210658 and TicksTimeStamp < 134165256277210658
) x
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
) y
GROUP BY y.SourceName
ORDER BY ProcessTimeStamp DESC, y.SourceName
The database structure is roughly as follows:
ID Path SourceName TicksTimeStamp Active
1 App1 Pipe1 132165256277210658 1
2 App1 Pipe1 132165256297210658 0
3 App1 Pipe1 132165956277210658 1
4 App2 Pipe2 132165956277210658 1
5 App2 Pipe2 132165956277210658 0
I use the ExecuteReader of C #. The same SQL statement runs on SQL Management for 4s, but the time returned by the ExecuteReader is 8-9s. Does the slow time have anything to do with this interface?

I don't really 'get' the entire query but I'm wondering about this part:
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
SQL doesn't really like OR's so why not convert this to
WHERE x.Active = 1 and ISNULL(x.NextActive, 0) = 0
This might cause a completely different query plan. (or not)
As CharlieFace mentioned, probably best to share the query plan so we might get an idea of what's going on.
PS: I'm also not sure what those 'ticksTimestamps' represent, but it looks like you're fetching a pretty wide range there, bigger volumes will also cause longer processing time. Even though you only return the top 10 it still has to go through the entire range to calculate those durations.

I agree with #Charlieface. I think the index you want is as follows:
CREATE INDEX idx ON Table1 (Path, TicksTimeStamp) INCLUDE (SourceName, Active);
You can add both indexes (with different names of course) and see which one the execution engine chooses.

I can suggest adding the following index which should help the inner query using LEAD:
CREATE INDEX idx ON Table1 (SourceName, TicksTimeStamp, Path) INCLUDE (Active);
The key point of the above index is that it should allow the lead values to be rapidly computed. It also has an INCLUDE clause for Active, to cover the entire select.

SQL question with attempt on customer information

Schema
Question: List all paying customers with users who had 4 or 5 activities during the week of February 15, 2021; also include how many of the activities sent were paid, organic and/or app store. (i.e. include a column for each of the three source types).
My attempt so far:
SELECT source_type, COUNT(*)
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
I would like to get a second opinion on it. I didn't include the accounts table because I don't believe that I need it for this query, but I could be wrong.

Have you tried to run this? It doesn't satisfy the brief on FOUR counts:
List all the ... customers (that match criteria)
There is no customer information included in the results at all, so this is an outright fail.
paying customers
This is the top level criteria, only customers that are not free should be included in the results.
Criteria: users who had 4 or 5 activities
There has been no attempt to evaluate this user criteria in the query, and the results do not provide enough information to deduce it.
there is further ambiguity in this requirement, does it mean that it should only include results if the account has individual users that have 4 or 5 acitvities, or is it simply that the account should have 4 or 5 activities overall.
If this is a test question (clearly this is contrived, if it is not please ask for help on how to design a better schema) then the use of the term User is usually very specific and would suggest that you need to group by or otherwise make specific use of this facet in your query.
Bonus: (i.e. include a column for each of the three source types).
This is the only element that was attempted, as the data is grouped by source_type but the information cannot be correlated back to any specific user or customer.
Next time please include example data and the expected outcome with your post. In preparing the data for this post you would have come across these issues yourself and may have been inspired to ask a different question, or through the process of writing the post up you may have resolved the issue yourself.
without further clarification, we can still start to evolve this query, a good place to start is to exclude the criteria and focus on the format of the output. the requirement mentions the following output requirements:
List Customers
Include a column for each of the source types.
Firstly, even though you don't think you need to, the request clearly states that Customer is an important facet in the output, and in your schema account holds the customer information, so although we do not need to, it makes the data readable by humans if we do include information from the account table.
This is a standard PIVOT style response then, we want a row for each customer, presenting a count that aggregates each of the values for source_type. Most RDBMS will support some variant of a PIVOT operator or function, however we can achieve the same thing with simple CASE expressions to conditionally put a value into projected columns in the result set that match the values we want to aggregate, then we can use GROUP BY to evaluate the aggregation, in this case a COUNT
The following syntax is for MS SQL, however you can achieve something similar easily enough in other RBDMS
OP please tag this question with your preferred database engine...
NOTE: there is NO filtering in this query... yet
SELECT accounts.company_id
, accounts.company_name
, paid = COUNT(st_paid)
, organic = COUNT(st_organic)
, app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
GROUP BY accounts.company_id, accounts.company_name
This results in the following shape of result:
company_id
company_name
paid
organic
app_store
apl01
apples
4
8
0
ora01
oranges
6
12
0
Criteria
When you are happy with the shpe of the results and that all the relevant information is available, it is time to apply the criteria to filter this data.
From the requirement, the following criteria can be identified:
paying customers
The spec doesn't mention paying specifically, but it does include a note that (free customers have current_mrr = 0)
Now aren't we glad we did join on the account table :)
users who had 4 or 5 activities
This is very specific about explicitly 4 or 5 activities, no more, no less.
For the sake of simplicity, lets assume that the user facet of this requirement is not important and that is is simply a reference to all users on an account, not just users who have individually logged 4 or 5 activities on their own - this would require more demo data than I care to manufacture right now to prove.
during the week of February 15, 2021.
This one was correctly identified in the original post, but we need to call it out just the same.
OP has used Monday to Friday of that week, there is no mention that weeks start on a Monday or that they end on Friday but we'll go along, it's only the syntax we need to explore today.
In the real world the actual values specified in the criteria should be parameterised, mainly because you don't want to manually re-construct the entire query every time, but also to sanitise input and prevent SQL injection attacks.
Even though it seems overkill for this post, using parameters even in simple queries helps to identify the variable elements, so I will use parameters for the 2nd criteria to demonstrate the concept.
DECLARE #from DateTime = '2021-02-15' -- Date in ISO format
DECLARE #to DateTime = (SELECT DateAdd(d, 5, #from)) -- will match Friday: 2021-02-19
/* NOTE: requirement only mentioned the start date, not the end
so your code should also only rely on the single fixed start date */
SELECT accounts.company_id, accounts.company_name
, paid = COUNT(st_paid), organic = COUNT(st_organic), app_store = COUNT(st_app_store)
FROM activities
INNER JOIN accounts ON activities.company_id = accounts.company_id
-- PIVOT the source_type
CROSS APPLY (SELECT st_paid = CASE source_type WHEN 'paid' THEN 1 END
,st_organic = CASE source_type WHEN 'organic' THEN 1 END
,st_app_store = CASE source_type WHEN 'app store' THEN 1 END
) as PVT
WHERE -- paid accounts = exclude 'free' accounts
accounts.current_mrr > 0
-- Date range filter
AND activity_time BETWEEN #from AND #to
GROUP BY accounts.company_id, accounts.company_name
-- The fun bit, we use HAVING to apply a filter AFTER the grouping is evaluated
-- Wording was explicitly 4 OR 5, not BETWEEN so we use IN for that
HAVING COUNT(source_type) IN (4,5)

I believe you are missing some information there.
without more information on the tables, I can only guess that you also have a customer table. i am going to assume there is a customer_id key that serves as key between both tables
i would take your query and do something like:
SELECT customer_id,
COUNT() AS Total,
MAX(CASE WHEN source_type = "app" THEN "numoperations" END) "app_totals"),
MAX(CASE WHEN source_type = "paid" THEN "numoperations" END) "paid_totals"),
MAX(CASE WHEN source_type = "organic" THEN "numoperations" END) "organic_totals"),
FROM (
SELECT source_type, COUNT() AS num_operations
FROM activities
WHERE activity_time BETWEEN '02-15-21' AND '02-19-21'
GROUP BY source_type
) tb1 GROUP BY customer_id
This is the most generic case i can think of, but does not scale very well. If you get new source types, you need to modify the query, and the structure of the output table also changes. Depending on the sql engine you are using (i.e. mysql vs microsoft sql) you could also use a pivot function.
The previous query is a little bit rough, but it will give you a general idea. You can add "ELSE" statements to the clause, to zero the fields when they have no values, and join with the customer table if you want only active customers, etc.

SQL: get records based on user likes of other records

I'm trying to write an SQL (Windows server) query that will provide some results based on what other users like.
It is a bit like on Amazon when it says 'Users who bought this also bought...'
It is based on the vote field, where a vote of '1' means a user liked a record; or a vote of '0' means they disliked it.
So when a user is on a particular record, I want to list 3 other records that users who liked the current record also liked.
snippet of relevant table provided below:
ID UserID Record ID Vote DateAdded
16 9999 12013011290 1 2008-11-11 13:23:44.000
17 8888 12013011290 0 2008-11-11 13:23:44.000
18 7777 12013011290 0 2008-11-11 13:23:44.000
20 4930 12013011290 1 2013-11-19 15:04:06.263
I think this requires ordering by a sub-select, but I'm not sure. Can anyone advise me on if this is possible and if so how! thanks.
p.s.
To maintain the quality of the results I think it would be extra useful to filter by DateAdded. That is,
- 'user x' is seeing recommended records about 'record z'
- 'user y' is someone who has liked 'record z' and 'record a'
- only count 'user y's' like of 'record a' IF they liked 'record a' an HOUR before or after they liked 'record z'
- in other words, only count the 'record a's' like if it was during the same website-browsing session as 'record z'
Hope this makes sense!

something like this?
select r.description
from record r
join (
select top 3 v.recordid from votes v
where v.vote = 1 and recordid != 123456789
and userid in
(
select userid from votes where recordid = 123456789 and vote =1
)
order by dateadded desc
) as x on x.recordid = r.id

A method I used for the basic version of this problem is indeed using multiple selects: figure out what users liked a specific item, then query further on what they tagged.
with Likers as
(select user_id from likes where content_id = 10)
select count(user_id) as like_count, content_id
from likes
natural join likers
where content_id <> 10
group by content_id
order by like_count desc;
(Tested using Sqlite3)
What you will receive is a list of items that were liked by everyone who liked item 10, ordered by the number of likes (within the search domain.) I would probably want to limit this as well, since on a larger dataset its likely to result in a large amount of stray items with only one or two similar likes that are in turn buried under items with hundreds of likes.
I suspect the reason you are checking timestamps in the first place is so that if somebody likes laundry detergent, then comes back two days later to like a movie, the system would not associate "people who like Epic Shootout 17 also like Clean More."
I would not recommend using date arithmetic for this. I might suggest creating another table to represent individual "sessions" and using the session_id for this task. Since there are (hopefully!) many, many like records on your database, you want to reduce the amount of work you are making it do. You can also use this session_id for logging any other actions a person did (for analytics purposes.) It is also computationally cheaper to ask for all things that happened within a session with a simple index and identity comparison than to perform date computations on potentially millions of records.
For reference, Piwik defines a new session as thirty minutes since the last action taken.

SQL convert sample points into durations

This is similar to Compute dates and durations in mysql query, except that I don't have a unique ID column to work with, and I have samples not start/end points.
As an interesting experiment, I set cron to ps aux > 'date +%Y-%m-%d_%H-%M'.txt. I now have around 250,000 samples of "what the machine was running".
I would like to turn this into a list of "process | cmd | start | stop". The assumption is that a 'start' event is the first time when the pair existed, and a 'stop' event is the first sample where it stopped existing: there is no chance of a sample "missing" or anything.
That said, what ways exist for doing this transformation, preferably using SQL (on the grounds that I like SQL, and this seems like a nice challenge). Assuming that pids cannot be repeated this is a trivial task (put everything in a table, SELECT MIN(time), MAX(time), pid GROUP BY pid). However, since PID/cmd pairs are repeated (I checked, there are duplicates), I need a method that does a true "find all contiguous segments" search.
If necessary I can do something of the form
Load file0 -> oldList
ForEach fileN:
Load fileN ->newList
oldList-newList = closedN
newList-oldList = openedN
oldList=newList
But that is not SQL and not interesting. And who knows, I might end up having real SQL data to deal with with this property at some point.
I'm thinking something where one first constructs a table of diff's, and then joins all close's against all open's and pulls the minimum-distance close after each open, but I'm wondering if there's a better way.

You don't mention what database you are using. Let me assume that you are using a database that supports ranking functions, since that simplifies the solution.
The key to solving this is an observation. You want to assign an id to each pid to see if it is unique. I am going to assume that a pid represents a single process when the pid did not appear in the previous timestamped output.
Now, the idea is:
Assign a sequential number to each set of output. The first call to ps gets 1, the next 2, and so on, based on date.
Assign a sequential number to each pid, based on date. The first appearance gets 1, the next 2, and so on.
For pids that appear in sequence, the difference is a constant. We can call this the groupid for that set.
So, this is the query in action:
select groupid, pid, min(time), max(time)
from (select t.*,
(dense_rank() over (order by time) -
row_number() over (partition by pid order by time)
) as groupid
from t
) t
group by groupid, pid
This works in most databases (SQL Server, Oracle, DB2, Postgres, Teradata, among others). It does not work in MySQL because MySQL does not support the window/analytic functions.

How Do I Get Total 1 Time for Multiple Rows

I've been asked to modify a report (which unfortunately was written horribly!! not by me!) to include a count of days. Please note the "Days" is not calculated using "StartDate" & "EndDate" below. The problem is, there are multiple rows per record (users want to see the detail for start & enddate), so my total for "Days" are counting for each row. How can I get the total 1 time without the total in column repeating?
This is what the data looks like right now:
ID Description startdate enddate Days
REA145681 Emergency 11/17/2011 11/19/2011 49
REA145681 Emergency 12/6/2011 12/9/2011 49
REA145681 Emergency 12/10/2011 12/14/2011 49
REA146425 Emergency 11/23/2011 12/8/2011 54
REA146425 Emergency 12/9/2011 12/12/2011 54
I need this:
ID Description startdate enddate Days
REA145681 Emergency 11/17/2011 11/19/2011 49
REA145681 Emergency 12/6/2011 12/9/2011
REA145681 Emergency 12/10/2011 12/14/2011
REA146425 Emergency 11/23/2011 12/8/2011 54
REA146425 Emergency 12/9/2011 12/12/2011
Help please. This is how the users want to see the data.
Thanks in advance!
Liz
--- Here is the query simplified:
select id
,description
,startdate -- users want to see all start dates and enddates
,enddate
,days = datediff(d,Isnull(actualstardate,anticipatedstartdate) ,actualenddate)
from table

As you didn't provide the data of your tables I'll operate over your result as if it was a table. This will result in what you're looking for:
select *,
case row_number() over (partition by id order by id)
when 1 then days
end
from t
Edit:
Looks like you DID added some SQL code. This should be what you're looking for:
select *,
case row_number() over (partition by id order by id)
when 1 then
datediff(d,Isnull(actualstardate,anticipatedstartdate) ,actualenddate)
end
from t

That is a task for the reporting tool. You will have to write something like he next code in teh Display Properties of the Days field:
if RowNumber > 1 AND id = previous_row(id)
then -- hide the value of Days
Colour = BackgroundColour
Days = NULL
Days = ' '
Display = false
... (anything that works)

So they want the output to be exactly the same except that they don't want to see the days listed multiple times for each ID value? And they're quite happy to see the ID and Description repeatedly but the Days value annoys them?
That's not really an SQL question. SQL is about which rows, columns and derived values are supposed to be presented in what order and that part seems to be working fine.
Suppressing the redundant occurrences of the Days value is more a matter of using the right tool. I'm not up on the current tools but the last time I was, QMF was very good for this kind of thing. If a column was the basis for a control break, you could, in effect, select an option for that column that told it not to repeat the value of the control break repeatedly. That way, you could keep it from repeating ID, Description AND Days if that's what you wanted. But I don't know if people are still using QMF and I have no idea if you are. And unless the price has come way down, you don't want to go out and buy QMF just to suppress those redundant values.
Other tools might do the same kind of thing but I can't tell you which ones. Perhaps the tool you are using to do your reporting - Crystal Reports or whatever - has that feature. Or not. I think it was called Outlining in QMF but it may have a different name in your tool.
Now, if this report is being generated by an application program, that is a different kettle of Fish. An application could handle that quite nicely. But most people use end-user reporting tools to do this kind of thing to avoid the greater cost involved in writing programs.
We might be able to help further if you specify what tool you are using to generate this report.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas