Retrieving statistical information when 2 rows are involved - sql

I need to get some information from a data set (csv) which I have boiled down to the following simple table,
Date_Time Id passed
2013-06-23 20:13:10 112 A
2013-06-23 20:58:11 112 B
2013-06-23 21:01:10 118 A
2013-06-23 21:03:31 118 A
2013-06-23 21:05:49 118 A
2013-06-23 23:05:08 118 B
2013-06-24 08:10:03 118 B
The first two records show the simple case, after a check-in (A) we see 0:45:01 later
a check-out (B).
But one can also have more check-ins in row (records 3,4,5) and the check-out following
later. Normally, there would be for every check-in a corresponding check-out.
Unfortunately, the data is not perfect and there are sometimes records missing. (In the
example there are only two check-outs for three check-ins)
I would like to get some statistical values of the times between check-in and check-out,
perhaps on month basis or by weekday and so on. But I also do have to find a way to
discard records if I have no check-out within X-hours or if I find a check-out without
a check-in.
I have been trying with pandas and it looked so prommissing but as a new-be
I got stuck on all the huge possiblities that this magical package offers.
I hope some one can help me out and maybe can explane me a little bit where
to look fore.
Many thanks in advance,
avm

Your table is not structured in such a way that you can do this with one query. If you had a check_in_id column which would be and added column then you could do it with one query. the idea being that there would be at most two rows with the same check_in_id and they would always have the same id.
So instead write a stored procedure to create a tmp table. The tmp table would contain the added column. Your stored procedure would need iterate over the rows of the table and find the most recent check out given the id, that is not already in the tmp table.

Related

Is it possible to recursively combine similar records (keeping - and adding - only specific columns) using a select?

I've been wracking my brain here trying to figure out a way to achieve a solution to the following without external applications (such as Excel).
I'll set up the structure: We are using a 3rd party ERP that provides a nicely configured conversion system for product packaging types. I'm trying to create a query that will take all conversions for a given product and return them inline. Because the number of conversion records is indeterminate, the query would need to be recursive.
To make things simple, let's use package quantites for this scenario example. If a product can be shipped in [eaches, pairs, sets, packages, and cartons], the conversion table records would look something like this:
pkConvKey
fkProdID
childUnit
parentUnit
chPerParent
ConvRec001
Prod123
each
pair
2
ConvRec002
Prod123
pair
set
3
ConvRec003
Prod123
set
pack
7
ConvRec004
Prod123
pack
carton
24
Using the table above, I can determine how many pairs of Prod123 are contained in a carton by following the math:
24 packs x 7 sets x 3 pairs = 504 pairs per carton.
I could further multiply that by 2 to get the count of individual pieces in a carton (1,008). That's the idea behind the conversion table but here's my actual problem.
I'd like to return a table of records where associated conversions are in-line, thusly:
fkProdID
unit1
unit2
qtyInUnit2
unit3
qtyInUnit3
unit4
qtyInUnit4
unit5
qtyInUnit5
Prod123
each
pair
2
set
3
pack
7
carton
24
Complicating the matter is that the unit types are unknown (arbitrary) values and there is no requirement to have a full, intact chain from unit A to unit Z. (For example, there might be a conversion record from each to pair, and another from set to pack, but not one from pair to set).
In this scenario, the select can't recursively link the records, and they would appear in the resulting table as two separate records - which is fine.
I have attempted to join the table to itself on t1.parentUnit = t2.childUnit, but that obviously doesn't work recursively.
I fear my only solution is to left join the table over and over - as many as 20 times in the query, settling for NULL values if additional conversions do not exist but then I would also have many duplicate rows (with incomplete conversion chains) to weed out.
Can this be done in a select query?
Thanks in advance!
-Dan

MSAccess Slow Updates on Self-Joined table

I am trying to improve the performance of updating only about 60K rows with data coming from different rows in the same table. At about 2 minutes, it's not terrible, but it's not great either, and my application really doesn't work if you have to wait so long between recalculations.
The app generates a set of financial statements for a business, where it calculates basic formulas on 1300 line items, like Rent, or Direct Labor, or Inventory costs, all of which roll up to totals that mimic the Balance Sheet, P&Ls, Cash Flow etc. Many of the line items need to calculate on a month by month basis, where for instance it has figure out April's On Hand Inventory before knowing what April's Inventory Value is. So the total program ends up looping through 48 months over 30 calculation passes, requiring about 8000 SQL statements. (fortunately it figures it all out by itself!) Each SQL is taking only a few milliseconds, but it adds up.
I'm pretty sure I can't reduce the number of loops, so I keep trying to figure out how to make each SQL quicker. The basic structure is as follows:
LI: Line item table that holds the basic info of each item, primary key LID
LID Name
123 Sales_1
124 Sales_2
200 Total Sales
Formula: Master/Detail tables that create any formula from the line items
Total sales=Sales_1 + Sales_2
or
{200}={123}+{124}
(I use curly braces to be able to find and replace the LIDs within the formula, as shown in the SQL below)
FC: Formula Calculation table: all line items by month, about 1300 items x 48 months=62K records. Primary key FID
FID SQL_ID LID LID_brace LIN OutputMonth Formula Amount
3232 25 123 {123} Sales_1 1 1200
3255 26 124 {124} Sales_2 1 1500
5454 177 200 {200} Total Sales 1 {123}+{124}
DMO:Operand Join table, which links a formula to its detail lines within the same table, so once Sales_1 is calculated, it can find the Total Sales record and update it, which then will evaluate then send its amount up the chain to the other LIDs that depend on it, such as Total Income. It locates the record to update based on the SQL_id, which is set based on the calc pass and month. Its complex to setup, but pretty straightforward once you actually run things
Master_FID Detail_FID
5454 3232 (links total sales to sales_1)
5454 3255 (links total sales to sales_2)
SQL1:
Update FC inner join DMO on FC.FID=DMO.Master_FID inner join FC2 on DMO.Detail_FID=FC2.FID set FC.formula=replace(FC.Formula,FC.LID_brace,FC2.Amount) where FC.sql_id=177
The above will change {123} + {124} to 1200+1500 which will then evaluate to 2700 when I run the following
SQL2:
UPDATE FC SET FC.amount = Eval([fc].[formula]) WHERE (((FC.calc_sql_id)=177 )
So those two sql statements are run over and over again, with the only thing changing is the SQL_id.
There are indexes on the SQL_ID, LID, FID etc
When measuring, the milliseconds per record can range from .04ms if there are many records included (~10K for some passes), up to 10 or 15 ms for just one record updated. Perhaps it is the setup of the query causing a whole lot of overhead time, because it doesn't seem to be a function of the actual number of records updated? Also its not very consistent, where some runs have 20+ ms compared to less than 3ms when it runs it again.
I know this is a complex question i'm asking that probably doesn't have a simple answer, but I'm just looking for directions for what might help. For instance, a parameter query if there isn't a whole lot of change between runs? Does Access have a better time of running a query if knows about it in advance, i.e a named query with parameters vs dynamic SQL? Am I just doomed because it still needs to run those 8000 queries?
Also, is there inherently a problem with trying to update the same table through a secondary join table, and/or is there a better way to do it?
Is it also because string replacing isn't efficient this way? If I tried RegEx would that be quicker? I would have to make a function that could do that within a query, but it seems like that's going to be slower.
Thanks in advance, this has been a most vexing problem!!!

SQL to Split Rows into Multiple Column

I have below table TEST with singe column DATA
00001900-01-01Aseopenigaccount-RF RF-ADIT
00341900-02-01Aseopenigaccount-RASS RASS-ADIT
00761900-03-01Adminopenigaccount-RASS OPEN-System
I required above column DATA split into below columns 4 columns
Code Date Description ShortDesc
0000 1900-01-01 Aseopenigaccount-RF RF-ADIT
0034 1900-02-01 Aseopenigaccount-RASS RF-ADIT
0076 1900-03-01 Adminopenigaccount-RASS RF-ADIT
#at9063, welcome to the community. As the comments indicate, you should provide a sample of your solution in your future questions. It would, also, be really helpful to provide any logical assumptions behind your dataset.
The solution is based on the data that you provided as an example. The first two columns can be extracted by taking the first 4 characters and the following 10. The Description column would start on your 15th character and would go up until the first space. ShortDescr would start from the first space.
SELECT LEFT(my_data,4) AS My_Code,
SUBSTRING(TRIM(my_data),5,10) AS my_date,
SUBSTRING(TRIM(my_data),15,CHARINDEX(' ',my_data)-15) AS My_Description,
SUBSTRING(TRIM(my_data),CHARINDEX(' ',my_data),LEN(TRIM(my_data))+1-CHARINDEX(' ',my_data)) AS my_ShortDesc
FROM test

SQL - SAP HANA - Use only first entry in Column table (based on date/time)

I have some tables on SAP HANA and „create column table“ to combine multiple „raw tables“.
In one table, there are duplicate rows, to be more specific, every Information (column) is the same but the date/time is not.So the source System has this weird habit to create one entry, several times (which is wrong). I do not have the possibility to manipulate data in the source System.
The table looks something like:
Table name: Testsubject_status
Column: Status....info....Timestamp
Test me...............bla.......05.01.2017 05:05:00
Test me...............bla......01.01.2017 11:15:00
Test him………..blub…..01.01.2017 11:17:00
Test her ………..blab.....01.01.2017 11:25:00
Test me ………..bla.......01.01.2017 11:35:00
Test it………......blue......01.01.2017 12:15:00
Test me ………..bla.......07.01.2017 12:15:00
All duplicates after the first entry (date/time whise) should be not considered in the newly created table.
Table name: Testsubject_status_NEW
Column: Status....info....Timestamp
Test me...............bla......01.01.2017 11:15:00
Test him………..blub…..01.01.2017 11:17:00
Test her ………..blab.....01.01.2017 11:25:00
Test it………......blue......01.01.2017 12:15:00
This problem does appear multiple times, not only with Test me.
Is the solution something like:
Select
xxx AS "tri"
yyy AS "tre"
zzz AS "tru"
Case when Testsubject_status.Status Count > 1 Then "take first entry"
From ...
Where …
???
I am glad for every help or advice.
Based on the description it should be sufficient to aggregate for the maximum date:
SELECT tri, tre, tru,
max(timestamp)
FROM
....
That works if the „de-duplication“ indeed should happen based on all remaining columns except timestamp.

SQL Query with multiple values in one column

I've been beating my head on the desk trying to figure this one out. I have a table that stores job information, and reasons for a job not being completed. The reasons are numeric,01,02,03,etc. You can have two reasons for a pending job. If you select two reasons, they are stored in the same column, separated by a comma. This is an example from the JOBID table:
Job_Number User_Assigned PendingInfo
1 user1 01,02
There is another table named Pending, that stores what those values actually represent. 01=Not enough info, 02=Not enough time, 03=Waiting Review. Example:
Pending_Num PendingWord
01 Not Enough Info
02 Not Enough Time
What I'm trying to do is query the database to give me all the job numbers, users, pendinginfo, and pending reason. I can break out the first value, but can't figure out how to do the second. What my limited skills have so far:
select Job_number,user_assigned,SUBSTRING(pendinginfo,0,3),pendingword
from jobid,pending
where
SUBSTRING(pendinginfo,0,3)=pending.pending_num and
pendinginfo!='00,00' and
pendinginfo!='NULL'
What I would like to see for this example would be:
Job_Number User_Assigned PendingInfo PendingWord PendingInfo PendingWord
1 User1 01 Not Enough Info 02 Not Enough Time
Thanks in advance
You really shouldn't store multiple items in one column if your SQL is ever going to want to process them individually. The "SQL gymnastics" you have to perform in those cases are both ugly hacks and performance degraders.
The ideal solution is to split the individual items into separate columns and, for 3NF, move those columns to a separate table as rows if you really want to do it properly (but baby steps are probably okay if you're sure there will never be more than two reasons in the short-medium term).
Then your queries will be both simpler and faster.
However, if that's not an option, you can use the afore-mentioned SQL gymnastics to do something like:
where find ( ',' |fld| ',', ',02,' ) > 0
assuming your SQL dialect has a string search function (find in this case, but I think charindex for SQLServer).
This will ensure all sub-columns begin and start with a comma (comma plus field plus comma) and look for a specific desired value (with the commas on either side to ensure it's a full sub-column match).
If you can't control what the application puts in that column, I would opt for the DBA solution - DBA solutions are defined as those a DBA has to do to work around the inadequacies of their users :-).
Create two new columns in that table and make an insert/update trigger which will populate them with the two reasons that a user puts into the original column.
Then query those two new columns for specific values rather than trying to split apart the old column.
This means that the cost of splitting is only on row insert/update, not on _every single select`, amortising that cost efficiently.
Still, my answer is to re-do the schema. That will be the best way in the long term in terms of speed, readable queries and maintainability.
I hope you are just maintaining the code and it's not a brand new implementation.
Please consider to use a different approach using a support table like this:
JOBS TABLE
jobID | userID
--------------
1 | user13
2 | user32
3 | user44
--------------
PENDING TABLE
pendingID | pendingText
---------------------------
01 | Not Enough Info
02 | Not Enough Time
---------------------------
JOB_PENDING TABLE
jobID | pendingID
-----------------
1 | 01
1 | 02
2 | 01
3 | 03
3 | 01
-----------------
You can easily query this tables using JOIN or subqueries.
If you need retro-compatibility on your software you can add a view to reach this goal.
I have a tables like:
Events
---------
eventId int
eventTypeIds nvarchar(50)
...
EventTypes
--------------
eventTypeId
Description
...
Each Event can have multiple eventtypes specified.
All I do is write 2 procedures in my site code, not SQL code
One procedure converts the table field (eventTypeIds) value like "3,4,15,6" into a ViewState array, so I can use it any where in code.
This procedure does the opposite it collects any options your checked and converts it in
If changing the schema is an option (which it probably should be) shouldn't you implement a many-to-many relationship here so that you have a bridging table between the two items? That way, you would store the number and its wording in one table, jobs in another, and "failure reasons for jobs" in the bridging table...
Have a look at a similar question I answered here
;WITH Numbers AS
(
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 0)) AS N
FROM JobId
),
Split AS
(
SELECT JOB_NUMBER, USER_ASSIGNED, SUBSTRING(PENDING_INFO, Numbers.N, CHARINDEX(',', PENDING_INFO + ',', Numbers.N) - Numbers.N) AS PENDING_NUM
FROM JobId
JOIN Numbers ON Numbers.N <= DATALENGTH(PENDING_INFO) + 1
AND SUBSTRING(',' + PENDING_INFO, Numbers.N, 1) = ','
)
SELECT *
FROM Split JOIN Pending ON Split.PENDING_NUM = Pending.PENDING_NUM
The basic idea is that you have to multiply each row as many times as there are PENDING_NUMs. Then, extract the appropriate part of the string
While I agree with DBA perspective not to store multiple values in a single field it is doable, as bellow, practical for application logic and some performance issues. Let say you have 10000 user groups, each having average 1000 members. You may want to have a table user_groups with columns such as groupID and membersID. Your membersID column could be populated like this:
(',10,2001,20003,333,4520,') each number being a memberID, all separated with a comma. Add also a comma at the start and end of the data. Then your select would use like '%,someID,%'.
If you can not change your data ('01,02,03') or similar, let say you want rows containing 01 you still can use " select ... LIKE '01,%' OR '%,01' OR '%,01,%' " which will insure it match if at start, end or inside, while avoiding similar number (ie:101).