I'm storing some very basic information "data sources" coming into my application. These data sources can be in the form of a document (e.g. PDF, etc.), audio (e.g. MP3, etc.) or video (e.g. AVI, etc.). Say, for example, I am only interested in the filename of the data source. Thus, I have the following table:
DataSource
Id (PK)
Filename
For each data source, I also need to store some of its attributes. Example for a PDF would be "numbe of pages." Example for audio would be "bit rate." Example for video would be "duration." Each DataSource will have different requirements for the attributes that need to be stored. So, I have modeled "data source attribute" this way:
DataSourceAttribute
Id (PK)
DataSourceId (FK)
Name
Value
Thus, I would have records like these:
DataSource->Id = 1
DataSource->Filename = 'mydoc.pdf'
DataSource->Id = 2
DataSource->Filename = 'mysong.mp3'
DataSource->Id = 3
DataSource->Filename = 'myvideo.avi'
DataSourceAttribute->Id = 1
DataSourceAttribute->DataSourceId = 1
DataSourceAttribute->Name = 'TotalPages'
DataSourceAttribute->Value = '10'
DataSourceAttribute->Id = 2
DataSourceAttribute->DataSourceId = 2
DataSourceAttribute->Name = 'BitRate'
DataSourceAttribute->Value '16'
DataSourceAttribute->Id = 3
DataSourceAttribute->DataSourceId = 3
DataSourceAttribute->Name = 'Duration'
DataSourceAttribute->Value = '1:32'
My problem is that this doesn't seem to scale. For example, say I need to query for all the PDF documents along with thier total number of pages:
Filename, TotalPages
'mydoc.pdf', '10'
'myotherdoc.pdf', '23'
...
The JOINs needed to produce the above result are just too costly. How should I address this problem?
Scaling is one of the most common problems with EAV (Entity-Attribute-Value) data structures. In short, you have to ask for the meta data (i.e. locate the attributes) to get to the data. However, here is a query that you can use to get the data you want:
Select DataSourceId
, Min( Case When Name = 'TotalPages' Then Value End ) As TotalPages
, Min( Case When Name = 'BitRate' Then Value End ) As BitRate
, Min( Case When Name = 'Duration' Then Vlaue End ) As Duration
From DataSourceAttribute
Group By DataSourceId
In order to improve performance, you'll want an index on DataSourceId and perhaps Name as well. To get to the results you posted, you would do:
Select DataSource.FileName
, Min( Case When DataSourceAttribute.Name = 'TotalPages' Then Value End ) As TotalPages
, Min( Case When DataSourceAttribute.Name = 'BitRate' Then Value End ) As BitRate
, Min( Case When DataSourceAttribute.Name = 'Duration' Then Vlaue End ) As Duration
From DataSourceAttribute
Join DataSource
On DataSource.Id = DataSourceAttribute.DataSourceId
Group By DataSource.FileName
It seems like you want something a bit more losse than a typical Relational db. Sounds like a good candidate for something like Lucene or MongoDB. Lucene is an index engine which allows any type of document to be stored and indexed. MongoDB is in the middle between RDBMS and free-form document storage. JSON in some form or other (MongoDB is a good example) should fit nicely.
This might work, but define too costly...
select
datasource.id,
d1.id as d1id,
d1.value as d1filename,
d2.id as d2id,
d2.value as d2totalpages
from datasource
inner join datasourceattribute d1
on datasource.id = d1.datasourceid and d1.name = 'filename'
inner join datasourceattribute d2
on datasource.id = d2.datasourceid and d2.name = 'totalpages'
having d1filename like '%pdf'
Related
I've been working on this problem, researching what I could be doing wrong but I can't seem to find an answer or fault in the code that I've written. I'm currently extracting data from a MS SQL Server database, with a WHERE clause successfully filtering the results to what I want. I get roughly 4 rows per employee, and want to add together a value column. The moment I add the GROUP BY clause against the employee ID, and put a SUM against the value, I'm getting a number that is completely wrong. I suspect the SQL code is ignoring my WHERE clause.
Below is a small selection of data:
hr_empl_code hr_doll_paid
1 20.5
1 51.25
1 102.49
1 560
I expect that a GROUP BY and SUM clause would give me the value of 734.24. The value I'm given is 211461.12. Through troubleshooting, I added a COUNT(*) column to my query to work out how many lines it's running against, and it's giving a result of 1152, furthering reinforces my belief that it's ignoring my WHERE clause.
My SQL code is as below. Most of it has been generated by the front-end application that I'm running it from, so there is some additional code in there that I believe does assist the query.
SELECT DISTINCT
T000.hr_empl_code,
SUM(T175.hr_doll_paid)
FROM
hrtempnm T000,
qmvempms T001,
hrtmspay T166,
hrtpaytp T175,
hrtptype T177
WHERE 1 = 1
AND T000.hr_empl_code = T001.hr_empl_code
AND T001.hr_empl_code = T166.hr_empl_code
AND T001.hr_empl_code = T175.hr_empl_code
AND T001.hr_ploy_ment = T166.hr_ploy_ment
AND T001.hr_ploy_ment = T175.hr_ploy_ment
AND T175.hr_paym_code = T177.hr_paym_code
AND T166.hr_pyrl_code = 'f' AND T166.hr_paid_dati = 20180404
AND (T175.hr_paym_type = 'd' OR T175.hr_paym_type = 't')
GROUP BY T000.hr_empl_code
ORDER BY hr_empl_code
I'm really lost where it could be going wrong. I have stripped out the additional WHERE AND and brought it down to just T166.hr_empl_code = T175.hr_empl_code, but it doesn't make a different.
By no means am I any expert in SQL Server and queries, but I have decent grasp on the technology. Any help would be very appreciated!
Group by is not wrong, how you are using it is wrong.
SELECT
T000.hr_empl_code,
T.totpaid
FROM
hrtempnm T000
inner join (SELECT
hr_empl_code,
SUM(hr_doll_paid) as totPaid
FROM
hrtpaytp T175
where hr_paym_type = 'd' OR hr_paym_type = 't'
GROUP BY hr_empl_code
) T on t.hr_empl_code = T000.hr_empl_code
where exists
(select * from qmvempms T001,
hrtmspay T166,
hrtpaytp T175,
hrtptype T177
WHERE T000.hr_empl_code = T001.hr_empl_code
AND T001.hr_empl_code = T166.hr_empl_code
AND T001.hr_empl_code = T175.hr_empl_code
AND T001.hr_ploy_ment = T166.hr_ploy_ment
AND T001.hr_ploy_ment = T175.hr_ploy_ment
AND T175.hr_paym_code = T177.hr_paym_code
AND T166.hr_pyrl_code = 'f' AND T166.hr_paid_dati = 20180404
)
ORDER BY hr_empl_code
Note: It would be more clear if you have used joins instead of old style joining with where.
I'm learning SAP queries.
I want to get all the Measure documents from an equipement.
To do that, I use 3 tables :
EQUI, IMPTT, IMRG
The query works but I have all documents instead I only want to get the last one by Date. But I can't do that. I'm sure that I have to add a custom field, but I have tried but none of them works.
For example, my last code :
select min( IMRG~INVTS ) IMRG~RECDV
from IMRG inner join IMPTT on
IMRG~POINT = IMPTT~POINT into (INVTS, IMRGVAL)
where IMRG~POINT = IMPTT-POINT AND
IMPTT~MPOBJ = EQUI-OBJNR
and IMRG~CANCL = '' group by IMRG~MDOCM IMRG~RECDV.
ENDSELECT.
Thanks for your help.
You will need to get the date from IMRG, and the inverted timestamp field, so the MIN() of this will be the most recent - that looks correct.
However your GROUP BY looks wrong. You should be grouping on the IMPTT~POINT field so that you get one record per measurement point. Note that one Point IMPTT can have many measurements (IMRG), so something like this:
SELECT EQUI-OBJNR, IMPTT~POINT, MIN(IMRG~IMRC_INVTS)
...
GROUP BY EQUI-OBJNR, IMPTT~POINT
If I got you correctly, you are trying to get the freshest measurement of the equipment disregard of measurement point. So you can try this query, which is not so beautiful, but it just works.
SELECT objnr COUNT(*) MIN( invts )
FROM equi AS eq
JOIN imptt AS tt
ON tt~mpobj = eq~objnr
JOIN imrg AS ig
ON ig~point = tt~point
INTO (wa_objnr, count, wa_invts)
WHERE ig~cancl = ''
GROUP BY objnr.
SELECT SINGLE recdv FROM imrg JOIN imptt ON imptt~point = imrg~point INTO wa_imrgval WHERE invts = wa_invts AND imptt~mpobj = wa_objnr.
WRITE: / wa_objnr, count, wa_invts, wa_imrgval.
ENDSELECT.
I have two tables DOCUMENT and ATTRIBUTES like these
DOCUMENT(id),
ATTRIBUTE(name, value, doc_fk).
I need to run a query that works like this "abstract query"
select top 100 documents
where $state='COMPLETED'
order by $creationDate
Where $state and $creationDate are two attributes.
Note that the limit is on documents, not attributes, and sort and filter are on two different attributes. The final query should return all document attributes, not only the filtered/sorted ones.
I was able to write this with a very complex query and I'm looking for better alternatives. I could post my solution if useful, but I do not want to point you in the, possibly, wrong direction.
It's ok to get a FEW extra documents, like 1000 instead of 100, and filter/sort in memory.
Could be ok for the limit not to be exact, like 74 instead of the required limit 100, but not too far from it.
Extra "soft" requirements:
the query should work with several databases (oracle, mysql and sqlserver), so weird analytic functions should be avoided unless available on all platforms
should work with JPA (eclipselink 2.4.0 implementation)
The expected output is something like this
DOC_ID ATTRIBUTE_NAME VALUE
123 state COMPLETED
123 creationDate 21/11/2012
123 userid someone
456 state COMPLETED
...
Ah, the flaws of an EAV design.
Try this.
select
top 100
document.*
from document
inner join attribute astate on document.id = astate.doc_fk
and astate.name='state'
and astate.value = 'completed'
inner join attribute acreation on document.id = acreation.doc_fk
and acreation.name='creationdate'
order by cast(acreation.value as date)
But it's only going to get more complicated if you persist with this EAV structure.
(PS. MySQL doesn't use TOP, but LIMIT instead)
SELECT doc_id, attr_name, attr_val, creationDate FROM
(
SELECT * FROM (
SELECT
doc.id as 'doc_id', attr.name as 'attr_name', null as 'attr_val', attr.value as 'creationDate'
FROM
ATTRIBUTE attr
LEFT JOIN
DOCUMENT doc ON attr.doc_fk = doc.id
WHERE
attr.name='creationDate'
ORDER BY creationDate desc;
) AS dt1
UNION ALL
SELECT * FROM(
SELECT
doc.id as 'doc_id', attr.name as 'attr_name', attr.value as 'attr_val', null as 'creationDate'
FROM
ATTRIBUTE attr
LEFT JOIN
DOCUMENT doc ON attr.doc_fk = doc.id;
) as dt2
) as dt0 GROUP BY doc_id ORDER by creationDate desc LIMIT 100;
Derived table 1 (dt1) gives you all the date attributes - to enable order your results by document's creation date.
Derived table 2 gives you all the attribute.. all put together by "union all", enables you to group by document, then order by the date of creation.
Hope this is in the right direction.
I have the following query that I've used to pull out a vehicle ID, it's registration, a driver id and the driver's name. The _core_people table you see referenced in this query also a field I wish to set to the vehicle's plate.
Here's the query:
SELECT fv._plate, cp._people_name
FROM
_fleet_vehicle fv,
_fleet_vehicle_status fvs,
_core_people cp,
_fleet_allocation fa
WHERE
cp._id_hierarchy = fv._id_hierarchy
AND fv._id_status = fvs._id
AND fvs._status_live = 1
AND fa._id_person = cp._id
AND fa._id_vehicle = fv._id
AND fa._alloc_end_date IS NULL
ORDER BY cp._people_name ASC
Now, I want to write an UPDATE...FROM clause that utilises this to set _core_people._plate (not shown here) to the plate field of their vehicle.
However, I'm not sure how to go about structuring the UPDATE...FROM clause.
Also, some drivers have 2 cars. Will it still work?
Thanks in advance!
UPDATE _core_people
SET _plate = fv._plate
FROM _fleet_vehicle fv,
_fleet_vehicle_status fvs,
_core_people cp,
_fleet_allocation fa
WHERE cp._id_hierarchy = fv._id_hierarchy
AND fv._id_status = fvs._id
AND fvs._status_live = 1
AND fa._id_person = cp._id
AND fa._id_vehicle = fv._id
AND fa._alloc_end_date IS NULL
If there are more than 1 record for _core_people matching the condition, _core_people will be updated to only one of them (it is not possible to tell which one exactly).
Having a mental block with going around this query.
I have the following tables:
review_list: has most of the data, but in this case the only important thing is review_id, the id of the record that I am currently interested in (int)
variant_list: model (varchar), enabled (bool)
variant_review: model (varchar), id (int)
variant_review is a many to many table linking the review_id in review_list to the model(s) in variant_list review and contains (eg):
..
test1,22
test2,22
test4,22
test1,23
test2,23... etc
variant_list is a list of all possible models and whether they are enabled and contains (eg):
test1,TRUE
test2,TRUE
test3,TRUE
test4,TRUE
what I am after in mysql is a query that when given a review_id (ie, 22) will return a resultset that will list each value in variant_review.model, and whether it is present for the given review_id such as:
test1,1
test2,1
test3,0
test4,1
or similar, which I can farm off to some webpage with a list of checkboxes for the types. This would show all the models available and whether each one was present in the table
Given a bit more information about the column names:
Select variant_list.model
, Case When variant_review.model Is Not Null Then 1 Else 0 End As HasReview
From variant_list
Left join variant_review
On variant_review.model = variant_list.model
And variant_review.review_id = 22
Just for completeness, if it is the case that you can have multiple rows in the variant_review table with the same model and review_id, then you need to do it differently:
Select variant_list.model
, Case
When Exists (
Select 1
From variant_review As VR
Where VR.model = variant_list.model
And VR.review_id = 22
) Then 1
Else 0
End
From variant_list