I have several gigabites of arducopter binary flight logs. Each log is a series of messages.
MessageType1: param1, param2, param3
MessageType2: param3, param4, param5, param6
...
The logs are self describing in the sense that the first time a message appears in the log it tells what are the names of the params.
MessageType1: timestamp, a, b
MessageType1: value 1, value 2, value 3
MessageType2: timestamp, c, d, e
MessageType1: value 4, value 5, value 6
MessageType1: value 7, value 8, value 9
MessageType2: value 10, value 11, value 12, value 13
I have written a python script that takes the logs apart and creates tables for each message type in a sqlite database where the message type is the table name and the parameter name is the column name.
Table MessageType1
| Flight Index | Timestamp | a | b |
|--------------|-----------|-------|---------|
| ... | | | |
| "Flight 1" | 111 | 14725 | 10656.0 |
| "Flight 1" | 112 | 57643 | 10674.0 |
| "Flight 1" | 113 | 57157 | 13674.0 |
| ... | | | |
| "Flight 2" | 111 | 56434 | 16543.7 |
| "Flight 2" | 112 | 56434 | 16543.7 |
Table MessageType2
| Flight Index | Timestamp | c | d | e |
|--------------|-----------|-------|---------|--------|
| ... | | | | |
| "Flight 1" | 111 | 14725 | 10656.0 | 462642 |
| "Flight 1" | 112 | 57643 | 10674.0 | 426428 |
| "Flight 1" | 113 | 57157 | 13674.0 | 642035 |
| ... | | | | |
| "Flight 2" | 111 | 56434 | 16543.7 | 365454 |
| "Flight 2" | 112 | 56434 | 16543.7 | 754632 |
| ... | | | | |
For a single log this database is good enough but i would like to add several logs. Meaning messages of several logs of same type go into a single table.
In this case I added a column "Flight Index" which is what I would like to have but:
Each log processed should have a unique identifier
The identifier should be minimal in size, as im dealing with tables that have possibly millions of rows.
Im thinking of adding the flight index as an integer and just iterating the number when processing logs and if the database exists taking the last row of a table and using its index + 1. Is this optimal or is there a SQL native way of operating?
Am i doing something wrong in general as I'm not experienced with SQL?
EDIT: added a second table to show that messages dont have the same number of parameters and example messages.
You can achieve this with two tables
Table 1
Flights
Flight name, Flight number, date, device, etc. (any other data points make sense)
"Flight 1", 1, 1/1/2018,...
"Flight 2", 2, 1/2/2018,...
Table 2
Flight_log
Flight_number, timestamp, parameter1, parameter2,
1,111,14725,10656.0
1,112,57643,10674.0
1,113,57157,13674.0
...
2,111,56434,16543.7
2,112,56434,16543.7
Before you load Flight_logs table you should have an entry in Flights table, you can do a "lookup" do get the Flight_number from Flight table
After reading about data normalization I ended up with the following database.
This minimizes the number of tables. I could have done 35 tables (one for each message) and right parameters for each column, but that would make the database more fragile in the case where the parameters in a message are changed.
EDIT: replaced the image as datamodler got fixed.
Related
I have a database where I need to retrieve the data as same order as it was populated in the table. The table name is bible When I type in table bible; in psql, it prints the data in the order it was populated with, but when I try to retrieve it, some rows are always out of order as in the below example:
table bible
-[ RECORD 1 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 1
day | 1
book | Genesis
chapter | 1
verse | 1
text | In the beginning God created the heavens and the earth.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 2
day | 1
book | John
chapter | 1
verse | 1
text | In the beginning was the Word, and the Word was with God, and the Word was God.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=John1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 3
day | 1
book | John
chapter | 1
verse | 2
text | The same was in the beginning with God.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=John1.2&key=dc5e2d416f46150bf6ceb21d884b644f
Everything is in order, but when I try to query the same thing using for example: select * from bible where day='1' or select * from bible where day='1' order by day or select * from bible where day='1' order by day, id;, I always get some rows out of order either in the day selected (here 1) or any other day.
I have been using Django to interfere with Postgres database, but since I found this problem, I tried to query using SQL, but nothing, I still get rows out of order, although they all have unique ids which I verified with select count(distinct id), count(id) from bible;
- [ RECORD 1 ]------------------------------------------------------------------------------------------------------
id | 1
day | 1
book | Genesis
chapter | 1
verse | 1
text | In the beginning God created the heavens and the earth.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.1&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 2 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 10
day | 1
book | Colossians
chapter | 1
verse | 18
text | And he is the head of the body, the church: who is the beginning, the firstborn from the dead; that in all things he might have the preemine
nce.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Colossians1.18&key=dc5e2d416f46150bf6ceb21d884b644f
-[ RECORD 3 ]-----------------------------------------------------------------------------------------------------------------------------------------
id | 11
day | 1
book | Genesis
chapter | 1
verse | 2
text | And the earth was waste and void; and darkness was upon the face of the deep: and the Spirit of God moved upon the face of the waters.
link | https://api.biblia.com/v1/bible/content/asv.txt.txt?passage=Genesis1.2&key=dc5e2d416f46150bf6ceb21d884b644f
As you could see above if you notice, the ids are out of order 1, 10, 11.
my table
Table "public.bible";
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+------+-----------+----------+---------+----------+--------------+-------------
id | text | | | | extended | |
day | text | | | | extended | |
book | text | | | | extended | |
chapter | text | | | | extended | |
verse | text | | | | extended | |
text | text | | | | extended | |
link | text | | | | extended | |
Access method: heap
The id field is of type text because I used pandas's to_sql() method to populate the bible table. I tried to drop the id column and then I added it again as a PK with ALTER TABLE bible ADD COLUMN id SERIAL PRIMARY KEY; but I still get data return out of order.
Is there anyway I can retrieve the data with ordering with id, without having some of the rows totally out of order? Thank you in advance!
Thou shalt cast thy id to integer to order it as number.
SELECT * FROM bible ORDER BY cast(id AS integer);
While #jordanvrtanoski is correct, the way to do this is django is:
>>> Bible.objects.extra(select={'id': 'CAST(id AS INTEGER)'}).order_by('id').values('id')
<QuerySet [{'id': 1}, {'id': 2}, {'id': 3}, {'id': 10}, {'id': 20}]>
Side note: If you want to filter on day as an example, you can do this:
>>> Bible.objects.extra(select={
'id': 'CAST(id AS INTEGER)',
'day': 'CAST(day AS INTEGER)'}
).order_by('id').values('id', 'day').filter(day=2)
<QuerySet [{'id': 2, 'day': 2}, {'id': 10, 'day': 2}, {'id': 11, 'day': 2}, {'id': 20, 'day': 2}]>
Otherwise you get this issue: (notice 1 is followed by 10 and not 2)
>>> Bible.objects.order_by('id').values('id')
<QuerySet [{'id': '1'}, {'id': '10'}, {'id': '2'}, {'id': '20'}, {'id': '3'}]>
I HIGHLY suggest you DO NOT do any of this, and set your tables correctly (have the correct column types and not have everything as text), or your query performance is going to suck.. BIG TIME
Building on both answers of #jordanvrtanoski and #Javier Buzzi, and some search online, the issue is because the ids are of type TEXT (or VARCHAR too), so, you would need to cast the id to type INTEGER as in the following:
ALTER TABLE bible ALTER COLUMN id TYPE integer USING (id::integer);
Now here is my table
Table "public.bible"
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
---------+---------+-----------+----------+-----------------------------------------+----------+--------------+-------------
id | integer | | | nextval('bible_id_seq'::regclass) | plain | |
day | text | | | | extended | |
book | text | | | | extended | |
chapter | text | | | | extended | |
verse | text | | | | extended | |
text | text | | | | extended | |
link | text | | | | extended | |
Indexes:
"lesson_unique_id" UNIQUE CONSTRAINT, btree (id)
Referenced by:
TABLE "notes_note" CONSTRAINT "notes_note_verse_id_5586a4bf_fk" FOREIGN KEY (verse_id) REFERENCES days_lesson(id) DEFERRABLE INITIALLY DEFERRED
Access method: heap
Hope this helps other people, and thank you everyone!
I have 3 Tables A,B,C Which I wish to join together, removing duplicate values from a field in A but giving preference to a certain value in C.
My Tables are as follows.
A
+--------------+--------------+-----------------+
| Installation | Substructure | Description |
+--------------+--------------+-----------------+
| A | 12 | non-unique text |
+--------------+--------------+-----------------+
| A | 22 | Non-unique text |
+--------------+--------------+-----------------+
| B | 54 | Non-unique text |
+--------------+--------------+-----------------+
This if Left Joined with table B on the Substructure
+--------------+-----------+
| Substructure | Reference |
+--------------+-----------+
| 12 | REF001 |
+--------------+-----------+
| 12 | REF002 |
+--------------+-----------+
| 12 | REF003 |
+--------------+-----------+
| 22 | REF004 |
+--------------+-----------+
| 22 | REF005 |
+--------------+-----------+
| 54 | REF006 |
+--------------+-----------+
| 54 | REF007 |
+--------------+-----------+
| 54 | REF008 |
+--------------+-----------+
This is further Right joined with Table C on the Reference
+-----------+-----------------+---------------+
| Reference | Description | Template_Type |
+-----------+-----------------+---------------+
| REF001 | Some Text | PNID |
+-----------+-----------------+---------------+
| REF002 | More Text | ISO |
+-----------+-----------------+---------------+
| REF003 | Non-Unique Text | Phot |
+-----------+-----------------+---------------+
The current form of the code is something like
SELECT DISTINCT
A.Substructure,
A.Description,
B.Reference,
C.Description AS REF_DES
FROM A
LEFT JOIN B ON (A.SUBSTRUCTURE = B.SUBSTRUCTURE)
RIGHT JOIN C ON (B.REFERENCE = C.REFERENCE)
Which works and returns every Template_Type , Reference associated with a given Substructure. However what I'd like to do now is remove the duplicate substructure entries from the returned query, keeping those that have the Template Type as PNID but if that substructure does not have an entry with a PNID I'd still like to have an entry for that substructure returned. If there's no document entry at all then I'd also like an entry for that sub-structure returned.
I tried using various WHERE conditions to filter the results further but obviously filtering on TEMPLATE_TYPE = value will exclude all the substructures that do not have PNIDS.
Unfortunately I have no control over how the data is stored in the tables.
The solution to this was to run a sub-query filtering Table C for PNIDs and then joining against the query results.
Is there any way to have a primary key with a feature that increments it but fills in gaps? Assuming I have the following table:
____________________
| ID | Value |
| 1 | A |
| 2 | B |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Notice that the value is only an example, the order has nothing to do with the question.
Once I remove the row with the ID of 2 (the table will look like this):
____________________
| ID | Value |
| 1 | A |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
And I add another row, with regular auto-increment feature it will look like this:
____________________
| ID | Value |
| 1 | A |
| 3 | C |
| 4 | D |
^^^^^^^^^^^^^^^^^^^^^
As expected.
The output I'd want would be:
____________________
| ID | Value |
| 1 | A |
| 2 | D |
| 3 | C |
^^^^^^^^^^^^^^^^^^^^^
Where the gap is filled with the new row. Also note that maybe, in memory, it would look different. But the point is that the primary key would fill the gaps.
When having the primary keys (for instance) 1, 2, 3, 6, 7, 10, 11, 4 should be first filled in, then 5, 8 and so on... When the table is empty (even if it had a million of rows before) it should start over from 1.
How do I accomplish that? Is there any built-in feature similar to that? Can I implement it?
EDIT: If it's not possible, why not?
No, you don't want to do that, as juergen-d said. It's unlikely to do what you think it is doing, and it will do it even less in a multi-user environment.
In a multiuser environment you are likely to get voids even when there are no deletes, just from aborted inserts.
I'm working on an old SQL Server database that has several tables that look like the following:
|-------------|-----------|-------|------------|------------|-----|
| MachineName | AlarmName | Event | AlarmValue | SampleTime | ... |
|-------------|-----------|-------|------------|------------|-----|
| 3 | 180 | 8 | 6.780 | 2014-02-24 | |
| 9 | 67 | 8 | 1.45 | 2014-02-25 | |
| ... | | | | | |
|-------------|-----------|-------|------------|------------|-----|
There is a separate table in the database that only contains unique strings, as well as the index for each unique string. The unique string table looks like this:
|----------|--------------------------------|
| Id | String |
|----------|--------------------------------|
| 3 | MyMachine |
| ... | |
| 8 | High CPU Usage |
| ... | |
| 67 | 404 Error |
| ... | |
|----------|--------------------------------|
Thus, when we want to get something out of the database, we get the respective rows out, then lookup each missing string based on the index value.
What I'm hoping to do is to replace all of the string indexes with the actual values in a single query without having to do post-processing on the query result.
However, I can't figure out how to do this in a single query. Do I need to use multiple JOINs? I've only been able to figure out how to replace a single value by doing something like -
SELECT UniqueString.String AS "MachineName" FROM UniqueString
JOIN Alarm ON Alarm.MachineName = UniqueString.Id
Any help would be much appreciated!
Yes, you can do multiple joins to the UniqueStrings table, but change the order to start with the table you are reporting on and use unique aliases for the joined table. Something like:
SELECT MN.String AS 'MachineName', AN.String as 'AlarmName' FROM Alarm A
JOIN UniqueString MN ON A.MachineName = MN.Id
JOIN UniqueString AN ON A.AlarmName = AN.Id
etc for any other columns
Description
According to the explain command, there is a range that is causing a query to perform a full table scan (160k rows). How do I keep the range condition and reduce the scanning? I expect the culprit to be:
Y.YEAR BETWEEN 1900 AND 2009 AND
Code
Here is the code that has the range condition (the STATION_DISTRICT is likely superfluous).
SELECT
COUNT(1) as MEASUREMENTS,
AVG(D.AMOUNT) as AMOUNT,
Y.YEAR as YEAR,
MAKEDATE(Y.YEAR,1) as AMOUNT_DATE
FROM
CITY C,
STATION S,
STATION_DISTRICT SD,
YEAR_REF Y FORCE INDEX(YEAR_IDX),
MONTH_REF M,
DAILY D
WHERE
-- For a specific city ...
--
C.ID = 10663 AND
-- Find all the stations within a specific unit radius ...
--
6371.009 *
SQRT(
POW(RADIANS(C.LATITUDE_DECIMAL - S.LATITUDE_DECIMAL), 2) +
(COS(RADIANS(C.LATITUDE_DECIMAL + S.LATITUDE_DECIMAL) / 2) *
POW(RADIANS(C.LONGITUDE_DECIMAL - S.LONGITUDE_DECIMAL), 2)) ) <= 50 AND
-- Get the station district identification for the matching station.
--
S.STATION_DISTRICT_ID = SD.ID AND
-- Gather all known years for that station ...
--
Y.STATION_DISTRICT_ID = SD.ID AND
-- The data before 1900 is shaky; insufficient after 2009.
--
Y.YEAR BETWEEN 1900 AND 2009 AND
-- Filtered by all known months ...
--
M.YEAR_REF_ID = Y.ID AND
-- Whittled down by category ...
--
M.CATEGORY_ID = '003' AND
-- Into the valid daily climate data.
--
M.ID = D.MONTH_REF_ID AND
D.DAILY_FLAG_ID <> 'M'
GROUP BY
Y.YEAR
Update
The SQL is performing a full table scan, which results in MySQL performing a "copy to tmp table", as shown here:
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | |
| 1 | SIMPLE | Y | range | YEAR_IDX | YEAR_IDX | 4 | NULL | 160422 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.Y.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | S | eq_ref | PRIMARY | PRIMARY | 4 | climate.SD.ID | 1 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+--------------+---------+-------------------------------+--------+-------------+
Answer
After using the STRAIGHT_JOIN:
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
| 1 | SIMPLE | C | const | PRIMARY | PRIMARY | 4 | const | 1 | Using temporary; Using filesort |
| 1 | SIMPLE | S | ALL | PRIMARY | NULL | NULL | NULL | 7795 | Using where |
| 1 | SIMPLE | SD | eq_ref | PRIMARY | PRIMARY | 4 | climate.S.STATION_DISTRICT_ID | 1 | Using index |
| 1 | SIMPLE | Y | ref | PRIMARY,STAT_YEAR_IDX | STAT_YEAR_IDX | 4 | climate.S.STATION_DISTRICT_ID | 1650 | Using where |
| 1 | SIMPLE | M | ref | PRIMARY,YEAR_REF_IDX,CATEGORY_IDX | YEAR_REF_IDX | 8 | climate.Y.ID | 54 | Using where |
| 1 | SIMPLE | D | ref | INDEX | INDEX | 8 | climate.M.ID | 11 | Using where |
+----+-------------+-------+--------+-----------------------------------+---------------+---------+-------------------------------+------+---------------------------------+
Related
http://dev.mysql.com/doc/refman/5.0/en/how-to-avoid-table-scan.html
http://dev.mysql.com/doc/refman/5.0/en/where-optimizations.html
Optimize SQL that uses between clause
Thank you!
ONE Request... It looks like you KNOW your data. Add the keyword "STRAIGHT_JOIN" and see the results...
SELECT STRAIGHT_JOIN ... the rest of your query...
Straight-join tells MySql to DO IT AS I HAVE LISTED. So, your CITY table is the first in the FROM list, thus indicating you expect that to be your primary... Additionally, your WHERE clause of the CITY is the immediate filter. With that being said, it will probably fly through the rest of the query...
Hope it helps... Its worked for me with gov't data of millions of records queried and joined to 10+ lookup tables where mySql was trying to think for me.
in order to do efficient between queries you are going to want a b tree index on your YEAR column. for example:
CREATE INDEX id_index USING BTREE ON YEAR_REF (YEAR);
BTREE indexes allow for efficient range queries, if this is in fact the root problem then having an index like this should get rid of the full table scan and have it only scan the part of the table that is in the range. read more about btrees on wikipedia
However, as with any optimisation advice, you should measure to make sure that you don't do more harm than good.
Can you change from searching within a radius to search in a bounding box?
You know the city so you can calculate a bounding box in your application.
Perhaps this
S.LATITUDE_DECIMAL >= latitude_lower and
S.LATITUDE_DECIMAL <= latitude_upper and
S.LONGITUDE_DECIMAL >= longitude_lower and
S.LONGITUDE_DECIMAL <= longitude_upper
could be a little faster?