Hypothetical example:
I have an SQL table that contains a billion or so transactions:
| Cost | DateTime |
| 1.00 | 2009-01-02 |
| 2.00 | 2009-01-03 |
| 2.00 | 2009-01-04 |
| 3.00 | 2009-01-05 |
| 1.00 | 2009-01-06 |
...
What I want is to pair down the data so that I only see the cost transitions:
| Cost | DateTime |
| 1.00 | 2009-01-02 |
| 2.00 | 2009-01-03 |
| 3.00 | 2009-01-05 |
| 1.00 | 2009-01-06 |
...
The simplest (and slowest) way to do this is to iterate over the entire table, tracking the changes. Is there a faster/better way to do this in SQL?
No. There is no faster way. You could write a query that does the same job but it will be much slower. You (as a developer) know that you need to compare a value only with its direct previous value, and there is no way to specify this with SQL. So you can do optimizations that SQL cannot.
So I imagine the fastest is to write a program that streams the results from the disk, holding in RAM only the last valid value and the current one (filtering out every value that is equal to the last valid).
This is a classic example of trying to use a sledge hammer when a hammer is needed. You want to extract some crazy reporting data out of a table but to do so is going to KILL your SQL Server. What you need to do to track changes is to create a tracking table specifically for this purpose. Then use a trigger that records a change in value in a product into this table. So on my products table, when I change the price it goes into the price tracking table.
If you are using this to track stock prices or something similar then again you use the same approach except you do a comparison of the price table and if a change occurs you save it. So the comparison only happens with new data, all the old comparisons are still housed in one location so you don't need to rerun the query which is going to kill your SQL Server's performance.
Related
I have very basic understanding of SQL.
And I have tables in a confluence page with various data. I would like to create a template my team can use where they have to fill in those tables with some values and then some automatic calculation is performed with the SQL query using the table transformer module.
The structure would be something similar to :
workload table
+-------+-------+
| task | hours |
+-------+-------+
| task1 | 2 |
| task2 | 5 |
+-------+-------+
ratetable (yes usually it will be one person)
+--------+------+
| person | rate |
+--------+------+
| John | 500 |
+--------+------+
project table (yes usually it will be a single project)
+----------+----------+
| project | duration |
+----------+----------+
| project1 | 15 |
+----------+----------+
And what I would like to achieve is to be able to compute the sum of the tasks multiplied by the rate multiplied by the duration. Don't mind the units, I have put a simple example but on my side, units work.
Of course I can use constants and this would be way easier. But the way confluence works, I can't do it easily because I can't fetch constants from the template in the SQL query (or at least I did not succeed achieving that) and putting constant in the SQL code is not very user friendly for my colleagues and they would for sure forget to change the numbers from time to time.
Basically, if I could easily reduce a table in SQL to a constant I can use everywhere, this would achieve what I want I think but I don't know if this is feasible in SQL easily. Apparently, in confluence, I cannot use the CREATTE FUNCTION capabilities.
For my futur project i have a ClickHouse db. This db is fed by several micro-services themselves fed by rabbitsMQ.
The data look like:
| Datetime | nodekey | value |
| 2018-01-01 00:10:00 | 15 | 156 |
| 2018-01-01 00:10:00 | 18 | 856 |
| 2018-01-01 00:10:00 | 86 | 8 |
| 2018-01-01 00:20:00 | 15 | 156 |
| 2018-01-01 00:20:00 | 18 | 84 |
| 2018-01-01 00:20:00 | 86 | 50 |
......
So for hundreds different nodekey, I have a value every 10 minutes.
I need to have another table with the sum or the means (depends on nodekey type) of the values for every hours ...
My first idea is just using a crontab ...
But the data didn't comming in fluid flow, sometime micro-service add 2 - 3 new values or some time a weeks of data comming ... and rarely i have to bulk insert a years of the new data...
And for the moment i only have hundreds nodekey but the project going to grows.
So, i think using a crontab or looping throught the db for updating data isn't a good idea...
What is my other options ?
How about just creating a view?
create view myview as
select
toStartOfHour(datetime) date_hour,
nodekey,
sum(value) sum_value
from mytable
group by
toStartOfHour(datetime),
nodekey
The advantage of this approach is that you don't need to worry about refreshing the data. When querying the view, you actually access the underlying live data. The downside is that it might not scale well when your dataset becomes really big (queries adressing the view will tend to slow down).
An intermediate option would be to use a materialized view, that will persist the data. If I correctly understand the clickhouse documentation, materialized views are automatically updated when the data in the source table is modified, which seems to be close to what you are looking for (however you need to use the proper engine, and this might impact the performance of your inserts).
I made a reminder application that heavily writes and reads records with future datetimes, but less on records with past datetimes. These reminders are indexed by remind_at, so a million of records means a million on the index, but speeds up checking records that must be reminded in the next hour.
| uuid | user_id | text | remind_at | ... | ... | ... |
| ------- | ------- | ------------ | ------------------- | --- | --- | --- |
| 45c1... | 23 | Buy paint | 2019-01-01 20:00:00 | ... | ... | ... |
| 23f1... | 924 | Pick up car | 2019-02-01 20:00:00 | ... | ... | ... |
| 2d84... | 650 | Call mom | 2020-03-01 20:00:00 | ... | ... | ... |
| 3f1a... | 81 | Get shoes | 2020-04-01 20:00:00 | ... | ... | ... |
The problem is performance. Once the database grows big, retrieving any record becomes relatively slow.
I'm trying to check what RDBMS offer a full or semi automated way allow better performance retrieving future datetimes, since past datetimes are rarely retrieved or checked.
A neat solution that I don't know if exist would be to instruct the RDBM to prune old entries from the index. I don't know if any RDBM allows that, but in PostgreSQL, SQL Server, and SQLite there is a way to use a "partial index", but what would happen if I **recreate an index on a table with millions of records?
Some solutions that didn't fit the bill:
Horizontal scaling: It would replicate the same problem, (n) number of times.
Vertical scaling: still doesn't fix the problem.
Sharding: Could be, since every instance would hold a part of the database, but the app will have to handle the "sharding key".
Two databases: Okay, one fast and other slow. Moving old entries to the "slow instance" (toaster) would be done manually. Also, the app would have to be heavily modified to check both databases since it doesn't know where it is initially. Logic increases heavily.
Anyway, the whole point is to make future (or the closest) records to remind snappier on retrieval while disregarding the performance to retrieve older entries.
Is it possible to convert table with many columns to many tables of two columns without losing data?
I will show what I mean:
Let say I have a table
+------------+----------+-------------+
|country code| site | advertiser |
+------------+----------+-------------|
| US | facebook | Cola |
| US | yahoo | Pepsi |
| FR | facebook | BMW |
| FR | yahoo | BMW |
+------------+----------+-------------+
The number of rows = [(number of countries) X (number of sites)] and the advertiser column is a variable that gets a value from a list with a limited number of advertisers
Is it possible to transform the 3 columns table to several tables with 2 columns without losing data?
If create two tables likes this I will surly lose data:
+------------+------------+
|country code| advertiser |
+------------+------------+
| US | Cola,Pepsi |
|-------------------------|
| FR | BMW |
+-------------------------+
+------------+------------+
| site | advertiser |
+------------+------------+
| facebook | Cola,BMW |
|-------------------------|
| yahoo | Pepsi,BMW |
+-------------------------+
But is I add a third "connection" table this will it help keep all the data and have the ability to recreate the original table?
+--------------+--------------------+
| country code | site |
+--------------+--------------------+
| US | facebook,yahoo |
|-----------------------------------|
| FR | facebook,yahoo |
+-----------------------------------+
Whether the table you specify can be 'converted' into into multiple tables is determined by whether the table is in fifth normal form i.e. if and only if every non-trivial join dependency in it is implied by the candidate keys.
If the table is in fifth normal form then it cannot be converted into multiple tables. If the table is not in fifth normal form then it is in one of the four lower normal forms and can be further normalized into fifth normal form by 'converting' it into multiple tables.
A table's normal form is determined by the column dependencies. These are determined by the meaning of the table i.e. what this table represents in the real world. You have not stated what the meaning of this table is and so whether this particular table can be converted into multiple tables is unknown.
You need to understand the process of normalization and using this you should be able to determine if it is possible to convert table with many columns to many tables of two columns without losing data? based on the column dependencies in the table.
You may be looking for Entity-Attribute-Value. Certainly it is much better than your proposal for keeping field values organized and not requiring a search of the field to determine if a value is present.
I was playing with the following, but it's not there just yet.
ALTER TABLE `product_price` CHANGE `price` = `price` - 20;
What you're looking for is this:
UPDATE product_price SET price = price - 20;
So if your data looks like this:
| id | price |
|----|---------------|
| 1 | 25.20 |
| 2 | 26.50 |
| 3 | 27.00 |
| 4 | 24.25 |
It will turn it to this:
| id | price |
|----|---------------|
| 1 | 5.20 |
| 2 | 6.50 |
| 3 | 7.00 |
| 4 | 4.25 |
As tehvan pointed out in your comments, ALTER is used when you want to change the structure of the table. From the docs:
ALTER TABLE enables you to change the structure of an existing table. For example, you can add or delete columns, create or destroy indexes, change the type of existing columns, or rename columns or the table itself. You can also change the comment for the table and type of the table.
If you want to update information in any way you want to use the UPDATE statement.
As Paolo Bergantino mentioned, you tried to alter the structure of the table rather than the data contained in it. The SQL is made up of different parts, each responsible for something different. For defining your data structures (tables, views, etc.) you use the DDL (Data Definition Language). For manipulating data on the other hand, you use the DML (Data Manipulation Language).
This site shows the different parts of the SQL along with examples.