Evening all,
I am attempting to create a table that stores a series of stats on web usage for my application on a daily basis (trivial things like no. new users, total visits etc.), I am currently querying these on the fly, however I would now like to start storing them partly for performance (reducing a load of aggregate querys to single lookup) and partly to allow for historic analysis.
I have come up with the follow basic schema for the table (there will be more columns than this, just to give an idea)
create table web_stats(
web_stat_id bigserial primary key,
date_created timestamp not null default now(),
user_count integer not null,
new_user_count integer not null
);
comment on table web_stats is 'Table stores statistics on web usage';
Now, I am happy to create the queries to populate the table going forward (I am using Quartz scheduler to run the queries daily)
However I am not so sure the best way to populate the table retrospectively for past dates, should I use an INSERT statement to create a blank row for every day since the application went live (about 2 years ago), then use an UPDATE to populate the blank rows? Or can this be done in one fell swoop? Can someone provide some SQL for creating the rows?
If there is anything wrong with my design assumptions please let me know!
This is how I ended up doing it
INSERT INTO web_stats (date_created)
SELECT DATE('2011-08-20')+x.id
FROM generate_series(0,521) AS x(id);
Where 2011-08-20 is the date the application went live, and 521 is the number of days from now until then
This creates the empty table so that I can use the date_created field to populate the other fields using UPDATE statements
Maybe not the most efficient method but it works
Related
I manage data-tier applications for a small company and my SW is receiving criticism for the fact that information for part-costing can't be retrieved historically. So, for instance, what they would like is to be able to, at any point in time, retrieve the cost of a part as it was 6 months ago.
They used to do this through spreadsheets. They would copy the part table every day into a .xlsx file, and then anytime they wanted to know "hey, what was the cost of that part Jan 20 of last year?", they could just pull it up in excel.
So, we've begun doing the same thing in SQL, and the plan so far is that we will create a new table each time the part costs are updated, name the table with today's date, and persist it in a database for archived information. Then, we're planning to pull in whichever table we need according to it's time-stamp.
I can't help but think this is going to get very messy. Is this a bad approach for archiving data? Are there any industry standards I can adhere to for solving this problem in as few headaches as possible?
So, we've begun doing the same thing in SQL, and the plan so far is
that we will create a new table each time the part costs are updated,
name the table with today's date, and persist it in a database for
archived information. Then, we're planning to pull in whichever table
we need according to it's time-stamp.
I can't help but think this is going to get very messy. Is this a bad
approach for archiving data? Are there any industry standards I can
adhere to for solving this problem in as few headaches as possible?
You are right ... this solution will be messy.
Simplest thing that you can do is to create a History table say Parts_History that will have all the columns as the main Parts table and add additional timestamp column(s) to track updates. Every time there is a new price for a part ( which I hope is done thru a stored procedure) the existing price gets moved into the new table and the main table gets updated ALL Inside one transaction. If you dont have a single SP that handles the update then you can do that inside a trigger.
I will try and see if there are any good examples out there.
As far as I now there is no standard but approach is rather obvious. You have a table say part(partid int primary key, price decimal). Create an audit table part_audit(auditId int identity(1,1) primary key, partId int, price decimal, dateChange datetime default getdate()) and a trigger on part after update, delete. In the trigger check update(price) and if so insert into part_audit from deleted. To find historical price select nearest dateChange after date of interest.
I want a table to be sync-able by a web API.
For example,
GET /projects?sequence_latest=2113&limit=10
[{"state":"updated", "id":12,"sequence":2116},
{"state":"deleted" "id":511,"sequence":2115}
{"state":"created", "id":601,"sequence":2114}]
What is a good schema to achieve this?
I intend this for Postgresql with Django ORM, which uses surrogate keys. Presence of an ORM may kill answers like unions.
I can come up with only half-solutions.
I could have a modified_time column, but we cannot convey deletions.
I could have a table for storing deleted IDs, when returning 10 new/updated rows, I could return all the deleted rows between them. But this works only when the latest change is an insert/update and there are a moderate number of deleted rows.
I could set a deleted flag on the row and null the rest, but its kinda bad schema design to set all columns nullable.
I could have another table that stores ID, modification sequence number and state(new, updated, deleted), but its another table to maintain and setting sequence numbers cause contentions; imagine n concurrent requests querying for latest ID.
If you're using an ORM you want simple(ish) and if you're serving the data via an API you want quick.
To go through your suggested options:
Correct, so this doesn't help you. You could have a deleted flag in your main table though.
This seems quite a random way of doing it and breaks your insistence that there be no UNION queries.
Not sure why you would need to NULL the rest of the column here? What benefit does this bring?
I would strongly advise against having a table that has a modification sequence number. Either this means that you're performing a lot of analytic queries in order to find out the most recent state or you're updating the same rows multiple times and maintaining a table with the same PK as your normal one. At that point you might as well have a deleted flag in your main table.
Essentially the design of your API gives you one easy option; you should have everything in the same table because all data is being returned through the same method. I would follow your point 2 and Wolph's suggestion, have a deleted_on column in your table; making it look like:
create table my_table (
id ... primary key
, <other_columns>
, created_on date
, modified_on date
, deleted_on date
);
I wouldn't even bother updating all the other columns to be NULL. If you want to ensure that you return no data create a view on top of your table that nulls data where the deleted_on column has data in it. Then, your API only accesses the table through the view.
If you are really, really worried about space and the volume of records and will perform regular database maintenance to ensure that both are controlled then maybe go with option 4. Create a second table that has the state of each ID in your main table and actually delete the data from your main table. You then can do a LEFT OUTER JOIN to the main table to get the data. When there is no data that ID has been deleted. Honestly, this is overkill until you know whether you will definitely require it.
You don't mention why you're using an web API for data-transfers; but, if you're going to be transferring a lot of data or using this for internal systems only it might be worth using a lower-level transfer mechanism.
I have a database of users for a web API, but I also want to store usage history for each user, i.e: page request count, data volumes, etc. What is the best way to implement this, in terms of database structure? My initial thought was to retain the main table, but then create a history table for each user. This seems horribly impractical, however. My gut feeling is that I probably need one separate table for usage history, but I am unclear as to how to structure it.
I am using SQLite.
For an event logging model (which is what you want), I can recommend two options
One table, lets call it activity_log.
`activity_log`{
id INTEGER PRIMARY KEY,
user_id MEDIUM INT NOT NULL,
event_type VARCHAR(10),
event_time TIMESTAMP
}
For each event in your system affecting a user, you insert a record into this role (i believe the column names are self-explanatory). I believe SQLite doesn't provide native TIMESTAMP type so you'll have to handle the storage in your application code. What this design will leave you with a table that has the potential to grow very large, but it will give you fine grained statistics. SQLite doesn't support clustered indexes but there are some options here that will help you out with performance tuning.
The same table as above, only instead of inserting a new row for every event, you're going to perform a conditional insert i.e. update existing rows for users already in and update for new users. This option will keep your table several times smaller than what you have above, but you'll only have access to the most recent use of your api.
If you can afford it, I'd say go with number 1.
In one of my programs, I maintain a table of module usage per user. The structure of the table is
table id
user id
prog id
date/time
history flag (0=current, 1=history)
runs (number of time user has run program on date)
About once a week, I aggregate the data in the table: if user 1 has run program 1 twice on a given date, then initially there will be two entries in the table:
1;1;1;04/10/12 08:56;0;1
2;1;1;04/10/12 09:33;0;1
After aggregation, the table becomes
3;1;1;04/10/12 00:00;1;2
Whilst the aggregation loses the time part, no other data is lost and queries against the table will be quicker.
I'm building a social network, and now I faced a problem.
So, which one is faster (to keep messages):
To have one database,
and to create new table (for messages) per new user?
Like this:
CREATE DATABASE 'user_messages';
CREATE TABLE 'user_id' (
id int(32) NOT NULL PRIMARY KEY,
new ENUM ('Y', 'N') NOT NULL DEFAULT 'Y',
time timestamp NOT NULL,
from_id int(32)
);
OR,
To keep all messages in one single table (with replication)???
(Using INDEXes)
What if there're billion rows?
Like this:
INSERT INTO 'user_messages' (id, new, time, from_id) VALUES ('id_value', 'Y', now(), 'friend_id');
Creating one table per user will become a nightmare to query without using dynamicly generated SQL all the time.
A far better option is to create one single table and store all messages for all users in that table with a foreign key back to the users table. The foreign key will be indexed, and should not have a massive performance problem. If you thing you are going to have billions of rows (or messages), then your database archtecture should be scaled accordingly to handle that quantity of data, but you're database design shouldn't be changed because of this.
So, which one is faster (to keep messages):
It's probable neither is faster, although I suspect SQL Server has an upper limit on tables that is significantly less than it's upper limits on rows. And I don't know if you're going down some slippery performance slope with transactions that span different databases.
Proper database design would dictate the single table.
(Using INDEXes) What if there're billion rows? Like this:
Well, maybe you should get to a billion rows before you start solving that problem. But note that you can partition the data (Google SQL Server Table Partitioning) for numerous reasons, typically for performance.
What would be the easiest way to count the new records that are inserted into a database? Is it possible to include a count query in with the load query?
Or is something more complex needed, such as recording the existing last record and counting everything added after it?
edit:
I have a cron job, that uses LOAD DATA INFILE in a script that is passed directly to mysql. This data is used with a php web application. As part of the php web application, I need to generate weekly reports, including how many records were inserted in the last week.
I am unable to patch mysql, or drastically change the database schema/structure, but I am able to add in new tables or fields. I would prefer not to count records from the csv file and store this result in a textfile or something. INstead, I would prefer to do everything from within PHP with queries.
Assuming your using Mysql 5 or greater, you could create a trigger which would fire upon inserting into a specific table. Note that an "insert" trigger also fires with the "LOAD" command.
Using a trigger would require you to persist the count information in a separate table. Basically you'd need to create a new table with 1 row/column to hold the count. The trigger would then update that value with the amount of data loaded.
Here's the MySQL manual page on triggers, the syntax is fairly straight forward. http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
edit
Alternatively, if you don't want to persist the data within the database you could perform your "Load" operations within a stored procedure. This would allow you to perform a select count() on the table before you begin the Load and after the Load is complete. You would just need to subtract the resulting values to determine how many rows were inserted during the Load.
Here's the MySQL manual page on procedures.
http://dev.mysql.com/doc/refman/5.0/en/create-procedure.html
That would probably depend on what is determined as being new. Is it entries entered into the database in the last five minutes or 10 minutes etc? Or is it any record past a certain Auto ID?
If you are looking at time based method of determining what's new, you can have a field (probably of type datetime) that records the time when the record was inserted and to get the number, you simply do a...
select count(*) from table where currentTime > 'time-you-consider-to-be-new'
If you don't want to go by recording the time, you can use an auto increment key and simply keep track of the last inserted ID and count the ones that come after that at any given time window. so if one hour ago the ID was 10000 then a number of records have been inserted since then. You will need to count all records greater than 10000 and keep track of the last insert ID and repeat whenever needed.
If you are not looking at a specific table, you can use the following:
show global status like "Com_%";
This will show you statistics for every type of query. These numbers just keep on counting, so if you want to use them, record the initial number when starting to track the queries, and subtract this from your final number (but yea, that's a given).
If you are looking for pure statistics, I can recommend using Munin with the MySQL plugins.
From where do you load the data? You might consider to count them befor you insert them into the database. If it's a sqlscript you might write a quick and dirty bash script (with grep or something similar) to count the fields.
You say you can't change the structure. Does that mean you can't change the table you are inserting into, or you can't change the database at all? If you can add a table, then just create a table with 2 columns - a timestamp and the key of the table you are loading. Before you load your csv file, create another csv file with just those two columns, and load that csv after your main one.
This might be simpler than you want, but what about a Nagios monitor to track the row count? (Also consider asking around on serferfault.com; this stuff is totally up their alley.)
Perhaps you could write a small shell script that queries the database for the number of rows. You could then have a Cron job that runs every minute/hour/day etc and outputs the COUNT to a log file. Over time, you could review the log file and see the rate at which the database is growing. If you also put a date in the log file, you could review it easier over longer periods.
See if this is the kind of MySQL data collection you're interested in: http://code.google.com/p/google-mysql-tools/wiki/UserTableMonitoring.
If that is the case, Google offers a MySQL patch (to apply to a clean mysql directory source) at http://google-mysql-tools.googlecode.com/svn/trunk/mysql-patches/all.v4-mysql-5.0.37.patch.gz. You can read more about the patch at http://code.google.com/p/google-mysql-tools/wiki/Mysql5Patches.
If this is not what you're looking for, I suggest you explain yourself a little more in order for us to help you better.
Could you use a trigger on the table which will insert into a table you created, which in the structure has a timestamp?
You could then use a date calculation on a period range to find the information needed.
I dont know what version of mysql you are using, but here is link to the syntax for trigger creation in version 5.0: http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html
Good luck,
Matt
Well, if you need exhaustive information: which rows were inserted, updated or deleted, it might make sense to create an additional audit table to store those things with a timestamp. You could do this with triggers. I would also write a stored procedure which would execute as event and erase old entries (whatever you consider old).
Refer to the link posted by Lima on how to create triggers in MySQL.
Refer to page 655 of "MySQL Cookbook" by Paul Dubois (2nd Edition) or page 158 of "SQL for smarties" by Joe Celko.
so the 'load' will only insert new data in the table ? or rewrite the whole table ?
If it will load new data, then you can do a select count(*) from yourtable
once before the loading and once after the loading ... the difference will show you how many new records where inserted..
If on the other hand you rewrite the whole table and want to find the different records from the previous version .. then you would need a completely different approach..
Which one is it ?
Your question is a bit ambiguous but they mysql c APIs provide a function "mysql_affected_rows" that you can call after each query to get the number of affected rows. For an insert it returns the number of rows inserted. Be aware that for updates it returns the number of rows changed not the number of rows that matched the where clause.
If you are performing a number of queries and need to know how many were inserted the most reliable way would probably be doing a count before and after the queries.
As noted in sobbayi's answer adding a "created at" timestamp to your tables would allow you to query for records created after (or before) a given time.
UPDATE:
OK here is what you need to do to get a count before and after:
create a table for the counts:
create table row_counts (ts timestamp not null, row_count integer not null);
in your script add the following before and after your load file inline query:
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
load file inline......
insert into row_counts (ts,row_count) select now(),count(0) from YOUR_TABLE;
the row_counts table will now have the count before and after your load.
show global status like 'Com_insert';
flush status and show session status... will work for just the current connection.
see http://dev.mysql.com/doc/refman/5.1/en/server-status-variables.html#statvar_Com_xxx
Since you asked for the easiest way, I would suggest you to use a trigger on insert. You could use a single column, single row table as a counter and update it with the trigger.