I'm getting the following data from a MySQL database
+----------------+------------+---------------------+----------+
| account_number | total_paid | doc_date | doc_type |
+----------------+------------+---------------------+----------+
| 18 | 54.0700 | 2009-10-22 02:37:09 | IN |
| 425 | 49.9500 | 2009-10-22 02:31:47 | PO |
+----------------+------------+---------------------+----------+
The query is fine and I'm getting the data I need except that the doc_type isn't very human readable. To fix this, I've done the following
CREATE TEMPORARY TABLE doc_type (id char(2), string varchar(60));
INSERT INTO doc_type VALUES
('IN', 'Invoice'),
('PO', 'Online payment'),
('PF', 'Offline payment'),
('CA', 'Credit adjustment'),
('DA', 'Debit adjustment'),
('OR', 'Order');
I then add a join against this temporary table so my doc_type column is easier to read which looks like this
+----------------+------------+---------------------+----------------+
| account_number | total_paid | doc_date | document_type |
+----------------+------------+---------------------+----------------+
| 18 | 54.0700 | 2009-10-22 02:37:09 | Invoice |
| 425 | 49.9500 | 2009-10-22 02:31:47 | Online payment |
+----------------+------------+---------------------+----------------+
Is this the best way to do this? Is it possible to replace the text in one query? I started looking at if statements but it doesn't seem to be what I'm after or maybe I just read it incorrectly.
// EDIT //
Thanks everyone. I suppose I'll keep doing it this way.
Unfortunately, it's not possible to change doc_type to integer as this is an existing database for a billing application I didn't write. I'd end up breaking functionality if I made any changes other than adding a table here and there.
Also appreciate the easy to understand case statement from Rahul. May come in handy later.
Your current way is the best. Arguably, document_type can be changed to an int, to save space and whatnot, but that's irrelevant.
Doing the join will be much faster and readable than any chained ifs.
Not to mention, extensible. Should you need to add a new doc_type, it's just an insert vs. potentially several queries.
You can use the SQL CASE statement to do this in a single query.
Select account_number, total_paid, doc_date,
case doctype
when 'IN' then 'Invoice'
when 'PO' then 'Online Payment'
end
from table
It is the best way to do this :)
If doc_type could be an integer, you also can use ELT function, as in
SELECT ELT(doc_type, 'Invoice', 'Document') FROM table;
but it is still worse than simple join as you have to put this thing into every query and every application that using the database, and changing description becomes a hell.
IIRC this is the correct way to achieve what you want to do. It's a normalized design
I think you are asking about the design and not how the data has to be fetched? If it is so, then I should tell I have always used the above kind of design.
This design leads to normalized database. There won't be consistency problems if you ever needed to change the name of the field like Invoice and Online Payment
I would suggest you to change doc_type field to int as not only it saves space(as told by Tordek) but it is also faster when you execute queries.
Firstly.If you used Invoice in doct_type as string, then the problems could have been was that string search is extremely slow when compared to other datatypes.
Second, it is case sensitive (which may lead to mistakes.
Thirdly, since string takes up much space, so much more space is required for storing it in the main table.
Fourth, If you ever required to change the name Invoice to say Billing, then searching for Invoice would take time and each and every row containing this value had to be updated
Related
I am using BigQuery to give my colleagues access to aggregated data in our system.
I have a raw_orders table where I store orders data. The thing is that the lines in this table are subject to change across time. When a change occurs, I add a new line in this table. So my table looks like this:
+-----+-------+---------------------+---------------------+
| id | total | created_at | updated_at |
+-----+-------+---------------------+---------------------+
| ABC | 15.76 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| ABC | 12.43 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
| DEF | 19.03 | 2020-01-01 12:56:32 | 2020-01-02 14:58:43 |
| DEF | 12.03 | 2020-01-01 12:56:32 | 2020-01-01 12:56:32 |
+-----+-------+---------------------+---------------------+
To allow my collaborators to query on a deduplicated table easily, I created a view of deduplicated lines using:
CREATE OR REPLACE VIEW xxx.orders as
select ro.*
from (
select ro.id, max(ro.updated_at) max_updated_at
from xxx.raw_orders ro
group by ro.id
) tmp inner join xxx.raw_orders ro2 on ro2.id = tmp.id && ro2.updated_at = tmp.max_updated_at
order by f.created_at desc
This works great, but I feel that I am spending too much budget on simple requests like:
SELECT * FROM rubee.orders WHERE created_at > '2020-11-01 00:00:00';
If I understand well, because of the view step, big query must use a lot of storage to deduplicate lines before responding a single result.
Am I doing something wrong here? How do you give access to deduplicated data without spending too much storage? Would you have a better strategy for what I try to do?
Ideally, you will use a materialized view for the purpose, but right now BigQuery has limited support on materialized view. You cannot create a mview to replace the view you were using.
It is possible to create a materialized view for the inner query, which may make the whole query less expensive but please read on.
Cost. There is no simple answer whether you are "spending too much budget" on the query.
If you're on pay-per-query plan and charged by "processed bytes", then although the query is more expensive for BigQuery to process, you're charged no more than scanning the whole table once (although technically the table was scanned more than once). In another word, deduplication is free. However, if your query pattern allows to to cluster/partition your table somehow to avoid scanning the whole table, then this "self-join" view does prevent you from saving the budget.
If you have reservation on slots, then you will benefit from making the query faster.
Suggestions. Give the situation is different case by case, the general suggestions are:
If it is possible, separating the data into "archived" and "active" so that "archived" data stay deduplicated (and partitioned/clustered to allow efficient search), and you only need a view to dedup "active" data.
Create a materialized view (on the inner "GROUP BY" query) may speed up the query a bit but not necessarily make it "cheaper", you may be charged the size of the base table + mview.
I am sure this question has been asked before, but I'm so new to SQL, I can't even combine the correct search terms to find an answer! So, apologies if this is a repetition.
The db I'm creating has to be created at run-time, then the data is entered after creation. Some fields will have a varying number of entries, but the number is unknown at creation time.
I'm struggling to come up with a db design to handle this variation.
As an (anonymised) example, please see below:
| salad_name | salad_type | salad_ingredients | salad_cost |
| apple | fruity | apple | cheap |
| unlikely | meaty | sausages, chorizo | expensive |
| normal | standard | leaves, cucumber, tomatoes | mid |
As you can see, the contents of "salad_ingredients" varies.
My thoughts were:
just enter a single, comma-separated string and separate at run-time. Seems hacky, and couldn't search by salad_ingredients!
have another table, for each salad, such as "apple_ingredients", which could have a varying number of rows for each ingredient. However, I can't do this, because I don't know the salad_name at creation time! :(
Have a separate salad_ingredients table, where each row is a salad_name, and there is an arbitrary number of ingredients fields, say 10, so you could have up to 10 ingredients. Again, seems slightly hacky, as I don't like to unused fields, and what happens if a super-complicated salad comes along?
Is there a solution that I've missed?
Thanks,
Dan
based on my experience the best solution is based on a normalized set of tables
table salads
id
salad_name
salad_type
salad_cost
.
table ingredients
id
name
and
table salad_ingredients
id
id_salad
id_ingredients
where id_salad is the corresponding if from salads
and id_ingredients is the corresponding if from ingredients
using proper join you can get (select) and filter (where) all the values you need
While trying to build a data warehousing application using Talend, we are faced with the following scenario.
We have two tables tables that look like
Table master
ID | CUST_NAME | CUST_EMAIL
------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM
Events Table
ID | CUST_ID | EVENT_NAME | EVENT_DATE
---------------------------------------
1 | 1 | ACC_APPLIED | 2014-01-01
2 | 1 | ACC_OPENED | 2014-01-02
3 | 1 | ACC_CLOSED | 2014-01-02
There is a one-to-many relationship between master and the events table.Since, given a limited number of event names I proposing that we denormalize this structure into something that looks like
ID | CUST_NAME | CUST_EMAIL | ACC_APP_DATE_ID | ACC_OPEN_DATE_ID |ACC_CLOSE_DATE_ID
-----------------------------------------------------------------------------------------
1 | FOO | FOO_BAR#EXAMPLE.COM | 20140101 | 20140102 | 20140103
THE DATE_ID columns refer to entries inside the time dimension table.
First question : Is this a good idea ? What are the other alternatives to this scheme ?
Second question : How do I implement this using Talend Open Studio ? I figured out a way in which I moved the data for each event name into it's own temporary table along with cust_id using the tMap component and later linked them together using another tMap. Is there another way to do this in talend ?
To do this in Talend you'll need to first sort your data so that it is reliably in the order of applied, opened and closed for each account and then denormalize it to a single row with a single delimited field for the dates using the tDenormalizeRows component.
After this you'll want to use tExtractDelimitedFields to split the single dates field.
Yeah, this is a good idea, this is called a cumulative snapshot fact. http://www.kimballgroup.com/2012/05/design-tip-145-time-stamping-accumulating-snapshot-fact-tables/
Not sure how to do this in Talend (dont know the tool) but it would be quite easy to implement in SQL using a Case or Pivot statement
Regarding only your first question, it's certainly a good idea -- unless there is any possibility of the same persons applying-opening-closing their account more than once AND you want to keep all this information in their history (so UPDATE wouldn't help).
Snowflaking is definitely not a good option if you are going to design a data warehouse. So, denormalizing will certainly be a good choice in this case. Following article almost fits perfectly to clear the air over such scenarios,
http://www.kimballgroup.com/2008/09/design-tip-105-snowflakes-outriggers-and-bridges/
I have 2 tables, a purchases table and a users table. Records in the purchases table looks like this:
purchase_id | product_ids | customer_id
---------------------------------------
1 | (99)(34)(2) | 3
2 | (45)(3)(74) | 75
Users table looks like this:
user_id | email | password
----------------------------------------
3 | joeShmoe#gmail.com | password
75 | nolaHue#aol.com | password
To get the purchase history of a user I use a query like this:
mysql_query(" SELECT * FROM purchases WHERE customer_id = '$users_id' ");
The problem is, what will happen when tens of thousands of records are inserted into the purchases table. I feel like this will take a performance toll.
So I was thinking about storing the purchases in an additional field directly in the user's row:
user_id | email | password | purchases
------------------------------------------------------
1 | joeShmoe#gmail.com | password | (99)(34)(2)
2 | nolaHue#aol.com | password | (45)(3)(74)
And when I query the user's table for things like username, etc. I can just as easily grab their purchase history using that one query.
Is this a good idea, will it help better performance or will the benefit be insignificant and not worth making the database look messier?
I really want to know what the pros do in these situations, for example how does amazon query it's database for user's purchase history since they have millions of customers. How come there queries don't take hours?
EDIT
Ok, so I guess keeping them separate is the way to go. Now the question is a design one:
Should I keep using the "purchases" table I illustrated earlier. In that design I am separating the product ids of each purchase using parenthesis and using this as the delimiter to tell the ids apart when extracting them via PHP.
Instead should I be storing each product id separately in the "purchases" table so it looks like this?:
purchase_id | product_ids | customer_id
---------------------------------------
1 | 99 | 3
1 | 34 | 3
1 | 2 | 3
2 | 45 | 75
2 | 3 | 75
2 | 74 | 75
Nope, this is a very, very, very bad idea.
You're breaking first normal form because you don't know how to page through a large data set.
Amazon and Yahoo! and Google bring back (potentially) millions of records - but they only display them to you in chunks of 10 or 25 or 50 at a time.
They're also smart about guessing or calculating which ones are most likely to be of interest to you - they show you those first.
Which purchases in my history am I most likely to be interested in? The most recent ones, of course.
You should consider building these into your design before you violate relational database fundamentals.
Your database already looks messy, since you are storing multiple product_ids in a single field, instead of creating an "association" table like this.
_____product_purchases____
purchase_id | product_id |
--------------------------
1 | 99 |
1 | 34 |
1 | 2 |
You can still fetch it in one query:
SELECT * FROM purchases p LEFT JOIN product_purchases pp USING (purchase_id)
WHERE purchases.customer_id = $user_id
But this also gives you more possibilities, like finding out how many product #99 were bought, getting a list of all customers that purchased product #34 etc.
And of course don't forget about indexes, that will make all of this much faster.
By doing this with your schema, you will break the entity-relationship of your database.
You might want to look into Memcached, NoSQL, and Redis.
These are all tools that will help you improve your query performances, mostly by storing data in the RAM.
For example - run the query once, store it in the Memcache, if the user refresh the page, you get the data from Memcache, not from MySQL, which avoids querying your database a second time.
Hope this helps.
First off, tens of thousands of records is nothing. Unless you're running on a teensy weensy machine with limited ram and harddrive space, a database won't even blink at 100,000 records.
As for storing purchase details in the users table... what happens if a user makes more than one purchase?
MySQL is hugely extensible, and don't let the fact that it's free convince you of otherwise. Keeping the two tables separate is probably best, not only because it keeps the db more normal, but having more indices will speed queries. A 10,000 record database is relatively small in deference to multi-hundred-million record health record databases.
As far as Amazon and Google, they hire hundreds of developers to write specialized query languages for their specific application needs... not something developers like us have the resources to fund.
I'm going to start work on a medium sized application, and i'm planning it's db design.
One thing that I'm not sure about is this.
I will have many tables which will need internationalization, such as: "membership_options, gender_options, language_options etc"
Each of these tables will share common i18n fields, like:
"title, alternative_title, short_description, description"
In your opinion which is the best way to do it?
Have an i18n table with the same fields for each of the tables that will need them?
or do something like:
Membership table Gender table
---------------- --------------
id | created_at id | created_at
1 - 22.03.2001 1 - 14.08.2002
2 - 22.03.2001 2 - 14.08.2002
General translation table
-------------------------
record_id | table_name | string_name | alternative_title| .... |id_language
1 - membership regular null 1 (english)
1 - membership normale null 2 (italian)
1 - gender man null 1(english)
1 -gender uomo null 2(italian)
This would avoid me repeating something like:
membership_translation table
-----------------------------
membership_id | name | alternative_title | id_lang
1 regular null 1
1 normale null 2
gender_translation table
-----------------------------
gender_id | name | alternative_title | id_lang
1 man null 1
1 uomo null 2
and so on, so i would probably reduce the number of db tables, but i'm not sure about performance.I'm not much of a DB designer, so please let me know.
The most common way I've seen this done is with two tables, membership and membership_ml, with one storing the base values and the ml table storing the localized strings. This is similar to your second option. Most of the systems I see like this are made that way because they weren't designed with internationalization in mind from the get go, so the extra _ml tables were "tacked on" later.
What I think is a better option is similar to your first option, but a little bit different. You would have a central table for storing all the translations, but instead of putting the table name and field name in there, you would use tokens and a central "Content" table to store all the translations. That way you can enforce some kind of RI between the tokens in the base table and the translations in the Content table if you want as well.
I actually asked a question about this very thing a while back, so you can have a look at that for some more info (rather than repasting the schema examples here).
I also think the best solution is to keep translations on different table. This approach use Open Cart which is open source and you can take a look the way it deals with the problem. Another source of information is here "http://www.gsdesign.ro/blog/multilanguage-database-design-approach/" especially on the comments sections