BQ /Big Data QA - google-bigquery

I work in small team as big data etl qa.
We work extensively on mappings of source to target both in GCP BQ.
Mappings and dev logic are extremely complex.
This digital transformation of data will be done for 30 countries.
My managers can’t provide budget for extra resources or expensive tools.I am tasked to finish qa without compromising quality with reduced time.
Im taking time of(each array-10 fields) qa approx 1 days.
Test strategy is as below
1.create source cte with dev logic and source fields(partitioning ,case statements,unpivoting,pivot filters per requirement)
2.create target cte with fields
3.Join source and target
5.Finally validate transformation usinf validation conditions at the end of sql.
This source is different for each country and so are requirements.
This exercise is strenuously tormenting to do for 30 countries.
Anyone has any idea how to reduce qa time or any new qa strategies or tools?
Please advise

Related

Capacity Management database

I am designing database using Microsoft Access for capacity management. I have different types of equipment and need to track the power usage by every month. For instance, I have HVAC system and Chiller System. Withing each system there are different equipments like AHU_1, AHU_2 ,AHU_3, MAU_1, MAU_2 and etc in HVAC system and CHWP_1, CHWP_2, CWP_1, CWP_2 and etc in Chiller system.
I need to track the power or usage by every month. For each system i have separate tables containing their respective equipments. What will be suitable way to track the usage? This is what I'm planning to do which I believe there are three options as in the picture below:
Creating a main table called Chiller_usage Table which will have all the equipments and dates with usage value. The problem i see is that it will has repetitive of each equipments due to dates and the pro is not many tables.
Creating each equipment table which will have dates and usage. The problem is I have around 60 to 70 equipments with 5 different major systems and will lead to mass amount of table which will be very difficult when making queries and reports.
Creating date table with equipments and usage value. This looks promising for now because i will have few table initially and as times goes on there will be 12 tables each year which is alot in the future.
What I'm thinking of is the first option since it is easy to manage when making custom queries because I need to perform calculation in terms of costing, usage analysis of each equipment with graphs and etc. But that i believe will be clumsy due to repetitive name of equipments due to variable dates. Is there any other viable options? Thank you.
Assuming you need to store monthly energy usage for each piece of equipment. Normalize the tables. neither the person entering the data or the manager asking for reports needs to see the complexity of the underlying tables. The person entering the data sees both a form for adding systems/equipment and a form for entering energy usage per equipment per month. The manager sees a report with just the information he wants like energy costs per a system per a year. The normalized tables can be recombined into human readable tables with queries. Access trys to make making the forms and reports as simple as clicking on those appropriate queries and then clicking on create form/report. In practice some assembly is required. Behind the scenes the forms put the correct data in the correct tables and the report shows only the data the manager wants. For instance. here is a normalized table structure based on the data you provided and some assumptions:
The tables are normalized and have the relationships shown. Beneath there is a query to show the total power each system uses between any two dates such as for a yearly report. So data looking like this:
is turned into this:

Data warehousing Model approach

We are in a process of building a health data warehouse. And have been having discussions over the basic structure of the data warehouse. I need your suggestions on pros and cons of the below structures. DWH will be used for reporting and research purpose. It will be a near real time data warehouse with latency time of around 5-10 minutes.
The Source database has one Encounter/visit table. Everything is saved in this table. It's the central table which links everything. So If I need to get a patient's journey in the production database, I just go to the encounter/visit table and see how many times a patient has come for a treatment/has been admitted or went back from emergency, has been admitted from emergency etc.
model 1 ->
Encounter/visit table having the common fields (like encounter_id,arrival_date,care_type etc)
and then further tables can be built as per the encounter types with encounter specific fields :
Encounter_Emergency (Emergency specific fields such as emergency diagnosis, triage category etc)
Encounter_Inpatient
Encounter_outpatient
Model 2 ->
Having separate tables as base tables and then create a view on the top which then includes all the encounter types together.
Encounter_Emergency (Emergency specific fields such as emergency diagnosis,triage category etc)
Encounter_Inpatient
Encounter_outpatient
model 3 ->
Encounter/visit table having all the fields as the source database
and views are created as per the encounter types with encounter specific fields :
view_Encounter_Emergency
view_Encounter_Inpatient
view_Encounter_outpatient
these views can be further combined with the emergency_diagnosis table to get the diagnosis or emergency_alerts table to access the alerts etc.
A prime consideration would be how often there will be additions, deletions, or alterations to Encounter Types.
Model B will require extensive rework in advance of any such change just to make sure the data continues to be captured. Either of the other two models will continue to capture reclassed data, but will require rework to report on it.
As between A and C, the question becomes traffic. Views are comparatively easy to spin up/down, but they'll be putting load on that big base table. That might be acceptable if the DW won't have tons of load on it. But if there will be extensive reporting (Pro Tip there's always more extensive reporting than the business tells you there will be), it may be more advantageous to break the data out into stand alone tables.
There is, of course, ETL overhead to maintaining all of those tables.
For speed of delivery, perhaps build Model C, but architect Model A in case consumption requires the more robust model. For the record, you could build Views that don't have any kind of vw_ prefix, or any other identifier in their names that lets users know that they're views. Then, later, you can replace them with tables of the same name, and legacy queries against the old views will continue to work. I've done just the same thing in the opposite direction, sneaking in views to replace redundant tables.

In SQL, how do I update data to the data warehouse as a daily activity?

I'm trying to figure out how to create a strategy to UPDATE both the dimension and fact tables of a data warehouse, as a daily activity.
Essentially, on the date the when the data is inserted into the operational DB, I want it to then be populated into the Dimension tables and hence Fact table.
To do this I presume I would use the SYSDATE() as a part of a WHERE condition, but I don't know how to set the date portion of SYSDATE() to equal the day the operational data is taken. Would it be 'today'? I don't know.
I can't even give a cogent example because i don't really know what I'm doing here.
I've seen somewhere that if I only use the date, that the update will occur at midnight?
I don't really know where to go next. Can anyone help?
#Theant6118 sounds like you have a lot of research and reverse engineering to do. There is no set rule of when or what time of day to do and update. If someone says midnight that could be depending on when all of the ETL (extract transform load) jobs for your operational data db complete and could depend on what date your ETL processes uses to filter what it does and dos not extract from your transnational data sources.
Next it kind of sounds like you are also wondering what datetime function to use to figure out current system time and that is somewhat dependent on the database engine you are using.
As far as timing. If your operatoinal DB is the data source for your datawarehouse (which is pretty standard) whenever that completes you can then process the cubes/tabular models or ETL for your datawarehouse independent of what the date of the transnational information you are synching is.
definitely update this question with what technology is being used. e.g. sql-server, oralce, etc.
Do some discovery on if sql-server SQL agent jobs when the execute what they do and their dependencies. I am not positive what the equivalent is on other platforms. And update this question.
Query data! test your hypothesis to see what information is contained within the operational data store and what is in the data warehouse.
Build your game plane and start to execute.
Continuous Learn Learn Learn about ETL, the specific technologies being used, and the current configuration!

Tableau - blend, join, or modify raw?

I have several large files which have metric information. I am linking these files to a roster database so I know who an employee reported to on a certain date (this allows me to track performance per manager or per site, etc).
The way I am doing this is creating a RosterID field which is EmployeeID+Date (Example: x112x20141022). So if I filter on one sheet on my dashboard, it will filter for all sheets.
Question: Do I use a blend between the files (there are about 8 different data sources right now), a left join, or do I just use python/pandas to literally add the information to the raw data source? I want the twbx file to be as fast as possible for the end users.
(What I mean by adding the information to the raw data is literally adding columns like Manager, Director, Location, Hire Date, Training Class. This would add to the file sizes marginally, but would it speed up Tableau?)
Data blend (creating a relationship between different datasources in Tableau) is the worst performing solution. While it is easy to use, it's highly inefficient, because it will virtually perform the join every single time a calculation is made.
Performing a left join on Tableau when connecting to data (I'm assuming you're using csv files) is a very good solution, as is making a single table with the information in Pandas. For the final user, there shouldn't bet any difference in performance (especially if you extract the data to tde in Tableau). Here you need to measure if it's easier for you to maintain a process in python/pandas or in Tableau extracting tool. I believe it's simpler to have Tableau doing the join. But it's your call

Google BigQuery pricing

I'm a Phd student from Singapore Management University. Currently I'm working in Carnegie Mellon University on a research project which needs the historical events from Github Archive (http://www.githubarchive.org/). I noticed that Google Bigquery has Github Archive data. So I run a program to crawl data using Google Bigquery service.
I just found that the price of Google bigquery shows on the console is not updated in real-time... While I started running the program for a few hours, the fee was only 4 dollar plus, so I thought the price is reasonable and I kept running the program. After 1~2 days, I checked the price again on Sep 13, 2013, the price became 1388$...I therefore immediately stopped using Google bigquery service. And just now I checked the price again, it turns out I need to pay 4179$...
It is my fault that I didn't realize I need to pay this big amount of money for executing queries and obtaining data from Google bigquery.
This project is only for research, not for commercial purpose. I would like to know whether it is possible to waive the fee. I really need [Google Bigquery team]'s kindly help.
Thank you very much & Best Regards,
Lisa
A year later update:
Please note some big developments since this situation:
Querying prices are 85% down.
GithubArchive is publishing daily and yearly tables now - so while developing your queries always test them on smaller datasets.
BigQuery pricing is based on the amount of data queried. One of its highlights is how easily it scales, going from scanning few gigabytes to terabytes in very few seconds.
Pricing scaling linearly is a feature: Most (or all?) other databases I know of would require exponentially more expensive resources, or are just not able to handle these amounts of data - at least not in a reasonable time frame.
That said, linear scaling means that a query over a terabyte is a 1000 times more expensive than a query over a gigabyte. BigQuery users need to be aware of this and plan accordingly. For these purposes BigQuery offers the "dry run" flag, that allows one to see exactly how much data will be queried before running the query - and adjust accordingly.
In this case WeiGong was querying a 105 GB table. Ten SELECT * LIMIT 10 queries will quickly amount to a terabyte of data, and so on.
There are ways to make these same queries consume much less data:
Instead of querying SELECT * LIMIT 10, call only the columns you are looking for. BigQuery charges based on the columns you are querying, so having unnecessary columns, will add unnecessary costs.
For example, SELECT * ... queries 105 GB, while SELECT repository_url, repository_name, payload_ref_type, payload_pull_request_deletions FROM [githubarchive:github.timeline] only goes through 8.72 GB, making this query more than 10 times less expensive.
Instead of "SELECT *" use tabledata.list when looking to download the whole table. It's free.
Github archive table contains data for all time. Partition it if you only want to see one month data.
For example, extracting all of the January data with a query leaves a new table of only 91.7 MB. Querying this table is a thousand times less expensive than the big one!
SELECT *
FROM [githubarchive:github.timeline]
WHERE created_at BETWEEN '2014-01-01' and '2014-01-02'
-> save this into a new table 'timeline_201401'
Combining these methods you can go from a $4000 bill, to a $4 one, for the same amount of quick and insightful results.
(I'm working with Github archive's owner to get them to store monthly data instead of one monolithic table to make this even easier)