Data Modeling for Google Bigquery - google-bigquery

Attached is a sample of my table structure currently.
My data source is from Google Campaign manager. When I extract the different tables as indicated in the sheet, I get a difference in figures(I am taking over from a person who did the initial design). E.g Impressions might be one figure in my “fact table” and another figure somewhere else.
Problem is, there are no primary keys and also tying the tables to one another is also difficult, link might be between dates. The database is Google Bigquery.
Do you have any idea how to do a proper data design with this type of marketing data and does it need to be denormalised? From research I gathered, Bigquery data design is best as denormalised.
I would also like to move away from spreadsheets, I believe this is also part of the chaos.
I believe CM360 Data Platform is my fact table.
[CM360_Tables Overview.xlsx - Google Sheets 1]
[etl_process.txt - Google Drive 1]

Related

Joing Ads Data in Ads Data Hub with GA360 Data in BigQuery

I need to find a way how to (SQL)-join my GA360 tables in BigQuery(BQ) with data within AdsDataHub(ADH).
I already know how to query tables from BQ within ADH:
SELECT *
FROM 'projectname.table_name'
But I cant find any resources on what matching key to use in the Join statement
SELECT
*
FROM
adh.*** AS adh_data
adh_data LEFT JOIN ???
ON ga360.??? = ???
I read through this https://developers.google.com/ads-data-hub/guides/join-your-data
But it's not really clear to me what to get/use from it and I couldn't find any information on this topic anywhere.
Thank you in advance!
AFAIK, ADH doesn't currently allow for querying across google analytics data sets (which would already be in ADHs "clean room" if they wanted you to be able to make such queries...)
Your best option might be to A: make sure that you're capturing 1st party IDs in your google analytics implementation and B: ensuring those IDs are also captured in your CRM platforms as they interact with your properties (assumption being your CRM can capture, along with that ID, any Google Analytics related data you may find useful, though it won't be log level I don't think...)
From there, with "onboarding" of sorts, you may be able to eventually drop your CRM data into ADH queryable tables which can be joined (per the link you shared, "join your data") and then well... you're at google's behest for the most part, but I think that's the path you're looking for...
PS: Google may have some solutions with guides that include some useful example queries regarding join keys across CM/DV/GoogleAds tables, and they may be high quality, but they may not be EXACTLY what you're looking for... It's entirely possible they are not publicly available though...

Backfill Google Analytics in BigQuery

I'm looking for a workaround on the following issue. Hope someone can help.
I'm unable to backfill data in the ga_sessions_ table in BigQuery through product linking in GA. e.g. partition ga_sessions_20180517 is missing
This specific view has already been linked before. Google documentation says that historical load is only done once per view (hence, the issue) (https://support.google.com/analytics/answer/3416092?hl=en)
Is there any way to work around it?
Kind regards,
Martijn
You can use Google Analytics Reporting API to get the data for that view. This method has lot of restrictions like sometimes the data is sampled/only 7 dimensions can be exported in one call, but at least you will be able to fetch your data in a partitioned manner.
Documentation hereDoc
If you need a lot of dimensions/metrics in hit level format, scitylana.com has a service that can provide this data historically.
If you have a clientId set in a custom dimension the data-quality is near perfect.
It also works without a clientId set.
You can get all history as available through the API.
You can get 100+ dimensions/metrics in one batch into BQ.

Database schema sample for storing social media post

I am building a small social networking website, I have a doubt regarding database schema:
How should I store the posts(text) by a user?
I'll have a separate POST table and will link USERS table with it, through USERS_POST table.
But every time to display all the posts on user's profile, system will have to search the entire USERS_POST table for USER id and then display?
What else should I do?
Similarly how should I store the multiple places the user has worked or studied?
I understand it's broad but I am new to Database. :)
First don't worry too much, start by making it work and see where you get performance problems. The database might be a lot quicker then you expect. Also it is often much easier to see what the best solution is when you have an actual query that is too slow.
Regarding your design, if a post is never linked to more then one user then forget the USERS_POST table and put the user id in the POST table. In either case an index on the user id would help (as in not having to read the whole table) when the database grows large.
Multiple places for a single user you would store in an additional table. For instance called USERS_PLACES, give it a column user_id to link it to USERS plus other columns for the data you wish to store per place.
BTW In postgresql you might want to keep all object (tables, columns, ...) names lowercase because unless you take care to always quote them like "USERS" postgresql will make them lowercase which can be confusing.

Efficient Database Structure

I'm working on an app where part of it involves people liking and commenting on pictures other people posted. Obviously I want the user to be notified when someone comments/likes their picture but I also want that user to be able to be able to see the pictures that they posted. This brings up a couple structuring questions.
I have a table that stores an image with it's ID, image, other info such as likes/comments, date posted info, and finally the userID of the user that posted the image:
Here's that table structure:
Image Posts Table: |postID|image|misc. image info|userID|
The userID is used to grab information from the users entry in the user table for notifications. Now when that user looks at a page containing his own posts I have two options:
1.) Query the Image Posts Table for any image containing that user's userID.
2.) Create a table for each user and put a postID of each image they posted :
Said User's Table: |postID|
I would assume that the second option would be more efficient because I don't have to query a table with a large amount of entries. Are there any more efficient ways to do this?
Obviously I should read up on good database design so do any of you have any good recommendations?
Multiple tables of identical structure almost never makes sense. Writing queries using your 2nd option would become ugly in short order. Stick with 1 large user's table, databases are designed to handle tables with many rows.
I would recommend against manually storing the userID, as Parse will do it's own internal magic if you just set a property called user to the current user. Internally it stores the ID and marks it as an object reference. They may or may not have extra optimizations in there for query performance.
Given that the system is designed around the concept of references, you should keep to just the two tables/classes you mentioned.
When you query the Image Posts table you can just add a where expression using the current user (again it internally gets the ID and searches on that). It is a fully indexed search so should perform well.
The other advantage is that when you query the Image Posts table you can use the include method to include the User object it is linked to, avoiding a 2nd query. This is only available if you store a reference instead of manually extracting and storing the userID.
Have a look at the AnyPic sample app on the tutorial page as it is very similar to what you mention and will demonstrate the ideas.

How to import complex relational data into SQL Server from Excel

We have business users who are entering product information into excel spreadsheets. I have been tasked with coming up with a way of entering this information into our SQL Server DB. The problem is that the excel spreadsheets aren't just a flat table, they're hierarchical. They're something like this
-[Product 1] [Other fields]...
-[Maintenance item 1] [Other fields]...
-[Maintenance task 1] [other fields]...
-[Maintenance item 2] [Other fields]...
-[Maintenance task 2] [other fields]...
-[Maintenance task 3] [other fields]...
-[Product 2] [Product Description] [Other fields]...
ETC.......
So there can be 0-many maintenance items for a product and 0-many maintenance tasks for a maintenance items. This is how the database is structured. I need to come up with a standard excel template I can send out to our business users so they can input this information and then figure out how to export this into sql server. The volume is going to be high so I need to have the import somewhat automated. How should I do this?
Welcome to the worst possible way to store data and try to import it into a database. If at all possible do not let them create garbage Excel spreadsheets like that. That method is bound to create very many bugs in the data imports and you will hate your life forever if you have to support this mess.
I can't believe I'm even suggesting this, but can you get them to use a simple Access database instead? It could even link directly to the SQL server database and store the data correctly. By using Access forms, the users will find it relatively easy to add and maintain information and you will have far fewer problems than trying to import Excel data in the form you described. It would be a far less expensive and far less error prone solution to your problem.
You are stuck with the format, the best way I have found to do something like ths is to import it as is into a staging table add the ids to every subordinate row (you may end up looping to do this) and then drag the information out to relational staging tables and then import into the production database.
You can create all this using SSIS but it won't be easy, it won't quick and it will be very prone to bugs if users aren't disciplined abnout exactly how they enter data (and they never are without a set of forms to fill out). Make sure you reject the Excel spreadsheet completely and send it back to the user if it strays at all from the prescribed struture. Trust me on this.
I's estimate the Access solution to take about a month and the Excel solution to take at least six months of development. Really that's how bad this is going to be.
I don't believe you'll find an import tool that will do this for you. Instead, you're going to have to write a script to ETL the spreadsheet files. I do a lot of this in Python (I'm doing it today, in fact).
Make sure that you handle exceptions on per-cell level, reporting to the user exactly which cell had unexpected information. With spreadsheets created by hand it is guaranteed that you will have to handle this on a regular basis.
That said, if this is coming to you as XLSX it might be possible to develop an XML translation to convert it to some more tractable XML document.
It probably makes more sense to break it up into several Excel sheets...one for product, but then another for maintenance items, and another for maintenance tasks. For each one, they'll have to enter some kind of ID to link them back together (ex: maintenance_task_id=1 links to maintenance_item_id=4). That can be a pain for business users to remember, but the only alternative is to enter lots of redundant data for each line.
Next, create a normalized database model (to avoid storing redundant data) and fill it by writing an app or script to parse-through your Excel sheets. Vague and high-level, but that's how I'd do it.
I agree with previous posts in general...
my suggestion - avoid the spreadsheet entirely. Spend your time making a simple front end form - preferably a web based one. catch the data as cleanly as possible (ANYTHING here will be better than the spreadsheet cleanliness. - including just having named fields)
you will spend less time in the end.
I would add VBA code to the template to add as much structure and intelligence as possible to the user data entry and validation.
In the extreme case of this you make the user enter all data via Forms which put all the validated data on the sheet, and then have an overall validation routine built into the Save or Close event.
less extreme would be to add 3 command buttons driving code for
- add product
- add maintenance item
- add maintenance task
and some overall validation code at save/closeThis way you add as much smarts as possible to the data entry tasks.
Use Named Cells or other hidden metadata created by the VBA code as markers so that your DB update routine can make better sense of the data.The last one I did like this took 3-4 manweeks including the DB update routines, but I think it was probably more complicated than your example. But if you are not experienced with VBA and the Excel object model and events it would obviously take much longer.