Is it possible to mandate a table expiration date? - google-bigquery

We supply data to our user base in BigQuery. Our users are free to query that data as much as they like however we currently don't allow them to create new BigQuery tables and write data into those tables. The reason for this is that this allows us to be compliant with various regulations that we have to conform to, particularly where that data has PII information in it.
Our users have told us this limits the usefulness of that data, which we think is very fair. There are legitimate reasons for a user to want to do something like
CREATE TABLE
`MyProject.MyDataset.t2`
AS SELECT customer_id, date1 from `AnotherProject.AnotherDataset.t1`
they also want and need to do similar using code:
from google.cloud import bigquery
client = bigquery.Client()
dataset_id = "MyDataset"
dataset_id_full = f"{client.project}.{dataset_id}"
dataset = bigquery.Dataset(dataset_id_full)
dataset = client.create_dataset(dataset)
job_config = bigquery.QueryJobConfig()
job_config.destination = f"{dataset_id_full}.t2"
query = """
SELECT customer_id, date1 from `AnotherProject.AnotherDataset.t1`
"""
query_job = client.query(query, job_config=job_config)
query_job.result() # Waits for the query to finish
Our compliance experts have decreed that users will be allowed to write data into such tables as long as the table only exists temporarily (say, for 20 days).
I am hoping we can enforce this by creating the datasets and setting a default table expiration time. At that link it is stated:
If you set the expiration when the table is created, the dataset's default table expiration is ignored.
At any point after the table is created, you can update the table's expiration time
We would like to allow users to create tables in a dataset that we provide for them but enforce the table expiration. Hence we want to give users permission to:
create tables
write data into those tables
read from those tables
while at the same time prohibiting them from:
defining a table expiration time of their choosing when they create a table
choosing to create a table without an expiration time
updating a table's expiration time
Is there a combination of IAM permissions that allows us to do this?

Related

Database designing - "1 table or 2 tables", "If 2 tables: create copy or join"

I will be storing same structure of data of different nature in table(s).
Scenario: There will be a template of data that will be shown to customers on registration.
On registration the customers can edit the details and save it as their local settings. This will not impact the global template setting that was displayed to the customer initially.
I have few options now.
Option # 1: I can store all global and local settings in just one table
For customer registration page
select * from Settings where RegistrationID is null
For retrieving customer settings
select * from Settings where RegistrationID = {RegID}
Therefore every time new customers come for registration the system will use this growing table for retrieving the template (global) settings
Option#2: Storing global and local settings in separate tables
This will have following further options
2.a)
Copy all user settings in the local.Settings table for user for retreival. In this case the system will just refer to one table. This will have a lot of duplications for each registration.
select * from local.Settings where RegistratioID = '{RegID}'
2.b) Store only edited settings in local.Settings table and create another table to local.RegistrationSettings(RegistrationID,SettingID) for handling one to many relationship. In this case system will need to use union statement on global.Settings and local.Settings. Also I will have to make considerations of uniqueness of setting ID on both tables, which I think is manageable if I define a range for both tables.
select * from
((select * from global.Settings) union all (select * from local.Settings where RegistationID = {RegID})) as s
inner join RegistrationSettings as r on r.SettingID = s.ID where r.RegistrationID = {RegID}
Question
Which one is the better option?
Is there any other way this could be handled?
I'd think it's better to have two different tables, allowing you to have better control of your data and making your system more efficient.

Storing logs in postgres database as text vs json type

Let's say we want to create a table to store logs of user activity in a database. I can think of 2 ways of doing this:
A table having a single row for each log entry that contains a log id, a foreign key to the user, and the log content. This way we will have a separate row for each activity that happens.
A table having a single row for the activity of each unique user(foreign key to the user) and a log id. We can have a json type column to store the logs associated with each user. Each time an activity occurs, we can get the associated log entry and update its JSON column by appending the new activity to it.
Approach 1 provides a clean way of adding new log entries without the need to update the old ones. But querying such a table to get the activity of a user would query the entire table.
Approach 2 adds complexity to adding a new user activity since we would have to fetch and update the JSON object but querying would just return a single row.
I need help to understand if one approach can be clearly advantageous over the other.
Databases are optimized to store and retrieve small rows from a big table. So go for the first solution. Indexes make joins like that fast.
Lumping all data for a user into a single JSON object won't make you happy: each update would have to read, modify and write the whole JSON, which is not efficient at all.
If you logs changes a lot, in terms of properties, I would create a table with:
log_id, user_id (fk) and log in json format with each row as one activity.
It won't be a performance problem if you index your table. In postgresql you can index on fields inside a json column.
Approach 2 will become slower to update after each update, as the column size grows. Also, querying will be more complex.
Also consider a logging framework that can parse semi-structured data into database columns, such as Serilog.
Otherwise I would also recommend your option '1', a single line per log with an index on the user_id, but suggest adding a timestamp to your columns so the query engine can sort on the order of events before having to parse the json itself for a timestamp:
CREATE TABLE user_log
(
log_id bigint, -- (PRIMARY KEY),
log_ts timestamp NOT NULL DEFAULT(now()),
user_id int NOT NULL, --REFERENCES users(user_id),
log_content json
);
CREATE INDEX ON user_log(user_id);
SELECT user_id, log_ts, log_content => 'action' AS user_action FROM user_log WHERE user_id = ? ORDER BY log_ts;

Table design in Azure Table Storage

I should organize REST-service for messaging using azure. Now i have problem with DB. I have 3 tables: users, chats, messages of chats.
Users contains user data like login, password hash, salt.
Chats contains partitionkey - userlogin, rowkey - chatId, nowInChat - the user came from a chat.
Messages of chat contains partitionkey, wich consists of
userlogin_chatId_datetimeticks
(zevis_8a70ff8d-c363-4eb4-8a51-f853fa113fa8 _634292263478068039),
rowkey - messageId, message, sender - userLogin.
I saw disadvantages in the design, such as, if you imagine that users are actively communicated a year ago, and now do not talk, and one of them wants to look at the history, then I'll have to send a large number of requests to the server with the time intervals, such as a week, request data. Sending the request with a time less than today will be ineffective, because We get the whole story.
How should we change the design of the table?
Because Azure Storage Tables do not support secondary indexes, and storage is very inexpensive, your best option is to store the data twice, using different partition and/or row keys. From the Azure Storage Table Design Guide:
To work around the lack of secondary indexes, you can store multiple
copies of each entity with each copy using different PartitionKey and
RowKey values
https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/#table-design-patterns
Thank you for your post, you have two options here. The easiest answer with the least amount of design change would be to include a StartTime and EndTime in the Chat table. Although these properties would not be indexed I'm guessing there will not be many rows to scan once you filter on the UserID.
The second option requires a bit more work, but cleaner, would be to create an additional table with Partition Key = UserID, Row Key = DateTimeTicks and your entity properties would contain the ChatID. This would enable you to quickly filter by user on a given date/date range. (This is the denormalization answer provided above).
Hopefully this helps your design progress.
I would create a separate table with these PK and RK values:
Partition Key = UserID, Row Key = DateTime.Max - DateTimeTicks
Optionally you can also append ChatId to the end of the Row Key above.
This way the most recent communication made by the user will always be on top. So you can later on simply query the table with passing in only the UserId and a take count (ie. Take Count = 1 if you want the latest chat entry from the user). The query will also be very fast because since you use inverted ticks for your row keys, azure table storage service will sort the entries for the same user id in increasing lexicographical order of Row Keys, always keeping the latest chat on top of the partition as it will have the minimum inverted tick value.
Even if you add Chat Id at the end of the RowKey (ie. InvertedTicks_ChatId) the sort order will not change and latest conversation will be on top regardless of chat id.
Once you read the entity back, you subtract the inverted ticks from DateTime.Max to find the actual date.

Multiple locations and different user privileges for database

I am not sure what the best route to go on this is. I have a client who has 3 different locations for his business. Each locations employees can only access their locations data. The owner can access all... Then, different roles should be able to access their stuff only (finance can see finance but not sales, etc..).
What is the best way to go about this? The solutions I can think of are:
Create a user table, give a location ID and role ID and base the data off of that. This would require adding the location ID a lot though..
Create 3 separate databases and have the information display based off of a role ID. This doesn't seem ideal
Use functionality on the DB side, stored procedures, etc...
Retrofitting a multi-tenancy security model into an existing database isn't a simple task - IMO this should be designed into the model from the start.
An extremely simple model (One Role per user, One Location per User) would look like this:
-- You need to add simple lookup tables for Role, Location
CREATE TABLE User
(
UserId INT, -- PK
RoleId INT, -- FK
LocationId INT NULL -- FK
);
All sensitive tables would either directly need the LocationId classification, or need to be joinable to a table which has the LocationId classification, i.e.:
CREATE TABLE SomeTable -- with location-sensitive data
(
Col1 ... Col N,
LocationId INT
);
The hard part however is to adjust all of your system's queries on the sensitive data tables such that they now enforce the Location-specific restriction. This is commonly done as an additional predicate filter which is appended to the where clause of queries done on these tables, and then joining back to the user-location table:
SELECT Col1 ... ColN
FROM SomeTable
INNER JOIN User on SomeTable.LocationId = User.LocationId
WHERE -- Usual Filter Criteria
AND ((User.UserId = #UserIdExecutingThisQuery
AND User.RoleId = `Finance`) -- Well, the Id for Finance
OR User.RoleId = `Administrator`) -- Well, the Id for Admin
As a result of the redesign effort, as a short term solution, you might look at at instead maintaining 3 distinct regional databases (or 3 regional schemas in the same database), and then using replication or similar to then centralize all data to a master database for the owner role to use.
This will give you the time to redesign your database (and app(s)) to use a multi-tenancy design. I would suggest a more comprehensive model of allowing multiple roles per user, and multiple locations per user (i.e. many-many junction tables), and not the simplistic model shown here.

Need help in designing friendship request SQL table

I have SQL table for friendship requests.
Table is on server - clients are mobile phones
Table:
Key = index, int, auto increment
C1 = userA_ID
C2 = userB_ID
(C1, C2 = unique)
C3 = status (pending, accepted, declined, unfriend....)
For better practice in the mobile, for not querying all the time the entire friendship request table I store a table also in the local DB on the device.
Once table was queried - it is stored in the local DB, so if nothing was changed device does not need to do queries from the server.
So - in the app init... (or every time entering mailbox of the app of the device) Device is asking server to know if there are new messages and friendship request updates...
For messages it is simple - since each new message has a different ID and I search on server all messages where id > stored id....
But for friendship request - I update the line in the server's DB so the index is still same index...
I thought of two options:
Add Date column and check for updates done later than last check (last check will be stored in the local DB). I prefer to do the comparison on indexes and not on a date
Get all friendship requests entries of the user when app inits and do compares locally
Any recommendations?
Better ideas?
You could add a bit column that tracks if the request has been viewed, and set it to true when you retrieve the request, then filter on that column. That would probably be slightly better performance than storing the date, and the date last retrieved, and comparing the two each time.