How to get segmentation data from Snowflake Table in API efficiently and cost-effectively? - sql

I have a segmentation project I am working on for my company and we have to create a pipeline to gather data from our app users and when they fit a segment then the app will receive that information and do something with it (not in my scope). So currently, the client connects and authenticates to an endpoint that allows their client to send JSON data to an Elasticsearch cluster (app started, level completed, etc). I'm then using an Azure Function to grab the live data every 5 minutes and store it in an Azure Blob Storage which then creates a queue that Snowflake reads and ingests the JSON files. We'd then use Snowflake to run a task per segment (that will be decided by the analysts or executives) and the data will be outputted to a table like the one below:
AccountID
Game
SegmentID
CreatedAt
DeletedAt
123456789
Game 1
1
2021-04-20
2021-04-21
123456789
Game 1
2
2021-04-20
123456789
Game 1
3
2021-04-20
Where SegmentID can represent something like
SegmentID
SegmentType
SegmentDescription
1
5 Day Streak
User played for 5 consecutive days
2
10 Day Streak
User played for 10 consecutive days
3
15 Day Streak
User played for 15 consecutive days
In the next step of the pipeline, the same API the user authenticated with should post a request when the game boots up to grab all the segments that the user matches. The dev team will then decide where, when in the session and how to use the information to personalize content. Something like:
select
SegmentID
from
SegmentTable
where
AccountID='{AccountID the App authenticated with}' and
Game='{Game the App authenticated with}' and
DeletedAt is null
Response:
SegmentID
2
3
Serialised:
{"SegmentID": [2,3]}
We expect to have about 300K-500K users per day. My question would be, what would be the most efficient and cost-effective way to get this information from Snowflake back to the client so that this amount of users wouldn't have issues when querying the same endpoint and it won't be costly.

OK, so a bit of a workaround, but I created an external function on Snowflake (using Azure Functions) that upserts data in a local MongoDB cluster. So the API connects to the MongoDB instance which can handle the large volume of concurrent connections and since it is on a local server it is quite cheap. The only cost is the data transfer from Snowflake to MongoDB and the running of the App Service Plan on Azure Functions (could not use consumption-based as to send data to our internal server I needed to create a VNET, NAT Gateway and a Static Outbound IP Address in Azure) and the API Management Service I had to create in Azure.
So how it works? For each stored procedure in Snowflake, at the end I am collecting the segments which have changed (New row or DELETED_AT not null) and triggering the external function which upserts the data in MongoDB using the pymongo client.

Related

How to manually test a data retention requirement in a search functionality?

Say, data needs to be kept for 2years. Then all data that were created 2years + 1day ago should not be displayed and be deleted from the server. How do you manually test that?
I’m new to testing and I can’t think of any other ways. Also, we cannot do automation due to time constraints.
You can create the data with backdating of more than two years in the database and can test, if it is being deleted or not automatically, In other ways ,you can change the current business date from the database and can test it
For the data retention functionality a manual tester needs to remember the search data so that the tester can perform the test cases for the search retention feature.
By Taking an example of a social networking app , being a manual tester you need to remember all the users that you searched for recently.
To check the time period of retention you can take the help from the backend developer so that they can change the time period (from like one year to 10 min) for testing purpose.
Even if you delete the search history and then you start typing the already entered search result the related result should pop on the first location of the search result. Data retention policies concern what data should be stored or archived, where that should happen, and for exactly how long. Once the retention time period for a particular data set expires, it can be deleted or moved as historical data to secondary or tertiary storage, depending on the requirement
Let’s us understand with an example, that we have below data in our database table based on past search made by users. Now with the help of this table, you can perform this testing with minimum effort and optimum result. We have Current Date as - ‘2022-03-10’ and Status column states that data is available / not available in database, where Visible means available, while Expired means deleted from table.

Search Keyword
Search On Date
Search Expiry Date
Status
sport
2022-03-05
2024-03-04
Visible
cricket news
2020-03-10
2022-03-09
Expired - Deleted
holy books
2020-03-11
2022-03-10
Visible
dance
2020-03-12
2022-03-11
Visible

Redis Caching Structure use Case

Problem Statement : One of our module(Ticket module - store procedure ) is taking 4- 5 sec to return the data from DB. This store procedure supports 7 -8 filters + it has joins on 4- 5 tables to get the text of the IDs stored in Ticket tables eg ( Client name, Ticket Status, TicketType ... ) and this has hampered the performance of the SP.
Current Tech stack : ASP.Net 4.0 Web API , MS SQL 2008
We are planning to introduce Redis as a caching server and Node js with the aim of improving the performance and scalability.
Use case : We have Service ticket module and its has following attributes
TicketId
ClientId
TicketDate
Ticket Status
Ticket Type
Each user of this module has access to fix no of clients ie
User1 have access to tickets of Client 1, 2, 3, 4......
User2 have access to tickets of Client 1, 2, 5, 7......
So basically when User1 accesses the Ticket module he should be able to filter service tickets on TicketId, Client, Ticket Date ( from and To) , Ticket Status (Open, Hold, In Process ...) and Ticket Type ( Request, Complaint, Service .....) + Since User1 has access of only Client 1 ,2 3, 4......, caching should only give the list of tickets of the clients he has access of.
Would appreciate if you guys can share your views how should we structure the Redis ie what should we use for each of the above items ie hashset, set, sorted set ... + how should we filter the tickets depending on the access of client resp user has.
Redis is a key/value storage. I would use hashset with a structure like:
Key: ticketId
Subkeys: clientId, ticketDate, ticketStatus, ticketType
Search, sorting etc - handle programmatically from application or/and in LUA.

Amount of Azure SQL Database space required

I want to save the addresses of some 100,000 people in Azure SQL Database. The schema for the address table will look like this.
The size of each table field is 250 NVARCHAR.
First Name
Middle Name
Last Name
Email
Phone 1
Phone 2
Phone 3
Fax 1
Twitter
Facebook
LinkedIn
Address 1
Address 2
City
State
Country
How much GB of storage I need to store 100,000 addresses and the additional space required by the SQL Server?
Does Microsoft Azure charge for accessing the server to retrieve or save data per call?
The Azure Active Directory B2C is suitable for managing large amount of people and their data.

How to send data to only one Azure SQL DB Table from Azure Streaming Analytics?

Background
I have set up an IoT project using an Azure Event Hub and Azure Stream Analytics (ASA) based on tutorials from here and here. JSON formatted messages are sent from a wifi enabled device to the event hub using webhooks, which are then fed through an ASA query and stored in one of three Azure SQL databases based on the input stream they came from.
The device (Particle Photon) transmits 3 different messages with different payloads, for which there are 3 SQL tables defined for long term storage/analysis. The next step includes real-time alerts, and visualization through Power BI.
Here is a visual representation of the idea:
The ASA Query
SELECT
ParticleId,
TimePublished,
PH,
-- and other fields
INTO TpEnvStateOutputToSQL
FROM TpEnvStateInput
SELECT
ParticleId,
TimePublished,
EventCode,
-- and other fields
INTO TpEventsOutputToSQL
FROM TpEventsInput
SELECT
ParticleId,
TimePublished,
FreshWater,
-- and other fields
INTO TpConsLevelOutputToSQL
FROM TpConsLevelInput
Problem: For every message received, the data is pushed to all three tables in the database, and not only the output specified in the query. The table in which the data belongs gets populated with a new row as expected, while the two other tables get populated with NULLs for columns which no data existed for.
From the ASA Documentation it was my understanding that the INTO keyword would direct the output to the specified sink. But that does not seem to be the case, as the output from all three inputs get pushed to all sinks (all 3 SQL tables).
The test script I wrote for the Particle Photon will send one of each type of message with hardcoded fields, in the order: EnvState, Event, ConsLevels, each 15 seconds apart, repeating.
Here is an example of the output being sent to all tables, showing one column from each table:
Which was generated using this query (in Visual Studio):
SELECT
t1.TimePublished as t1_t2_t3_TimePublished,
t1.ParticleId as t1_t2_t3_ParticleID,
t1.PH as t1_PH,
t2.EventCode as t2_EventCode,
t3.FreshWater as t3_FreshWater
FROM dbo.EnvironmentState as t1, dbo.Event as t2, dbo.ConsumableLevel as t3
WHERE t1.TimePublished = t2.TimePublished AND t2.TimePublished = t3.TimePublished
For an input event of type TpEnvStateInput where the key 'PH' would exist (and not keys 'EventCode' or 'FreshWater', which belong to TpEventInput and TpConsLevelInput, respectively), an entry into only the EnvironmentState table is desired.
Question:
Is there a bug somewhere in the ASA query, or a misunderstanding on my part on how ASA should be used/setup?
I was hoping I would not have to define three separate Stream Analytics containers, as they tend to be rather pricey. After running through this tutorial, and leaving 4 ASA containers running for one day, I used up nearly $5 in Azure credits. At a projected $150/mo cost, there's just no way I could justify sticking with Azure.
ASA is supposed to be purposed for Complex Event Processing. You are using ASA in your queries to essentially pass data from the event hub to tables. It will be much cheaper if you actually host a simple "worker web app" to process the incoming events.
This blog post covers the best practices:
http://blogs.msdn.com/b/servicebus/archive/2015/01/16/event-processor-host-best-practices-part-1.aspx
ASA is great if you are doing some transformations, filters, light analytics on your input data in real-time. Furthermore, it also works great if you have some Azure Machine Learning models that are exposed as functions (currently in preview).
In your example, all three "select into" statements are reading from the same input source, and don't have any filter clauses, so all rows would be selected.
If you only want to rows select specific rows for each of the output, you have to specify a filter condition. For example, assuming you only want records with a non null value in column "PH" for the output "TpEnvStateOutputToSQL", then ASA query would look like below
SELECT
ParticleId,
TimePublished,
PH
-- and other fields INTO TpEnvStateOutputToSQL FROM TpEnvStateInput WHERE PH IS NOT NULL

How to clean up inactive players in redis?

I'm making a game that uses redis to store game state. It keeps track of locations and players pretty well, but I don't have a good way to clean up inactive players.
Every time a player moves (it's a semi-slow moving game. Think 1-5 frames per second), I update a hash with the new location and remove the old location key.
What would be the best way to keep track of active players? I've thought of the following
Set some key on the user to expire. Update every heartbeat or move. Problem is the locations are stored in a hash, so if the user key expires the player will still be in the same spot.
Same, but use pub/sub to listen for the expiration and finish cleaning up (seems overly complicated)
Store heartbeats in a sorted set, have a process run every X seconds to look for old players. Update score every heartbeat.
Completely revamp the way I store locations so I can use expire.. somehow?
Any other ideas?
Perhaps use separate redis data structures (though same database) to track user activity
and user location.
For instance, track users currently online separately using redis sets:
[my code snippet is in python using the redis-python bindings, and adapted from example app in Flask (python micro-framework); example app and the framework both by Armin Ronacher.]
from redis import Redis as redis
from time import time
r1 = redis(db=1)
when the function below is called, it creates a key based on current unix time in minutes
and then adds a user to a set having that key. I would imagine you would want to
set the expiry at say 10 minutes, so at any given time, you have 10 keys live
(one per minute).
def record_online(player_id):
current_time = int(time.time())
expires = now + 600 # 10 minutes TTL
k1 = "playersOnline:{0}".format(now//60)
r1.sadd(k1, player_id)
r1.expire(k1, expires)
So to get all active users just union all of the live keys (in this example, that's 10
keys, a purely arbitrary number), like so:
def active_users(listOfKeys):
return r1.sunion(listOfKeys)
This solves your "clean-up" issues because of the TTL--the inactive users would not appear in your live keys because they constantly recycle--i.e., in active users are only keyed to old timestamps which don't persist in this example (but perhaps are written to a permanent store by redis before expiry). In any event, this clears inactive users from your active redis db.