ReducyByKeyAndWindowByCount in Spark stateful streaming aggegations - apache-spark-sql

I've to inner join two relational tables extracted from Oracle.
Actually i want to perform 1-to-1 join to get one row per primary key with aggegated in list values from the second table. So before joining 1-to-1 two tables i have to reduce all my rows by key to a 1 with values kept in the list.
Here is the illustration of what i need:
[![tables aggregation][1]][1]
And here i've met a problem which is when to stop aggegation for my key and pass aggegated entity to the next step. Spark offers solutions for that by providing window intervals and watermaking for late data. And so assumption for keeping data consistency is the time it receives the data. It is feasible and applicable for infinite datasets but in my case i exactly know the count of aggegations for each key. For exampe for customer_id 1000 i know exactly that there are only 3 products and after i've aggegated 3 products i know that i can stop aggegation now and go to the next streaming step in my pipeline. How can this solution be implemented using Spark and streaming? I know there is reduceByKeyAndWindow operation but in my case i need something like reduceByKeyAndWindowByCount.
Count will be stored in a static dataset or simply store it in a row as an additional data.

Finally we've decided to switch from streaming to core spark with batch processing because we have finite dataset and that thing works well for our use case. We've came to a conclusion that spark streaming was designed for processing continuous (which was actually obvious only from it's naming) datasets. And thats why we have only window intervals by time and watermarks to correct network or other delays during transportation. We've also found our design with counters ugly, complex and in the other words bad. It is a live example of a bad design and such growing complexity was a marker that we were moving in the wrong direction and were trying to use a tool for a purpose it was not designed for.

Related

How to decide on a data warehouse schema for clinical dataset?

I am trying to design a schema for the a clinical dataset, MIMIC-III. I am trying to figure out a way to store the events in a way such that users can query the dataset with ease using possibly a star schema. Almost all of the entries such as diagnoses, procedures, notes, charts etc. are related to a single admission. I had a few things in my mind but I have no experience so am finding it difficult to figure out what the best way to do this is.
Create multiple fact tables, for example I would have one fact table for diagnoses, one for procedures, one for lab notes but this just seems like too many fact tables with little to gain. Like I could have a fact table with an entry for each diagnosis per user per admission but would that give me more benefit than the OLTP schema already implemented?
Create one fact table with a row per admission with multiple columns/dimensions like diagnoses, procedures etc. But the issue with is that for most there are multiple diagnoses per admission so I will have to link to a bridge table in most fact tables and then it would look like the image below. The issue with this is the required joins for the queries.
The third way that I read about is using something like an entity-attribute-value fact table where I have a fact table with each row being one fact. For example, one row could be something like (patientid - addmissionid - (Attribute) Heart Rate Reading - (VALUE) 120bpm) this would create a single fact table with almost everything inside and fewer joins required but it would require the end user to pivot the table after querying due to the nature of EAV's thus making it more complicated for the end user. It would look like the i2b2 star schema.
The last way I thought of was doing an entry per event into the fact table but having many columns in the fact table to store dimensions, like (patientid, admissionid, icustay_id, diagnosis, procedure, labnote, labevent, microbiologyevent, etc.) in which patientid, and admissionid will be in all rows but the rest will depend on the entry so one entry could have just patientid, admissionid and a single procedure. I don't know how the end result of this will be like in terms of querying due to my lack of experience. I also don't know whether or not all these entries with almost every column being irrelevant is the way to go.
Any help would be greatly appreciated, I'm trying to have this implemented into BigQuery.
Some tests have concluded that reducing the number of joins can enhance the BigQuery's performance. In other words, if you denormalize most of your data in a single table you will probably get better results since you'll not need to perform many joins.
There are some points that you should think about when deciding the data modeling:
Is it easy to write queries in your model?
If you need many joins, your model will be probably difficult to build queries. If you use nested fields to reduce the number of joins you can simplify the queries you will write. However, if you create very complex nested structures, you will start having problems to query the model again.
Is it easy to join data in your model?
If you have many tables to be joined, your data will be difficult to join. The more denormalized the data is, the easier it is to join.
Easy to update data
If you need to update your data, you should consider that denormalization can help you. If you reduce the number of tables, you will need to update fewer tables. Its important to say that if you create very complex nested fields, it will also be difficult to be updated.
Make data understandable
This is the most important point. Will your data be understandable in a given model? All the previous points are technical and not exactly related to your data. You should design your model considering these points, but your data must make sense to you.
Finally, I would like to summarize some advice:
You can get a better performance denormalizing your data as long as the data keeps understandable.
Use nested fields to denormalize the model but try not creating very complex structures (more than 2 level of nesting).
Keep in mind that your model will need more disk space when you denormalize it so your billing for storage in BigQuery will probably be higher.
I hope it helps
Initial Reading
Trying reading The Data Warehouse Toolkit, chapter 14 has a section on healthcare data modelling.
Modelling vs Storage
But what you should try and figure out is what is the important and high value data, and what is the less important and low value data. Only model and load into the database the high value data. If you try and build the perfect data model, you are never going to deliver any useful reporting to your customers/clients.
What will be used on a daily or hourly basis? This information needs to be in your data model and loaded into the database for aggregation and slicing.
What will be used only occasionally? Once the slicing and dicing is complete, there will be further questions about the minutiae of a small cohort. This is when you go to the bulk storage and retrieve this data from your Data Lake.
Data Lake
Instrument measurements are great examples of low value information. Most individual measurements are not useful, but your ETL could review them and make an overall determination of some kind. E.G. Blood Pressure Normal, Heart Rate High, etc.
Prescription drugs are another example of low value information. In the database you may set a flag if multiple prescriptions are present or something similar. Freeform notes are another. Once some cohort has been decided on based on many other factors, you enable a data scientist to process the note with some machine learning but doing this for all notes is not very useful.
Group/Junk Dimensions
A number of these measurement determinations could be lumped together inside of a group/junk dimension as a way of resolving maintaining the grain of the fact and keeping the fact from exploding with low value rows. You could even hold off on modelling a junk dimension until your customers/clients start telling you about long executions for specific types of data extracts. You can then design a junk dimension to serve these higher value measurement determinations to enable additional aggregations or slicing prior to extracting the insightful data from the data lake.
File Structure
In your data lake I would have many file schemas for the low value data. These could be JSON, parquet, csv, or whatever you prefer. You would include the data needed to connect it back to the fact along with the data specific to the file type.
PatientId
AdmissionId
MeasureType
MeasureValue
Date
Time
The point is most of this data will never be looked at, but occasionally a few records are high value. You just don't know which they will be, so you store them as cheaply as possible until they are needed.
The data lake also allows you to change your file schema as new information becomes available and executing this change is trivial compared with changing a database star schema.
Use your favourite scripting language to create these files. Python, C#, Azure Function App, AWS Lamda, whatever. This will depend on your own skills and resources available.

Normalization of SQL Database with similar data managed by different tools

I'm designing a database for storing a bunch of product data that is both pulled via an API and scraped off the web. This scraper will pull some data that is static and some data that varies with time. Therefore there will one table for each type of data (static/variable). I'm trying to decide if there should be a separate table for variable data that is scraped compared to variable data that is pulled through an API.
At first, I thought they should be stored in separate tables because they are managed by separate tools. However, data will be pulled through the API and scraped on the same schedule (daily), and so they will both be mapped with the same ProductID and date. So, it seems like I could just combine the schema of both tables to save on the join time during queries for processing the data later. The obvious downside to this is managing whether rows need to be created or updated whenever one of the processes runs (which of the scraper vs API tools create or update rows).
For what it's worth, these scripts will be pulling data for millions (maybe tens of millions) of rows per day, and storing it for quite a while. So, the tables are going to get quite huge, and that's why I'm concerned with join times later on.
Here's an example in case this is all a little cloudy as an example. There are multiple industries for this, but I'll just use real estate:
Scraped Static Data: ProductID, Address, City, State, Zip, SquareFeet, etc.
Scraped Variable Data: ProductID, Price, PricePerSqFt, etc.
API Variable Data: ProductID, PageHits, UniqueVisitors, etc.
Mainly just the variable data is the concern here. So, just summarize, separate tables for the sake of general design principles, or one table for the sake of speed on joins?
Thanks in advance for the input
The example you give indicates that, apart from having 2 or 3 tables, you should also consider having just one table for both static and variable data. As long as the key of everything is just the product id, you can keep all information describing a particular id value in one record. Or do you intend to have a time stamp as part of the key of your variable data?
Once this has been decided, I can't see any advantage in having more tables than necessary.
The joins you mention won't be particularly complicated, as they basically mean to read a single record from each of your tables, each time using a primary key, which is fast. But still reading 3 records means more effort than reading 2, or only one.
There is no general design principle saying you should have a separate table for each way to collect data. On the contrary, it's the purpose of a database to contain data according to their logical structure without (too much) regard of the technical means of collecting or accessing them.
The logic to decide whether to insert or update a row isn't complicated. Also, if you want to verify your data, you might need some logic anyway, e.g. making sure that variable data only get inserted for an object that already has static data.

SQL Server Time Series Modelling Huge datacollection

I have to implement data collection for replay for electrical parameters for 100-1000's of devices with at least 20 parameters to monitor. This amounts to huge data collection as it will be based very similar to time series.I have to support resolution for 1 second. thinking about 1 year [365*24*60*60*1000]=31536000000 rows.
I did my research but still have few questions
As data will be huge is it good to keep data in same table or should the tables be spitted. [data structure is same] or i should
rely on indexes?
Data inserts also will be very frequent but i can batch them still what is the best way? Is it directly writing to same database
or using a temporary database for write and sync with it?
Does SQL Server has a specific schema recommendation to do time series optimization for select,update and inserts? any out of box
helps for day average ? or specific common aggregate functions i can
write my own but just to know as this a standard problem so they
might have some best practices and samples out of box.**
please let me know any help is appreciated, thanks in advance
1) You probably want to explore the use of partitions. This will allow very effective inserts (its a meta operation if you do the partitioning correctly) and very fast (2). You may want to explore columnstore indexes because the data (once collected) will never change and you will have very large data sets. Partitioning and columnstore require a learning curve but its very doable. There are lots of code on the internet describing the use of date functions in SQL Server.
That is a big number but I would start with one table see if it hold up. If you split it in multiple tables it is still the same amount of data.
Do you ever need to search across devices? If not you can have a separate table for each device.
I have some audit tables that are not that big but still big and have not had any problems. If the data is loaded in time order then make date the first (or only) column of the clustered index.
If the the PK is date, device then fine but if you can get two reading in the same seconds you cannot do that. If this is the PK then if you can load the data by that sort. Even if you have to stage each second and load. You just cannot afford to fragment a table that big. If you cannot load by the sort then take a fill factor of 50%.
If you cannot have a PK then just use date as clustered index but not as PK and put a non clustered index on device.
I have some tables of 3,000,000,000 and I have the luxury of loading by PK with no other indexes. There is no measurable degradation in insert from row 1 to row 3,000,000,000.

What's the probability of repeating a rowkey that depends on datetime in Azure

Is the there the chance to repeat a RowKey in an Azure table that is constructed like this?
string.Format("{0:d19}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks)
This is the same to ask:
Will my webrole create two records before we get to a new value for Ticks?
I want to return the entities in reverse chronological order. This table is not expected to be constantly added new entities, but hopefully will have many insert transactions.
Solution
OK, this is the solution I came across thanks to Mark Seeman:
string.Format("{0:d19}+{1}", DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks, Guid.NewGuid().ToString("N"))
Using a Guid, I know for sure that the rowkey will not be repeated.
The probability of getting more than one identical rowkey is 1 as the number of concurrent writes approaches infinity :)
Seriously, a Tick is 100 nanoseconds. Modern processors run at speeds where you will have hundreds of CPU cycles per Tick, so getting identical Ticks will happen.
In fact, I was just talking to one of my customers yesterday. He had that exact problem and had had to hack around it to ensure uniqueness.
Consider introducing a truly unique component (such as GUID) as part of a (sortable) composite.

How should I keep accurate records summarising multiple tables?

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).
I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).
Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).
You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html
In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.
You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).