Spark joins- save as dataframes or partitioned hive tables - apache-spark-sql

I’m working on a project with test data close to 1 million records and 4 such files .
The task is to perform around 40 calculations joining the data from 4 different files each close to 1gb .
Currently, I save the data from each into a spark table using saveastable and perform operations . For e.g. - table1 joins with table2 and the results are saved to table3 . Table3(result of 1 and 2 ) joins with table4 and so on . Finally I’m saving these calculations on a different table and generating the reports.
The entire process takes around 20 minutes and my concern is when this code gets to the production with data probably 5 times more than this , will there be performance issues .
Or is it better to save those data from each file in a partitioned way and then perform the joins and arrive to the final resultset .
P.S - The objective is to get instant results and there might be cases where the user is updating a few rows from the file and expecting an instant result. And the data is on a monthly basis , basically once every month with categories and sub-categories within .

What you are doing is just fine, but make sure to cache + count after every resource extensive operations instead of writing all the joins and then save at last step.
If you do not cache in between, spark will run entire DAG from top to bottom at the last step , it may cause JVM to overflow and spill to disk during operations which may in turn affect the execution time.

Related

What is the best approach for bulk cleaning a database table that has a large amount of duplicated data loaded every day (snowflake db)

Thanks in advance for reading this, I hope I explain my problem.
In one of our domains, we have a pipeline (Multiple) where data flows from S3 into a snowflake staging table using airflow. The data itself originates from a number of different applications but the process is always the same. The data is extracted from the application by the support teams (multiple support teams across multiple countries, using different technologies), then into AWS S3 and then bulk loaded into snowflake. Due to limitations on the data from source their often isn't any filter on the data itself and effectively the staging table is loaded with the raw CSV every single day, a file date column is added to the data itself. The result is that we have tables that have been loaded with the same data every single day since 2009.
However the data does change, from day to day a column value will change and so the file date is very useful in tracking changed attributes and something that I want to exploit. Further if the data was cleansed we would need approximately 1% of the data.
These tables are huge some contain around 16 trillion rows but can we be quite narrow.
I would like to optimally loop through each days worth of data and then only load into the staging tables new data as apposed to just loading everything each day.
I have tried the following
A query that windows over the entire set and compares the hashed value of each row (minus the file date) and then only returns if it did not appear in the previous dates data set. This works but not for the larger tables as the warehouse starts to write to disk and then it takes hours.
A day by day loop that looks at each file date data set and compares to the previous day and only loads the difference, this takes to long on the initial clean of the tables but is what I am doing once the data has been cleaned and will form the initial load procedure.
The current solution is where I dynamically create multiple minus set statement where I look at each day minus the day before then batch these into blocks of 10-20 based of the average daily row size so as an example
INSERT INTO TEMP TABLE
(Select * FROM TABLE A WHERE FILE_DATE = 040123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 030123)
UNION ALL
(Select * FROM TABLE A WHERE FILE_DATE = 030123
MINUS
Select * FROM TABLE A WHERE FILE_DATE = 020123)
etc...
This is not pretty though does work however its taking me around 12 hours to process 70 odd tables.
I would like advice on if their is another approach.
Please bear in mind that I am limited to using snowflake due to resourcing issues and politics.
Any guidance and ideas would be much appreciated.
Regards

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

Billing Charges when Querying a DataStudio Report Connected to a View in BiqQuery

I have created a report in DataStudio using data pulled from BigQuery and saved as a view. After playing around with the report for a while I have noticed that I have been billed 100+ times for the query (all the exact same size of data), but I only ran it once to build the view. Am I getting charged every time I interact with the report e.g. apply a filter? If not, what is causing these costs?
Your report will run a query against the view for each element on the page so 4 graphs = 4 queries.
If you then change a filter for example, that will run a further 4 queries (assuming the filter affects them all).

Load order of entires in big query tables

I have some sample data that I've been loading into Google BigQueries. I have been importing the data in ndjson format. If I load the data all in one file, I see them show up in a different order in the table's preview tab than when I sequentially import them one ndjson line at a time.
When importing sequentially I wait till I see the following output:
Waiting on bqjob_XXXX ... (2s) Current status: RUNNING
Waiting on bqjob_XXXX ... (2s) Current status: DONE
The order the rows show up seems to match the order I append them as the job importing them seem to finish before I move on to the next. But when loading them all in one file, they show up in a different order than they exist in my data file.
So why do the data entries show up in a different order when loading in bulk? How are the data entries queued to be loaded and also how are they indexed into the table?
BigQuery has no notion of indexes. Data in BigQuery tables have no particular order that you can rely on. If you need to get ordered data out of BigQuery you will need to use explicit ORDER BY in your query - which btw quite not recommended for large results as it increases resource cost and can end up with Resources Exceeded error.
BigQuery internal storage can "shuffle" your data rows internally for the best / most optimal performance of querying. So again - there is no such things as physical order of data in BigQuery tables
Oficial language in docs is like this - line ordering is not guaranteed for compressed or uncompressed files.

SSRS - Report doesn't loads via Report Builder, while it does via SQL

this is my first question here.
I struggled days and days trying to find a solution everywhere with no success.
Basically I have a standard stored procedure pulling out a report dataset in a few seconds (5-6 seconds).
It aggregates (GROUPING BY and SUMMING) 23000 rows.
Indeed, my final dataset comes out with 4 rows and 33 columns executing, as said, in 5-6 seconds.
Unfortunately, while trying to load it via ReportBuilder, it loads endlessly (querying SQL Server, the StoredProcedure remains stuck in a RUNNING status forever).
Everything on ReportBuilder (DB Accesses, Dataset, Parameters, Matrix....) is right configured: I was indeed able to load it until I added a few additional (4) fields.
The SQL dataset is basically something like:
PARAMETERS DECLARATION
SELECT
FIELDS
FROM
(SELECT
FIELD A
SUMS
FROM
TABLE
JOIN TABLES
WHERE
PARAMETERS MATCHING
GROUP BY A
) AS B
ORDER BY FIELD
An "external layer" SELECT was needed to make some calculations on some FIELDS, also in some cases using some PARAMETERS.
That's it.
I use to work with huge datasets, sometimes pulling out 30,000 rows with 110 fields, but if something loads via SQL it also does always via ReportBuilder: this is the very first time it behaves in this different way.
So I'm asking if there are some strange SSRS/ReportBuilder limitations I never faced in my experience.
Any help would be really really appreciated!
Thanks in advance to everyone who'll spend time :)