uSQL Maximum Request Length Exceeded - azure-data-lake

I am attempting to collapse a large number of disparate files into a unified layout. There are roughly 3k files that I am collapsing down to 40 subject areas. My original design is 40 jobs, each outputting a single unified layout. I then decided to create them all at once. When doing so, I receive an Internal Server Error stating "Maximum Request Length Exceeded".
I am trying to understand what that max is and how I can work around it. I am new to Azure Data Lake Analytics and uSQL, so there may be an obvious path that I do not see. Maybe I can create the 40 jobs and execute them via a coordinating script? Looking for efficiency. Suggestions?

You may be running into one of the capacity limits for your account. It's a little hard to tell with what you've shared. More details on the ADLA quota limits can be found here https://learn.microsoft.com/en-us/azure/data-lake-analytics/data-lake-analytics-quota-limits

Related

How to enrich events using a very large database with azure stream analytics?

I'm in the process of analyzing Azure Stream Analytics to replace a stream processing solutions based on NiFi with some REST microservices.
One step is the enrichment of sensor data form a very large database of sensors (>120Gb).
Is it possible with Azure Stream Analytics? I tried with a very small subset of the data (60Mb) and couldn't even get it to run.
Job logs give me warnings of memory usage being too high. Tried scaling to 36 stream units to see if it was even possible, to no avail.
What strategies do I have to make it work?
If I deterministically (via a hash function) partition the input stream using N partitions by ID and then partition the database using the same hash function (to get id on stream and ID on database to the same partition) can I make this work? Do I need to create several separated stream analytics jobs do be able to do that?
I suppose I can use 5Gb chunks, but I could not get it to work with ADSL Gen2 datalake. Does it really only works with Azure SQL?
Stream Analytics supports reference datasets of up to 5GB. Please note that large reference datasets come with the downside of making jobs/nodes restarts very slow (up to 20 minutes for the ref data to be distributed; restarts that may be user initiated, for service updates, or various errors).
If you can downsize that 120Gb to 5Gb (scoping only the columns and rows you need, converting to types that are smaller in size), then you should be able to run that workload. Sadly we don't support partitioned reference data yet. This means that as of now, if you have to use ASA, and can't reduce those 120Gb, then you will have to deploy 1 distinct job for each subset of stream/reference data.
Now I'm surprised you couldn't get a 60Mb ref data to run, if you have details on what exactly went wrong, I'm happy to provide guidance.

Google big query backfill takes very long

I am new to stack overflow. I use Google big query to connect data from multiple sources toegether. I have made a connection to Google ads (using data transfer from big query) and this works well. But when i run a backfill of older data it takes more then 3 days to get the data from 180 days in big query. Google advises 180 days as maximum. But it takes so long. I want to do this for the past 2 years and multiple clients (we are an agency). I need to do this in chunks of 180 days.
Does anybody have a solution for this taking so long?
Thanks in advance.
According to the documentation, BigQuery Data Transfer Service supports a maximum of 180 days (as you said) per backfill request and simultaneous backfill requests are not supported [1].
BigQuery Data Transfer Service limits the maximum rate of incoming requests and enforces appropriate quotas on a per-project basis [2] and other BigQuery tasks in the project may be limiting the amount of resources used by the Transfer. Load jobs created by transfers are included in BigQuery's quotas on load jobs. It's important to consider how many transfers you enable in each project to prevent transfers and other load jobs from producing quotaExceeded errors.
If you need to increase the number of transfers, you can create other projects.
If you want to speed up the transfers for all your clients, you could split them into several projects, because it seems that’s an important amount of transfers that you are going to make there.

AWS DynamoDB Strange Behavior -- Provisioned Capacity & Queries

I have some strange things occurring with my AWS DynamoDB tables. To give you some context, I have several tables for an AWS Lambda function to query and modify. The source code for the function is housed in an S3 bucket. The function is triggered by an AWS Api.
A few days ago I noticed a massive spike in the amount of read and write requests I was being charged for in AWS. To be specific, the number of read and write requests increased by 3,000 from what my tables usually experience (they usually have fewer than 750 requests). Additionally, I have seen similar numbers in my Tier 1 S3 requests, with an increase of nearly 4,000 requests in the past six days.
Immediately, I suspected something malicious had happened, and I suspended all IAM roles and changed their keys. I couldn't see anything in the logs from Lambda denoting it was coming from my function, nor had the API received a volume of requests consistent with what was happening on the tables or the bucket.
When I was looking through the logs on the tables, I was met with this very strange behavior relating to the provisioned write and read capacity of the table. It seems like the table's capacities are ping ponging back and forth wildly as shown in the photo.
I'm relatively new to DynamoDB and AWS as a whole, but I thought I had set the table up with very specific provisioned write and read limits. The requests have continued to come in, and I am unable to figure out where in the world they're coming from.
Would one of you AWS Wizards mind helping me solve this bizarre situation?
Any advice or insight would be wildly appreciated.
Turns out refreshing the table that appears in the DynamoDB management window causes the table to be read from, hence the unexplainable jump in reads. I was doing it the whole time 🤦‍♂️

Allowing many users to view stale BigQuery data query results concurrently

If I have a BigQuery dataset with data that I would like to make available to 1000 people (where each of these people would only be allowed to view their subset of the data, and is OK to view a 24hr stale version of their data), how can I do this without exceeding the 50 concurrent queries limit?
In the BigQuery documentation there's mention of 50 concurrent queries being permitted which give on-the-spot accurate data, which I would surpass if I needed them to all be able to view on-the-spot accurate data - which I don't.
In the documentation there is mention of Batch jobs being permitted and saving of results into destination tables which I'm hoping would somehow allow a reliable solution for my scenario, but am having difficulty finding information on how reliably or frequently those batch jobs can be expected to run, and whether or not someone querying results that exist in those destination tables is in itself counting towards the 50 concurrent users limit.
Any advice appreciated.
Without knowing the specifics of your situation and depending on how much data is in the output, I would suggest putting your own cache in front of BigQuery.
This sounds kind of like a dashboading/reporting solution, so I assume there is a large amount of data going in and a relatively small amount coming out (per-user).
Run one query per day with a batch script to generate your output (grouped by user) and then export it to GCS. You can then break it up into multiple flat files (or just read it into memory on your frontend). Each user hits your frontend, you determine which part of the output to serve up to them and respond.
This should be relatively cheap if you can work off the cached data and it is small enough that handling the BigQuery output isn't too much additional processing.
Google Cloud Functions might be an easy way to handle this, if you don't want the extra work of setting up a new VM to host your frontend.

What is the upper file count limit for concatenating files using MsConcat?

We are thinking to use the MsConcat function to merge a large number of small files stored in Azure Data Lake Store. I am wondering if there is any limit on a number of files. I have not seen any information about any limit on number of files in documentation.
The msconcat API is designed to concatenate up to 500 files. In most typical cases, it should work okay for operations of this size. There are some uncommon cases, such as when the system is under high load, when you may see failures for 500 files or under. However these are not expected during normal operations.