Azure Data Factory Limits - azure-data-factory-2

I have created a simple pipeline that operates as such:
Generates an access token via an Azure Function. No problem.
Uses a Lookup activity to create a table to iterate through the rows (4 columns by 0.5M rows). No problem.
For Each activity (sequential off, batch-size = 10):
(within For Each): Set some variables for checking important values.
(within For Each): Pass values through web activity to return a json.
(within For Each): Copy Data activity mapping parts of the json to the sink-dataset (postgres).
Problem: The pipeline slows to a crawl after approximately 1000 entries/inserts.
I was looking at this documentation regarding the limits of ADF.
ForEach items: 100,000
ForEach parallelism: 20
I would expect that this falls within in those limits unless I'm misunderstanding it.
I also cloned the pipeline and tried it by offsetting the query in one, and it tops out at 2018 entries.
Anyone with more experience be able to give me some idea of what is going on here?

As a suggestion, whenever I have to fiddle with variables inside a foreach, I made a new pipeline for the foreach process, and call it from within the foreach. That way I make sure that the variables get their own context for each iteration of the foreach.
Have you already checked that the bottleneck is not at the source or sink? If the database or web service is under some stress, then going sequential may help if your scenario allows that.
Hope this helped!

Related

Airflow: BigQueryOperator vs BigQuery Quotas and Limits

Is there any pratical way to control quotas and limits on Airflow?.
I'm specially interested on controlling BigQuery concurrency.
There are different levels of quotas on BigQuery . So according to the Operator inputs, there should be a way to check if conditions are met, otherwise waiting for it to fulfill.
It seems to be a composition of Sensor-Operators, querying against a database like redis for example:
QuotaSensor(Project, Dataset, Table, Query) >> QuotaAddOperator(Project, Dataset, Table, Query)
QuotaAddOperator(Project, Dataset, Table, Query) >> BigQueryOperator(Project, Dataset, Table, Query)
BigQueryOperator(Project, Dataset, Table, Query) >> QuotaSubOperator(Project, Dataset, Table, Query)
The Sensor must check conditions like:
- Global running queries <= 300
- Project running queries <= 100
- .. etc
Is there any lib that already does that for me? A plugin perhaps?
Or any other easier solution?
Otherwise, following the Sensor-Operators approach.
How can I encapsulate all of it under a single operator? To avoid repetition of code,
a single operator: QuotaBigQueryOperator
Currently, it is only possible to get the Compute Engine quotas programmatically. However, there is an opened feature request to get/set other project quotas via API. You can post there about the specific case you would like to have implemented and follow it to track it and ask for updates.
Meanwhile, as workaround you can try to use the PythonOperator. With it you can define your own custom code and you would be able to implement retries for the queries that you send that get a quotaExceeded error (or the specific error you are getting). In this way you wouldn't have to explicitly check for the quota levels. You just run the queries and retry until they get executed. This is a simplified code for the strategy I am thinking about:
for query in QUERIES_TO_RUN:
while True:
try:
run(query)
except quotaExceededException:
continue # Jumps to the next cycle of the nearest enclosing loop.
break

How to get multiple data from gemfire cacheloader?

We are going to implement gemfire for our project. We are currently syncing gemfire cache with our DB2 database. So, we are facing issue while putting DB data into cache.
To put DB data into region. I have implement com.gemstone.gemfire.cache.CacheLoader and override load method of it. As written in java doc load method will return only one Object. But for our requirement we will have to return multiple VO from load method
public List<CmDvceInvtrGemfireBean> load(LoaderHelper<CmDvceInvtrGemfireBean, CmDvceInvtrGemfireBean> helper)
throws CacheLoaderException
While returining multiple VO in form of List<CmDvceInvtrGemfireBean> gemfire region consider it's as single value.
So, when i invoke,
System.out.println("return COUNT" + cmDvceInvtrRecord.query("SELECT COUNT(*) FROM /cmDvceInvtrRecord"));
It return count of one. But i can see total 7 number of data into it.
So, I want to implement the kind of mechanism that will put all the 7 values as a separate VO in Region
Is there any way to do this using Gemfire CacheLoader?
A CacheLoader was meant to load a value only for a single entry in the GemFire Region on a cache miss. As the Javadoc states...
..creates the value for the desired key..
While a key can map to a multi-valued (e.g. an array/Collection) value, the CacheLoader can only populate a single entry.
You will have to resort to other means of populating the cache with multiple "entries" in a single operation.
Out of curiosity, why do you need (requirement?) to load multiple entries (from the DB) at once? Are you trying to minimize the number of round trips to the DB?
Also, what logic are you using to decide what VO from the DB will be loaded based on the information (i.e. key) provided in the CacheLoader?
For instance, are you somehow trying to predictably select values from the DB based on the CacheLoader key that would subsequently minimize cache misses on future Region.get(key) calls?
Sorry, I don't have a better answer for you right now, but answers to some of these questions may help me give you some ideas for alternatives.
Cheers,
John

neo4j count nodes performance on 200K nodes and 450K relations

We're developing an application based on neo4j and php with about 200k nodes, which every node has a property like type='user' or type='company' to denote a specific entity of our application. We need to get the count of all nodes of a specific type in the graph.
We created an index for every entity like users, companies which holds the nodes of that property. So inside users index resides 130K nodes, and the rest on companies.
With Cypher we quering like this.
START u=node:users('id:*')
RETURN count(u)
And the results are
Returned 1 row.Query took 4080ms
The Server is configured as default with a little tweaks, but 4 sec is too for our needs. Think that the database will grow in 1 month 20K, so we need this query performs very very much.
Is there any other way to do this, maybe with Gremlin, or with some other server plugin?
I'll cache those results, but I want to know if is possible to tweak this.
Thanks a lot and sorry for my poor english.
Finaly, using Gremlin instead of Cypher, I found the solution.
g.getRawGraph().index().forNodes('NAME_OF_USERS_INDEX').query(
new org.neo4j.index.lucene.QueryContext('*')
).size()
This method uses the lucene index to get "aproximate" rows.
Thanks again to all.
Mmh,
this is really about the performance of that Lucene index. If you just need this single query most of the time, why not update an integer with the total count on some node somewhere, and maybe update that together with the index insertions, for good measure run an update with the query above every night on it?
You could instead keep a property on a specific node up to date with the number of such nodes, where updates are done guarded by write locks:
Transaction tx = db.beginTx();
try {
...
...
tx.acquireWriteLock( countingNode );
countingNode.setProperty( "user_count",
((Integer)countingNode.getProperty( "user_count" ))+1 );
tx.success();
} finally {
tx.finish();
}
If you want the best performance, don't model your entity categories as properties on the node. In stead, do it like this :
company1-[:IS_ENTITY]->companyentity
Or if you are using 2.0
company1:COMPANY
The second would also allow you automatically update your index in a separate background thread by the way, imo one of the best new features of 2.0
The first method should also proof more efficient, since making a "hop" in general takes less time than reading a property from a node. It does however require you to create a separate index for the entities.
Your queries would look like this :
v2.0
MATCH company:COMPANY
RETURN count(company)
v1.9
START entity=node:entityindex(value='company')
MATCH company-[:IS_ENTITIY]->entity
RETURN count(company)

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

Caching of Map applications in Hadoop MapReduce?

Looking at the combination of MapReduce and HBase from a data-flow perspective, my problem seems to fit. I have a large set of documents which I want to Map, Combine and Reduce. My previous SQL implementation was to split the task into batch operations, cumulatively storing what would be the result of the Map into table and then performing the equivalent of a reduce. This had the benefit that at any point during execution (or between executions), I had the results of the Map at that point in time.
As I understand it, running this job as a MapReduce would require all of the Map functions to run each time.
My Map functions (and indeed any function) always gives the same output for a given input. There is simply no point in re-calculating output if I don't have to. My input (a set of documents) will be continually growing and I will run my MapReduce operation periodically over the data. Between executions I should only really have to calculate the Map functions for newly added documents.
My data will probably be HBase -> MapReduce -> HBase. Given that Hadoop is a whole ecosystem, it may be able to know that a given function has been applied to a row with a given identity. I'm assuming immutable entries in the HBase table. Does / can Hadoop take account of this?
I'm made aware from the documentation (especially the Cloudera videos) that re-calculation (of potentially redundant data) can be quicker than persisting and retrieving for the class of problem that Hadoop is being used for.
Any comments / answers?
If you're looking to avoid running the Map step each time, break it out as its own step (either by using the IdentityReducer or setting the number of reducers for the job to 0) and run later steps using the output of your map step.
Whether this is actually faster than recomputing from the raw data each time depends on the volume and shape of the input data vs. the output data, how complicated your map step is, etc.
Note that running your mapper on new data sets won't append to previous runs - but you can get around this by using a dated output folder. This is to say that you could store the output of mapping your first batch of files in my_mapper_output/20091101, and the next week's batch in my_mapper_output/20091108, etc. If you want to reduce over the whole set, you should be able to pass in my_mapper_output as the input folder, and catch all of the output sets.
Why not apply your SQL workflow in a different environment? Meaning, add a "processed" column to your input table. When time comes to run a summary, run a pipeline that goes something like:
map (map_function) on (input table filtered by !processed); store into map_outputs either in hbase or simply hdfs.
map (reduce function) on (map_outputs); store into hbase.
You can make life a little easier, assuming you are storing your data in Hbase sorted by insertion date, if you record somewhere timestamps of successful summary runs, and open the filter on inputs that are dated later than last successful summary -- you'll save some significant scanning time.
Here's an interesting presentation that shows how one company architected their workflow (although they do not use Hbase):
http://www.scribd.com/doc/20971412/Hadoop-World-Production-Deep-Dive-with-High-Availability