Version/Environment (if relevant):
All.
Issue:
After starting a spark cluster you might run in to a situation where subsequent executions will not start. Those executions will never leave the Assigning stage. In versions past 5.1.0 you will see the following warning below the Stage list...
User is over quota; limit is X; current execution is number X in queue.
Note: In domino versions 4.x, 5.0.x, 5.1.0 there is no user facing warning.
Root Cause:
There is a per user limitation on the number of executions that can be run simultaneously. This value has a default value of 25 which most users will never hit. However spark clusters also count towards this execution quota. If you start a spark cluster with a maximum number of workers set to the 'cluster size limit' you will not be able to start any subsequent executions.
A workspace or job with an on demand spark cluster behaves as following...
- one execution slot will be consumed for each Spark executor which includes autoscaled clusters
- one for the Spark master
- one for the workspace or job
Apart from the lack of warning is some Domino versions this is behavior as expected.
Resolution:
There are two ways to work around this issue:
1) Decrease the number of workers in the on-demand spark cluster.
2) An administrator can increase the Central Config value of the setting...
com.cerebro.domino.computegrid.userExecutionsQuota.maximumExecutionsPerUser
Notes/Information:
There are several feature requests in progress with engineering around this...
1) Add messaging to the spark clusters setup UI to notify users of this outcome; DOM-39661
2) Notify users that they are over the executions quota; DOM-17370
3) Document this behaviour in public documentation set; DOCS-1423
Comments
0 comments
Please sign in to leave a comment.