The Model api is configured such that it uses uWSGI as the front end web server which throttles requests to the backend which is basically a container inside a pod.
Now the uWSGI has certain configurations which can be modified to our benefit. One of it is the timeout setting, which is basically a value instructing uWSGI on how long to keep the connection to the socket alive.
What are the implications of this ? Well if the connection is severed by something while the os is reading or writing to the socket or while the invoked functions is executing the api call would fail causing unknown effects to your system.
We can configure the above setting depending on how long the invoked function takes to process and how big the payload passed in to the request is. The best way to measure this would be to gauge the time it takes for the function to return by invoking it inside a workspace or a local deployment of it and update the setting to the next biggest value. This setting can be found in the Model API UI page, under settings, advanced tab. Default is set to 60 seconds if no override is configured in the mentioned location.
If you still see these errors/exceptions persist in the stdout logs, the next best thing would be to generate a tcpdump to specifically look for any errors or http response codes, we could also review the logs for any long gaps between messages and where the FINs are coming from which closed the connection.
Tcpdump should be run on the model instance (the run container), although the physical interface has to process all the packets so if we don't find much value with tcpdump run inside the container we could also try running it on the node.
Please follow the below steps to run the tcpdump session :-
- since a model can be deployed over multiple instances/pods we'll loop through them and run the session on all : for POD in $( kubectl get po -A |grep your-model-pod-id | awk '{print $2}' | xargs echo ) ; do kubectl cp tcpdump.sh -n namespace "${POD}":/tmp/tcpdump.sh -c the-relevant-run-container && kubectl exec -n namespace po/"${POD}" -c the-relevant-run-container -- chmod +x /tmp/tcpdump.sh && kubectl exec -n namespace po/"${POD}" -c the-relevant-run-container -- /tmp/tcpdump.sh
- the above tcpdump.sh script consists of the following : set -x if [ ! -f /usr/sbin/tcpdump ]; then
apt-get update -y apt-get install -y tcpdump psmisc fi && echo "Running tcpdump in the background" && /usr/sbin/tcpdump -A -w /tmp/"${HOSTNAME}".pcap -i eth0 &
echo "Done" - and lastly run the following to kill any process instances of the tcpdump : for POD in $(kubectl get pods -A | grep your-model-pod-id | awk '{print $2}' | xargs echo ) ; do kubectl exec -n namespace po/"${POD}" -c the-relevant-run-container -- killall tcpdump && kubectl cp -n namespace po/"${POD}":/tmp/"${POD}.pcap "${POD}".$(date +%Y%m%d).pcap -c the-relevant-run-container
The .pcap files can be viewed with tshark or Wireshark (for a GUI)
Comments
0 comments
Please sign in to leave a comment.