old lessons relearned
challenge: automate the provisioning of 400 Teltonika devices
"My main takeaway when automating, is to start small and then build out. Yes I know, this seems like something that should be really obvious however apparently I still needed a reminder of this!”
- Roger, IT powder monkey at iunxi
When one of iunxi’s main customers was in need of a way to failover for their brick and mortar stores our architects came up with a way to failover from their main internet connection using a 4G router. This 4G router was going to be supplied by Teltonika and needed to be placed at around 400 locations. I was tasked with automating the provisioning process for these Teltonika devices.
The Teltonika’s are running busybox Linux and Ansible being my personal preference when using desired state i thought how hard could it be ? I will just use the Linux modules available to me in Ansible and Bob is your uncle. Was I wrong there. It turned out that there is no Python available on these boxes so no Ansible Linux modules for you my friend! As we wanted to change as little as possible on these devices we weren’t going to cross-compile Python on them.
how I eventually… got it to work
So on to the gritty details, to make my Ansible playbook connect I need 3 ingredients: ”Username, Password and hostname/ip”. The username, password part is easy that’s default however the IP was a different beast all together.
The intended users for this Playbook have little to no technical knowledge and so I wanted the process to be ‘zero touch’. When the Teltonika boots up it has a default config where it runs as a DHCP server (it is still a router) and has the address 192.168.1.1. We solved this by creating a dedicated subnet per devices and applying VRF route leaking and then ‘natting’ it to a unique IP. Now we are able to SSH to the box and use Ansible to command and delegate_to modules to copy the firmware and configuration to the routers as needed. See it’s not all bad, we are okay. At least, that’s what I thought at this point.
Then came applying the config and loading the firmware. Teltonika has a process of doing this using the ‘sysupgrade’ command.
I called ‘sysupgrade‘ using the raw module in Ansible (yes indeed, no Python) and soon came to the realization that because I wasn’t using Python, Ansible had no way other then the exit code of the process to know how this ended.
And as this process takes approx. 8 to 10 minutes it was timing out. So I did something bad by using the script module and delegated the command to run in a bash script on the Ansible control server. This worked but wasn’t very trustworthy as I only knew the exit codes of the script and not the commands being run by the script. But I can check the outcome of the work that was supposed to be done by the script later. I decided to solve this with the Ansible async module, this allows a process to run in the background. Async, when used in conjunction with poll greater then 0, makes it not asynchronous. You still get a long running task in the background. But the Ansible play will not continue unless it’s finished or the maximum timeout (as indicated by the value of async) is reached.
So the process now roughly looks like this: Connect up to 20 Teltonika’s to a provisioning switch Ansible runs as a cronjob and tries to connect to specified IP’s in the inventory file. It then copies config and firmware to the devices, waits for reboot, does checks to see if correct config is applied and firmware is the desired version and then emails the department that does provisioning with the results.
The above method worked and all was well, we could now start using the provisioning; or so I thought. After about a week of provisioning I was approached by a colleague with the question: “Only 30% of the Teltonika’s I hook up to be provisioned are successful. Is this normal?” This of course was not what I had expected.
It turns out the vendor was shipping the devices with new firmware (I should have checked which version was on there to begin with, doh) and these were taking a lot longer to update the firmware and thus causing timeouts and failing. As I didn’t want to meddle with the async value for every future change the vendor made to the device, I decided I need something more robust (still no Python so my options were limited)
I found an Ansible module called wait_for which I used to query a regular expression after my async task of upgrading the firmware. So now I know SSH is up and able to login on the device.
My main takeaway when automating, is to start small and then build out. Yes I know, this seems like something that should be really obvious however apparently I still needed a reminder of this!