Pioreactors Crashing after 26.3.0 Update

Hello,

I am using 10 pioreactors with hardware 40 mL v 1.5 with two designated leaders (no hardware).

I have recently updated my system to version 26.3.0. I have two leaders (leader only systems) operating with an RPI 5 and 10 workers (5 per leader) using RPI 02W. I updated my systems from version 26.2.23 for most of my workers; however I recently set up one of the leaders and therefore it was using 26.3.0. I updated all my systems via internet through the UI.

After updating my system all my workers are disconnecting from their respective leaders (for both leaders). By disconnecting I mean a combination of jobs being “lost” and needing to manually be restarted through the UI and also occasionally some of my workers will go offline and need to be rebooted through the UI or manually powercycled.

I am running stirring, and heating on 5 of my workers with leader01 and running stirring, heating, and LED control for my other 5 workers with leader02.

When I first updated the system I was getting this error:
Exception on /api/bioreactor/descriptors [GET] Traceback (most recent call last): File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/whoami.py”, line 225, in _get_pioreactor_model_name result.raise_for_status() ~~~~~~~~~~~~~~~~~~~~~~~^^ File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/mureq.py”, line 250, in raise_for_status raise HTTPErrorStatus(self.status_code) pioreactor.mureq.HTTPErrorStatus: HTTP response returned error code 404 During handling of the above exception, another exception occurred: Traceback (most recent call last): File “/opt/pioreactor/venv/lib/python3.13/site-packages/flask/app.py”, line 1511, in wsgi_app response = self.full_dispatch_request() File “/opt/pioreactor/venv/lib/python3.13/site-packages/flask/app.py”, line 919, in full_dispatch_request rv = self.handle_user_exception(e) File “/opt/pioreactor/venv/lib/python3.13/site-packages/flask/app.py”, line 917, in full_dispatch_request rv = self.dispatch_request() File “/opt/pioreactor/venv/lib/python3.13/site-packages/flask/app.py”, line 902, in dispatch_request return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^ File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/web/api.py”, line 2318, in get_bioreactor_variable_descriptors return attach_cache_control(jsonify(to_builtins(get_bioreactor_descriptors())), max_age=0) ~~~~~~~~~~~~~~~~~~~~~~~~~~^^ File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/bioreactor.py”, line 76, in get_bioreactor_descriptors default=get_default_bioreactor_value(metadata.key), ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^ File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/bioreactor.py”, line 60, in get_default_bioreactor_value return validate_bioreactor_value(variable_name, resolved_default) File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/bioreactor.py”, line 92, in validate_bioreactor_value maximum = get_pioreactor_model().reactor_max_fill_volume_ml ~~~~~~~~~~~~~~~~~~~~^^ File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/whoami.py”, line 176, in get_pioreactor_model name = _get_pioreactor_model_name(target_unit_name) File “/opt/pioreactor/venv/lib/python3.13/site-packages/pioreactor/whoami.py”, line 230, in _get_pioreactor_model_name raise NoWorkerFoundError(f"Worker {unit_name} is not found.") pioreactor.exc.NoWorkerFoundError: Worker leader01 is not found.

I eventually added my leader as a worker by SSHing into the leader and running commands suggested by gemini and now it shows up in the UI. I did not assign it to an experiment and this seemed to temporarily resolve the issue. I added the version of the leader as 40 mL v 1.5 as a placeholder. The errors then stopped.

I was able to successfully run heating, stirring, and LEDs on my leader02 for 6 hours. Then I started jobs on leader01 for the other 5 pioreactors and all the jobs were lost. I thought that this could be because of issues with both leaders trying to access the same workers. In the config file for leader02 I found that all 10 pioreactors were being called, however leader01 was calling the correct 5. This led to both leaders calling 5 of the same workers. I fixed the config files and used the static IP addresses for everything just to ensure no issues and started another test with 5 workers per leaders. All jobs were lost pretty immediately after.

Not sure what else I can try, any help is greatly appreciated!

ah shoot - once again I didn’t plan for leader-only Pioreactors. Let me fix this with a hot-fix asap.

Thank you! I was able to fix it by adding the leaders as workers. If you have any ideas on how to fix the disconnects I am experiencing that would also be great!

When you visit the Inventory page on either leader, do you see all 10 workers, or just the subset of 5?

Just a subset of the 5 so for leader01 I see workers 1-5 along with leader01 and for leader02 I see workers 6-10 along with leader02.

So a job is “lost” when it can’t talk to MQTT. MQTT is a software that is running on each leader. Double check the following in each clusters config.ini:

  1. Find the section [mqtt], and confirm that the broker_address is the same as the respective leader’s IP.
  2. Find the section [cluster.topology], confirm the leader_hostname and leader_address is correct. (Compare it to the Leader page in the UI)

(Maybe send me the config.ini’s, too, to cam@pioreactor.com)


I’ll keep thinking. This problem only showed up after upgrading?

Another change I just made was changing the broker address from pioreactor.local for both to leader01.local and leader02.local. And I just started another test!

oh yea that’s probably it. pioreactor.local is an alias, and you have two leaders sharing that alias. Both groups of clusters are trying to connect to pioreactor.local, but it’s not clear who “owns” that alias at any given time: leader01 or leader02. So the workers are definitely engaging in some sort of cross-talk.

Keep broker_address as either the hostname.local (ex: leader01.local), or a static IPv4.

If you are still seeing problems, try a cluster wide reboot to reset.

gotcha that makes sense. I checked the broker_address, leader_hostname, and leader_address and they are all correct. I don’t see anything called cluster_topology but I will send you the config files in case I am missing anything!

Sorry cluster.topology

Marking as resolved: part of the issue was a power-supply issue not providing enough power per port.