Updating made the UI lose the Pioreactors

Eleni_M · September 27, 2024, 1:19pm

Hello guys,

We have been using our Pioreactors non-stop lately and i kind of fell behind with the updates. I tried today to update to a later version. I was 24.6.10 i think (i am not sure about the last two digits ). I was being lazy and selected “updated to next release over the internet”. But it was taking forever and since then all the Pioreactors have disappeared from the UI. When i try to update using a .zip file i get this message:

“Unexpected token ‘<’, “<?xml vers”… is not valid JSON”

Maybe relevant. When i started the first update one of the workers was not online. And the UI showed briefly the green message saying leader.1 updated to a 24.7 version but then there was a red window that leader1 cannot find worker7. Since then UI shows no messages.
I can still ssh into the leader and i tried updating again but no luck.

Any ideas?

CamDavidsonPilon · September 27, 2024, 1:32pm

Hi @Eleni_M,

It sounds like the webserver is offline. You can try restarting the leader to see if the UI looks correct again. If so, here’s how you can proceed:

Have all your workers powered up again. SSH into the leader, and enter
```
pios update app -v 24.7.7 -y
```
Wait for that to finish.
Your cluster should be now up to date with each other, and you can continue the update process.

If you rebooted the leader, and things still don’t seem right, let me know and I’ll provide other instructions!

Eleni_M · September 27, 2024, 1:55pm

I restarted the leader and i entered
pios update -v 24.7.7 -y

and i got this:

Traceback (most recent call last):
File “/usr/local/bin/pios”, line 8, in
sys.exit(pios())
^^^^^^
File “/usr/local/lib/python3.11/dist-packages/click/core.py”, line 1157, in call
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/click/core.py”, line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/click/core.py”, line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/click/core.py”, line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/click/core.py”, line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pioreactor/cli/pios.py”, line 262, in update
units = universal_identifier_to_all_workers(units)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pioreactor/cli/pios.py”, line 61, in universal_identifier_to_all_workers
units = get_workers_in_inventory()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pioreactor/cluster_management/init.py”, line 28, in get_workers_in_inventory
return tuple(worker[“pioreactor_unit”] for worker in result.json())
^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/pioreactor/mureq.py”, line 234, in json
return loads(self.body)
^^^^^^^^^^^^^^^^
msgspec.DecodeError: JSON is malformed: invalid character (byte 0)

CamDavidsonPilon · September 27, 2024, 1:56pm

It still looks like the web server is down. Hm, can you run and show me the output of:

sudo systemctl status lighttpd.service

and

sudo systemctl status huey.service

Eleni_M · September 27, 2024, 2:10pm

So this is what i get with the first one:

pioreactor@leader1:~ $ sudo systemctl status lighttpd.service
● lighttpd.service - Lighttpd Daemon
Loaded: loaded (/lib/systemd/system/lighttpd.service; enabled; preset: enabled)
Active: active (running) since Fri 2024-09-27 15:41:10 CEST; 24min ago
Process: 669 ExecStartPre=/usr/sbin/lighttpd -tt -f /etc/lighttpd/lighttpd.conf (code=exited, status=0/SUCCESS)
Main PID: 686 (lighttpd)
Tasks: 1 (limit: 497)
CPU: 15min 47.097s
CGroup: /system.slice/lighttpd.service
└─686 /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf

Sep 27 16:06:06 leader1 lighttpd[2029]: File “/var/www/pioreactorui/tasks.py”, line 18, in
Sep 27 16:06:06 leader1 lighttpd[2029]: from pioreactor.pubsub import get_from
Sep 27 16:06:06 leader1 lighttpd[2029]: ImportError: cannot import name ‘get_from’ from ‘pioreactor.pubsub’ (/usr/local>
Sep 27 16:06:06 leader1 lighttpd[2030]: Traceback (most recent call last):
Sep 27 16:06:06 leader1 lighttpd[2030]: File “/var/www/pioreactorui/main.fcgi”, line 7, in
Sep 27 16:06:06 leader1 lighttpd[2030]: import tasks # noqa: F401
Sep 27 16:06:06 leader1 lighttpd[2030]: ^^^^^^^^^^^^
Sep 27 16:06:06 leader1 lighttpd[2030]: File “/var/www/pioreactorui/tasks.py”, line 18, in
Sep 27 16:06:06 leader1 lighttpd[2030]: from pioreactor.pubsub import get_from
Sep 27 16:06:06 leader1 lighttpd[2030]: ImportError: cannot import name ‘get_from’ from ‘pioreactor.pubsub’ (/usr/local>
lines 1-20/20 (END)

also this is what i see in the updates page, not sure if it helps:

CamDavidsonPilon · September 27, 2024, 2:43pm

Oooo okay. I think I know what happened, and it’s a bug in our update process. Here’s what I want you to do:

SSH into your leader, and run:

wget -o /tmp/pioreactorui_24.7.3.tar.gz  https://github.com/Pioreactor/pioreactor/releases/download/24.7.7/pioreactorui_24.7.3.tar.gz

then

pio update ui --source pioreactorui_24.7.3.tar.gz

This should fix your web server. Check your UI and make sure it looks normal.
Next, proceed with updating your cluster (ideally via the release archive method)

Eleni_M · September 27, 2024, 4:31pm

It’s getting better, now the leader is updated but the workers are stuck in version 24.8.22 and the temperature, dosing and led automations are not responding (both in leader and workers).

CamDavidsonPilon · September 27, 2024, 4:40pm

Okay, we are making progress! Can you SSH into a worker, and try to manually update with:

pio update app -v 24.9.19

Does that succeed or fail? If it fails, can you post the output?

Eleni_M · September 27, 2024, 4:52pm

i tried with worker1 and it worked !
I mean the update worked, are automations still not responding

CamDavidsonPilon · September 27, 2024, 5:37pm

i tried with worker1 and it worked !

So you should be able to SSH to your leader, and run:

pios update app -v 24.9.19

And that will update all your workers at once (or you can do it one-by-one by SSHing into each worker and using the original command).

I’m guessing other jobs like stirring still work - and it’s just automations that fail? So we did change how automations work in an inbetween release.

Can you show me the output of:

ls /var/www/pioreactorui/contrib/jobs | grep temperature

Eleni_M · September 27, 2024, 6:18pm

The leader doesn’t update the workers unfortunately…but it also doesn’t give an error message in the UI logs.

Here is what i get:
pioreactor@leader1:~ $ ls /var/www/pioreactorui/contrib/jobs | grep temperature
03_temperature_automation.yaml
03_temperature_control.yaml

CamDavidsonPilon · September 27, 2024, 6:22pm

The leader doesn’t update the workers unfortunately…but it also doesn’t give an error message in the UI logs.

Actually that makes sense now that I think about your cluster. Best just to manually do it one-by-one.

Finally, run this on your leader:

rm /var/www/pioreactorui/contrib/jobs/*_control.yaml

That should solve the automation issue

Eleni_M · September 27, 2024, 6:29pm

But does that mean that my leader has lost connection to the workers? Is it something we can fix or its going to stay like this?

I run it but It didn’t fix the automations.

CamDavidsonPilon · September 27, 2024, 6:46pm

But does that mean that my leader has lost connection to the workers? Is it something we can fix or its going to stay like this?

No, it’s temporary. This is a consequence of the new way we do updates on the workers. Your workers need to be on a later version (24.9.x) before the leader can communicate successfully.

I run it but It didn’t fix the automations.

Make sure your (hard) refresh the UI first. You should only see one “row” of “Temperature Automation” and other automations in the job list (previously, you may have seen two!):

Eleni_M · September 27, 2024, 6:54pm

I see one line and I did a (hard) refresh, still not responding:

CamDavidsonPilon · September 27, 2024, 7:03pm

Okay, hm, again, on a worker (say worker1), can you try the following:

pio update

That should bring this worker up to date with the leader, and communicating with it correctly. Then try starting automation an automation in the UI.

Eleni_M · September 27, 2024, 7:11pm

It brought worker1 ahead of the leader. Still automations are not running, but they are anyway not running on the leader either.

CamDavidsonPilon · September 27, 2024, 7:22pm

Ah okay, I thought leader was up to date. That’s okay. I think the goal now is to try to get your leader and your workers up to 24.9.26. First bring them to 24.9.19, and then to 24.9.26, using the commands I mentioned above.

For the automations: is it still only the automations that aren’t responding from the UI, or is it now any activity?

One thing to look at is: on a machine like worker1 or leader1, open up the logs with pio logs, and then try running the activity from the UI. You should see something like:

INFO Executing pio run temperature_automation --automation-name thermostat --skip-first-run 0 --target-temperature 30

(make sure it says temperature_automation and not temperature_control)

Eleni_M · September 27, 2024, 8:12pm

All Reactors are up to date now. And yes it is only the automations that are not working. Stirring, OD etc run normally.

I tried what you said, this what i get:
2024-09-27T22:10:24+0200 [monitor] DEBUG Running JOB_SOURCE=user EXPERIMENT='pH control SC002' nohup pio run temperature_automation >/dev/null 2>&1 & from monitor job.

CamDavidsonPilon · September 28, 2024, 12:56am

That command looks like it’s missing arguments. For example, it needs a --automation-name argument:

pio run temperature_automation --automation-name thermostat

So why is it missing commands? Those commands should come from the UI when you click “START”, go to the leader’s web server, and then be sent to a worker from there.

I’m kinda stumped. Your leader is version 24.9.26? Try rebooting the cluster (leader and workers) maybe? I’ll have to think more about this and get back to you shortly!