Pioreactor workers not connecting on newest version

DonalC · May 24, 2024, 7:09pm

Hi all,

I recently updated my cluster to release 24.5.22, and when I did this all my active worker Pioreactors disconnected and no longer show up in the inventory (not just that they are unassigned). I’ve tried power cycling my cluster leader and multiple different workers, reflashing workers, and adding from the UI or command line, but I cannot get any of my workers to be added. When I try to add from the UI, I get the error:

"Unable to complete connection. The following error occurred: ... ERROR [add_pioreactor] Did not add
Pioreactor to backende[0m Traceback (most recent call last): File "/usr/local/lib/python3.11/dist
packages/pioreactor/cluster_management/__init__.py", line 110, in add_worker result.raise_for_status()
File "/usr/local/lib/python3.11/dist-packages/pioreactor/mureq.py", line 230, in raise_for_status raise
HTTPErrorStatus(self.status_code) pioreactor.mureq.HTTPErrorStatus: HTTP response returned error
code 500 During handling of the above exception, another exception occurred: Traceback (most recent
call last): File "/usr/local/bin/pio", line 8, in <module> sys.exit(pio()) ^^^^^ File
"/usr/local/lib/python3.11/dist-packages/click/core.py", line 1157, in __call__ return self.main(*args,
**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line
1078, in main rv = self.invoke(ctx) ^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist
packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1688, in
invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File
"/usr/local/lib/python3.11/dist-packages/click/core.py", line 1434, in invoke return
ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File
"/usr/local/lib/python3.11/dist-packages/click/core.py", line 783, in invoke return __callback(*args,
**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist
packages/pioreactor/cluster_management/__init__.py", line 113, in add_worker raise HTTPException("Did
not add Pioreactor to backend") http.client.HTTPException: Did not add Pioreactor to backend"

Any ideas on how to fix this? I am running on a local access point and previously had 4 workers connected to my cluster leader (Raspberry Pi without HAT), all my Raspberry Pis are 3 Model A+, and all my pioreactors version 1.0. I had tried updating the config.ini file to include

model=pioreactor_20ml
version=1.0

but had even more errors when I tried that:

Unable to complete connection. The following error occurred: ... ERROR [add_pioreactor] + set -e + export
LC_ALL=C + LC_ALL=C + SSHPASS=raspberry + PIO_VERSION=1.0 + PIO_MODEL=pioreactor_20ml +
HOSTNAME=worker1 + HOSTNAME_local=worker1.local + USERNAME=pioreactor ++ hostname +
LEADER_HOSTNAME=leader + ssh-keygen -R worker1.local + ssh-keygen -R worker1 ++ getent hosts
worker1.local ++ cut '-d ' -f1 + ssh-keygen -R 10.42.0.61 + N=120 + counter=0 + sshpass -p raspberry ssh
pioreactor@worker1.local 'test -d /home/pioreactor/.pioreactor && echo '\''exists'\''' Warning:
Permanently added 'worker1.local' (ED25519) to the list of known hosts. + pio workers discover -t + grep
-q worker1 + sshpass -p raspberry ssh-copy-id pioreactor@worker1.local /usr/bin/ssh-copy-id: INFO:
Source of key(s) to be installed: "/home/pioreactor/.ssh/id_rsa.pub" /usr/bin/ssh-copy-id: INFO:
attempting to log in with the new key(s), to filter out any that are already installed /usr/bin/ssh-copy-id:
WARNING: All keys were skipped because they already exist on the remote system. (if you think this is a
mistake, you may want to use -f option) +
UNIT_CONFIG=/home/pioreactor/.pioreactor/config_worker1.ini + rm -f
/home/pioreactor/.pioreactor/config_worker1.ini + touch
/home/pioreactor/.pioreactor/config_worker1.ini + echo -e '# Any settings here are specific to worker1,
and override the settings in shared config.ini' + crudini --set
/home/pioreactor/.pioreactor/config_worker1.ini pioreactor version 1.0 --set
/home/pioreactor/.pioreactor/config_worker1.ini pioreactor model pioreactor_20ml + ssh-keyscan
worker1.local # worker1.local:22 SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22 SSH-2.0
OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22 SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22
SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22 SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 + pios
sync-configs --units worker1 --skip-save e[36m2023-12-11T17:11:10-0500 DEBUG [sync_configs] Syncing
configs on worker1...e[0m + sleep 1 + N=120 + counter=0 + sshpass -p raspberry ssh
pioreactor@worker1.local 'test -f /home/pioreactor/.pioreactor/config.ini && echo '\''exists'\''' + ssh
pioreactor@worker1.local 'echo "server leader.local iburst prefer" | sudo tee -a /etc/chrony/chrony.conf'
tee: /etc/chrony/chrony.conf: No such file or directory e[0m Traceback (most recent call last): File
"/usr/local/bin/pio", line 8, in <module> sys.exit(pio()) ^^^^^ File "/usr/local/lib/python3.11/dist
packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1078, in main rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist-packages/click/core.py", line 1688, in invoke return
_process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File
"/usr/local/lib/python3.11/dist-packages/click/core.py", line 1688, in invoke return
_process_result(sub_ctx.command.invoke(sub_ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File
"/usr/local/lib/python3.11/dist-packages/click/core.py", line 1434, in invoke return
ctx.invoke(self.callback, **ctx.params) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File
"/usr/local/lib/python3.11/dist-packages/click/core.py", line 783, in invoke return __callback(*args,
**kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.11/dist
packages/pioreactor/cluster_management/__init__.py", line 102, in add_worker raise
BashScriptError(res.stderr) pioreactor.exc.BashScriptError: + set -e + export LC_ALL=C + LC_ALL=C +
SSHPASS=raspberry + PIO_VERSION=1.0 + PIO_MODEL=pioreactor_20ml + HOSTNAME=worker1 +
HOSTNAME_local=worker1.local + USERNAME=pioreactor ++ hostname + LEADER_HOSTNAME=leader
+ ssh-keygen -R worker1.local + ssh-keygen -R worker1 ++ getent hosts worker1.local ++ cut '-d ' -f1 +
ssh-keygen -R 10.42.0.61 + N=120 + counter=0 + sshpass -p raspberry ssh pioreactor@worker1.local
'test -d /home/pioreactor/.pioreactor && echo '\''exists'\''' Warning: Permanently added 'worker1.local'
(ED25519) to the list of known hosts. + pio workers discover -t + grep -q worker1 + sshpass -p raspberry
ssh-copy-id pioreactor@worker1.local /usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: \
"/home/pioreactor/.ssh/id_rsa.pub" /usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s),
to filter out any that are already installed /usr/bin/ssh-copy-id: WARNING: All keys were skipped because
they already exist on the remote system. (if you think this is a mistake, you may want to use -f option) +
UNIT_CONFIG=/home/pioreactor/.pioreactor/config_worker1.ini + rm -f
/home/pioreactor/.pioreactor/config_worker1.ini + touch
/home/pioreactor/.pioreactor/config_worker1.ini + echo -e '# Any settings here are specific to worker1,
and override the settings in shared config.ini' + crudini --set
/home/pioreactor/.pioreactor/config_worker1.ini pioreactor version 1.0 --set
/home/pioreactor/.pioreactor/config_worker1.ini pioreactor model pioreactor_20ml + ssh-keyscan
worker1.local # worker1.local:22 SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22 SSH-2.0
OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22 SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22
SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 # worker1.local:22 SSH-2.0-OpenSSH_9.2p1 -2+deb12u1 + pios
sync-configs --units worker1 --skip-save e[36m2023-12-11T17:11:10-0500 DEBUG [sync_configs] Syncing
configs on worker1...e[0m + sleep 1 + N=120 + counter=0 + sshpass -p raspberry ssh
pioreactor@worker1.local 'test -f /home/pioreactor/.pioreactor/config.ini && echo '\''exists'\''' + ssh
pioreactor@worker1.local 'echo "server leader.local iburst prefer" | sudo tee -a /etc/chrony/chrony.conf'
tee: /etc/chrony/chrony.conf: No such file or directory

CamDavidsonPilon · May 24, 2024, 7:29pm

Hi @DonalC

Sorry about this trouble! We recently abstracted how we make API calls to the leader.

HTTPException(“Did not add Pioreactor to backend”)

This makes me think there’s an API error somewhere. A few questions:

Do you recall what software version you were on before? here’s a list of releases
Were all the Pioreactors on the same software version?
What is the value in the entry leader_address under section [cluster.topology] in the config.ini?

when I did this all my active worker Pioreactors disconnected and no longer show up in the inventory (not just that they are unassigned)

This sounds pretty concerning, but may have been a connection issue preventing fetching the workers from the database. Let me ask: from your UI, can you still create new experiments, or does that fail?

DonalC · May 24, 2024, 7:47pm

I was previously on 24.4.3, but had reflashed my cluster leader yesterday as I was in the process of switching from a local access point to using a WiFi network (I have 6 pioreactors and have been running into the limit of how many can connect on the local access point).
Yes, they should have been, they were all connected to the cluster on the previous version and were functioning correctly
leader_address=leader.local
Yes I am able to create new experiments

The strange part to me, is that it seems to still recognize that the pioreactors are on the local access point, as occasionally in the logs I’ll get a message from the watchdog “Update: < worker_i > is connected. All is well” followed immediately by “< worker_j > seems to be lost. Trying to re-establish connection…” (where worker_i and worker_j are different).

Thanks for your help

CamDavidsonPilon · May 24, 2024, 8:26pm

Edit: Actually, on closer read:

I was previously on 24.4.3, but had reflashed my cluster leader yesterday as I was in the process of switching from a local access point to using a WiFi network

So did you update using our update process, or do you mean you’re starting from scratch, on 24.5.22, and unable to add workers?

Previous answer:

Ah okay. So you probably jumped too far ahead in upgrades (oddly enough, inbetween 24.4.3 to 24.5.22 is a few important upgrades), and this caused some issues. We should have been more clear on upgrade best-practices.¹

So let’s start by fixing your leader, and then deal with the workers.

We’ll start with removing all workers from the cluster. This isolates things to just run on our leader. On your leader’s command line, for each worker, run:
```
pio workers remove worker1
```
with worker1 replaced by the worker name.
Next, we’re going to downgrade the leader to a previous version, and then bring it back up to the current version. This way it see all the “deltas” in updates.
1. download the release archive for our 24.5.1 release, and use the UI to “down-grade” to 24.5.1.
2. download the release archive for our 24.5.13 release, and use the UI to update to 24.5.13.
3. finally, use the release-archive for 24.5.22 to update back to 24.5.22.

I think your leader should be caught up and ready. Here, you can try adding workers. Try adding new workers first, ie those you just flashed. Does that succeed?

¹We recently added some protection against making these large version jumps, but that was introduced in an inbetween version.

DonalC · May 24, 2024, 8:27pm

I’ll follow the steps listed of course, but I did actually update to 24.5.1 and then 24.5.13 before 24.5.22

I tried removing all the workers, but each time I got the message

Worker worker1 not present to be removed. Check hostname.

I updated using the UI from 24.4.11 to 24.4.3, 24.5.1, 24.5.13, and 24.5.22

CamDavidsonPilon · May 24, 2024, 8:32pm

I’ll follow the steps listed of course, but I did actually update to 24.5.1 and then 24.5.13 before 24.5.22

Oh interesting! Okay maybe disregard my comments then! It could be a problem with the web-server not retrieving the right data.

When you’re on the leader, what does curl http://localhost/api/workers return?

DonalC · May 24, 2024, 8:38pm

I get an internal server error:

<!doctype html>
<html lang=en>
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete 
your request. Either the server is overloaded or there is an error in 
the application.</p>

CamDavidsonPilon · May 24, 2024, 8:43pm

Okay, now that looks like a problem.

Anything odd occurring in the UI logs?

tail /var/log/pioreactorui.log -n 20

Let’s open up the database, and make sure it’s healthy:

pio db

opens up the DB shell. Try

SELECT * FROM workers;

you should see your workers. Let’s check integrity:

pragma integrity_check;

Any issues there?

.q to exit the shell.

DonalC · May 24, 2024, 8:51pm

This puts out the following calls to dispatch_request:

Traceback (most recent call last):
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 1455, in wsgi_app
response = self.full_dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 869, in full_dispatch_request
rv = self.handle_user_exception(e)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 867, in full_dispatch_request
rv = self.dispatch_request()
^^^^^^^^^^^^^^^^^^^^^^^
File “/usr/local/lib/python3.11/dist-packages/flask/app.py”, line 852, in dispatch_request
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/var/www/pioreactorui/views.py”, line 1422, in get_list_of_workers
all_workers = query_db(
^^^^^^^^^
File “/var/www/pioreactorui/app.py”, line 102, in query_db
cur = _get_db_connection().execute(query, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.OperationalError: no such table: workers

When checking the workers I get

Parse error: no such table: workers

The integrity check returned “ok”

CamDavidsonPilon · May 24, 2024, 8:57pm

No table workers?! Didn’t we start with a Pioreactor with an working Inventory page, (which relies on the workers table)?

I have no ideas how the workers table isn’t there, but we can add it. Open up the db shell again with pio db, and enter:

CREATE TABLE IF NOT EXISTS experiment_worker_assignments (
    pioreactor_unit     TEXT NOT NULL,
    experiment          TEXT NOT NULL,
    assigned_at         TEXT NOT NULL,
    UNIQUE(pioreactor_unit), -- force a worker to only ever be assigned to a single experiment.
    FOREIGN KEY (pioreactor_unit) REFERENCES workers(pioreactor_unit)  ON DELETE CASCADE,
    FOREIGN KEY (experiment) REFERENCES experiments(experiment)  ON DELETE CASCADE
);

CREATE TABLE IF NOT EXISTS workers (
    pioreactor_unit     TEXT NOT NULL, -- id
    added_at            TEXT NOT NULL,
    is_active           INTEGER DEFAULT 1 NOT NULL,
    UNIQUE(pioreactor_unit)
);

And then I think everything should start working? You’ll need to re-add the workers to the leader again.