Computer Systemen 2023: Labbook

Started Labbook 2024.

December 21, 2023

Tried again to repair WSL. Removed the program, and reinstalled it again. Still same error (busy with update).
Yet, I didn't follow the steps of manual install to the letter (first enable all features, restart and then wsl --update.
Step 4 failed because I have already a newer update of wsl installed.
Continued with step 6, and now Ubuntu 20.04 could be opened. Same for 18.04 and 22.04. Back in business!

December 18, 2023

Looked at the error-code I receive from both Ubuntu version I have with WSL.
The fora suggested to uninstall WSL and install WSL2 again.
Installed Ubuntu 20.04, which pointed to WSL install manual.
Yet, step 4 fails because WSL is still busy. Should try step 5 again after a reboot.
Checked the machines at my lab.
- ws1 is running 20.04
- ws4 is still running 18.04
- ws5 is now also running 22.04, but is used by wytse and merrin.
- ws7 is running 22.04, and is used by pepijn
- ws8 is running 20.04, and is used by wytse
- ws9 is running 22.04, and is used by thomasorden
- ws10 is running 22.04, and is not recently used

December 15, 2023

Updated the Nvidia SDK manager on nb-ros. A new JetPack is availabe (v 5.1.2). Didn't install it (yet).
Note that also Jetson Nano modules could be selected! The board is not detected, should check how this connection should be made (WIFI / USB).
Such download would require 19Gb (download locations to be specified in Step 2).

December 4, 2023

Had on nb-dual a problem with the nvidia-drivers, which couldn't be fixed with --fix-broken.
At the end I removed 127 packages, mainly the part of the cuda-toolkit-12-1.
At least the now the other 197 outdated packages could be installed.
Left for the moment the cuda-toolkit-12-1 uninstalled.

September 5, 2023

On nb-dual my 18.04 seems to work no longer, after the update.
Downloaded Ubuntu 22.04 LTS from the App Store.
Did a sudo apt upgrade, which worked.

May 24, 2023

Looked at Deep Learning GPU Benchmarks. For Resnet the 4090 clearly outperforms the 3900, no measurements for the 4080 yet (other than tools like Blender).
The Aurora R15 with a 4080 can now be bought with 25% reduction.

April 14, 2023

Couldn't reach ws5 from home because the wired connection was loose, and the system used the wifi-connection. Once wired connected (no-wifi), I had to use Pulse to start a UvA vpn to reach the staff-webserver.

April 5, 2023

Had to update my git-credentials again. Installed on ws7 the github-cli gh.
Continued with gh auth login. Selecting to authenticate GitHub CLI, which tries to open a browser, which fails on failed snap on a gnome.terminal.
Yet, I tried this as admin, as user the authenticate works (forgot to look for how long).
Pulse is not working on ws7, waiting on user input but no pop-up appears.
Installed Pulse on ws5, which works (as can be seen by this entree).

March 21, 2023

Installed VPN Pulse secure at ws7. It seems that Ubuntu 22.04 not standardly accepts ssh-rsa, so I used the Remote Server Workaround as suggested here. That worked (otherwise I couldn't have made this post).

March 20, 2023

My XPS workstation hangs again at boot. Luckily, the recovery option works.
Tried the firefox removal trick, as done previously
Followed the steps from here, yet the purge is incomplete because x2dhunspell is still mounted.
Used the trick from this forum:
snap disable firefox sudo systemctl stop var-snap-firefox-common-host\\x2dhunspell.mount sudo systemctl disable var-snap-firefox-common-host\\x2dhunspell.mount snap remove --purge firefox
Still, the boot is hanging after Started File System Check Daemon.
Did an sudo apt reinstall ubuntu-desktop and saw that 31 packages are held back.
Used the trick in this post and did sudo apt install --only-upgrade --dry-run for each of the packages. Twice a new boot-record was made (the last one based on linux-nvidia-*-common). Now the XPS system boots again for Ubuntu 22.04!

March 17, 2023

Did a fast scan on which operating system runs on which workstation:
- ws1: Ubuntu 20.04.5
- ws2: Ubuntu 18.04.4
- ws3: Ubuntu 22.04
- ws4: Ubuntu 18.04.6
- ws5: Ubuntu 18.04.3
- ws6: Ubuntu 18.04.6
- ws7: Ubuntu 22.04
- ws8: Ubuntu 20.04.5
- ws10: Ubuntu 22.04

February 15, 2023

Looking into the github documentation on Token expiration and revocation.
The next step seems to be creating a personal access token.
The most difficult step is that the token is fine-grained, so you have to select all details on permissions (default is no-access).
I have given 7 permissions on the repository and 3 account permissions, inspired by the previous classic token which expired on Oct 2022. Unfortunatelly, this token will already expire on March 12, 2023.
Yet, it seems that I need a ssh-key.
Back to basic: Authenticating with the command line.
Tried gh auth login, selected GitHub.com, https, token, but that gives an error error validating token: HTTP 401: Bad credentials. My hyphothesis that I didn't select CLI in the fine-grain selection.
Selecting login via the webbrowser in gh auth login seems to work. At least the git push in zedm_capture worked.

February 7, 2023

Connecting the Core V2 to my nb-dual. Following the instructions, checked with lspci | grep -i nvidia if the eGPU is accessible. Looks good:
03:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
Checked dmesg | grep -i lockdown, which finds no string.
Checked nvidia-smi, which also looks good:
Tue Feb 7 13:48:01 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:03:00.0 Off | N/A | | 22% 26C P8 14W / 250W | 0MiB / 12288MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Run /usr/local/zed/tools/ZED_Diagnostic, which now gives green signs. Tool gives in the terminal one warning libpng warning: iCCP: known incorrect sRGB profile. Later I also get the warning QPixmap::scaled: Pixmap is a null pixmap. Even on USB-B the USB-Bandwidth is OK.
Trying to modify zed_wrapper_nodelet.cpp in rescue labbook.

February 3, 2023

Updated as requested my password. Webserver access still works!
Printers should be reconfigured by removing and reinstalling again. Unfortunatelly, this fails with a CUPS server error 'client-error-not-possible
Was redirected to the new Linux support commmunity page.
There is good remark. The ppd uses as queue MacFollow-Me, while the instructions say LinuxFollow-Me. The ppd from the community is a broken link, so had to edit the ppd with a texteditor. Could print a test-page.

January 30, 2023

Installed the keys on the Ubuntu partition of nb-dual.
Used jumpbot to access ws1.
Tried to login as regular user to ws7 and ws8, but at the end used the admin account.
Both were used by students, so switched to ws10. There I could use my regular user account.
Didn't use port forwarding, so didn't have the graphical interface of the Jupyter notebook. Instead copied the first cell with the imports.
Importing tensorflow fails on ImportError: /lib/x86_64-linux-gnu/libnccl.so.2: undefined symbol: cudaGraphAddEventWaitNode, version libcudart.so.11.0.
I have a conda environment with tf1.14, but ~/anaconda3/bin/conda activate tf1.14 fails with CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.. Running conda init bash doesn't help.
In the bash there was both definitions for the PATH and LD_LIBRARY for CUDA 10.2 and 11.0 active. Switched the CUDA 11.0 off, but now import tensorflow failed. Reloaded the shell, now the undefined version libcudart.so.11.0 is back.
Yet, conda activate tf1.14 now works! Inside this conda environment python circle_detection_and_localization.py also works (first cell).
Created the simple model:
Model: "model" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) [(None, 128, 128, 3)] 0 _________________________________________________________________ conv2d (Conv2D) (None, 128, 128, 16) 1216 _________________________________________________________________ batch_normalization (BatchNo (None, 128, 128, 16) 64 _________________________________________________________________ conv2d_1 (Conv2D) (None, 128, 128, 32) 12832 _________________________________________________________________ batch_normalization_1 (Batch (None, 128, 128, 32) 128 _________________________________________________________________ conv2d_2 (Conv2D) (None, 128, 128, 1) 801 ================================================================= Total params: 15,041 Trainable params: 14,945 Non-trainable params: 96 Starting the training. Got some warnings:
_________________________________________________________________ (1000, 128, 128, 3) Epoch 1/100 2023-01-30 17:23:22.305349: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA 2023-01-30 17:23:22.440027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695 pciBusID: 0000:01:00.0 2023-01-30 17:23:22.442261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1 2023-01-30 17:23:22.467410: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10 2023-01-30 17:23:22.482567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10 2023-01-30 17:23:22.486446: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10 2023-01-30 17:23:22.513839: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10 2023-01-30 17:23:22.518528: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10 2023-01-30 17:23:22.568653: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7 2023-01-30 17:23:22.571769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22647 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
Have not seen the output of epoch 2/100 yet.

January 27, 2023

Finally have installed the Follow-Me printer on nb-dual, following these instructions.
Initial attempts failed because I used a space in the description.

January 17, 2023

ws4 has PCI-bus error, which seems related with its nvidia-drivers.
Could still login via the ssh-server in the lab.
Did a sudo apt upgrade (199 packages). Note that ws4 uses repositories from lambdalab.com.
The update contains a kernel update (to 5.4.0-136). It contains also an update of tensorboard to v1.15.0.
The upgrade solved the PCI-bus error.
Note that ws5 also has problems with its fans.

December 21, 2023

December 18, 2023

December 15, 2023

December 4, 2023

September 5, 2023

May 24, 2023

April 14, 2023

April 5, 2023

March 21, 2023

March 20, 2023

March 17, 2023

February 15, 2023

February 7, 2023

February 3, 2023

January 30, 2023

January 27, 2023

January 17, 2023

Previous Labbooks