Started Labbook 2024.
December 21, 2023
- Tried again to repair WSL. Removed the program, and reinstalled it again. Still same error (busy with update).
- Yet, I didn't follow the steps of manual install to the letter (first enable all features, restart and then wsl --update.
- Step 4 failed because I have already a newer update of wsl installed.
- Continued with step 6, and now Ubuntu 20.04 could be opened. Same for 18.04 and 22.04. Back in business!
December 18, 2023
- Looked at the error-code I receive from both Ubuntu version I have with WSL.
- The fora suggested to uninstall WSL and install WSL2 again.
- Installed Ubuntu 20.04, which pointed to WSL install manual.
- Yet, step 4 fails because WSL is still busy. Should try step 5 again after a reboot.
-
- Checked the machines at my lab.
- ws1 is running 20.04
- ws4 is still running 18.04
- ws5 is now also running 22.04, but is used by wytse and merrin.
- ws7 is running 22.04, and is used by pepijn
- ws8 is running 20.04, and is used by wytse
- ws9 is running 22.04, and is used by thomasorden
- ws10 is running 22.04, and is not recently used
December 15, 2023
- Updated the Nvidia SDK manager on nb-ros. A new JetPack is availabe (v 5.1.2). Didn't install it (yet).
- Note that also Jetson Nano modules could be selected! The board is not detected, should check how this connection should be made (WIFI / USB).
- Such download would require 19Gb (download locations to be specified in Step 2).
December 4, 2023
- Had on nb-dual a problem with the nvidia-drivers, which couldn't be fixed with --fix-broken.
- At the end I removed 127 packages, mainly the part of the cuda-toolkit-12-1.
- At least the now the other 197 outdated packages could be installed.
- Left for the moment the cuda-toolkit-12-1 uninstalled.
September 5, 2023
- On nb-dual my 18.04 seems to work no longer, after the update.
- Downloaded Ubuntu 22.04 LTS from the App Store.
- Did a sudo apt upgrade, which worked.
May 24, 2023
- Looked at Deep Learning GPU Benchmarks. For Resnet the 4090 clearly outperforms the 3900, no measurements for the 4080 yet (other than tools like Blender).
- The Aurora R15 with a 4080 can now be bought with 25% reduction.
April 14, 2023
- Couldn't reach ws5 from home because the wired connection was loose, and the system used the wifi-connection. Once wired connected (no-wifi), I had to use Pulse to start a UvA vpn to reach the staff-webserver.
April 5, 2023
- Had to update my git-credentials again. Installed on ws7 the github-cli gh.
- Continued with gh auth login. Selecting to authenticate GitHub CLI, which tries to open a browser, which fails on failed snap on a gnome.terminal.
- Yet, I tried this as admin, as user the authenticate works (forgot to look for how long).
- Pulse is not working on ws7, waiting on user input but no pop-up appears.
- Installed Pulse on ws5, which works (as can be seen by this entree).
March 21, 2023
- Installed VPN Pulse secure at ws7. It seems that Ubuntu 22.04 not standardly accepts ssh-rsa, so I used the Remote Server Workaround as suggested here. That worked (otherwise I couldn't have made this post).
March 20, 2023
- My XPS workstation hangs again at boot. Luckily, the recovery option works.
- Tried the firefox removal trick, as done previously
- Followed the steps from here, yet the purge is incomplete because x2dhunspell is still mounted.
- Used the trick from this forum:
snap disable firefox
sudo systemctl stop var-snap-firefox-common-host\\x2dhunspell.mount
sudo systemctl disable var-snap-firefox-common-host\\x2dhunspell.mount
snap remove --purge firefox
- Still, the boot is hanging after Started File System Check Daemon.
- Did an sudo apt reinstall ubuntu-desktop and saw that 31 packages are held back.
- Used the trick in this post and did sudo apt install --only-upgrade --dry-run for each of the packages. Twice a new boot-record was made (the last one based on linux-nvidia-*-common). Now the XPS system boots again for Ubuntu 22.04!
March 17, 2023
- Did a fast scan on which operating system runs on which workstation:
- ws1: Ubuntu 20.04.5
- ws2: Ubuntu 18.04.4
- ws3: Ubuntu 22.04
- ws4: Ubuntu 18.04.6
- ws5: Ubuntu 18.04.3
- ws6: Ubuntu 18.04.6
- ws7: Ubuntu 22.04
- ws8: Ubuntu 20.04.5
- ws10: Ubuntu 22.04
February 15, 2023
- Looking into the github documentation on Token expiration and revocation.
- The next step seems to be creating a personal access token.
- The most difficult step is that the token is fine-grained, so you have to select all details on permissions (default is no-access).
- I have given 7 permissions on the repository and 3 account permissions, inspired by the previous classic token which expired on Oct 2022. Unfortunatelly, this token will already expire on March 12, 2023.
- Yet, it seems that I need a ssh-key.
- Back to basic: Authenticating with the command line.
- Tried gh auth login, selected GitHub.com, https, token, but that gives an error error validating token: HTTP 401: Bad credentials. My hyphothesis that I didn't select CLI in the fine-grain selection.
- Selecting login via the webbrowser in gh auth login seems to work. At least the git push in zedm_capture worked.
February 7, 2023
- Connecting the Core V2 to my nb-dual. Following the instructions, checked with lspci | grep -i nvidia if the eGPU is accessible. Looks good:
03:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX TITAN X] (rev a1)
- Checked dmesg | grep -i lockdown, which finds no string.
- Checked nvidia-smi, which also looks good:
Tue Feb 7 13:48:01 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.86.01 Driver Version: 515.86.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:03:00.0 Off | N/A |
| 22% 26C P8 14W / 250W | 0MiB / 12288MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
- Run /usr/local/zed/tools/ZED_Diagnostic, which now gives green signs. Tool gives in the terminal one warning libpng warning: iCCP: known incorrect sRGB profile. Later I also get the warning QPixmap::scaled: Pixmap is a null pixmap. Even on USB-B the USB-Bandwidth is OK.
- Trying to modify zed_wrapper_nodelet.cpp in rescue labbook.
February 3, 2023
- Updated as requested my password. Webserver access still works!
- Printers should be reconfigured by removing and reinstalling again. Unfortunatelly, this fails with a CUPS server error 'client-error-not-possible
- Was redirected to the new Linux support commmunity page.
- There is good remark. The ppd uses as queue MacFollow-Me, while the instructions say LinuxFollow-Me. The ppd from the community is a broken link, so had to edit the ppd with a texteditor. Could print a test-page.
January 30, 2023
- Installed the keys on the Ubuntu partition of nb-dual.
- Used jumpbot to access ws1.
- Tried to login as regular user to ws7 and ws8, but at the end used the admin account.
- Both were used by students, so switched to ws10. There I could use my regular user account.
-
- Didn't use port forwarding, so didn't have the graphical interface of the Jupyter notebook. Instead copied the first cell with the imports.
- Importing tensorflow fails on ImportError: /lib/x86_64-linux-gnu/libnccl.so.2: undefined symbol: cudaGraphAddEventWaitNode, version libcudart.so.11.0.
- I have a conda environment with tf1.14, but ~/anaconda3/bin/conda activate tf1.14 fails with CommandNotFoundError: Your shell has not been properly configured to use 'conda activate'.. Running conda init bash doesn't help.
- In the bash there was both definitions for the PATH and LD_LIBRARY for CUDA 10.2 and 11.0 active. Switched the CUDA 11.0 off, but now import tensorflow failed. Reloaded the shell, now the undefined version libcudart.so.11.0 is back.
- Yet, conda activate tf1.14 now works! Inside this conda environment python circle_detection_and_localization.py also works (first cell).
- Created the simple model:
Model: "model"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 128, 128, 3)] 0
_________________________________________________________________
conv2d (Conv2D) (None, 128, 128, 16) 1216
_________________________________________________________________
batch_normalization (BatchNo (None, 128, 128, 16) 64
_________________________________________________________________
conv2d_1 (Conv2D) (None, 128, 128, 32) 12832
_________________________________________________________________
batch_normalization_1 (Batch (None, 128, 128, 32) 128
_________________________________________________________________
conv2d_2 (Conv2D) (None, 128, 128, 1) 801
=================================================================
Total params: 15,041
Trainable params: 14,945
Non-trainable params: 96
Starting the training. Got some warnings:
_________________________________________________________________
(1000, 128, 128, 3)
Epoch 1/100
2023-01-30 17:23:22.305349: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2023-01-30 17:23:22.440027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties:
name: NVIDIA GeForce RTX 3090 major: 8 minor: 6 memoryClockRate(GHz): 1.695
pciBusID: 0000:01:00.0
2023-01-30 17:23:22.442261: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudart.so.10.1
2023-01-30 17:23:22.467410: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10
2023-01-30 17:23:22.482567: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcufft.so.10
2023-01-30 17:23:22.486446: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcurand.so.10
2023-01-30 17:23:22.513839: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusolver.so.10
2023-01-30 17:23:22.518528: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcusparse.so.10
2023-01-30 17:23:22.568653: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2023-01-30 17:23:22.571769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1326] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22647 MB memory) -> physical GPU (device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6)
- Have not seen the output of epoch 2/100 yet.
January 27, 2023
- Finally have installed the Follow-Me printer on nb-dual, following these instructions.
- Initial attempts failed because I used a space in the description.
January 17, 2023
- ws4 has PCI-bus error, which seems related with its nvidia-drivers.
- Could still login via the ssh-server in the lab.
- Did a sudo apt upgrade (199 packages). Note that ws4 uses repositories from lambdalab.com.
- The update contains a kernel update (to 5.4.0-136). It contains also an update of tensorboard to v1.15.0.
- The upgrade solved the PCI-bus error.
- Note that ws5 also has problems with its fans.
Previous Labbooks