NVidia-smi shows API mismatch
Sometime when updated NVidia driver and CUDA on Rocky Linux systems, running nvidia-smi shows that kernel driver version mismatch. If you run
dmseg
It will show:
NVRM: API mismatch: the client has the version aaa.bbb, but
NVRM: this kernel module has the version ccc.ddd. Please
NVRM: make sure that this kernel module and all NVIDIA driver
NVRM: components have the same version.
And aaa.bbb is not the same with ccc.ddd.
This happens that the corresponding nvidia driver was not properly registered by dkms.
Other solution suggested to reboot the server, reinstall drivers, recreate initramfs, and rmmod of corresponding nvidia mods. These methods sometimes works. When they are all not working, you can try
dkms install -m nvidia -v 570.144
where replacing 570.144 to your most recent installed nvidia driver version. Then reboot the server. This should work.
When dnf update kernel failed to generate initramfs
This sometimes happen when the automatic nvidia kernel module installation fails.
The workaround is:
- Boot into an older kernel;
- ls /boot/ and find the kernel name of the vmlinuz without corresponding initramfs. For example, it could be vmlinuz-6.14.3-300.fc42.x86_64. Your kernel version will be 6.14.3-300.fc42.x86_64, in the form of major.mid,minor-nnn.osver.arch. Let’s call it $KVER
- Check the existence of /lib/modules/$KVER/ by ls it. If exists, do “depmod -v $KVER”. This will create modules.dep in the folder /lib/modules/$KVER/.
- Do dracut –force –kver $KVER. If not working, use a lower version of gcc like “CC=gcc-14 dracut –force –kver $KVER”.
- Reboot into the newest kernel. Usually this will not contail nvidia driver kernel module.
- Run the nvidia driver installation downloaded from NVidia driver site, like NVIDIA-Linux-x86_64-570.144.run. This will install the NVidia driver kernel module into the kernel. If not working, use a lower version of gcc, like “CC=gcc-14 ./NVIDIA-Linux-x86_64-570.144.run”
- Now you have a good kernel with proper graphic driver kernel module.
Setting firewalld to allow nodes on intranet to access internet
firewall-cmd --zone=public --add-interface=<internet interface> --permanent
firewall-cmd --zone=internal --add-interface=<intranet interface as gateway> --permanent
firewall-cmd --set-default-zone=public --permanent
firewall-cmd --reload
firewall-cmd --get-default-zone
firewall-cmd --new-policy internal-public --permanent
firewall-cmd --reload
firewall-cmd --policy internal-public --add-ingress-zone=internal --permanent
firewall-cmd --policy internal-public --add-egress-zone=public --permanent
firewall-cmd --policy internal-public --set-target=ACCEPT --permanent
firewall-cmd --reload
firewall-cmd --info-policy internal-public
When upgrading OS, dnf and rpm fails on SHA1 packages
First, use
rpm -q gpg-pubkey –qf ‘%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n’
to identify keys from obsolete repositories, then use
rpm -e gpg-pubkey-xxxxxxxx-yyyyyyyy
to remove the keys that were imported from the SHA1 era.
Then have the offending packages removed by
rpm -q –nosignature –querybynumber xxxx
where you can get the xxxx from the stderr messages from
rpm -qa >/dev/null
Grub force install to boot sector
When upgrading OS or replacing drives, the UEFI booting may not be able to automatically installed on the boot drive, resulting not able to update the boot menu to include the new kernels.
When you do grub2-install <your boot device>, it errors out with information:
Installing for x86_64-efi platform.
grub2-install: error: This utility should not be used for EFI platforms because it does not support UEFI Secure Boot. If you really wish to proceed, invoke the –force option.
Make sure Secure Boot is disabled before proceeding.
Do not worry. Just force it,
grub2-install –force <your boot device>
then update the grub menu,
grub2-mkconfig -o /boot/grub2/grub.cfg
it will work.
Linux ssh log in super slow
It was found that systemd-logind malfunctioned,
By restarting it —
systemctl restart systemd-logind
The problem is resolved.
tcsh script behavior change
Recently I noticed a tcsh behavior change.
CentOS7, Rocky 8, Ubuntu 20.04, Fedora 41, and Linux Mint if you have a string variable in tcsh,
like
set a=”mystring”
and attempt to get its path when treating is as a filename:
set b=”$a:h”
it will return $a itself.
To be noted that the expect behavior when a=”mypath/mystring”,
set b=”$a:h”
get string $b as mypath
However, in Rocky9, $b will be “” empty string when $a does not contain any slash.
This behavior caused some of our tcsh scripts to malfunction.
The reason is still under investigating.
Window 11 cannot install Asian Keyboard
It happened to me that when I attempt to install Chinese/Japan/Korean input method to my windows 11 box, it fails on installing the “Basic Typing” after attempted to download for about 30 seconds with error
“Sorry, we’re having trouble installing this feature. You can try again later. Error code 0x0”.
And all other components like Handwriting, Text-to-speech, Speech recognition also fail. And the installed Asian keyboard does not work showing the feature is not ready.
I tried the tricks in this: [https://answers.microsoft.com/en-us/windows/forum/all/windows-11-unable-to-download-language-packs/b78b04da-2c75-45d8-a828-f553441b220f] but none of them works.
The workaround I found is to download
26100.1.240331-1435.ge_release_amd64fre_CLIENT_LOF_PACKAGES_OEM.iso
from https://files.rg-adguard.net/file/025cfc5d-f5fa-7d00-246e-76c04a40e210
and extract the corresponding language pack .cab files like
Microsoft-Windows-Client-Language-Pack_x64_zh-cn.cab
Microsoft-Windows-LanguageFeatures-Basic-zh-cn-Package~31bf3856ad364e35~amd64~~.cab
Microsoft-Windows-LanguageFeatures-Handwriting-zh-cn-Package~31bf3856ad364e35~amd64~~.cab
Microsoft-Windows-LanguageFeatures-Speech-zh-cn-Package~31bf3856ad364e35~amd64~~.cab
Microsoft-Windows-LanguageFeatures-TextToSpeech-zh-cn-Package~31bf3856ad364e35~amd64~~.cab
and install them one by one in PowerShell with Admin privilege like:
Add-WindowsPackage -Online -PackagePath “.\Microsoft-Windows-LanguageFeatures-Basic-zh-cn-Package~31bf3856ad364e35~amd64~~.cab”
After all of these, the “Basic typing” is still not available but the input method works.
When dnf/yum update stuck on cleaning up…
Sometimes when you are doing dnf/yum update, the progress may stop on the last step – cleaning up packages for hours, if you have a super large data drive. This may be caused by an installing script falsely attempts to scan through multi-million files on your data drive that is not mounted in a regular location. If this is the case, you can do the following:
Open another terminal, use “top” to find out which process is keeping working, like texlua etc will show up on top.
Then you can do “lsof | grep <process_name> to find out which drive this process is scanning through.
When you find it, for example, if it is “/data/home”, you can do “umount -l <volume_name>”, (here it is “umount -l /data/home”), wait 10 seconds, then “mount /data/home” to remount it. Then the process that scanning the drive will think there is no more files, and quit it.
This will allow the dnf/yum finish without any error.
Restart docker failed on docker0 network interface
Sometimes when you attempt to restart docker.service, it fails on cannot restart docker0 network interface.
In this case, you can simple do
ifdown docker0
Then you can start docker.service again.