Search Results: "jff"

31 May 2021

Russell Coker: Some Ideas About Storage Reliability

Hard Drive Brands When people ask for advice about what storage to use they often get answers like use brand X, it works well for me and brand Y had a heap of returns a few years ago . I m not convinced there is any difference between the small number of manufacturers that are still in business. One problem we face with reliability of computer systems is that the rate of change is significant, so every year there will be new technological developments to improve things and every company will take advantage of them. Storage devices are unique among computer parts for their requirement for long-term reliability. For most other parts in a computer system a fault that involves total failure is usually easy to fix and even a fault that causes unreliable operation usually won t spread it s damage too far before being noticed (except in corner cases like RAM corruption causing corrupted data on disk). Every year each manufacturer will bring out newer disks that are bigger, cheaper, faster, or all three. Those disks will be expected to remain in service for 3 years in most cases, and for consumer disks often 5 years or more. The manufacturers can t test the new storage technology for even 3 years before releasing it so their ability to prove the reliability is limited. Maybe you could buy some 8TB disks now that were manufactured to the same design as used 3 years ago, but if you buy 12TB consumer grade disks, the 20TB+ data center disks, or any other device that is pushing the limits of new technology then you know that the manufacturer never tested it running for as long as you plan to run it. Generally the engineering is done well and they don t have many problems in the field. Sometimes a new range of disks has a significant number of defects, but that doesn t mean the next series of disks from the same manufacturer will have problems. The issues with SSDs are similar to the issues with hard drives but a little different. I m not sure how much of the improvements in SSDs recently have been due to new technology and how much is due to new manufacturing processes. I had a bad experience with a nameless brand SSD a couple of years ago and now stick to the better known brands. So for SSDs I don t expect a great quality difference between devices that have the names of major computer companies on them, but stuff that comes from China with the name of the discount web store stamped on it is always a risk. Hard Drive vs SSD A few years ago some people were still avoiding SSDs due to the perceived risk of new technology. The first problem with this is that hard drives have lots of new technology in them. The next issue is that hard drives often have some sort of flash storage built in, presumably a SSHD or Hybrid Drive gets all the potential failures of hard drives and SSDs. One theoretical issue with SSDs is that filesystems have been (in theory at least) designed to cope with hard drive failure modes not SSD failure modes. The problem with that theory is that most filesystems don t cope with data corruption at all. If you want to avoid losing data when a disk returns bad data and claims it to be good then you need to use ZFS, BTRFS, the NetApp WAFL filesystem, Microsoft ReFS (with the optional file data checksum feature enabled), or Hammer2 (which wasn t production ready last time I tested it). Some people are concerned that their filesystem won t support wear levelling for SSD use. When a flash storage device is exposed to the OS via a block interface like SATA there isn t much possibility of wear levelling. If flash storage exposes that level of hardware detail to the OS then you need a filesystem like JFFS2 to use it. I believe that most SSDs have something like JFFS2 inside the firmware and use it to expose what looks like a regular block device. Another common concern about SSD is that it will wear out from too many writes. Lots of people are using SSD for the ZIL (ZFS Intent Log) on the ZFS filesystem, that means that SSD devices become the write bottleneck for the system and in some cases are run that way 24*7. If there was a problem with SSDs wearing out I expect that ZFS users would be complaining about it. Back in 2014 I wrote a blog post about whether swap would break SSD [1] (conclusion it won t). Apart from the nameless brand SSD I mentioned previously all of my SSDs in question are still in service. I have recently had a single Samsung 500G SSD give me 25 read errors (which BTRFS recovered from the other Samsung SSD in the RAID-1), I have yet to determine if this is an ongoing issue with the SSD in question or a transient thing. I also had a 256G SSD in a Hetzner DC give 23 read errors a few months after it gave a SMART alert about Wear_Leveling_Count (old age). Hard drives have moving parts and are therefore inherently more susceptible to vibration than SSDs, they are also more likely to cause vibration related problems in other disks. I will probably write a future blog post about disks that work in small arrays but not in big arrays. My personal experience is that SSDs are at least as reliable as hard drives even when run in situations where vibration and heat aren t issues. Vibration or a warm environment can cause data loss from hard drives in situations where SSDs will work reliably. NVMe I think that NVMe isn t very different from other SSDs in terms of the actual storage. But the different interface gives some interesting possibilities for data loss. OS, filesystem, and motherboard bugs are all potential causes of data loss when using a newer technology. Future Technology The latest thing for high end servers is Optane Persistent memory [2] also known as DCPMM. This is NVRAM that fits in a regular DDR4 DIMM socket that gives performance somewhere between NVMe and RAM and capacity similar to NVMe. One of the ways of using this is Memory Mode where the DCPMM is seen by the OS as RAM and the actual RAM caches the DCPMM (essentially this is swap space at the hardware level), this could make multiple terabytes of RAM not ridiculously expensive. Another way of using it is App Direct Mode where the DCPMM can either be a simulated block device for regular filesystems or a byte addressable device for application use. The final option is Mixed Memory Mode which has some DCPMM in Memory Mode and some in App Direct Mode . This has much potential for use of backups and to make things extra exciting App Direct Mode has RAID-0 but no other form of RAID. Conclusion I think that the best things to do for storage reliability are to have ECC RAM to avoid corruption before the data gets written, use reasonable quality hardware (buy stuff with a brand that someone will want to protect), and avoid new technology. New hardware and new software needed to talk to new hardware interfaces will have bugs and sometimes those bugs will lose data. Filesystems like BTRFS and ZFS are needed to cope with storage devices returning bad data and claiming it to be good, this is a very common failure mode. Backups are a good thing.

27 June 2017

Reproducible builds folks: Reproducible Builds: week 113 in Stretch cycle

Here's what happened in the Reproducible Builds effort between Sunday June 18 and Saturday June 24 2017: Upcoming and Past events Our next IRC meeting is scheduled for the 6th of July at 17:00 UTC with this agenda currently:
  1. Introductions
  2. Reproducible Builds Summit update
  3. NMU campaign for buster
  4. Press release: Debian is doing Reproducible Builds for Buster
  5. Reproducible Builds Branding & Logo
  6. should we become an SPI member
  7. Next meeting
  8. Any other business
On June 19th, Chris Lamb presented at LinuxCon China 2017 on Reproducible Builds. On June 23rd, Vagrant Cascadian held a Reproducible Builds question and answer session at Open Source Bridge. Reproducible work in other projects LEDE: firmware-utils and mtd-utils/mkfs.jffs2 now honor SOURCE_DATE_EPOCH. Toolchain development and fixes There was discussion on #782654 about packaging bazel for Debian. Dan Kegel wrote a patch to use ar determinitiscally for Homebrew, a package manager for MacOS. Dan Kegel worked on using SOURCE_DATE_EPOCH and other reproduciblity fixes in fpm, a multi plattform package builder. The Fedora Haskell team disabled parallel builds to achieve reproducible builds. Bernhard M. Wiedemann submitted many patches upstream: Packages fixed and bugs filed Patches submitted upstream: Other patches filed in Debian: Reviews of unreproducible packages 573 package reviews have been added, 154 have been updated and 9 have been removed in this week, adding to our knowledge about identified issues. 1 issue type has been updated: Weekly QA work During our reproducibility testing, FTBFS bugs have been detected and reported by: diffoscope development Version 83 was uploaded to unstable by Chris Lamb. It also moved the previous changes from experimental (to where they were uploaded) to unstable. It included contributions from previous weeks. You can read about these changes in our previous weeks' posts, or view the changelog directly (raw form). We plan to maintain a backport of this and future versions in stretch-backports. Ximin Luo also worked on better html-dir output for very very large diffs such as those for GCC. So far, this includes unreleased work on a PartialString data structure which will form a core part of a new and more intelligent recursive display algorithm. strip-nondeterminism development Versions 0.035-1 was uploaded to unstable from experimental by Chris Lamb. It included contributions from: Later in the week Mattia Rizzolo uploaded 0.035-2 with some improvements to the autopkgtest and to the general packaging. We currently don't plan to maintain a backport in stretch-backports like we did for jessie-backports. Please speak up if you think otherwise. reproducible-website development tests.reproducible-builds.org Misc. This week's edition was written by Ximin Luo, Holger Levsen, Bernhard M. Wiedemann, Mattia Rizzolo, Chris Lamb & reviewed by a bunch of Reproducible Builds folks on IRC & the mailing lists.

15 January 2016

Ian Wienand: Australia, ipv6 and dd-wrt

It seems that other than Internode, no Australian ISP has any details at all about native IPv6 deployment. Locally I am on Optus HFC, which I believe has been sold to the NBN, who I believe have since discovered that it is not quite what they thought it was. i.e. I think they have more problems than rolling out IPv6 and I won't hold my breath. So the only other option is to use a tunnel of some sort, and it seems there is really only one option with local presence via SixXS. There are other options, notably He.net, but they do not have Australian tunnel-servers. SixXS is the only one I could find with a tunnel in Sydney. So first sign up for an account there. The process was rather painless and my tunnel was provided quickly. After getting this, I got dd-wrt configured and working on my Netgear WNDR3700 V4. Here's my terse guide, cobbled together from other bits and pieces I found. I'm presuming you have a recent dd-wrt build that includes the aiccu tool to create the tunnel, and are pretty familiar with logging into it, etc. Firstly, on dd-wrt make sure you have JFFS2 turned on for somewhere to install scripts. Go Administration, JFFS2 Support, Internal Flash Storage, Enabled. Next, add the aiccu config file to /jffs/etc/aiccu.conf
# AICCU Configuration
# Login information
username USERNAME
password PASSWORD
# Protocol and server listed on your tunnel
protocol tic
server tic.sixxs.net
# Interface names to use
ipv6_interface sixxs
# The tunnel_id to use
# (only required when there are multiple tunnels in the list)
#tunnel_id <your tunnel id>
# Be verbose?
verbose false
# Daemonize?
daemonize true
# Require TLS?
requiretls true
# Set default route?
defaultroute true
Now you can add a script to bring up the tunnel and interface to /jffs/config/sixxs.ipup (make sure you make it executable) where you replace your tunnel address in the ip commands.
# wait until time is synced
while [  date +%Y  -eq 1970 ]; do
sleep 5
done
# check if aiccu is already running
if [ -n " ps grep etc/aiccu grep -v grep " ]; then
aiccu stop
sleep 1
killall aiccu
fi
# start aiccu
sleep 3
aiccu start /jffs/etc/aiccu.conf
sleep 3
ip -6 addr add 2001:....:....:....::/64 dev br0
ip -6 route add 2001:....:....:....::/64 dev br0
sleep 5
#### BEGIN FIREWALL RULES ####
WAN_IF=sixxs
LAN_IF=br0
#flush tables
ip6tables -F
#define policy
ip6tables -P INPUT DROP
ip6tables -P FORWARD DROP
ip6tables -P OUTPUT ACCEPT
# Input to the router
# Allow all loopback traffic
ip6tables -A INPUT -i lo -j ACCEPT
#Allow unrestricted access on internal network
ip6tables -A INPUT -i $LAN_IF -j ACCEPT
#Allow traffic related to outgoing connections
ip6tables -A INPUT -i $WAN_IF -m state --state RELATED,ESTABLISHED -j ACCEPT
# for multicast ping replies from link-local addresses (these don't have an
# associated connection and would otherwise be marked INVALID)
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-reply -s fe80::/10 -j ACCEPT
# Allow some useful ICMPv6 messages
ip6tables -A INPUT -p icmpv6 --icmpv6-type destination-unreachable -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type packet-too-big -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type time-exceeded -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type parameter-problem -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-request -j ACCEPT
ip6tables -A INPUT -p icmpv6 --icmpv6-type echo-reply -j ACCEPT
# Forwarding through from the internal network
# Allow unrestricted access out from the internal network
ip6tables -A FORWARD -i $LAN_IF -j ACCEPT
# Allow some useful ICMPv6 messages
ip6tables -A FORWARD -p icmpv6 --icmpv6-type destination-unreachable -j ACCEPT
ip6tables -A FORWARD -p icmpv6 --icmpv6-type packet-too-big -j ACCEPT
ip6tables -A FORWARD -p icmpv6 --icmpv6-type time-exceeded -j ACCEPT
ip6tables -A FORWARD -p icmpv6 --icmpv6-type parameter-problem -j ACCEPT
ip6tables -A FORWARD -p icmpv6 --icmpv6-type echo-request -j ACCEPT
ip6tables -A FORWARD -p icmpv6 --icmpv6-type echo-reply -j ACCEPT
#Allow traffic related to outgoing connections
ip6tables -A FORWARD -i $WAN_IF -m state --state RELATED,ESTABLISHED -j ACCEPT
Now you can reboot, or run the script, and it should bring the tunnel up and you should be correclty firewalled such that packets get out, but nobody can get in. Back to the web-interface, you can now enable IPv6 with Setup, IPV6, Enable. You leave "IPv6 Type" as Native IPv6 from ISP. Then I enabled Radvd and added a custom config in the text-box to get DNS working with google DNS on hosts with:
interface br0
 
AdvSendAdvert on;
prefix 2001:....:....:....::/64
  
  ;
 RDNSS 2001:4860:4860::8888 2001:4860:4860::8844
  
  ;
 ;
(again, replace the prefix with your own) That is pretty much it; at this point, you should have an IPv6 network and it's most likely that all your network devices will "just work" with it. I got full scores on the IPv6 test sites on a range of devices. Unfortunately, even a geographically close tunnel still really kills latency; compare these two traceroutes:
$ mtr -r -c 1 google.com
Start: Fri Jan 15 14:51:18 2016
HOST: jj                          Loss%   Snt   Last   Avg  Best  Wrst StDev
1.  -- 2001:....:....:....::      0.0%     1    1.4   1.4   1.4   1.4   0.0
2.  -- gw-163.syd-01.au.sixxs.ne  0.0%     1   12.0  12.0  12.0  12.0   0.0
3.  -- ausyd01.sixxs.net          0.0%     1   13.5  13.5  13.5  13.5   0.0
4.  -- sixxs.sydn01.occaid.net    0.0%     1   13.7  13.7  13.7  13.7   0.0
5.  -- 15169.syd.equinix.com      0.0%     1   11.5  11.5  11.5  11.5   0.0
6.  -- 2001:4860::1:0:8613        0.0%     1   14.1  14.1  14.1  14.1   0.0
7.  -- 2001:4860::8:0:79a0        0.0%     1  115.1 115.1 115.1 115.1   0.0
8.  -- 2001:4860::8:0:8877        0.0%     1  183.6 183.6 183.6 183.6   0.0
9.  -- 2001:4860::1:0:66d6        0.0%     1  196.6 196.6 196.6 196.6   0.0
10. -- 2001:4860:0:1::72d         0.0%     1  189.7 189.7 189.7 189.7   0.0
11. -- kul01s07-in-x09.1e100.net  0.0%     1  194.9 194.9 194.9 194.9   0.0
$ mtr -4 -r -c 1 google.com
Start: Fri Jan 15 14:51:46 2016
HOST: jj                          Loss%   Snt   Last   Avg  Best  Wrst StDev
1. -- gateway                    0.0%     1    1.3   1.3   1.3   1.3   0.0
2. -- 10.50.0.1                  0.0%     1   11.0  11.0  11.0  11.0   0.0
3. -- ???                       100.0     1    0.0   0.0   0.0   0.0   0.0
4. -- ???                       100.0     1    0.0   0.0   0.0   0.0   0.0
5. -- ???                       100.0     1    0.0   0.0   0.0   0.0   0.0
6. -- riv4-ge4-1.gw.optusnet.co  0.0%     1   12.1  12.1  12.1  12.1   0.0
7. -- 198.142.187.20             0.0%     1   10.4  10.4  10.4  10.4   0.0
When you watch what is actually using ipv6 (the ipvfoo plugin for Chrome is pretty cool, it shows you what requests are going where), it's mostly all just traffic to really big sites (Google/Google Analytics, Facebook, Youtube, etc) who have figured out IPv6. Since these are exactly the type of places that have made efforts to get caching as close as possible to you (Google's mirror servers are within Optus' network, for example) and so you're really shooting yourself in the foot going around it using an external tunnel. The other thing is that I'm often hitting IPv6 mirrors and downloading larger things for work stuff (distro updates, git clones, image downloads, etc) which is slower and wasting someone else's bandwith for really no benefit. So while it's pretty cool to have an IPv6 address (and a fun experiment) I think I'm going to turn it off. One positive was that after running with it for about a month, nothing has broken -- which suggests that most consumer level gear in a typical house (phones, laptops, TVs, smart-watches, etc) is either ready or ignores it gracefully. Bring on native IPv6!

8 June 2015

Lunar: Reproducible builds: week 6 in Stretch cycle

What happened about the reproducible builds effort for this week: Presentations On May 26th,Holger Levsen presented reproducible builds in Debian at CCC Berlin for the Datengarten 52. The presentation was in German and the slides in English. Audio and video recordings are available. Toolchain fixes Niels Thykier fixed the experimental support for the automatic creation of debug packages in debhelper that being tested as part of the reproducible toolchain. Lunar added to the reproducible build version of dpkg the normalization of permissions for files in control.tar. The patch has also been submitted based on the main branch. Daniel Kahn Gillmor proposed a patch to add support for externally-supplying build date to help2man. This sparkled a discussion about agreeing on a common name for an environment variable to hold the date that should be used. It seems opinions are converging on using SOURCE_DATE_UTC which would hold a ISO-8601 formatted date in UTC) (e.g. 2015-06-05T01:08:20Z). Kudos to Daniel, Brendan O'Dea, Ximin Luo for pushing this forward. Lunar proposed a patch to Tar upstream adding a --clamp-mtime option as a generic solution for timestamp variations in tarballs which might also be useful for dpkg. The option changes the behavior of --mtime to only use the time specified if the file mtime is newer than the given time. So far, upstream is not convinced that it would make a worthwhile addition to Tar, though. Daniel Kahn Gillmor reached out to the libburnia project to ask for help on how to make ISO created with xorriso reproducible. We should reward Thomas Schmitt with a model upstream trophy as he went through a thorough analysis of possible sources of variations and ways to improve the situation. Most of what is missing with the current version in Debian is available in the latest upstream version, but libisoburn in Debian needs help. Daniel backported the missing option for version 1.3.2-1.1. akira submitted a new issue to Doxygen upstream regarding the timestamps added to the generated manpages. Packages fixed The following 49 packages became reproducible due to changes in their build dependencies: activemq-protobuf, bnfc, bridge-method-injector, commons-exec, console-data, djinn, github-backup, haskell-authenticate-oauth, haskell-authenticate, haskell-blaze-builder, haskell-blaze-textual, haskell-bloomfilter, haskell-brainfuck, haskell-hspec-discover, haskell-pretty-show, haskell-unlambda, haskell-x509-util, haskelldb-hdbc-odbc, haskelldb-hdbc-postgresql, haskelldb-hdbc-sqlite3, hasktags, hedgewars, hscolour, https-everywhere, java-comment-preprocessor, jffi, jgit, jnr-ffi, jnr-netdb, jsoup, lhs2tex, libcolor-calc-perl, libfile-changenotify-perl, libpdl-io-hdf5-perl, libsvn-notify-mirror-perl, localizer, maven-enforcer, pyotherside, python-xlrd, python-xstatic-angular-bootstrap, rt-extension-calendar, ruby-builder, ruby-em-hiredis, ruby-redcloth, shellcheck, sisu-plexus, tomcat-maven-plugin, v4l2loopback, vim-latexsuite. The following packages became reproducible after getting fixed: Some uploads fixed some reproducibility issues but not all of them: Patches submitted which did not make their way to the archive yet: Daniel Kahn Gilmor also started discussions for emacs24 and the unsorted lists in generated .el files, the recording of a PID number in lush, and the reproducibility of ISO images in grub2. reproducible.debian.net Notifications are now sent when the build environment for a package has changed between two builds. This is a first step before automatically building the package once more. (Holger Levsen) jenkins.debian.net was upgraded to Debian Jessie. (Holger Levsen) A new variation is now being tested: $PATH. The second build will be done with a /i/capture/the/path added. (Holger Levsen) Holger Levsen with the help of Alexander Couzens wrote extra job to test the reproducibility of coreboot. Thanks James McCoy for helping with certificate issues. Mattia Rizollo made some more internal improvements. strip-nondeterminism development Andrew Ayer released strip-nondeterminism/0.008-1. This new version fixes the gzip handler so that it now skip adding a predetermined timestamp when there was none. Holger Levsen sponsored the upload. Documentation update The pages about timestamps in manpages generated by Doxygen, GHC .hi files, and Jar files have been updated to reflect their status in upstream. Markus Koschany documented an easy way to prevent Doxygen to write timestamps in HTML output. Package reviews 83 obsolete reviews have been removed, 71 added and 48 updated this week. Meetings A meeting was held on 2015-06-03. Minutes and full logs are available. It was agreed to hold such a meeting every two weeks for the time being. The time of the next meeting should be announced soon.

25 January 2015

Jonathan Dowland: Frontier: First Encounters

Cobra mk. 3 Cobra mk. 3
Four years ago, whilst looking for something unrelated, I stumbled across Tom Morton's port of "Frontier: Elite II" for the Atari to i386/OpenGL. This took me right back to playing Frontier on my Amiga in the mid-nineties. I spent a bit of time replaying Frontier and its sequel, First Encounters, for which there exists an interesting family of community-written game engines based on a reverse-engineering of the original DOS release. I made some scrappy notes about engines, patches etc. at the time, which are on my frontier page. With the recent release of Elite: Dangerous, I thought I'd pick up where I left in 2010 and see if I could get the Thargoid ship. I'm nowhere near yet, but I've spent some time trying to maximize income during the game's initial Soholian Fever period. My record in a JJFFE-derived engine (and winning the Wiccan Ware race during the same period) is currently 727,800. Can you do better?

24 July 2014

Craig Small: PHP uniqid() not always a unique ID

For quite some time modern versions of JFFNMS have had a problem. In large installations hosts would randomly appear as down with the reachability interface going red. All other interface types worked, just this one. Reachability interfaces are odd, because they call fping or fping6 do to the work. The reason is because to run a ping program you need to have root access to a socket and to do that is far too difficult and scary in PHP which is what JFFNMS is written in. To capture the output of fping, the program is executed and the output captured to a temporary file. For my tiny setup this worked fine, for a lot of small setups this was also fine. For larger setups, it was not fine at all. Random failed interfaces and, most bizzarely of all, even though a file disappearing. The program checked for a file to exist and then ran stat in a loop to see if data was there. The file exist check worked but the stat said file not found. At first I thought it was some odd load related problem, perhaps the filesystem not being happy and having a file there but not really there. That was, until someone said Are these numbers supposed to be the same? The numbers he was referring to was the filename id of the temporary file. They were most DEFINITELY not supposed to be the same. They were supposed to be unique. Why were they always unique for me and not for large setups? The problem is with the uniqid() function. It is basically a hex representation of the time. Large setups often have large numbers of child processes for polling devices. As the number of poller children increases, the chance that two child processes start the reachability poll at the same time and have the same uniqid increases. It s why the problem happened, but not all the time. The stat error was another symptom of this bug, what would happen was: Who finishes first is entirely dependent on how quickly the fping returns and that is dependent on how quicky the remote host responds to pings, so its kind of random. A minor patch to use tempnam() instead of uniqid() and adding the interface ID in the mix for good measure (no two children will poll the same interface, the parent s scheduler makes sure of that.) The initial responses is that it is looking good.

3 February 2014

Craig Small: jffnms 0.9.4

JFFNMS version 0.9.4 was released today, this version fixes some bugs that have recently appeared in previous versions.
Alarmed Interfaces and Events

Alarmed Interfaces and Events

The triggers rules editor had a problem where some of the rules clicked off the triggers would not appear or could not be edited correctly. Most of the Admin screens have the ability to sort the rows. This, unfortunately, didn t sort but the functionality has been restored. Most users are probably unaware of this, but the database schema is first created for MySQL and is then converted for PostgreSQL. The conversi0n process is far from ideal and hasn t worked until this release. More testing is required for PostgreSQL support but it should be a lot better.

26 May 2012

Craig Small: JFFNMS 0.9.3

jffnms version 0.9.3 has been released today. This is a vast improvement over the 0.9.x releases and anyone using that train is strongly recommended to upgrade.So what changed? What didn t change! A nice summary would be fixing a lot of things that were broken or needed some tweaking. A really, really big thanks to Marek for all the testing and bug reports and also patient just run this and tell me what it says tests he did too. If something wasn t right before and works now, it is quite likely it is working because Marek told me how it broke. A brief overview of what has changed: You can download the file off sourceforge at https://sourceforge.net/projects/jffnms/files/JFFNMS%20Releases/

14 April 2012

Craig Small: PHP floats and locales

I recently had a bug report in jffnms that the SLA checks were failing with bizarre calculations. Things like 300% disk drive utilization and the like. Briefly, JFFNMS is written in PHP and checks values that come out of rrdtool and makes various comparisons like have you used more than 80% of your disk or have there been too many errors. The logs showed strange input variables coming in, all were integers below 10. I don t know of many 1 or 3 kB sized disk drives. What was going on? I ran a rrdtool fetch command on the relevant file and got output of something like 1,780000e+07 which for an 18GB drive seemed ok. Notice the comma, in this locale that s a decimal point hmm. In lib/api.rrdtool.inc.php there is this line around the rrdtool_fetch area:
$value[] = ($line1[$i]=="nan")?0:(float)$line1[$i];
A quick check and I was finding that my 1,7 e+07 was coming back as 1. We had a float conversion problem. Or more specifically, php has a float conversion problem. I built a small check script like the following:
setlocale(LC_NUMERIC,'pl_PL.UTF-8');
$linfo = localeconv();
$pi='3,14';
print "Decimal is \"$linfo[decimal_point]\". Pi is $pi and ".(float)($pi)."\n";
print "Half is ".(1/2)."\n";
Which gave the output of:
Decimal is , . Pi is 3,14 and 3 Half is 0,5
So PHP is saying that decimal point is a comma and it uses it BUT if a string comes in with a comma, its not a decimal point. Really?? Are they serious here? I tried various combinations and could not make it parse correctly. The fix was made easier for me because I know rrdtool fetch only outputs values in scientific notation. That means if there is a string with a comma, then it must be a decimal point as it could never be used for a thousands mark. By using str_replace to replace any comma with a period the code worked again and didn t even need the locale to be set correctly, or that the locale for the shell where rrdtool is run is the same as the locale in php.

1 April 2012

Craig Small: JFFNMS 0.9.3 1st release candidate

I have been putting a lot of testing into jffnms lately. I have been very lucky to have had someone with the time and patience to try out various sub versions and give me access to their results. The end-result of all this testing is a much, much less buggy JFFNMS. There have been a strack of problems with caching results, for example, where status would not be updated or even worse the status of one device impacted on another. The poller parent scheduler had a problem too where it would almost always sit in the first child starving the others of work which slowed things down. The scheduler now is a lot fairer across the children giving a speed up. I ve heard speed-ups of 15x for this one change alone. I also had a curious bug where if a device was set to not gather state it still did and created events but not alerts. This meant your event table was spammed with down interface alerts even on interface you know are down and you turned state checking off. 0.9.3 now does it the right way. The first RC is now uploaded and can be found at https://sourceforge.net/projects/jffnms/files/jffnms%20RC/ to try out. I m a little worried that the pollers now run too fast and could overwhelm the usually crummy control stack found in network devices for parsing SNMP. I m interested to hear how people find it.

4 March 2012

Craig Small: JFFNMS 0.9.2 Released

JFFNMS version 0.9.2 was released today both as an upstream tar.gz file and a new debian package. This version fixes some bugs including making sure it works with PHP5.4. The biggest change in PHP 5.4 is that you can no longer call by reference. Previously you could call a function like myfunc(&blah); which would send a pointer to blah and not the item itself. Now the function definition needs to define what it wants rather than change it each time.

1 November 2011

Philipp Kern: Useful Firefox extensions

Many people around me switched to Chrome or Chromium. I also used it for a bit, but I was a bit disappointed about the extensions available. To show why, here's a list of the extensions I've currently installed:
If Firefox on Android were quicker to start and faster overall, I might even use it there. But as-is it's not very useful. Sadly this also means that I can't use Firefox Sync on my phone and as I don't use Chrome on my desktop I also can't use Chrome to Phone. So I usually go and build a QR code on my laptop and read that with Android's Barcode Scanner.

Of course I'm actually using Iceweasel and I'm very grateful for Mike Hommey's efforts to track the release channel on mozilla.debian.net.

5 June 2011

Craig Small: JFFNMS 0.9.1 Released

jffnms 0.9.1 has the database extracts and updates missing from 0.9.0 This is the most problematic part of the project release; ensuring that the database updates correctly. Version 0.9.1 is functionally the same as 0.9.0, it is just with less bugs!

2 June 2011

Craig Small: What happens without software testing

New TurboGears logo

Image via Wikipedia

Well jffnms 0.9.0 was a, well, its a good example of what can go wrong without adequate testing. Having it written in PHP makes it difficult to test, because you can have entire globs of code that are completely wrong but are not activated (because the if statement proves false for example) and you won t get an error. It also had some database migration problems. This is the most difficult and annoying part of releasing versions of JFFNMS. There are so many things to check, like: I ve been looking at sqlalchemy which is part of turbogears. It s a pretty impressive setup and lets you get on with writing your application and not mucking around with the low-level stuff. It s a bit of a steep learning curve learning python AND sqlalchemy AND turbogears but I ve got some rudimentary code running ok and its certainly worth it (python erroring on un-assigned varables but not forcing you to define them is a great compromise). The best thing is that you can develop on a sqlite database but deploy using mysql or postgresql with a minor change. Python and turbogears both emphasise automatic testing. Ideally the testing should cover all the code you write. The authors even suggest you write the test first then implement the feature. After chasing down several bugs, some of which I introduced fixing other bugs, automatic testing would make my life a lot easier and perhaps I wouldn t dread the release cycle so often.

21 May 2011

Craig Small: JFFNMS 0.9.0 Released

After 3 release candidates jffnms is now at version 0.9.0. Both the web frontend and backend (engines) have had extensive re-work done to them to cleanup and tighten up the code. There should be a lot less warnings and errors from PHP when you set to a higher error level.

4 May 2011

Craig Small: JFFNMS 3rd RC lucky?

I ve just uploaded to sourceforge the third and hopefully last RC for jffnms network management system version 0.9.0 The reason for the delay was easter as well as I wanted to test the engines for a long while to make sure I was not getting any orphan children or items. Previous versions had processes that never died or if they died the parent didn t realise and didn t handle the item, permanently making the item out for work . PHP5 has much better job and process handling and the new version takes advantage of this handling. It s run well on my on test setup for a week or two. You can find the RC code or just the older releases at https://sourceforge.net/projects/jffnms/files/

19 April 2011

Craig Small: Passwords in PHP

Category:WikiProject Cryptography participants

Image via Wikipedia

Generally speaking it is a really bad idea to hold passwords in cleartext. I am actually amazed people still do this! The standard way of holding passwords that has been around for years is to encrypt or hash the password and store the result, called a ciphertext. There have been many ways of hashing the password, starting off with plain old crypt with no salt (a random pair of characters) then crypt with salt through to MD5 and SHA. The thing is, each one of these hashing techniques results in a ciphertext in a different length. Now with most languages, this doesn t matter because you know what hash you are using; its simply the name of the function or some flag you set. PHP is different, because all of these methods use the one function called crypt which is a little confusing because it is more than plain old crypt. Around the PHP version 5.3 the developers started putting in the more complex hash algorithms which is good, but the ciphertext has been growing. A lot of applications store this hashed password in a database and the decision needs to be made; how big should this field be? For a long while, 50 characters would be enough and this is what programs like jffnms use. Unfortunately the SHA-512 algorithm needs 98 characters so what you get is half a hash stored. When the user enters their password, the program compares the full hash from that password to the half hash in the database and it always fails. I ve adjusted JFFNMS to use 200 character long fields which is fine for now. The problem is who knows what the future will bring?

7 March 2011

Craig Small: JFFNMS at RC2, ncurses at 5.8

After some reports back about jffnms 0.9.0rc1 I have now updated it to rc2. Thanks for all who gave me information about how it worked in YOUR setup. I cannot be sure but I d say the second RC will be the last until the release itself. Sven has also given me the nod and ncurses 5.8 migrated into unstable. We ve had one report that the new version of ncurses might not play well with stfl (see #616711 ) but generally speaking it should work ok. Finally, congratulations to the debian project on winning two categories at the Linux New Media Awards. It was especially good to hear the presentation by Karsten Gerloff who is president of the Free Software Foundation Europe. ncurses bug update It seems that the ncurses bug is more serious and is to do with newwin() function in the library. If you get crashes when a program starts and its linked to ncurses 5.8 (even if it is not a Debian system) you may have this problem. It doesn t happen to all ncurses programs, as the stfl example code and mutt work ok. Y9VW3CNYRFF6

2 March 2011

Craig Small: JFFNMS 0.9.0 release candidate 1 out

The next version of jffnms is nearing completion and is now at Release Candidate 1. Version 0.9.0 has a major amount of work in cleaning up and securing the code. The majority of the work has been in the complete re-write of the engines that do the polling, autodiscovery and consolidation. The parent/child communication has changed as has the way the processes are forked. On the front-end, the requirement to register globals has finally been removed, with the code explicitly specifying and sanitising the variables it requires. This will make it easier to debug problems and make the application webservers more secure. Finally there is better support for High Capacity interface counters and some support for IPv6, meaning you can see how slow ipv6.google.com is from your place. JFFNMS 0.9.0rc1 is available from SourceForge at https://sourceforge.net/projects/jffnms/files/jffnms%20RC/

5 February 2011

Craig Small: JFFNMS and IPv6

ipv6-google-rrt.png One of the many Free Software projects I work on is JFFNMS, which is a network management system written in PHP. In light that the last IPv4 address blocks have now been allocated to APNIC it s probably timely to look at how to manage network devices in a new IPv6 world. First you need to get the basics sorted out and for that it is best to use the net-snmp command line utilities to check all is well. Then its onto what to do in JFFNMS itself. Now fixed with proper markup, I hope.
    strlcpy(hostname, a1, sizeof(hostname));
        if ((pptr = strchr (hostname, ':')))  
            remote_port = strtol (pptr + 1, NULL, 0);
         
Special Agent 6 All devices that are capable of being managed with SNMP have an agent. It s basically the SNMP server on the router, switch or server. The first step is to make sure it can accept and send traffic using IPv6. Depending on what the device is, it can be real simple. For example, my little Juniper router its a matter of setting the community and access control, just like IPv4 settings:
csmall@juniper1# set snmp community public clients 2001:db8:62:d0::/64
And then run snmpget on the command line to check
csmall@elmo$ snmpget -v 1 -c public udp6:2001:44b8:62:d0::4 system.sysObjectID.0
SNMPv2-MIB::sysObjectID.0 = OID: SNMPv2-SMI::enterprises.2636.1.1.1.2.41
We have success! For Linux systems that use net-snmp you will need to make the snmp daemon listen on UDPv6 ports as well as adjusting the access control. Access control is a matter of adding com2sec6 lines to /etc/snmp/snmpd.conf They are the same format as com2sec lines and are reasonably straight forward. Next, get the daemon to listen to requests on the IPv6 ports as well. The snmpd(8) man page says:
By default, snmpd listens for incoming SNMP requests on UDP port 161 on all IPv4 interfaces. However, it is possible to modify this behaviour by specifying one or more listening addresses as arguments to snmpd.
It wouldn t of killed them to put comma separated list somewhere, now would it? My /etc/default/snmpd file now looks like:
SNMPDOPTS= -Lf /dev/null -u snmp -g snmp -I -smux -p /var/run/snmpd.pid udp:127.0.0.1,udp6:[::1]:161
The bits in bold are the changes. Then its a quick check:
csmall@elmo$ snmpget -v 1 -c public udp6:[::1] system.sysObjectID.0
SNMPv2-MIB::sysObjectID.0 = OID: NET-SNMP-MIB::netSnmpAgentOIDs.10
And we have working agents! JFFNMS Changes OK, we have a working IPv6 network and snmp works across our network so we can query devices. JFFNMS uses PHP so all should be well, shouldn t it? Database Tables This is an almost classic problem with converting IPv4 to IPv6. You have 15 bytes for a IP address string 111.222.333.444 . Ipv6 can be much larger than that. JFFNMS has things like
  CREATE TABLE  hosts (
    [...]
   ip  char(20) NOT NULL default '',
A simple fix to use a char(39) or varchar(39) will do the trick. Fixing the JFFNMS code One of the most troubling problems with the host entries is that the values entered by a user can be a hostname, IP address or IP address and port separated by a colon. There are lots of bits of code that separate out the port or address by just finding the colon or use non-IPv6 aware functions like gethostbyname() that will need to be fixed. I ve got a function to check for an IPv6 address, using the inet_pton() function.
  function is_ipv6($addr)
 
  $net_addr = @inet_pton($addr);
  if ($net_addr === FALSE   strlen($net_addr) < 16)
    return FALSE;
  return TRUE;
 
PHP functions PHP still has the old IPv4 only functions, but unlike libc does not have the Address Family independent functions. The most significant absence is the replacement functions for gethostbyname(). There is dnsgetrecord() but it is quite low-level (it won t recurse lookups if you get alias or CNAME results). It s a little hit and miss in PHP-land with what functions work with IPv6 addresses and what ones do not. fsockopen() and file() do work with IPv6. PHP and SNMP To query remote devices using SNMP, JFFNMS uses the PHP SNMP functions. Unfortunately, while there is some IPv6 support in PHP, it doesn t extend to the SNMP functions. The library does it, as the net-snmp command lines use the same library, it is just that the small shim between the script and net-snmp that is PHP gets in the way. Looking at the PHP code itself you find things like
    strlcpy(hostname, a1, sizeof(hostname));
        if ((pptr = strchr (hostname, ':')))  
            remote_port = strtol (pptr + 1, NULL, 0);
         
which definitely need to be fixed for IPv6. There is also some simple address lookup going on somewhere as well which will also need to fixed. In short, JFFNMS won t be doing SNMP based queries over IPv6 until php5-snmp can do it. Reachability JFFNMS also has a different sort of query called a reachability type. It s essentially a ping from the server running JFFNMS out to the device being managed. It uses fping to do this work, but there is also a program called fping6. It s a simple matter of checking the address type and then selecting fping or fping6 to do the reachability work. The JFFNMS code soon to be pushed into git has this change in it now. Anything Else? The next stage is to find anything else that will work with IPv6. A likely candidate is the TCP and UDP port types as they use nmap to discover the ports and the fsockopen() function call to poll it. fsockopen() does handle IPv6 addresses if you escape them in square brackets.

Next.