Search Results: "cfm"

7 March 2024

Gunnar Wolf: Constructed truths truth and knowledge in a post-truth world

This post is a review for Computing Reviews for Constructed truths truth and knowledge in a post-truth world , a book published in Springer Link
Many of us grew up used to having some news sources we could implicitly trust, such as well-positioned newspapers and radio or TV news programs. We knew they would only hire responsible journalists rather than risk diluting public trust and losing their brand s value. However, with the advent of the Internet and social media, we are witnessing what has been termed the post-truth phenomenon. The undeniable freedom that horizontal communication has given us automatically brings with it the emergence of filter bubbles and echo chambers, and truth seems to become a group belief. Contrary to my original expectations, the core topic of the book is not about how current-day media brings about post-truth mindsets. Instead it goes into a much deeper philosophical debate: What is truth? Does truth exist by itself, objectively, or is it a social construct? If activists with different political leanings debate a given subject, is it even possible for them to understand the same points for debate, or do they truly experience parallel realities? The author wrote this book clearly prompted by the unprecedented events that took place in 2020, as the COVID-19 crisis forced humanity into isolation and online communication. Donald Trump is explicitly and repeatedly presented throughout the book as an example of an actor that took advantage of the distortions caused by post-truth. The first chapter frames the narrative from the perspective of information flow over the last several decades, on how the emergence of horizontal, uncensored communication free of editorial oversight started empowering the netizens and created a temporary information flow utopia. But soon afterwards, algorithmic gatekeepers started appearing, creating a set of personalized distortions on reality; users started getting news aligned to what they already showed interest in. This led to an increase in polarization and the growth of narrative-framing-specific communities that served as echo chambers for disjoint views on reality. This led to the growth of conspiracy theories and, necessarily, to the science denial and pseudoscience that reached unimaginable peaks during the COVID-19 crisis. Finally, when readers decide based on completely subjective criteria whether a scientific theory such as global warming is true or propaganda, or question what most traditional news outlets present as facts, we face the phenomenon known as fake news. Fake news leads to post-truth, a state where it is impossible to distinguish between truth and falsehood, and serves only a rhetorical function, making rational discourse impossible. Toward the end of the first chapter, the tone of writing quickly turns away from describing developments in the spread of news and facts over the last decades and quickly goes deep into philosophy, into the very thorny subject pursued by said discipline for millennia: How can truth be defined? Can different perspectives bring about different truth values for any given idea? Does truth depend on the observer, on their knowledge of facts, on their moral compass or in their honest opinions? Zoglauer dives into epistemology, following various thinkers ideas on what can be understood as truth: constructivism (whether knowledge and truth values can be learnt by an individual building from their personal experience), objectivity (whether experiences, and thus truth, are universal, or whether they are naturally individual), and whether we can proclaim something to be true when it corresponds to reality. For the final chapter, he dives into the role information and knowledge play in assigning and understanding truth value, as well as the value of second-hand knowledge: Do we really own knowledge because we can look up facts online (even if we carefully check the sources)? Can I, without any medical training, diagnose a sickness and treatment by honestly and carefully looking up its symptoms in medical databases? Wrapping up, while I very much enjoyed reading this book, I must confess it is completely different from what I expected. This book digs much more into the abstract than into information flow in modern society, or the impact on early 2020s politics as its editorial description suggests. At 160 pages, the book is not a heavy read, and Zoglauer s writing style is easy to follow, even across the potentially very deep topics it presents. Its main readership is not necessarily computing practitioners or academics. However, for people trying to better understand epistemology through its expressions in the modern world, it will be a very worthy read.

23 February 2024

Gunnar Wolf: 10 things software developers should learn about learning

This post is a review for Computing Reviews for 10 things software developers should learn about learning , a article published in Communications of the ACM
As software developers, we understand the detailed workings of the different components of our computer systems. And probably due to how computers were presented since their appearance as digital brains in the 1940s we sometimes believe we can transpose that knowledge to how our biological brains work, be it as learners or as problem solvers. This article aims at making the reader understand several mechanisms related to how learning and problem solving actually work in our brains. It focuses on helping expert developers convey knowledge to new learners, as well as learners who need to get up to speed and start coding. The article s narrative revolves around software developers, but much of what it presents can be applied to different problem domains. The article takes this mission through ten points, with roughly the same space given to each of them, starting with wrong assumptions many people have about the similarities between computers and our brains. The first section, Human Memory Is Not Made of Bits, explains the brain processes of remembering as a way of strengthening the force of a memory ( reconsolidation ) and the role of activation in related network pathways. The second section, Human Memory Is Composed of One Limited and One Unlimited System, goes on to explain the organization of memories in the brain between long-term memory (functionally limitless, permanent storage) and working memory (storing little amounts of information used for solving a problem at hand). However, the focus soon shifts to how experience in knowledge leads to different ways of using the same concepts, the importance of going from abstract to concrete knowledge applications and back, and the role of skills repetition over time. Toward the end of the article, the focus shifts from the mechanical act of learning to expertise. Section 6, The Internet Has Not Made Learning Obsolete, emphasizes that problem solving is not just putting together the pieces of a puzzle; searching online for solutions to a problem does not activate the neural pathways that would get fired up otherwise. The final sections tackle the differences that expertise brings to play when teaching or training a newcomer: the same tools that help the beginner s productivity as training wheels will often hamper the expert user s as their knowledge has become automated. The article is written with a very informal and easy-to-read tone and vocabulary, and brings forward several issues that might seem like commonsense but do ring bells when it comes to my own experiences both as a software developer and as a teacher. The article closes by suggesting several books that further expand on the issues it brings forward. While I could not identify a single focus or thesis with which to characterize this article, the several points it makes will likely help readers better understand (and bring forward to consciousness) mental processes often taken for granted, and consider often-overlooked aspects when transmitting knowledge to newcomers.

20 January 2024

Gunnar Wolf: A deep learning technique for intrusion detection system using a recurrent neural networks based framework

This post is a review for Computing Reviews for A deep learning technique for intrusion detection system using a recurrent neural networks based framework , a article published in Computer Communications
So let s assume you already know and understand that artificial intelligence s main building blocks are perceptrons, that is, mathematical models of neurons. And you know that, while a single perceptron is too limited to get interesting information from, very interesting structures neural networks can be built with them. You also understand that neural networks can be trained with large datasets, and you can get them to become quite efficient and accurate classifiers for data comparable to your dataset. Finally, you are interested in applying this knowledge to defensive network security, particularly in choosing the right recurrent neural network (RNN) framework to create an intrusion detection system (IDS). Are you still with me? Good! This paper might be right for you! The paper builds on a robust and well-written introduction and related work sections to arrive at explaining in detail what characterizes a RNN, the focus of this work, among other configurations also known as neural networks, and why they are particularly suited for machine learning (ML) tasks. RNNs must be trained for each problem domain, and publicly available datasets are commonly used for such tasks. The authors present two labeled datasets representing normal and hostile network data, identified according to different criteria: NSL-KDD and UNSW-NB15. They proceed to show a framework to analyze and compare different RNNs and run them against said datasets, segmented for separate training and validation phases, compare results, and finally select the best available model for the task measuring both training speed as well as classification accuracy. The paper is quite heavy due to both its domain-specific terminology many acronyms are used throughout the text and its use of mathematical notation, both to explain specific properties of each of the RNN types and for explaining the preprocessing carried out for feature normalization and selection. This is partly what led me to start the first paragraph by assuming that we, as readers, already understand a large body of material if we are to fully follow the text. The paper does begin by explaining its core technologies, but quickly ramps up and might get too technical for nonexpert readers. It is undeniably an interesting and valuable read, showing the state of the art in IDS and ML-assisted technologies. It does not detail any specific technology applying its findings, but we will probably find the information conveyed here soon enough in industry publications.

22 December 2023

Gunnar Wolf: Pushing some reviews this way

Over roughly the last year and a half I have been participating as a reviewer in ACM s Computing Reviews, and have even been honored as a Featured Reviewer. Given I have long enjoyed reading friends reviews of their reading material (particularly, hats off to the very active Russ Allbery, who both beats all of my frequency expectations (I could never sustain the rythm he reads to!) and holds documented records for his >20 years as a book reader, with far more clarity and readability than I can aim for!), I decided to explicitly share my reviews via this blog, as the audience is somewhat congruent; I will also link here some reviews that were not approved for publication, clearly marking them so. I will probably work on wrangling my Jekyll site to display an (auto-)updated page and RSS feed for the reviews. In the meantime, the reviews I have published are:

21 March 2022

Gunnar Wolf: Long, long, long live Emacs after 39 years

Reading Planet Debian (see, Sam, we are still having a conversation over there? ), I read Anarcat s 20+ years of Emacs. And.. Well, should I brag contribute to the discussion? Of course, why not? Emacs is the first computer program I can name that I ever learnt to use to do something minimally useful. 39 years ago.
From the Space Cadet keyboard that (obviously ) influenced Emacs early design
The Emacs editor was born, according to Wikipedia, in 1976, same year as myself. I am clearly not among its first users. It was already a well-established citizen when I first learnt it; I am fortunate to be the son of a Physics researcher at UNAM, My father used to take me to his institute after he noticed how I was attracted to computers; we would usually spend some hours there between 7 and 11PM on Friday nights. His institute had a computer room where they had very sweet gear: Some 10 Heathkit terminals quite similar to this one: The terminals were connected (via individual switches) to both a PDP-11 and a Foonly F2 computers. The room also had a beautiful thermal printer, a beautiful Tektronix vectorial graphics output terminal, and some other stuff. The main user for my father was to typeset some books; he had recently (1979) published Integral Transforms in Science and Engineering (that must be my first mention in scientific literature), and I remember he was working on the proceedings of a conference he held in Oaxtepec (the account he used in the system was oax, not his usual kbw, which he lent me). He was also working on Manual de Lenguaje y Tipograf a Cient fica en Castellano, where you can see some examples of TeX; due to a hardware crash, the book has the rare privilege of being a direct copy of the output of the thermal printer: It was not possible to produce a higher resolution copy for several years But it is fun and interesting to see what we were able to produce with in-house tools back in 1985! So, what could he teach me so I could use the computers while he worked? TeX, of course. No, no LaTeX (that was published in 1984). LaTeX is a set of macros developed initially by Leslie Lamport, used to make TeX easier; TeX was developed by Donald Knuth, and if I have this information correct, it was Knuth himself who installed and demonstrated TeX in the Foonly computer, during a visit to UNAM. Now, after 39 years hammering at Emacs buffers Have I grown extra fingers? Nope. I cannot even write decent elisp code, and can barely read it. I do use org-mode (a lot!) and love it; I have written basically five books, many articles and lots of presentations and minor documents with it. But I don t read my mail or handle my git from Emacs. I could say, I m a relatively newbie after almost four decades. Four decades When we got a PC in 1986, my father got the people at the Institute to get him memacs (micro-emacs). There was probably a ten year period I barely used any emacs, but always recognized it. My fingers hve memorized a dozen or so movement commands, and a similar number of file management commands. And yes, Emacs and TeX are still the main tools I use day to day.

13 April 2020

Giovanni Mascellani: DKIM for Debian Developers

What is DKIM? DKIM (DomainKeys Identified Mail), as Wikipedia puts it, "is an email authentication method designed to detect forged sender addresses in emails (email spoofing), a technique often used in phishing and email spam". More prosaically, one of the reasons email spam is so abundant is that, given a certain email message, there is no simple way to know for certain who sent it and how reputable they are. So even if people having addresses @debian.org are very nice and well-behaving, any random spammer can easily send emails from whatever@debian.org, and even if you trust people from @debian.org you cannot easily configure your antispam filter to just accept all emails from @debian.org, because spammers would get in too. Since nearly ten years DKIM is there to help you. If you send an email from @debian.org with DKIM, it will have a header like this:
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=debian.org;
    s=vps.gio.user; t=1586779391;
    bh=B6tckJy2cynGjNRdm3lhFDrp0tD7fF8hS4x0FCfLADo=;
    h=From:Subject:To:Date:From;
    b=H4EDlATxVm7XNqPy2x7IqCchBUz1SxFtUSstB23BAsdyTKJIohM0O4RRWhrQX+pqE
     prPVhzcfNALMwlfExNE69940Q6pMCuYsoxNQjU7Jl/UX1q6PGqdVSO+mKv/aEI+N49
     vvYNgPJNLaAFnYqbWCPI8mNskLHLe2VFYjSjE4GJFOxl9o2Gpe9f5035FYPJ/hnqBF
     XPnZq7Osd9UtBrBq8agEooTCZHbNFSyiXdS0qp1ts7HAo/rfrBfbQSk39fOOQ5GbjV
     6FehkN4GAXFNoFnjfmjrVDJC6hvA8m0tJHbmZrNQS0ljG/SyffW4OTlzFzu4jOmDNi
     UHLnEgT07eucw==
The field d=debian.org is the domain this email claims to be from and the fields bh= and b= are a cryptographic public key signature certifying this fact. How do I check that the email is actually from @debian.org? I use the selector s=vps.gio.user to fetch the public key via DNS, and then use the public key to verify the signature.
$ host -t TXT vps.gio.user._domainkey.debian.org
vps.gio.user._domainkey.debian.org descriptive text "v=DKIM1; k=rsa; s=email; h=sha256; p=" "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsM/W/kxtKWT58Eak0cfm/ntvurfbkkvugrG2jfvSMnHHkFyfJ34Xvn/HhQPLwX1QsjhuLV+tW+BQtxY7jxSABCee6nHQRBrpDej1t86ubw3CSrxcg1mzJI5BbL8un0cwYoBtUvhCYAZKarv1W2otCGs43L0s" "GtEqqtmYN/hIVVm4FcqeYS1cYrZxDsjPzCEocpYBhqHh1MTeUEddVmPHKZswzvllaWF0mgIXrfDNAE0LiX39aFKWtgvflrYFKiL4hCDnBcP2Mr71TVblfDY0wEdAEbGEJqHR1SxvWyn0UU1ZL4vTcylB/KJuV2gMhznOjbnQ6cjAhr2JYpweTYzz3wIDAQAB"
There it is! Debian declares in its DNS record that that key is authorized to sign outbound email from @debian.org. The spammer hopefully does not have access to Debian's DKIM keys, and they cannot sign emails. Many large and small email services have already deployed DKIM since years, while most @debian.org emails still do not use it. Why not? Because people send @debian.org emails from many different servers. Basically, every DD used their @debian.org address sends email from their own mail server, and those mail servers (fortunately) do not have access to Debian's DNS record to install their DKIM keys. Well, that was true until yesterday! :-) A few weeks ago I poked DSA asking to allow any Debian Developer to install their DKIM keys, so that DDs could use DKIM to sign their emails and hopefully reduce the amount of spam sent from @debian.org. They have done it (thank you DSA very much, especially adsb), and now it is possible to use it! How do I configure it? I will not write here a full DKIM tutorial, there are many around. You have to use opendkim-genkey to generate a key and then configure your mail server to use opendkim to digitally sign outbound email. There are a few Debian-specific things you have to care about, though. First the have to choose a selector, which is a string used to distinguish many DKIM keys belonging to the same domain. Debian allows you to installa a key whose selector is <something>.<uid>.user, where <uid> is your Debian uid (this is done both for namespacing reasons and for exposing who might be abusing the system). So check carefully that your selector has this form. Then you cannot edit directly Debian's DNS record. But you can use the email-LDAP gateway on db.debian.org to install your key in a way similar to how entries in debian.net are handled (see the updated documentation). Specifically, suppose that opendkim-genkey generated the following thing for selector vps.gio.user and domain debian.org:
vps.gio.user._domainkey IN  TXT ( "v=DKIM1; h=sha256; k=rsa; "
      "p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsM/W/kxtKWT58Eak0cfm/ntvurfbkkvugrG2jfvSMnHHkFyfJ34Xvn/HhQPLwX1QsjhuLV+tW+BQtxY7jxSABCee6nHQRBrpDej1t86ubw3CSrxcg1mzJI5BbL8un0cwYoBtUvhCYAZKarv1W2otCGs43L0sGtEqqtmYN/hIVVm4FcqeYS1cYrZxDsjPzCEocpYBhqHh1MTeUE"
      "ddVmPHKZswzvllaWF0mgIXrfDNAE0LiX39aFKWtgvflrYFKiL4hCDnBcP2Mr71TVblfDY0wEdAEbGEJqHR1SxvWyn0UU1ZL4vTcylB/KJuV2gMhznOjbnQ6cjAhr2JYpweTYzz3wIDAQAB" )  ; ----- DKIM key vps.gio.user for debian.org
Then you have to carefully copy the content of the p= field (without being fooled by it being split between different strings) and construct a request of the form:
dkimPubKey: vps.gio.user MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsM/W/kxtKWT58Eak0cfm/ntvurfbkkvugrG2jfvSMnHHkFyfJ34Xvn/HhQPLwX1QsjhuLV+tW+BQtxY7jxSABCee6nHQRBrpDej1t86ubw3CSrxcg1mzJI5BbL8un0cwYoBtUvhCYAZKarv1W2otCGs43L0sGtEqqtmYN/hIVVm4FcqeYS1cYrZxDsjPzCEocpYBhqHh1MTeUEddVmPHKZswzvllaWF0mgIXrfDNAE0LiX39aFKWtgvflrYFKiL4hCDnBcP2Mr71TVblfDY0wEdAEbGEJqHR1SxvWyn0UU1ZL4vTcylB/KJuV2gMhznOjbnQ6cjAhr2JYpweTYzz3wIDAQAB
and then send it GPG-signed to changes@db.debian.org:
echo 'dkimPubKey: vps.gio.user blahblahblah'   gpg --clearsign   mail changes@db.debian.org
Then use host -t TXT vps.gio.user._domainkey.debian.org to chech the key gets published (it will probably take some minutes/hours, I don't know). Once it is published, you can enable DKIM in you mail server and your email will be signed. Congratulations, you will not look like a spammer any more! You can send an email to check-auth@verifier.port25.com to check that your setup is correct. They will reply with a report, including the success of DKIM test. Notice that currently Debian's setup only allows you to use RSA DKIM keys and doesn't allow you to set other DKIM fields (but you probably won't need to set them). EDIT DSA made an official announcement about DKIM support, which you might want to check out as well, together with its links. EDIT 2 Now ed25519 keys are supported, the syntax for specifying keys on LDAP is a little bit more flexible and you can also insert CNAME records. See the official documentation for the updated details. So we have solved our problems with spam? Ha, no! DKIM is only a small step. Useful, also because it enable other steps to be taken in the future, but small. In particular, DKIM enables you to say: "This particular email actually comes from @debian.org", but doesn't tell anybody what to do with emails that are not signed. A third-party mail server might wonder whether @debian.org emails are actually supposed to be signed or not. There is another standard for dealing with that, which is called DMARD, and I believe that Debian should eventually use it, but not now: the problem is that currently virtually no email from @debian.org is signed with DKIM, so if DMARC was enabled other mail servers would start to nuke all @debian.org emails, except those which are already signed, a minority. If people and services sending emails from @debian.org will start configuring DKIM on their servers, which is now possible, it will eventually come a time when DMARC can be enabled, and spammers will find themselves unable to send forged @debian.org emails. We are not there yet, but todays we are a little step closer than yesterday. Also, notice that having DKIM on @debian.org only counters spam pretending to be from @debian.org, but there is much more. The policy on what to accept is mostly independent on that on what you send. However, knowing that @debian.org emails have DKIM and DMARC would mean that we can set our spam filters to be more aggressive in general, but whitelist official Debian Developers and services. And the same can be done for other domains using DKIM and DMARC. Finally, notice that some incompatibilities between DKIM and mailing lists are known, and do not have a definitive answer yet. Basically, most mailing list engines modify either the body of the headers in forwarded emails, which means that DKIM does not validate any more. There are many proposed solutions, possibly none completely satisfying, but since spam is not very satisfying as well, something will have to be worked out. I wrote a lot already, though, so I wont't discuss this here.

6 March 2020

Reproducible Builds: Reproducible Builds in February 2020

Welcome to the February 2020 report from the Reproducible Builds project. One of the original promises of open source software is that distributed peer review and transparency of process results in enhanced end-user security. However, whilst anyone may inspect the source code of free and open source software for malicious flaws, almost all software today is distributed as pre-compiled binaries. This allows nefarious third-parties to compromise systems by injecting malicious code into ostensibly secure software during the various compilation and distribution processes. The motivation behind the reproducible builds effort is to provide the ability to demonstrate these binaries originated from a particular, trusted, source release: if identical results are generated from a given source in all circumstances, reproducible builds provides the means for multiple third-parties to reach a consensus on whether a build was compromised via distributed checksum validation or some other scheme. In this month s report, we cover:

If you are interested in contributing to the project, please visit our Contribute page on our website.

Media coverage & upstream news Omar Navarro Leija, a PhD student at the University Of Pennsylvania, published a paper entitled Reproducible Containers that describes in detail the workings of a new user-space container tool called DetTrace:
All computation that occurs inside a DetTrace container is a pure function of the initial filesystem state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows.
There was also considerable discussion on our mailing list regarding this research and a presentation based on the paper will occur at the ASPLOS 2020 conference between March 16th 20th in Lausanne, Switzerland. The many virtues of Reproducible Builds were touted as benefits for software compliance in a talk at FOSDEM 2020, debating whether the Careful Inventory of Licensing Bill of Materials Have Impact of FOSS License Compliance which pitted Jeff McAffer and Carol Smith against Bradley Kuhn and Max Sills. (~47 minutes in). Nobuyoshi Nakada updated the canonical implementation of the Ruby programming language a change such that filesystem globs (ie. calls to list the contents of filesystem directories) will henceforth be sorted in ascending order. Without this change, the underlying nondeterministic ordering of the filesystem is exposed to the language which often results in an unreproducible build. Vagrant Cascadian reported on our mailing list regarding a quick reproducible test for the GNU Guix distribution, which resulted in 81.9% of packages registering as reproducible in his installation:
$ guix challenge --verbose --diff=diffoscope ...
2,463 store items were analyzed:
  - 2,016 (81.9%) were identical
  - 37 (1.5%) differed
  - 410 (16.6%) were inconclusive
Jeremiah Orians announced on our mailing list the release of a number of tools related to cross-compilation such as M2-Planet and mescc-tools-seed. This project attemps a full bootstrap of a cross-platform compiler for the C programming language (written in C itself) from hex, the ultimate goal being able to demonstrate fully-bootstrapped compiler from hex to the GCC GNU Compiler Collection. This has many implications in and around Ken Thompson s Trusting Trust attack outlined in Thompson s 1983 Turing Award Lecture. Twitter user @TheYoctoJester posted an executive summary of reproducible builds in the Yocto Project: Finally, Reddit user tofflos posted to the /r/Java subreddit asking about how to achieve reproducible builds with Maven and Chris Lamb noticed that the Linux kernel documentation about reproducible builds of it is available on the kernel.org homepages in an attractive HTML format.

Distribution work

Debian Chris Lamb created a merge request for the core debian-installer package to allow all arguments and options from sources.list files (such as [check-valid-until=no] , etc.) in order that we can test the reproducibility of the installer images on the Reproducible Builds own testing infrastructure. (#13) Thorsten Glaser followed-up to a bug filed against the dpkg-source component that was originally filed in late 2015 that claims that the build tool does not respect permissions when unpacking tarballs if the umask is set to 0002. Matthew Garrett posted to the debian-devel mailing list on the topic of Producing verifiable initramfs images as part of a wider conversation on being able to trust the entire software stack on our computers. 59 reviews of Debian packages were added, 30 were updated and 42 were removed this month adding to our knowledge about identified issues. Many issue types were noticed and categorised by Chris Lamb, including:

openSUSE In openSUSE, Bernhard M. Wiedemann published his monthly Reproducible Builds status update as well as provided the following patches:

Software development

diffoscope diffoscope is our in-depth and content-aware diff-like utility that can locate and diagnose reproducibility issues. It is run countless times a day on our testing infrastructure and is essential for identifying fixes and causes of nondeterministic behaviour. Chris Lamb made the following changes this month, including uploading version 137 to Debian:
  • The sng image utility appears to return with an exit code of 1 if there are even minor errors in the file. (#950806)
  • Also extract classes2.dex, classes3.dex from .apk files extracted by apktool. (#88)
  • No need to use str.format if we are just returning the string. [ ]
  • Add generalised support for ignoring returncodes [ ] and move special-casing of returncodes in zip to use Command.VALID_RETURNCODES. [ ]

Other tools disorderfs is our FUSE-based filesystem that deliberately introduces non-determinism into directory system calls in order to flush out reproducibility issues. This month, Vagrant Cascadian updated the Vcs-Git to specify the debian packaging branch. [ ] reprotest is our end-user tool to build same source code twice in widely differing environments and then checks the binaries produced by each build for any differences. This month, versions 0.7.13 and 0.7.14 were uploaded to Debian unstable by Holger Levsen after Vagrant Cascadian added support for GNU Guix [ ].

Project documentation & website There was more work performed on our documentation and website this month. Bernhard M. Wiedemann added a Java Gradle Build Tool snippet to the SOURCE_DATE_EPOCH documentation [ ] and normalised various terms to unreproducible [ ]. Chris Lamb added a Meson.build example [ ] and improved the documentation for the CMake [ ] to the SOURCE_DATE_EPOCH documentation, replaced anyone can with anyone may as, well, not everyone has the resources, skills, time or funding to actually do what it refers to [ ] and improved the pre-processing for our report generation [ ][ ][ ][ ] etc. In addition, Holger Levsen updated our news page to improve the list of reports [ ], added an explicit mention of the weekly news time span [ ] and reverted sorting of news entries to have latest on top [ ] and Mattia Rizzolo added Codethink as a non-fiscal sponsor [ ] and lastly Tianon Gravi added a Docker Images link underneath the Debian project on our Projects page [ ].

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including: Vagrant Cascadian submitted patches via the Debian bug tracking system targeting the packages the Civil Infrastructure Platform has identified via the CIP and CIP build depends package sets:

Testing framework We operate a fully-featured and comprehensive Jenkins-based testing framework that powers tests.reproducible-builds.org. This month, the following changes were made by Holger Levsen: In addition, Mattia Rizzolo added an Apache web server redirect for buildinfos.debian.net [ ] and reverted the reshuffling of arm64 architecture builders [ ]. The usual build node maintenance was performed by Holger Levsen, Mattia Rizzolo [ ][ ] and Vagrant Cascadian.

Getting in touch If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

This month s report was written by Bernhard M. Wiedemann, Chris Lamb and Holger Levsen. It was subsequently reviewed by a bunch of Reproducible Builds folks on IRC and the mailing list.

27 June 2017

Benjamin Mako Hill: Learning to Code in One s Own Language

I recently published a paper with Sayamindu Dasgupta that provides evidence in support of the idea that kids can learn to code more quickly when they are programming in their own language. Millions of young people from around the world are learning to code. Often, during their learning experiences, these youth are using visual block-based programming languages like Scratch, App Inventor, and Code.org Studio. In block-based programming languages, coders manipulate visual, snap-together blocks that represent code constructs instead of textual symbols and commands that are found in more traditional programming languages. The textual symbols used in nearly all non-block-based programming languages are drawn from English consider if statements and for loops for common examples. Keywords in block-based languages, on the other hand, are often translated into different human languages. For example, depending on the language preference of the user, an identical set of computing instructions in Scratch can be represented in many different human languages:
Examples of a short piece of Scratch code shown in four different human languages: English, Italian, Norwegian Bokm l, and German.
Although my research with Sayamindu Dasgupta focuses on learning, both Sayamindu and I worked on local language technologies before coming back to academia. As a result, we were both interested in how the increasing translation of programming languages might be making it easier for non-English speaking kids to learn to code. After all, a large body of education research has shown that early-stage education is more effective when instruction is in the language that the learner speaks at home. Based on this research, we hypothesized that children learning to code with block-based programming languages translated to their mother-tongues will have better learning outcomes than children using the blocks in English. We sought to test this hypothesis in Scratch, an informal learning community built around a block-based programming language. We were helped by the fact that Scratch is translated into many languages and has a large number of learners from around the world. To measure learning, we built on some of our our own previous work and looked at learners cumulative block repertoires similar to a code vocabulary. By observing a learner s cumulative block repertoire over time, we can measure how quickly their code vocabulary is growing. Using this data, we compared the rate of growth of cumulative block repertoire between learners from non-English speaking countries using Scratch in English to learners from the same countries using Scratch in their local language. To identify non-English speakers, we considered Scratch users who reported themselves as coming from five primarily non-English speaking countries: Portugal, Italy, Brazil, Germany, and Norway. We chose these five countries because they each have one very widely spoken language that is not English and because Scratch is almost fully translated into that language. Even after controlling for a number of factors like social engagement on the Scratch website, user productivity, and time spent on projects, we found that learners from these countries who use Scratch in their local language have a higher rate of cumulative block repertoire growth than their counterparts using Scratch in English. This faster growth was despite having a lower initial block repertoire. The graph below visualizes our results for two prototypical learners who start with the same initial block repertoire: one learner who uses the English interface, and a second learner who uses their native language.
Summary of the results of our model for two prototypical individuals.
Our results are in line with what theories of education have to say about learning in one s own language. Our findings also represent good news for designers of block-based programming languages who have spent considerable amounts of effort in making their programming languages translatable. It s also good news for the volunteers who have spent many hours translating blocks and user interfaces. Although we find support for our hypothesis, we should stress that our findings are both limited and incomplete. For example, because we focus on estimating the differences between Scratch learners, our comparisons are between kids who all managed to successfully use Scratch. Before Scratch was translated, kids with little working knowledge of English or the Latin script might not have been able to use Scratch at all. Because of translation, many of these children are now able to learn to code.
This blog post and the work that it describes is a collaborative project with Sayamindu Dasgupta. Sayamindu also published a very similar version of the blog post in several places. Our paper is open access and you can read it here. The paper was published in the proceedings of the ACM Learning @ Scale Conference. We also recently gave a talk about this work at the International Communication Association s annual conference. We received support and feedback from members of the Scratch team at MIT (especially Mitch Resnick and Natalie Rusk), as well as from Nathan TeBlunthuis at the University of Washington. Financial support came from the US National Science Foundation.

18 June 2017

Benjamin Mako Hill: The Community Data Science Collective Dataverse

I m pleased to announce the Community Data Science Collective Dataverse. Our dataverse is an archival repository for datasets created by the Community Data Science Collective. The dataverse won t replace work that collective members have been doing for years to document and distribute data from our research. What we hope it will do is get our data like our published manuscripts into the hands of folks in the forever business. Over the past few years, the Community Data Science Collective has published several papers where an important part of the contribution is a dataset. These include: Recently, we ve also begun producing replication datasets to go alongside our empirical papers. So far, this includes: In the case of each of the first groups of papers where the dataset was a part of the contribution, we uploaded code and data to a website we ve created. Of course, even if we do a wonderful job of keeping these websites maintained over time, eventually, our research group will cease to exist. When that happens, the data will eventually disappear as well. The text of our papers will be maintained long after we re gone in the journal or conference proceedings publisher s archival storage and in our universities institutional archives. But what about the data? Since the data is a core part perhaps the core part of the contribution of these papers, the data should be archived permanently as well. Toward that end, our group has created a dataverse. Our dataverse is a repository within the Harvard Dataverse where we have been uploading archival copies of datasets over the last six months. All five of the papers described above are uploaded already. The Scratch dataset, due to access control restrictions, isn t listed on the main page but it s online on the site. Moving forward, we ll be populating this new datasets we create as well as replication datasets for our future empirical papers. We re currently preparing several more. The primary point of the CDSC Dataverse is not to provide you with way to get our data although you re certainly welcome to use it that way and it might help make some of it more discoverable. The websites we ve created (like for the ones for redirects and for page protection) will continue to exist and be maintained. The Dataverse is insurance for if, and when, those websites go down to ensure that our data will still be accessible.
This post was also published on the Community Data Science Collective blog.

19 May 2017

Benjamin Mako Hill: Children s Perspectives on Critical Data Literacies

Last week, we presented a new paper that describes how children are thinking through some of the implications of new forms of data collection and analysis. The presentation was given at the ACM CHI conference in Denver last week and the paper is open access and online. Over the last couple years, we ve worked on a large project to support children in doing and not just learning about data science. We built a system, Scratch Community Blocks, that allows the 18 million users of the Scratch online community to write their own computer programs in Scratch of course to analyze data about their own learning and social interactions. An example of one of those programs to find how many of one s follower in Scratch are not from the United States is shown below. Last year, we deployed Scratch Community Blocks to 2,500 active Scratch users who, over a period of several months, used the system to create more than 1,600 projects. As children used the system, Samantha Hautea, a student in UW s Communication Leadership program, led a group of us in an online ethnography. We visited the projects children were creating and sharing. We followed the forums where users discussed the blocks. We read comment threads left on projects. We combined Samantha s detailed field notes with the text of comments and forum posts, with ethnographic interviews of several users, and with notes from two in-person workshops. We used a technique called grounded theory to analyze these data. What we found surprised us. We expected children to reflect on being challenged by and hopefully overcoming the technical parts of doing data science. Although we certainly saw this happen, what emerged much more strongly from our analysis was detailed discussion among children about the social implications of data collection and analysis. In our analysis, we grouped children s comments into five major themes that represented what we called critical data literacies. These literacies reflect things that children felt were important implications of social media data collection and analysis. First, children reflected on the way that programmatic access to data even data that was technically public introduced privacy concerns. One user described the ability to analyze data as, creepy , but at the same time, very cool. Children expressed concern that programmatic access to data could lead to stalking and suggested that the system should ask for permission. Second, children recognized that data analysis requires skepticism and interpretation. For example, Scratch Community Blocks introduced a bug where the block that returned data about followers included users with disabled accounts. One user, in an interview described to us how he managed to figure out the inconsistency:

At one point the follower blocks, it said I have slightly more followers than I do. And, that was kind of confusing when I was trying to make the project. [ ] I pulled up a second [browser] tab and compared the [data from Scratch Community Blocks and the data in my profile]. Third, children discussed the hidden assumptions and decisions that drive the construction of metrics. For example, the number of views received for each project in Scratch is counted using an algorithm that tries to minimize the impact of gaming the system (similar to, for example, Youtube). As children started to build programs with data, they started to uncover and speculate about the decisions behind metrics. For example, they guessed that the view count might only include unique views and that view counts may include users who do not have accounts on the website. Fourth, children building projects with Scratch Community Blocks realized that an algorithm driven by social data may cause certain users to be excluded. For example, a 13-year-old expressed concern that the system could be used to exclude users with few social connections saying:

I love these new Scratch Blocks! However I did notice that they could be used to exclude new Scratchers or Scratchers with not a lot of followers by using a code: like this:
when flag clicked
if then user s followers < 300
stop all.
I do not think this a big problem as it would be easy to remove this code but I did just want to bring this to your attention in case this not what you would want the blocks to be used for.
Fifth, children were concerned about the possibility that measurement might distort the Scratch community s values. While giving feedback on the new system, a user expressed concern that by making it easier to measure and compare followers, the system could elevate popularity over creativity, collaboration, and respect as a marker of success in Scratch.

I think this was a great idea! I am just a bit worried that people will make these projects and take it the wrong way, saying that followers are the most important thing in on Scratch. Kids conversations around Scratch Community Blocks are good news for educators who are starting to think about how to engage young learners in thinking critically about the implications of data. Although no kid using Scratch Community Blocks discussed each of the five literacies described above, the themes reflect starting points for educators designing ways to engage kids in thinking critically about data. Our work shows that if children are given opportunities to actively engage and build with social and behavioral data, they might not only learn how to do data analysis, but also reflect on its implications.

This blog-post and the work that it describes is a collaborative project by Samantha Hautea, Sayamindu Dasgupta, and Benjamin Mako Hill. We have also received support and feedback from members of the Scratch team at MIT (especially Mitch Resnick and Natalie Rusk), as well as from Hal Abelson from MIT CSAIL. Financial support came from the US National Science Foundation.

3 February 2017

Benjamin Mako Hill: New Dataset: Five Years of Longitudinal Data from Scratch

Scratch is a block-based programming language created by the Lifelong Kindergarten Group (LLK) at the MIT Media Lab. Scratch gives kids the power to use programming to create their own interactive animations and computer games. Since 2007, the online community that allows Scratch programmers to share, remix, and socialize around their projects has drawn more than 16 million users who have shared nearly 20 million projects and more than 100 million comments. It is one of the most popular ways for kids to learn programming and among the larger online communities for kids in general.
Front page of the Scratch online community (https://scratch.mit.edu) during the period covered by the dataset.
Since 2010, I have published a series of papers using quantitative data collected from the database behind the Scratch online community. As the source of data for many of my first quantitative and data scientific papers, it s not a major exaggeration to say that I have built my academic career on the dataset. I was able to do this work because I happened to be doing my masters in a research group that shared a physical space ( The Cube ) with LLK and because I was friends with Andr s Monroy-Hern ndez, who started in my masters cohort at the Media Lab. A year or so after we met, Andr s conceived of the Scratch online community and created the first version for his masters thesis project. Because I was at MIT and because I knew the right people, I was able to get added to the IRB protocols and jump through the hoops necessary to get access to the database. Over the years, Andr s and I have heard over and over, in conversation and in reviews of our papers, that we were privileged to have access to such a rich dataset. More than three years ago, Andr s and I began trying to figure out how we might broaden this access. Andr s had the idea of taking advantage of the launch of Scratch 2.0 in 2013 to focus on trying to release the first five years of Scratch 1.x online community data (March 2007 through March 2012) most of the period that the codebase he had written ran the site. After more work than I have put into any single research paper or project, Andr s and I have published a data descriptor in Nature s new journal Scientific Data. This means that the data is now accessible to other researchers. The data includes five years of detailed longitudinal data organized in 32 tables with information drawn from more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and much more. The dataset includes metadata on user behavior as well the full source code for every project. Alongside the data is the source code for all of the software that ran the website and that users used to create the projects as well as the code used to produce the dataset we ve released. Releasing the dataset was a complicated process. First, we had navigate important ethical concerns about the the impact that a release of any data might have on Scratch s users. Toward that end, we worked closely with the Scratch team and the the ethics board at MIT to design a protocol for the release that balanced these risks with the benefit of a release. The most important features of our approach in this regard is that the dataset we re releasing is limited to only public data. Although the data is public, we understand that computational access to data is different in important ways to access via a browser or API. As a result, we re requiring anybody interested in the data to tell us who they are and agree to a detailed usage agreement. The Scratch team will vet these applicants. Although we re worried that this creates a barrier to access, we think this approach strikes a reasonable balance. Beyond the the social and ethical issues, creating the dataset was an enormous task. Andr s and I spent Sunday afternoons over much of the last three years going column-by-column through the MySQL database that ran Scratch. We looked through the source code and the version control system to figure out how the data was created. We spent an enormous amount of time trying to figure out which columns and rows were public. Most of our work went into creating detailed codebooks and documentation that we hope makes the process of using this data much easier for others (the data descriptor is just a brief overview of what s available). Serializing some of the larger tables took days of computer time. In this process, we had a huge amount of help from many others including an enormous amount of time and support from Mitch Resnick, Natalie Rusk, Sayamindu Dasgupta, and Benjamin Berg at MIT as well as from many other on the Scratch Team. We also had an enormous amount of feedback from a group of a couple dozen researchers who tested the release as well as others who helped us work through through the technical, social, and ethical challenges. The National Science Foundation funded both my work on the project and the creation of Scratch itself. Because access to data has been limited, there has been less research on Scratch than the importance of the system warrants. We hope our work will change this. We can imagine studies using the dataset by scholars in communication, computer science, education, sociology, network science, and beyond. We re hoping that by opening up this dataset to others, scholars with different interests, different questions, and in different fields can benefit in the way that Andr s and I have. I suspect that there are other careers waiting to be made with this dataset and I m excited by the prospect of watching those careers develop. You can find out more about the dataset, and how to apply for access, by reading the data descriptor on Nature s website.

21 October 2016

Iain R. Learmonth: PATHspider 1.0.0 released!

In today s Internet we see an increasing deployment of middleboxes. While middleboxes provide in-network functionality that is necessary to keep networks manageable and economically viable, any packet mangling whether essential for the needed functionality or accidental as an unwanted side effect makes it more and more difficult to deploy new protocols or extensions of existing protocols. For the evolution of the protocol stack, it is important to know which network impairments exist and potentially need to be worked around. While classical network measurement tools are often focused on absolute performance values, PATHspider performs A/B testing between two different protocols or different protocol extensions to perform controlled experiments of protocol-dependent connectivity problems as well as differential treatment. PATHspider is a framework for performing and analyzing these measurements, while the actual A/B test can be easily customized. Late on the 21st October, we released version 1.0.0 of PATHspider and it s ready for production use (whatever that means for Internet research software). Our first real release of PATHspider was version 0.9.0 just in time for the presentation of PATHspider at the 2016 Applied Networking Research Workshop co-located with IETF 96 in Berlin earlier this year. Since this release we have made a lot of changes and I ll talk about some of the highlights here (in no particular order):

Switch from twisted.plugin to straight.plugin While we anticipate that some plugins may wish to use some features of Twisted, we didn t want to have Twisted as a core dependency for PATHspider. We found that straight.plugin was not just a drop-in replacement but it simplified the way in which 3rd-party plugins could be developed and it was worth the effort for that alone.

Library functions for the Observer PATHspider has an embedded flow-meter (think something like NetFlow but highly customisable). We found that even with the small selection of plugins that we had we were duplicating code across plugins for these customisations of the flow-meter. In this release we now provide library functions for common needs such as identifying TCP 3-way handshake completions or identifying ICMP Unreachable messages for flows.

New plugin: DSCP We ve added a new plugin for this release to detect breakage when using DiffServ code points to achieve differentiated services within a network.

Plugins are now subcommands Using the subparsers feature of argparse, all plugins including 3rd-party plugins will now appear as subcommands to the PATHspider command. This makes every plugin a first-class citizen and makes PATHspider truly generalised. We have an added benefit from this that plugins can also ask for extra arguments that are specific to the needs of the plugin, for example the DSCP plugin allows the user to select which code point to send for the experimental test.

Test Suite PATHspider now has a test suite! As the size of the PATHspider code base grows we need to be able to make changes and have confidence that we are not breaking code that another module relies on. We have so far only achieved 54% coverage of the codebase but we hope to improve this for the next release. We have focussed on the critical portions of data collection to ensure that all the results collected by PATHspider during experiments is valid.

DNS Resolver Utility Back when PATHspider was known as ECNSpider, it had a utility for resolving IP addresses from the Alexa top 1 million list. This utility has now been fully integrated into PATHspider and appears as a plugin to allow for repeated experiments against the same IP addresses even if the DNS resolver would have returned a different addresss.

Documentation Documentation is definitely not my favourite activity, but it has to be done. PATHspider 1.0.0 now ships with documentation covering commandline usage, input/output formats and development of new plugins.
If you d like to check out PATHspider, you can find the website at https://pathspider.net/. Debian packages will be appearing shortly and will find their way into stable-backports within the next 2 weeks (hopefully). Current development of PATHspider is supported by the European Union s Horizon 2020 project MAMI. This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No 688421. The opinions expressed and arguments employed reflect only the authors view. The European Commission is not responsible for any use that may be made of that information.

13 September 2016

John Goerzen: Two Boys, An Airplane, Plus Hundreds of Old Computers

Was there anything you didn t like about our trip? Jacob s answer: That we had to leave so soon! That s always a good sign. When I first heard about the Vintage Computer Festival Midwest, I almost immediately got the notion that I wanted to go. Besides the TRS-80 CoCo II up in my attic, I also have fond memories of an old IBM PC with CGA monitor, a 25MHz 486, an Alpha also in my attic, and a lot of other computers along the way. I didn t really think my boys would be interested. But I mentioned it to them, and they just lit up. They remembered the Youtube videos I d shown them of old line printers and punch card readers, and thought it would be great fun. I thought it could be a great educational experience for them too and it was. It also turned into a trip that combined being a proud dad with so many of my other interests. Quite a fun time. IMG_20160911_061456 (Jacob modeling his new t-shirt) Captain Jacob Chicago being not all that close to Kansas, I planned to fly us there. If you re flying yourself, solid flight planning is always important. I had already planned out my flight using electronic tools, but I always carry paper maps with me in the cockpit for backup. I got them out and the boys and I planned out the flight the old-fashioned way. Here s Oliver using a scale ruler (with markings for miles corresponding to the scale of the map) and Jacob doing calculating for us. We measured the entire route and came to within one mile of the computer s calculation for each segment those boys are precise! 20160904_175519 We figured out how much fuel we d use, where we d make fuel stops, etc. The day of our flight, we made it as far as Davenport, Iowa when a chance of bad weather en route to Chicago convinced me to land there and drive the rest of the way. The boys saw that as part of the exciting adventure! Jacob is always interested in maps, and had kept wanting to use my map whenever we flew. So I dug an old Android tablet out of the attic, put Avare on it (which has aviation maps), and let him use that. He was always checking it while flying, sometimes saying this over his headset: DING. Attention all passengers, this is Captain Jacob speaking. We are now 45 miles from St. Joseph. Our altitude is 6514 feet. Our speed is 115 knots. We will be on the ground shortly. Thank you. DING Here he is at the Davenport airport, still busy looking at his maps: IMG_20160909_183813 Every little airport we stopped at featured adults smiling at the boys. People enjoyed watching a dad and his kids flying somewhere together. Oliver kept busy too. He loves to help me on my pre-flight inspections. He will report every little thing to me a scratch, a fleck of paint missing on a wheel cover, etc. He takes it seriously. Both boys love to help get the plane ready or put it away. The Computers Jacob quickly gravitated towards a few interesting things. He sat for about half an hour watching this old Commodore plotter do its thing (click for video): VID_20160910_142044 His other favorite thing was the phones. Several people had brought complete analog PBXs with them. They used them to demonstrate various old phone-related hardware; one had several BBSs running with actual modems, another had old answering machines and home-security devices. Jacob learned a lot about phones, including how to operate a rotary-dial phone, which he d never used before! IMG_20160910_151431 Oliver was drawn more to the old computers. He was fascinated by the IBM PC XT, which I explained was just about like a model I used to get to use sometimes. They learned about floppy disks and how computers store information. IMG_20160910_195145 He hadn t used joysticks much, and found Pong ( this is a soccer game! ) interesting. Somebody has also replaced the guts of a TRS-80 with a Raspberry Pi running a SNES emulator. This had thoroughly confused me for a little while, and excited Oliver. Jacob enjoyed an old TRS-80, which, through a modern Ethernet interface and a little computation help in AWS, provided an interface to Wikipedia. Jacob figured out the text-mode interface quickly. Here he is reading up on trains. IMG_20160910_140524 I had no idea that Commodore made a lot of adding machines and calculators before they got into the home computer business. There was a vast table with that older Commodore hardware, too much to get on a single photo. But some of the adding machines had their covers off, so the boys got to see all the little gears and wheels and learn how an adding machine can do its printing. IMG_20160910_145911 And then we get to my favorite: the big iron. Here is a VAX a working VAX. When you have a computer that huge, it s easier for the kids to understand just what something is. IMG_20160910_125451 When we encountered the table from the Glenside Color Computer Club, featuring the good old CoCo IIs like what I used as a kid (and have up in my attic), I pointed out to the boys that we have a computer just like this that can do these things and they responded wow! I think they are eager to try out floppy disks and disk BASIC now. Some of my favorites were the old Unix systems, which are a direct ancestor to what I ve been working with for decades now. Here s AT&T System V release 3 running on its original hardware: IMG_20160910_144923 And there were a couple of Sun workstations there, making me nostalgic for my college days. If memory serves, this one is actually running on m68k in the pre-Sparc days: IMG_20160910_153418 Returning home After all the excitement of the weekend, both boys zonked out for awhile on the flight back home. Here s Jacob, sleeping with his maps still up. IMG_20160911_132952 As we were nearly home, we hit a pocket of turbulence, the kind that feels as if the plane is dropping a bit (it s perfectly normal and safe; you ve probably felt that on commercial flights too). I was a bit concerned about Oliver; he is known to get motion sick in cars (and even planes sometimes). But what did I hear from Oliver? Whee! That was fun! It felt like a roller coaster! Do it again, dad!

4 July 2016

Benjamin Mako Hill: Studying the relationship between remixing & learning

With more than 10 million users, the Scratch online community is the largest online community where kids learn to program. Since it was created, a central goal of the community has been to promote remixing the reworking and recombination of existing creative artifacts. As the video above shows, remixing programming projects in the current web-based version of Scratch is as easy is as clicking on the see inside button in a project web-page, and then clicking on the remix button in the web-based code editor. Today, close to 30% of projects on Scratch are remixes. Remixing plays such a central role in Scratch because its designers believed that remixing can play an important role in learning. After all, Scratch was designed first and foremost as a learning community with its roots in the Constructionist framework developed at MIT by Seymour Papert and his colleagues. The design of the Scratch online community was inspired by Papert s vision of a learning community similar to Brazilian Samba schools (Henry Jenkins writes about his experience of Samba schools in the context of Papert s vision here), and a comment Marvin Minsky made in 1984:
Adults worry a lot these days. Especially, they worry about how to make other people learn more about computers. They want to make us all computer-literate. Literacy means both reading and writing, but most books and courses about computers only tell you about writing programs. Worse, they only tell about commands and instructions and programming-language grammar rules. They hardly ever give examples. But real languages are more than words and grammar rules. There s also literature what people use the language for. No one ever learns a language from being told its grammar rules. We always start with stories about things that interest us.
In a new paper titled Remixing as a pathway to Computational Thinking that was recently published at the ACM Conference on Computer Supported Collaborative Work and Social Computing (CSCW) conference, we used a series of quantitative measures of online behavior to try to uncover evidence that might support the theory that remixing in Scratch is positively associated with learning. scratchblocksOf course, because Scratch is an informal environment with no set path for users, no lesson plan, and no quizzes, measuring learning is an open problem. In our study, we built on two different approaches to measure learning in Scratch. The first approach considers the number of distinct types of programming blocks available in Scratch that a user has used over her lifetime in Scratch (there are 120 in total) something that can be thought of as a block repertoire or vocabulary. This measure has been used to model informal learning in Scratch in an earlier study. Using this approach, we hypothesized that users who remix more will have a faster rate of growth for their code vocabulary. Controlling for a number of factors (e.g. age of user, the general level of activity) we found evidence of a small, but positive relationship between the number of remixes a user has shared and her block vocabulary as measured by the unique blocks she used in her non-remix projects. Intriguingly, we also found a strong association between the number of downloads by a user and her vocabulary growth. One interpretation is that this learning might also be associated with less active forms of appropriation, like the process of reading source code described by Minksy. The second approach we used considered specific concepts in programming, such as loops, or event-handling. To measure this, we utilized a mapping of Scratch blocks to key programming concepts found in this paper by Karen Brennan and Mitchel Resnick. For example, in the image below are all the Scratch blocks mapped to the concept of loop . scratchblocksctWe looked at six concepts in total (conditionals, data, events, loops, operators, and parallelism). In each case, we hypothesized that if someone has had never used a given concept before, they would be more likely to use that concept after encountering it while remixing an existing project. Using this second approach, we found that users who had never used a concept were more likely to do so if they had been exposed to the concept through remixing. Although some concepts were more widely used than others, we found a positive relationship between concept use and exposure through remixing for each of the six concepts. We found that this relationship was true even if we ignored obvious examples of cutting and pasting of blocks of code. In all of these models, we found what we believe is evidence of learning through remixing. Of course, there are many limitations in this work. What we found are all positive correlations we do not know if these relationships are causal. Moreover, our measures do not really tell us whether someone has understood the usage of a given block or programming concept.However, even with these limitations, we are excited by the results of our work, and we plan to build on what we have. Our next steps include developing and utilizing better measures of learning, as well as looking at other methods of appropriation like viewing the source code of a project.

This blog post and the paper it describes are collaborative work with Sayamindu Dasgupta, Andr s Monroy-Hern ndez, and William Hale. The paper is released as open access so anyone can read the entire paper here. This blog post was also posted on Sayamindu Dasgupta s blog and on Medium by the MIT Media Lab.

9 June 2016

NOKUBI Takatsugu: Recurrent Convolutional Neural Networks for Text Classification

I made a simple implementation of text classification with Recurrent CNN. https://github.com/knok/rcnn-text-classification It uses chainer, a Deep Learning framework. Recurrent convolutional neural networks for text classification
Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao, Chinese Academy of Sciences, China
AAAI. 2015.

13 December 2015

Robert Edmonds: Works with Debian: Intel SSD 750, AMD FirePro W4100, Dell P2715Q

I recently installed new hardware in my primary computer running Debian unstable. The disk used for the / and /home filesystem was replaced with an Intel SSD 750 series NVM Express card. The graphics card was replaced by an AMD FirePro W4100 card, and two Dell P2715Q monitors were installed. Intel SSD 750 series NVM Express card This is an 800 GB SSD on a PCI-Express x4 card (model number SSDPEDMW800G4X1) using the relatively new NVM Express interface, which appears as a /dev/nvme* device. The stretch alpha 4 Debian installer was able to detect and install onto this device, but grub-installer 1.127 on the installer media was unable to install the boot loader. This was due to a bug recently fixed in 1.128:
grub-installer (1.128) unstable; urgency=high
  * Fix buggy /dev/nvme matching in the case statement to determine
    disc_offered_devfs (Closes: #799119). Thanks, Mario Limonciello!
 -- Cyril Brulebois <kibi@debian.org>  Thu, 03 Dec 2015 00:26:42 +0100
I was able to download and install the updated .udeb by hand in the installer environment and complete the installation. This card was installed on a Supermicro X10SAE motherboard, and the UEFI BIOS was able to boot Debian directly from the NVMe card, although I updated to the latest available BIOS firmware prior to the installation. It appears in lspci like this:
02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
(prog-if 02 [NVM Express])
    Subsystem: Intel Corporation SSD 750 Series [Add-in Card]
    Flags: bus master, fast devsel, latency 0
    Memory at f7d10000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at f7d00000 [disabled] [size=64K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI-X: Enable+ Count=32 Masked-
    Capabilities: [60] Express Endpoint, MSI 00
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [150] Virtual Channel
    Capabilities: [180] Power Budgeting <?>
    Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [270] Device Serial Number 55-cd-2e-41-4c-90-a8-97
    Capabilities: [2a0] #19
    Kernel driver in use: nvme
The card itself appears very large in marketing photos, but this is a visual trick: the photographs are taken with the low-profile PCI bracket installed, rather than the standard height PCI bracket which it ships installed with. smartmontools fails to read SMART data from the drive, although it is still able to retrieve basic device information, including the temperature:
root@chase 0 :~# smartctl -d scsi -a /dev/nvme0n1
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.0-trunk-amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               NVMe
Product:              INTEL SSDPEDMW80
Revision:             0135
Compliance:           SPC-4
User Capacity:        800,166,076,416 bytes [800 GB]
Logical block size:   512 bytes
Rotation Rate:        Solid State Device
Logical Unit id:      8086INTEL SSDPEDMW800G4                     1000CVCQ531500K2800EGN  
Serial number:        CVCQ531500K2800EGN
Device type:          disk
Local Time is:        Sun Dec 13 01:48:37 2015 EST
SMART support is:     Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     31 C
Drive Trip Temperature:        85 C
Error Counter logging not supported
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
root@chase 4 :~# 
Simple tests with cat /dev/nvme0n1 >/dev/null and iotop show that the card can read data at about 1 GB/sec, about twice as fast as the SATA-based SSD that it replaced. apt/dpkg now run about as fast on the NVMe SSD as they do on a tmpfs. Hopefully this device doesn't at some point require updated firmware, like some infamous SSDs have. AMD FirePro W4100 graphics card This is a graphics card capable of driving multiple DisplayPort displays at "4K" resolution and at a 60 Hz refresh rate. It has four Mini DisplayPort connectors, although I only use two of them. It was difficult to find a sensible graphics card. Most discrete graphics cards appear to be marketed towards video gamers who apparently must seek out bulky cards that occupy multiple PCI slots and have excessive cooling devices. (To take a random example, the ASUS STRIX R9 390X has three fans and brags about its "Mega Heatpipes".) AMD markets a separate line of "FirePro" graphics cards intended for professionals rather than gamers, although they appear to be based on the same GPUs as their "Radeon" video cards. The AMD FirePro W4100 is a normal half-height PCI-E card that fits into a single PCI slot and has a relatively small cooler with a single fan. It doesn't even require an auxilliary power connection and is about the same dimensions as older video cards that I've successfully used with Debian. It was difficult to determine whether the W4100 card was actually supported by an open source driver before buying it. The word "FirePro" appears nowhere on the webpage for the X.org Radeon driver, but I was able to find a "CAPE VERDE" listed as an engineering name which appears to match the "Cape Verde" code name for the FirePro W4100 given on Wikipedia's List of AMD graphics processing units. This explains the "verde" string that appears in the firmware filenames requested by the kernel (available only in the non-free/firmware-amd-graphics package):
[drm] initializing kernel modesetting (VERDE 0x1002:0x682C 0x1002:0x2B1E).
[drm] Loading verde Microcode
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_pfp.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_me.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_ce.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_rlc.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_mc.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_smc.bin
The card appears in lspci like this:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde GL [FirePro W4100]
(prog-if 00 [VGA controller])
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 2b1e
    Flags: bus master, fast devsel, latency 0, IRQ 55
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f7e00000 (64-bit, non-prefetchable) [size=256K]
    I/O ports at e000 [size=256]
    Expansion ROM at f7e40000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: [58] Express Legacy Endpoint, MSI 00
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150] Advanced Error Reporting
    Capabilities: [200] #15
    Capabilities: [270] #19
    Kernel driver in use: radeon
The W4100 appears to work just fine, except for a few bizarre error messages that are printed to the kernel log when the displays are woken from power saving mode:
[Sun Dec 13 00:24:41 2015] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_enable_smc_cac failed
[Sun Dec 13 00:24:41 2015] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_enable_smc_cac failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[Sun Dec 13 00:24:41 2015] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_enable_smc_cac failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
There don't appear to be any ill effects from these error messages, though. I have the following package versions installed:
 / Name                          Version             Description
+++-=============================-===================-================================================
ii  firmware-amd-graphics         20151207-1          Binary firmware for AMD/ATI graphics chips
ii  linux-image-4.3.0-trunk-amd64 4.3-1~exp2          Linux 4.3 for 64-bit PCs
ii  xserver-xorg-video-radeon     1:7.6.1-1           X.Org X server -- AMD/ATI Radeon display driver
The Supermicro X10SAE motherboard has two PCI-E 3.0 slots, but they're listed as functioning in either "16/NA" or "8/8" mode, which apparently means that putting anything in the second slot (like the Intel 750 SSD, which uses an x4 link) causes the video card to run at a smaller x8 link width. This can be verified by looking at the widths reported in the "LnkCap" and "LnkSta" lines in the lspci -vv output:
root@chase 0 :~# lspci -vv -s 01:00.0   egrep '(LnkCap LnkSta):'
        LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
        LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
root@chase 0 :~# 
I did not notice any visible artifacts or performance degradation because of the smaller link width. The sensors utility from the lm-sensors package is capable of reporting the temperature of the GPU:
root@chase 0 :~# sensors radeon-pci-0100
radeon-pci-0100
Adapter: PCI adapter
temp1:        +55.0 C  (crit = +120.0 C, hyst = +90.0 C)
root@chase 0 :~# 
Dell P2715Q monitors Two new 27" Dell monitors with a native resolution of 3840x2160 were attached to the new graphics card. They replaced two ten year old Dell 2001FP monitors with a native resolution of 1600x1200 that had experienced burn-in, providing 4.32 times as many pixels. (TV and monitor manufacturers now shamelessly refer to the 3840x2160 resolution as "4K" resolution even though neither dimension reaches 4000 pixels.) There was very little to setup beyond plugging the DisplayPort inputs on these monitors into the DisplayPort outputs on the graphics card. Most of the setup involved reconfiguring software to work with the very high resolution. X.org, for tl;dr CLOSED NOTABUG reasons doesn't set the DPI correctly. These monitors have ~163 DPI resolution, so I added -dpi 168 to /etc/X11/xdm/Xservers. (168 is an even 1.75x multiple of 96.) Software like Google Chrome and xfce4-terminal rendered fonts and graphical elements at the right size, but other software like notion, pidgin, and virt-manager did not fully understand the high DPI. E.g., pidgin renders fonts at the correct size, but icons are too small. The default X cursor was also too small. To fix this, I installed the dmz-cursor-theme package, ran update-alternatives --config x-cursor-theme and selected /usr/share/icons/DMZ-Black/cursor.theme as the cursor theme. Overall, these displays are much brighter and more readable than the ones they replaced.

1 January 2015

Dirk Eddelbuettel: New Year's Run 2015

Nice, crisp and pretty cold morning for the 2015 New Year's 5k. I had run this before with friends: 2003, 2005, 2006 and 2007. Needless to say, I am a little slower now: A hand-stopped 23:13 for a 7:27 min/mile pace is fine by me given the weather, lack of speedwork in the last few months, and generally reduced running---but also almost a minute slower than the last time I ran this. In case you're a fellow running geek and into this, Strava has my result nicely aggregated.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

6 September 2014

Joachim Breitner: ICFP 2014

Another on-my-the-journey-back blog post; this time from the Frankfurt Airport Train Station my flight was delayed (if I knew that I could have watched the remaining Lightning Talks), and so was my train, but despite 5min of running through the Airport just not enough. And now that the free 30 Minutes of Railway Station Internet are used up, I have nothing else to do but blog... Last week I was attending ICFP 2014 in Gothenburg, followed by the Haskell Symposium and the Haskell Implementors Workshop. The justification to attend was the paper on Safe Coercions (joint work with Richard Eisenberg, Simon Peyton Jones and Stephanie Weirich), although Richard got to hold the talk, and did so quite well. So I got to leisurely attend the talks, while fighting the jet-lag that I brought from Portland. There were as expected quite a few interesting talks. Among them the first keynote, Kathleen Fisher on the need for formal methods in cars and toy-quadcopters and unmanned battle helicopters, which made me conclude that my Isabelle skills might eventually become relevant in practical applications. And did you know that if someone gains access to your car s electronics, they can make the seat belt pull you back hard? Stefanie Weirich s keynote (and the subsequent related talks by Jan Stolarek and Richard Eisenberg) on what a dependently typed Haskell would look like and what we could use it for was mouth-watering. I am a bit worried that Haskell will be become a bit obscure for newcomers and people that simply don t want to think about types too much, on the other hand it seems that Haskell as we know it will always stay there, just as a subset of the language. Similarly interesting were refinement types for Haskell (talks by Niki Vazou and by Eric Seidel), in the form of LiquidTypes, something that I have not paid attention to yet. It seems to be a good way for more high assurance in Haskell code. Finally, the Haskell Implementors Workshop had a truckload of exciting developments in and around Haskell: More on GHCJS, Partial type signatures, interactive type-driven development like we know it from Agda, the new Haskell module system and amazing user-defined error messages the latter unfortunately only in Helium, at least for now. But it s not the case that I only sat and listened. During the Haskell Implementors Workshop I held a talk Contributing to GHC with a live demo of me fixing a (tiny) bug in GHC, with the aim of getting more people to hack on GHC (slides, video). The main message here is that it is not that big of deal. And despite me not actually saying much interesting in the talk, I got good feedback afterwards. So if it now actually motivates someone to contribute to GHC, I m even more happier. And then there is of course the Hallway Track. I discussed the issues with fusing a left fold (unfortunately, without a great solution). In order to tackle this problem more systematically, John Wiegley and I created the beginning of a List Fusion Lab , i.e. a bunch of list benchmark and the possibility to compare various implementations (e.g. with different RULES) and various compilers. With that we can hopefully better assess the effect of a change to the list functions. PS: The next train is now also delayed, so I ll likely miss my tram and arrive home even later... PPS: I really have to update my 10 year old picture on my homepage (or redesign it completely). Quite a few people knew my name, but expected someone with shoulder-long hair... PPPS: Haskell is really becoming mainstream: I just talked to a randomly chosen person (the boy sitting next to me in the train), and he is a Haskell enthusiast, building a structured editor for Haskell together with his brother. And all that as a 12th-grader...

10 November 2013

Luke Faraone: Why I use my bank's mobile site on my desktop

(or, cutting out bloat by using a platform where bloat won't fly)

Let me start off by saying I'm generally a huge fan of my bank, USAA. Their offerings are free of hidden fees, their phone support excellent, and the perks they provide are competitive. They don't have the best savings interest rates, but you can always find a better deal online to park money not actively in your checking account.

However, USAA's website is a behemoth. My account page took about 8 seconds to fully load, downloading 1.4MiB of content.

The "My Accounts" page you're redirected to after logging in.
It is frequently buggy; whenever I log in via Google Chrome on Ubuntu 12.04 I land on a page with a URL beginning with "https://www.usaa.com/inet/gas_bank/AccountBannerAjax" and a bunch of GET parameters like "currentaccountkey" and "accnumber" with values like "encrypted12a1f4dd1[ ]". The server returns a 200 OK, promises a Content-Length of 20, but then actually returns zero bytes. After navigating to the homepage and clicking a button, I end up getting logged in, but I wonder what percentage of their userbase are experiencing this problem?

For some strange reason, I get a lot of checks. It appears that nobody else informed the banking system that it's 2013, and the easiest mechanism for people to send money without paying fees is still on paper. To its credit, USAA made remote deposit of checks available to all customers in 2006, when it was mostly an offering limited to businesses. However, it seems like they haven't updated their web workflow since then.


Using it on the web still requires using a signed Java applet (itself discouraged by CMU's CERT) that does the incredibly complex task of letting you select a file from your computer and upload it to their servers. At least, that's what I think it does, because any time I chose "Run", my browser complained a few minutes later that the tab had stopped responding. Regardless of functionality, you can accomplish almost anything their site could currently be doing with HTML5 and a third party service if they want to crop images locally.

Spinning after logging
in on Android
USAA's mobile app for Android has another host of problems; I haven't been able to log into it for 2 weeks, and when I chatted with someone today I was told they were "doing some maintenance this weekend", so I should try again in a few hours once that's finished.

I googled around a bit for some way to perhaps make the applet work in Ubuntu (which admittedly is not a supported platform), and came upon a Facebook thread where a rep suggested using the mobile web site.

A breath of fresh air
I loaded it in my browser, and was amazed at how well it functioned. Obviously designed for higher-end devices (It didn't even load in one WAP emulator I tried), the mobile web interface was a refreshing breath of fresh air. It scaled well to a full-screen device (see below), loaded quickly, and gave me all the information I would have wanted out of the normal web interface.

Most notably: remember the whole "upload a check" workflow that required a buggy Java applet on the main website? We get bog-standard HTML form fields, no additional magic. There goes any theories about the Java client doing some magic validation or prep of the image; here, all they're getting is the images and my session cookie.


I'm still shocked at whoever thought a My Yahoo!-style homepage was the best layout for a bank, but props to the web developers who managed to make a mobile interface that was both pretty and allowed me to work around broken functionality in their implementations on every other platform I had access to.

But why was the mobile web interface the least bloated? Easy. On the desktop, you generally have a nice pipe, or if not, the user knows it and won't be too upset if your site is just as slow as other sites similarly situated. On mobile, the user downloaded all the code already, so the only latency should be the API requests against the server, right?

On the mobile web users have come to expect relatively speedy mobile-optimised sites and there's less screen real estate to do fancy things that get in the way of content. For many sites, that's a huge improvement. Of course, it would be really nice if more banks supported open protocols for interactions (USAA has a read-only, limited-duration OFX feed), but I would settle for a better web interface.

So tl;dr: USAA, please make www.usaa.com redirect to m.usaa.com, kthxbai.

30 July 2013

Daniel Pocock: Switzerland's scariest railway (video)

Although promoted as a funicular railway, the Gelmerbahn funicular has the appearance of a giant roller coaster and all of the fun too. For all those still having nightmares about recent rail tragedies in the age of non-stop surveillance and on-line replays, the 108 degree ascent (or descent) may not be your thing. On the other hand, for those who experience the London Underground each morning, "mind the gap" takes on a whole new meaning as you see your feet hanging out the bottom of the train and the tracks descending vertically into the valley below. <video controls="" height="340" poster="http://danielpocock.com/sites/danielpocock.com/files/gbahn-preview.jpg" width="560">

<source src="http://video.danielpocock.com/Gelmerbahn.webm" type="video/webm"></source>
Click to access video on original page
</video> and here are download links for those who prefer to download now and watch later (about 160MB): At the top At the top, there is a scenic two hour walk around Lake Gelmer, the path takes you across the top of a dam built for hydro-electric power generation. Lake Gelmer - hydro-electric dam Getting there from DebConf13 and the Debian 20th birthday For those coming to Switzerland for DebConf13 next month, the Gelmerbahn is probably a little bit too far for a day trip but makes an excellent place to stop during a 2 or 3 day tour around Switzerland. It is accessible using the all-day bus pass for the tour of 3 or 4 mountain passes or by train to Innertkirchen and then a short ride on the bus. The busses are irregular, use sbb.ch to plan the journey. Walking down the mountain from the lake to the bus stop takes about 90 minutes and for those who choose this option, the ticket is cheaper. More Swiss travel blogs Please click here to access some of my previous blogs about Swiss travel with videos and useful ideas for day trips.

Next.