Search Results: "cfm"

21 March 2022

Gunnar Wolf: Long, long, long live Emacs after 39 years

Reading Planet Debian (see, Sam, we are still having a conversation over there? ), I read Anarcat s 20+ years of Emacs. And.. Well, should I brag contribute to the discussion? Of course, why not? Emacs is the first computer program I can name that I ever learnt to use to do something minimally useful. 39 years ago.
From the Space Cadet keyboard that (obviously ) influenced Emacs early design
The Emacs editor was born, according to Wikipedia, in 1976, same year as myself. I am clearly not among its first users. It was already a well-established citizen when I first learnt it; I am fortunate to be the son of a Physics researcher at UNAM, My father used to take me to his institute after he noticed how I was attracted to computers; we would usually spend some hours there between 7 and 11PM on Friday nights. His institute had a computer room where they had very sweet gear: Some 10 Heathkit terminals quite similar to this one: The terminals were connected (via individual switches) to both a PDP-11 and a Foonly F2 computers. The room also had a beautiful thermal printer, a beautiful Tektronix vectorial graphics output terminal, and some other stuff. The main user for my father was to typeset some books; he had recently (1979) published Integral Transforms in Science and Engineering (that must be my first mention in scientific literature), and I remember he was working on the proceedings of a conference he held in Oaxtepec (the account he used in the system was oax, not his usual kbw, which he lent me). He was also working on Manual de Lenguaje y Tipograf a Cient fica en Castellano, where you can see some examples of TeX; due to a hardware crash, the book has the rare privilege of being a direct copy of the output of the thermal printer: It was not possible to produce a higher resolution copy for several years But it is fun and interesting to see what we were able to produce with in-house tools back in 1985! So, what could he teach me so I could use the computers while he worked? TeX, of course. No, no LaTeX (that was published in 1984). LaTeX is a set of macros developed initially by Leslie Lamport, used to make TeX easier; TeX was developed by Donald Knuth, and if I have this information correct, it was Knuth himself who installed and demonstrated TeX in the Foonly computer, during a visit to UNAM. Now, after 39 years hammering at Emacs buffers Have I grown extra fingers? Nope. I cannot even write decent elisp code, and can barely read it. I do use org-mode (a lot!) and love it; I have written basically five books, many articles and lots of presentations and minor documents with it. But I don t read my mail or handle my git from Emacs. I could say, I m a relatively newbie after almost four decades. Four decades When we got a PC in 1986, my father got the people at the Institute to get him memacs (micro-emacs). There was probably a ten year period I barely used any emacs, but always recognized it. My fingers hve memorized a dozen or so movement commands, and a similar number of file management commands. And yes, Emacs and TeX are still the main tools I use day to day.

13 April 2020

Giovanni Mascellani: DKIM for Debian Developers

What is DKIM? DKIM (DomainKeys Identified Mail), as Wikipedia puts it, "is an email authentication method designed to detect forged sender addresses in emails (email spoofing), a technique often used in phishing and email spam". More prosaically, one of the reasons email spam is so abundant is that, given a certain email message, there is no simple way to know for certain who sent it and how reputable they are. So even if people having addresses @debian.org are very nice and well-behaving, any random spammer can easily send emails from whatever@debian.org, and even if you trust people from @debian.org you cannot easily configure your antispam filter to just accept all emails from @debian.org, because spammers would get in too. Since nearly ten years DKIM is there to help you. If you send an email from @debian.org with DKIM, it will have a header like this:
DKIM-Signature: v=1; a=rsa-sha256; c=simple/simple; d=debian.org;
    s=vps.gio.user; t=1586779391;
    bh=B6tckJy2cynGjNRdm3lhFDrp0tD7fF8hS4x0FCfLADo=;
    h=From:Subject:To:Date:From;
    b=H4EDlATxVm7XNqPy2x7IqCchBUz1SxFtUSstB23BAsdyTKJIohM0O4RRWhrQX+pqE
     prPVhzcfNALMwlfExNE69940Q6pMCuYsoxNQjU7Jl/UX1q6PGqdVSO+mKv/aEI+N49
     vvYNgPJNLaAFnYqbWCPI8mNskLHLe2VFYjSjE4GJFOxl9o2Gpe9f5035FYPJ/hnqBF
     XPnZq7Osd9UtBrBq8agEooTCZHbNFSyiXdS0qp1ts7HAo/rfrBfbQSk39fOOQ5GbjV
     6FehkN4GAXFNoFnjfmjrVDJC6hvA8m0tJHbmZrNQS0ljG/SyffW4OTlzFzu4jOmDNi
     UHLnEgT07eucw==
The field d=debian.org is the domain this email claims to be from and the fields bh= and b= are a cryptographic public key signature certifying this fact. How do I check that the email is actually from @debian.org? I use the selector s=vps.gio.user to fetch the public key via DNS, and then use the public key to verify the signature.
$ host -t TXT vps.gio.user._domainkey.debian.org
vps.gio.user._domainkey.debian.org descriptive text "v=DKIM1; k=rsa; s=email; h=sha256; p=" "MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsM/W/kxtKWT58Eak0cfm/ntvurfbkkvugrG2jfvSMnHHkFyfJ34Xvn/HhQPLwX1QsjhuLV+tW+BQtxY7jxSABCee6nHQRBrpDej1t86ubw3CSrxcg1mzJI5BbL8un0cwYoBtUvhCYAZKarv1W2otCGs43L0s" "GtEqqtmYN/hIVVm4FcqeYS1cYrZxDsjPzCEocpYBhqHh1MTeUEddVmPHKZswzvllaWF0mgIXrfDNAE0LiX39aFKWtgvflrYFKiL4hCDnBcP2Mr71TVblfDY0wEdAEbGEJqHR1SxvWyn0UU1ZL4vTcylB/KJuV2gMhznOjbnQ6cjAhr2JYpweTYzz3wIDAQAB"
There it is! Debian declares in its DNS record that that key is authorized to sign outbound email from @debian.org. The spammer hopefully does not have access to Debian's DKIM keys, and they cannot sign emails. Many large and small email services have already deployed DKIM since years, while most @debian.org emails still do not use it. Why not? Because people send @debian.org emails from many different servers. Basically, every DD used their @debian.org address sends email from their own mail server, and those mail servers (fortunately) do not have access to Debian's DNS record to install their DKIM keys. Well, that was true until yesterday! :-) A few weeks ago I poked DSA asking to allow any Debian Developer to install their DKIM keys, so that DDs could use DKIM to sign their emails and hopefully reduce the amount of spam sent from @debian.org. They have done it (thank you DSA very much, especially adsb), and now it is possible to use it! How do I configure it? I will not write here a full DKIM tutorial, there are many around. You have to use opendkim-genkey to generate a key and then configure your mail server to use opendkim to digitally sign outbound email. There are a few Debian-specific things you have to care about, though. First the have to choose a selector, which is a string used to distinguish many DKIM keys belonging to the same domain. Debian allows you to installa a key whose selector is <something>.<uid>.user, where <uid> is your Debian uid (this is done both for namespacing reasons and for exposing who might be abusing the system). So check carefully that your selector has this form. Then you cannot edit directly Debian's DNS record. But you can use the email-LDAP gateway on db.debian.org to install your key in a way similar to how entries in debian.net are handled (see the updated documentation). Specifically, suppose that opendkim-genkey generated the following thing for selector vps.gio.user and domain debian.org:
vps.gio.user._domainkey IN  TXT ( "v=DKIM1; h=sha256; k=rsa; "
      "p=MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsM/W/kxtKWT58Eak0cfm/ntvurfbkkvugrG2jfvSMnHHkFyfJ34Xvn/HhQPLwX1QsjhuLV+tW+BQtxY7jxSABCee6nHQRBrpDej1t86ubw3CSrxcg1mzJI5BbL8un0cwYoBtUvhCYAZKarv1W2otCGs43L0sGtEqqtmYN/hIVVm4FcqeYS1cYrZxDsjPzCEocpYBhqHh1MTeUE"
      "ddVmPHKZswzvllaWF0mgIXrfDNAE0LiX39aFKWtgvflrYFKiL4hCDnBcP2Mr71TVblfDY0wEdAEbGEJqHR1SxvWyn0UU1ZL4vTcylB/KJuV2gMhznOjbnQ6cjAhr2JYpweTYzz3wIDAQAB" )  ; ----- DKIM key vps.gio.user for debian.org
Then you have to carefully copy the content of the p= field (without being fooled by it being split between different strings) and construct a request of the form:
dkimPubKey: vps.gio.user MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAsM/W/kxtKWT58Eak0cfm/ntvurfbkkvugrG2jfvSMnHHkFyfJ34Xvn/HhQPLwX1QsjhuLV+tW+BQtxY7jxSABCee6nHQRBrpDej1t86ubw3CSrxcg1mzJI5BbL8un0cwYoBtUvhCYAZKarv1W2otCGs43L0sGtEqqtmYN/hIVVm4FcqeYS1cYrZxDsjPzCEocpYBhqHh1MTeUEddVmPHKZswzvllaWF0mgIXrfDNAE0LiX39aFKWtgvflrYFKiL4hCDnBcP2Mr71TVblfDY0wEdAEbGEJqHR1SxvWyn0UU1ZL4vTcylB/KJuV2gMhznOjbnQ6cjAhr2JYpweTYzz3wIDAQAB
and then send it GPG-signed to changes@db.debian.org:
echo 'dkimPubKey: vps.gio.user blahblahblah'   gpg --clearsign   mail changes@db.debian.org
Then use host -t TXT vps.gio.user._domainkey.debian.org to chech the key gets published (it will probably take some minutes/hours, I don't know). Once it is published, you can enable DKIM in you mail server and your email will be signed. Congratulations, you will not look like a spammer any more! You can send an email to check-auth@verifier.port25.com to check that your setup is correct. They will reply with a report, including the success of DKIM test. Notice that currently Debian's setup only allows you to use RSA DKIM keys and doesn't allow you to set other DKIM fields (but you probably won't need to set them). EDIT DSA made an official announcement about DKIM support, which you might want to check out as well, together with its links. EDIT 2 Now ed25519 keys are supported, the syntax for specifying keys on LDAP is a little bit more flexible and you can also insert CNAME records. See the official documentation for the updated details. So we have solved our problems with spam? Ha, no! DKIM is only a small step. Useful, also because it enable other steps to be taken in the future, but small. In particular, DKIM enables you to say: "This particular email actually comes from @debian.org", but doesn't tell anybody what to do with emails that are not signed. A third-party mail server might wonder whether @debian.org emails are actually supposed to be signed or not. There is another standard for dealing with that, which is called DMARD, and I believe that Debian should eventually use it, but not now: the problem is that currently virtually no email from @debian.org is signed with DKIM, so if DMARC was enabled other mail servers would start to nuke all @debian.org emails, except those which are already signed, a minority. If people and services sending emails from @debian.org will start configuring DKIM on their servers, which is now possible, it will eventually come a time when DMARC can be enabled, and spammers will find themselves unable to send forged @debian.org emails. We are not there yet, but todays we are a little step closer than yesterday. Also, notice that having DKIM on @debian.org only counters spam pretending to be from @debian.org, but there is much more. The policy on what to accept is mostly independent on that on what you send. However, knowing that @debian.org emails have DKIM and DMARC would mean that we can set our spam filters to be more aggressive in general, but whitelist official Debian Developers and services. And the same can be done for other domains using DKIM and DMARC. Finally, notice that some incompatibilities between DKIM and mailing lists are known, and do not have a definitive answer yet. Basically, most mailing list engines modify either the body of the headers in forwarded emails, which means that DKIM does not validate any more. There are many proposed solutions, possibly none completely satisfying, but since spam is not very satisfying as well, something will have to be worked out. I wrote a lot already, though, so I wont't discuss this here.

6 March 2020

Reproducible Builds: Reproducible Builds in February 2020

Welcome to the February 2020 report from the Reproducible Builds project. One of the original promises of open source software is that distributed peer review and transparency of process results in enhanced end-user security. However, whilst anyone may inspect the source code of free and open source software for malicious flaws, almost all software today is distributed as pre-compiled binaries. This allows nefarious third-parties to compromise systems by injecting malicious code into ostensibly secure software during the various compilation and distribution processes. The motivation behind the reproducible builds effort is to provide the ability to demonstrate these binaries originated from a particular, trusted, source release: if identical results are generated from a given source in all circumstances, reproducible builds provides the means for multiple third-parties to reach a consensus on whether a build was compromised via distributed checksum validation or some other scheme. In this month s report, we cover:

If you are interested in contributing to the project, please visit our Contribute page on our website.

Media coverage & upstream news Omar Navarro Leija, a PhD student at the University Of Pennsylvania, published a paper entitled Reproducible Containers that describes in detail the workings of a new user-space container tool called DetTrace:
All computation that occurs inside a DetTrace container is a pure function of the initial filesystem state of the container. Reproducible containers can be used for a variety of purposes, including replication for fault-tolerance, reproducible software builds and reproducible data analytics. We use DetTrace to achieve, in an automatic fashion, reproducibility for 12,130 Debian package builds, containing over 800 million lines of code, as well as bioinformatics and machine learning workflows.
There was also considerable discussion on our mailing list regarding this research and a presentation based on the paper will occur at the ASPLOS 2020 conference between March 16th 20th in Lausanne, Switzerland. The many virtues of Reproducible Builds were touted as benefits for software compliance in a talk at FOSDEM 2020, debating whether the Careful Inventory of Licensing Bill of Materials Have Impact of FOSS License Compliance which pitted Jeff McAffer and Carol Smith against Bradley Kuhn and Max Sills. (~47 minutes in). Nobuyoshi Nakada updated the canonical implementation of the Ruby programming language a change such that filesystem globs (ie. calls to list the contents of filesystem directories) will henceforth be sorted in ascending order. Without this change, the underlying nondeterministic ordering of the filesystem is exposed to the language which often results in an unreproducible build. Vagrant Cascadian reported on our mailing list regarding a quick reproducible test for the GNU Guix distribution, which resulted in 81.9% of packages registering as reproducible in his installation:
$ guix challenge --verbose --diff=diffoscope ...
2,463 store items were analyzed:
  - 2,016 (81.9%) were identical
  - 37 (1.5%) differed
  - 410 (16.6%) were inconclusive
Jeremiah Orians announced on our mailing list the release of a number of tools related to cross-compilation such as M2-Planet and mescc-tools-seed. This project attemps a full bootstrap of a cross-platform compiler for the C programming language (written in C itself) from hex, the ultimate goal being able to demonstrate fully-bootstrapped compiler from hex to the GCC GNU Compiler Collection. This has many implications in and around Ken Thompson s Trusting Trust attack outlined in Thompson s 1983 Turing Award Lecture. Twitter user @TheYoctoJester posted an executive summary of reproducible builds in the Yocto Project: Finally, Reddit user tofflos posted to the /r/Java subreddit asking about how to achieve reproducible builds with Maven and Chris Lamb noticed that the Linux kernel documentation about reproducible builds of it is available on the kernel.org homepages in an attractive HTML format.

Distribution work

Debian Chris Lamb created a merge request for the core debian-installer package to allow all arguments and options from sources.list files (such as [check-valid-until=no] , etc.) in order that we can test the reproducibility of the installer images on the Reproducible Builds own testing infrastructure. (#13) Thorsten Glaser followed-up to a bug filed against the dpkg-source component that was originally filed in late 2015 that claims that the build tool does not respect permissions when unpacking tarballs if the umask is set to 0002. Matthew Garrett posted to the debian-devel mailing list on the topic of Producing verifiable initramfs images as part of a wider conversation on being able to trust the entire software stack on our computers. 59 reviews of Debian packages were added, 30 were updated and 42 were removed this month adding to our knowledge about identified issues. Many issue types were noticed and categorised by Chris Lamb, including:

openSUSE In openSUSE, Bernhard M. Wiedemann published his monthly Reproducible Builds status update as well as provided the following patches:

Software development

diffoscope diffoscope is our in-depth and content-aware diff-like utility that can locate and diagnose reproducibility issues. It is run countless times a day on our testing infrastructure and is essential for identifying fixes and causes of nondeterministic behaviour. Chris Lamb made the following changes this month, including uploading version 137 to Debian:
  • The sng image utility appears to return with an exit code of 1 if there are even minor errors in the file. (#950806)
  • Also extract classes2.dex, classes3.dex from .apk files extracted by apktool. (#88)
  • No need to use str.format if we are just returning the string. [ ]
  • Add generalised support for ignoring returncodes [ ] and move special-casing of returncodes in zip to use Command.VALID_RETURNCODES. [ ]

Other tools disorderfs is our FUSE-based filesystem that deliberately introduces non-determinism into directory system calls in order to flush out reproducibility issues. This month, Vagrant Cascadian updated the Vcs-Git to specify the debian packaging branch. [ ] reprotest is our end-user tool to build same source code twice in widely differing environments and then checks the binaries produced by each build for any differences. This month, versions 0.7.13 and 0.7.14 were uploaded to Debian unstable by Holger Levsen after Vagrant Cascadian added support for GNU Guix [ ].

Project documentation & website There was more work performed on our documentation and website this month. Bernhard M. Wiedemann added a Java Gradle Build Tool snippet to the SOURCE_DATE_EPOCH documentation [ ] and normalised various terms to unreproducible [ ]. Chris Lamb added a Meson.build example [ ] and improved the documentation for the CMake [ ] to the SOURCE_DATE_EPOCH documentation, replaced anyone can with anyone may as, well, not everyone has the resources, skills, time or funding to actually do what it refers to [ ] and improved the pre-processing for our report generation [ ][ ][ ][ ] etc. In addition, Holger Levsen updated our news page to improve the list of reports [ ], added an explicit mention of the weekly news time span [ ] and reverted sorting of news entries to have latest on top [ ] and Mattia Rizzolo added Codethink as a non-fiscal sponsor [ ] and lastly Tianon Gravi added a Docker Images link underneath the Debian project on our Projects page [ ].

Upstream patches The Reproducible Builds project detects, dissects and attempts to fix as many currently-unreproducible packages as possible. We endeavour to send all of our patches upstream where appropriate. This month, we wrote a large number of such patches, including: Vagrant Cascadian submitted patches via the Debian bug tracking system targeting the packages the Civil Infrastructure Platform has identified via the CIP and CIP build depends package sets:

Testing framework We operate a fully-featured and comprehensive Jenkins-based testing framework that powers tests.reproducible-builds.org. This month, the following changes were made by Holger Levsen: In addition, Mattia Rizzolo added an Apache web server redirect for buildinfos.debian.net [ ] and reverted the reshuffling of arm64 architecture builders [ ]. The usual build node maintenance was performed by Holger Levsen, Mattia Rizzolo [ ][ ] and Vagrant Cascadian.

Getting in touch If you are interested in contributing to the Reproducible Builds project, please visit our Contribute page on our website. However, you can get in touch with us via:

This month s report was written by Bernhard M. Wiedemann, Chris Lamb and Holger Levsen. It was subsequently reviewed by a bunch of Reproducible Builds folks on IRC and the mailing list.

27 June 2017

Benjamin Mako Hill: Learning to Code in One s Own Language

I recently published a paper with Sayamindu Dasgupta that provides evidence in support of the idea that kids can learn to code more quickly when they are programming in their own language. Millions of young people from around the world are learning to code. Often, during their learning experiences, these youth are using visual block-based programming languages like Scratch, App Inventor, and Code.org Studio. In block-based programming languages, coders manipulate visual, snap-together blocks that represent code constructs instead of textual symbols and commands that are found in more traditional programming languages. The textual symbols used in nearly all non-block-based programming languages are drawn from English consider if statements and for loops for common examples. Keywords in block-based languages, on the other hand, are often translated into different human languages. For example, depending on the language preference of the user, an identical set of computing instructions in Scratch can be represented in many different human languages:
Examples of a short piece of Scratch code shown in four different human languages: English, Italian, Norwegian Bokm l, and German.
Although my research with Sayamindu Dasgupta focuses on learning, both Sayamindu and I worked on local language technologies before coming back to academia. As a result, we were both interested in how the increasing translation of programming languages might be making it easier for non-English speaking kids to learn to code. After all, a large body of education research has shown that early-stage education is more effective when instruction is in the language that the learner speaks at home. Based on this research, we hypothesized that children learning to code with block-based programming languages translated to their mother-tongues will have better learning outcomes than children using the blocks in English. We sought to test this hypothesis in Scratch, an informal learning community built around a block-based programming language. We were helped by the fact that Scratch is translated into many languages and has a large number of learners from around the world. To measure learning, we built on some of our our own previous work and looked at learners cumulative block repertoires similar to a code vocabulary. By observing a learner s cumulative block repertoire over time, we can measure how quickly their code vocabulary is growing. Using this data, we compared the rate of growth of cumulative block repertoire between learners from non-English speaking countries using Scratch in English to learners from the same countries using Scratch in their local language. To identify non-English speakers, we considered Scratch users who reported themselves as coming from five primarily non-English speaking countries: Portugal, Italy, Brazil, Germany, and Norway. We chose these five countries because they each have one very widely spoken language that is not English and because Scratch is almost fully translated into that language. Even after controlling for a number of factors like social engagement on the Scratch website, user productivity, and time spent on projects, we found that learners from these countries who use Scratch in their local language have a higher rate of cumulative block repertoire growth than their counterparts using Scratch in English. This faster growth was despite having a lower initial block repertoire. The graph below visualizes our results for two prototypical learners who start with the same initial block repertoire: one learner who uses the English interface, and a second learner who uses their native language.
Summary of the results of our model for two prototypical individuals.
Our results are in line with what theories of education have to say about learning in one s own language. Our findings also represent good news for designers of block-based programming languages who have spent considerable amounts of effort in making their programming languages translatable. It s also good news for the volunteers who have spent many hours translating blocks and user interfaces. Although we find support for our hypothesis, we should stress that our findings are both limited and incomplete. For example, because we focus on estimating the differences between Scratch learners, our comparisons are between kids who all managed to successfully use Scratch. Before Scratch was translated, kids with little working knowledge of English or the Latin script might not have been able to use Scratch at all. Because of translation, many of these children are now able to learn to code.
This blog post and the work that it describes is a collaborative project with Sayamindu Dasgupta. Sayamindu also published a very similar version of the blog post in several places. Our paper is open access and you can read it here. The paper was published in the proceedings of the ACM Learning @ Scale Conference. We also recently gave a talk about this work at the International Communication Association s annual conference. We received support and feedback from members of the Scratch team at MIT (especially Mitch Resnick and Natalie Rusk), as well as from Nathan TeBlunthuis at the University of Washington. Financial support came from the US National Science Foundation.

18 June 2017

Benjamin Mako Hill: The Community Data Science Collective Dataverse

I m pleased to announce the Community Data Science Collective Dataverse. Our dataverse is an archival repository for datasets created by the Community Data Science Collective. The dataverse won t replace work that collective members have been doing for years to document and distribute data from our research. What we hope it will do is get our data like our published manuscripts into the hands of folks in the forever business. Over the past few years, the Community Data Science Collective has published several papers where an important part of the contribution is a dataset. These include: Recently, we ve also begun producing replication datasets to go alongside our empirical papers. So far, this includes: In the case of each of the first groups of papers where the dataset was a part of the contribution, we uploaded code and data to a website we ve created. Of course, even if we do a wonderful job of keeping these websites maintained over time, eventually, our research group will cease to exist. When that happens, the data will eventually disappear as well. The text of our papers will be maintained long after we re gone in the journal or conference proceedings publisher s archival storage and in our universities institutional archives. But what about the data? Since the data is a core part perhaps the core part of the contribution of these papers, the data should be archived permanently as well. Toward that end, our group has created a dataverse. Our dataverse is a repository within the Harvard Dataverse where we have been uploading archival copies of datasets over the last six months. All five of the papers described above are uploaded already. The Scratch dataset, due to access control restrictions, isn t listed on the main page but it s online on the site. Moving forward, we ll be populating this new datasets we create as well as replication datasets for our future empirical papers. We re currently preparing several more. The primary point of the CDSC Dataverse is not to provide you with way to get our data although you re certainly welcome to use it that way and it might help make some of it more discoverable. The websites we ve created (like for the ones for redirects and for page protection) will continue to exist and be maintained. The Dataverse is insurance for if, and when, those websites go down to ensure that our data will still be accessible.
This post was also published on the Community Data Science Collective blog.

19 May 2017

Benjamin Mako Hill: Children s Perspectives on Critical Data Literacies

Last week, we presented a new paper that describes how children are thinking through some of the implications of new forms of data collection and analysis. The presentation was given at the ACM CHI conference in Denver last week and the paper is open access and online. Over the last couple years, we ve worked on a large project to support children in doing and not just learning about data science. We built a system, Scratch Community Blocks, that allows the 18 million users of the Scratch online community to write their own computer programs in Scratch of course to analyze data about their own learning and social interactions. An example of one of those programs to find how many of one s follower in Scratch are not from the United States is shown below. Last year, we deployed Scratch Community Blocks to 2,500 active Scratch users who, over a period of several months, used the system to create more than 1,600 projects. As children used the system, Samantha Hautea, a student in UW s Communication Leadership program, led a group of us in an online ethnography. We visited the projects children were creating and sharing. We followed the forums where users discussed the blocks. We read comment threads left on projects. We combined Samantha s detailed field notes with the text of comments and forum posts, with ethnographic interviews of several users, and with notes from two in-person workshops. We used a technique called grounded theory to analyze these data. What we found surprised us. We expected children to reflect on being challenged by and hopefully overcoming the technical parts of doing data science. Although we certainly saw this happen, what emerged much more strongly from our analysis was detailed discussion among children about the social implications of data collection and analysis. In our analysis, we grouped children s comments into five major themes that represented what we called critical data literacies. These literacies reflect things that children felt were important implications of social media data collection and analysis. First, children reflected on the way that programmatic access to data even data that was technically public introduced privacy concerns. One user described the ability to analyze data as, creepy , but at the same time, very cool. Children expressed concern that programmatic access to data could lead to stalking and suggested that the system should ask for permission. Second, children recognized that data analysis requires skepticism and interpretation. For example, Scratch Community Blocks introduced a bug where the block that returned data about followers included users with disabled accounts. One user, in an interview described to us how he managed to figure out the inconsistency:

At one point the follower blocks, it said I have slightly more followers than I do. And, that was kind of confusing when I was trying to make the project. [ ] I pulled up a second [browser] tab and compared the [data from Scratch Community Blocks and the data in my profile]. Third, children discussed the hidden assumptions and decisions that drive the construction of metrics. For example, the number of views received for each project in Scratch is counted using an algorithm that tries to minimize the impact of gaming the system (similar to, for example, Youtube). As children started to build programs with data, they started to uncover and speculate about the decisions behind metrics. For example, they guessed that the view count might only include unique views and that view counts may include users who do not have accounts on the website. Fourth, children building projects with Scratch Community Blocks realized that an algorithm driven by social data may cause certain users to be excluded. For example, a 13-year-old expressed concern that the system could be used to exclude users with few social connections saying:

I love these new Scratch Blocks! However I did notice that they could be used to exclude new Scratchers or Scratchers with not a lot of followers by using a code: like this:
when flag clicked
if then user s followers < 300
stop all.
I do not think this a big problem as it would be easy to remove this code but I did just want to bring this to your attention in case this not what you would want the blocks to be used for.
Fifth, children were concerned about the possibility that measurement might distort the Scratch community s values. While giving feedback on the new system, a user expressed concern that by making it easier to measure and compare followers, the system could elevate popularity over creativity, collaboration, and respect as a marker of success in Scratch.

I think this was a great idea! I am just a bit worried that people will make these projects and take it the wrong way, saying that followers are the most important thing in on Scratch. Kids conversations around Scratch Community Blocks are good news for educators who are starting to think about how to engage young learners in thinking critically about the implications of data. Although no kid using Scratch Community Blocks discussed each of the five literacies described above, the themes reflect starting points for educators designing ways to engage kids in thinking critically about data. Our work shows that if children are given opportunities to actively engage and build with social and behavioral data, they might not only learn how to do data analysis, but also reflect on its implications.

This blog-post and the work that it describes is a collaborative project by Samantha Hautea, Sayamindu Dasgupta, and Benjamin Mako Hill. We have also received support and feedback from members of the Scratch team at MIT (especially Mitch Resnick and Natalie Rusk), as well as from Hal Abelson from MIT CSAIL. Financial support came from the US National Science Foundation.

3 February 2017

Benjamin Mako Hill: New Dataset: Five Years of Longitudinal Data from Scratch

Scratch is a block-based programming language created by the Lifelong Kindergarten Group (LLK) at the MIT Media Lab. Scratch gives kids the power to use programming to create their own interactive animations and computer games. Since 2007, the online community that allows Scratch programmers to share, remix, and socialize around their projects has drawn more than 16 million users who have shared nearly 20 million projects and more than 100 million comments. It is one of the most popular ways for kids to learn programming and among the larger online communities for kids in general.
Front page of the Scratch online community (https://scratch.mit.edu) during the period covered by the dataset.
Since 2010, I have published a series of papers using quantitative data collected from the database behind the Scratch online community. As the source of data for many of my first quantitative and data scientific papers, it s not a major exaggeration to say that I have built my academic career on the dataset. I was able to do this work because I happened to be doing my masters in a research group that shared a physical space ( The Cube ) with LLK and because I was friends with Andr s Monroy-Hern ndez, who started in my masters cohort at the Media Lab. A year or so after we met, Andr s conceived of the Scratch online community and created the first version for his masters thesis project. Because I was at MIT and because I knew the right people, I was able to get added to the IRB protocols and jump through the hoops necessary to get access to the database. Over the years, Andr s and I have heard over and over, in conversation and in reviews of our papers, that we were privileged to have access to such a rich dataset. More than three years ago, Andr s and I began trying to figure out how we might broaden this access. Andr s had the idea of taking advantage of the launch of Scratch 2.0 in 2013 to focus on trying to release the first five years of Scratch 1.x online community data (March 2007 through March 2012) most of the period that the codebase he had written ran the site. After more work than I have put into any single research paper or project, Andr s and I have published a data descriptor in Nature s new journal Scientific Data. This means that the data is now accessible to other researchers. The data includes five years of detailed longitudinal data organized in 32 tables with information drawn from more than 1 million Scratch users, nearly 2 million Scratch projects, more than 10 million comments, more than 30 million visits to Scratch projects, and much more. The dataset includes metadata on user behavior as well the full source code for every project. Alongside the data is the source code for all of the software that ran the website and that users used to create the projects as well as the code used to produce the dataset we ve released. Releasing the dataset was a complicated process. First, we had navigate important ethical concerns about the the impact that a release of any data might have on Scratch s users. Toward that end, we worked closely with the Scratch team and the the ethics board at MIT to design a protocol for the release that balanced these risks with the benefit of a release. The most important features of our approach in this regard is that the dataset we re releasing is limited to only public data. Although the data is public, we understand that computational access to data is different in important ways to access via a browser or API. As a result, we re requiring anybody interested in the data to tell us who they are and agree to a detailed usage agreement. The Scratch team will vet these applicants. Although we re worried that this creates a barrier to access, we think this approach strikes a reasonable balance. Beyond the the social and ethical issues, creating the dataset was an enormous task. Andr s and I spent Sunday afternoons over much of the last three years going column-by-column through the MySQL database that ran Scratch. We looked through the source code and the version control system to figure out how the data was created. We spent an enormous amount of time trying to figure out which columns and rows were public. Most of our work went into creating detailed codebooks and documentation that we hope makes the process of using this data much easier for others (the data descriptor is just a brief overview of what s available). Serializing some of the larger tables took days of computer time. In this process, we had a huge amount of help from many others including an enormous amount of time and support from Mitch Resnick, Natalie Rusk, Sayamindu Dasgupta, and Benjamin Berg at MIT as well as from many other on the Scratch Team. We also had an enormous amount of feedback from a group of a couple dozen researchers who tested the release as well as others who helped us work through through the technical, social, and ethical challenges. The National Science Foundation funded both my work on the project and the creation of Scratch itself. Because access to data has been limited, there has been less research on Scratch than the importance of the system warrants. We hope our work will change this. We can imagine studies using the dataset by scholars in communication, computer science, education, sociology, network science, and beyond. We re hoping that by opening up this dataset to others, scholars with different interests, different questions, and in different fields can benefit in the way that Andr s and I have. I suspect that there are other careers waiting to be made with this dataset and I m excited by the prospect of watching those careers develop. You can find out more about the dataset, and how to apply for access, by reading the data descriptor on Nature s website.

21 October 2016

Iain R. Learmonth: PATHspider 1.0.0 released!

In today s Internet we see an increasing deployment of middleboxes. While middleboxes provide in-network functionality that is necessary to keep networks manageable and economically viable, any packet mangling whether essential for the needed functionality or accidental as an unwanted side effect makes it more and more difficult to deploy new protocols or extensions of existing protocols. For the evolution of the protocol stack, it is important to know which network impairments exist and potentially need to be worked around. While classical network measurement tools are often focused on absolute performance values, PATHspider performs A/B testing between two different protocols or different protocol extensions to perform controlled experiments of protocol-dependent connectivity problems as well as differential treatment. PATHspider is a framework for performing and analyzing these measurements, while the actual A/B test can be easily customized. Late on the 21st October, we released version 1.0.0 of PATHspider and it s ready for production use (whatever that means for Internet research software). Our first real release of PATHspider was version 0.9.0 just in time for the presentation of PATHspider at the 2016 Applied Networking Research Workshop co-located with IETF 96 in Berlin earlier this year. Since this release we have made a lot of changes and I ll talk about some of the highlights here (in no particular order):

Switch from twisted.plugin to straight.plugin While we anticipate that some plugins may wish to use some features of Twisted, we didn t want to have Twisted as a core dependency for PATHspider. We found that straight.plugin was not just a drop-in replacement but it simplified the way in which 3rd-party plugins could be developed and it was worth the effort for that alone.

Library functions for the Observer PATHspider has an embedded flow-meter (think something like NetFlow but highly customisable). We found that even with the small selection of plugins that we had we were duplicating code across plugins for these customisations of the flow-meter. In this release we now provide library functions for common needs such as identifying TCP 3-way handshake completions or identifying ICMP Unreachable messages for flows.

New plugin: DSCP We ve added a new plugin for this release to detect breakage when using DiffServ code points to achieve differentiated services within a network.

Plugins are now subcommands Using the subparsers feature of argparse, all plugins including 3rd-party plugins will now appear as subcommands to the PATHspider command. This makes every plugin a first-class citizen and makes PATHspider truly generalised. We have an added benefit from this that plugins can also ask for extra arguments that are specific to the needs of the plugin, for example the DSCP plugin allows the user to select which code point to send for the experimental test.

Test Suite PATHspider now has a test suite! As the size of the PATHspider code base grows we need to be able to make changes and have confidence that we are not breaking code that another module relies on. We have so far only achieved 54% coverage of the codebase but we hope to improve this for the next release. We have focussed on the critical portions of data collection to ensure that all the results collected by PATHspider during experiments is valid.

DNS Resolver Utility Back when PATHspider was known as ECNSpider, it had a utility for resolving IP addresses from the Alexa top 1 million list. This utility has now been fully integrated into PATHspider and appears as a plugin to allow for repeated experiments against the same IP addresses even if the DNS resolver would have returned a different addresss.

Documentation Documentation is definitely not my favourite activity, but it has to be done. PATHspider 1.0.0 now ships with documentation covering commandline usage, input/output formats and development of new plugins.
If you d like to check out PATHspider, you can find the website at https://pathspider.net/. Debian packages will be appearing shortly and will find their way into stable-backports within the next 2 weeks (hopefully). Current development of PATHspider is supported by the European Union s Horizon 2020 project MAMI. This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No 688421. The opinions expressed and arguments employed reflect only the authors view. The European Commission is not responsible for any use that may be made of that information.

13 September 2016

John Goerzen: Two Boys, An Airplane, Plus Hundreds of Old Computers

Was there anything you didn t like about our trip? Jacob s answer: That we had to leave so soon! That s always a good sign. When I first heard about the Vintage Computer Festival Midwest, I almost immediately got the notion that I wanted to go. Besides the TRS-80 CoCo II up in my attic, I also have fond memories of an old IBM PC with CGA monitor, a 25MHz 486, an Alpha also in my attic, and a lot of other computers along the way. I didn t really think my boys would be interested. But I mentioned it to them, and they just lit up. They remembered the Youtube videos I d shown them of old line printers and punch card readers, and thought it would be great fun. I thought it could be a great educational experience for them too and it was. It also turned into a trip that combined being a proud dad with so many of my other interests. Quite a fun time. IMG_20160911_061456 (Jacob modeling his new t-shirt) Captain Jacob Chicago being not all that close to Kansas, I planned to fly us there. If you re flying yourself, solid flight planning is always important. I had already planned out my flight using electronic tools, but I always carry paper maps with me in the cockpit for backup. I got them out and the boys and I planned out the flight the old-fashioned way. Here s Oliver using a scale ruler (with markings for miles corresponding to the scale of the map) and Jacob doing calculating for us. We measured the entire route and came to within one mile of the computer s calculation for each segment those boys are precise! 20160904_175519 We figured out how much fuel we d use, where we d make fuel stops, etc. The day of our flight, we made it as far as Davenport, Iowa when a chance of bad weather en route to Chicago convinced me to land there and drive the rest of the way. The boys saw that as part of the exciting adventure! Jacob is always interested in maps, and had kept wanting to use my map whenever we flew. So I dug an old Android tablet out of the attic, put Avare on it (which has aviation maps), and let him use that. He was always checking it while flying, sometimes saying this over his headset: DING. Attention all passengers, this is Captain Jacob speaking. We are now 45 miles from St. Joseph. Our altitude is 6514 feet. Our speed is 115 knots. We will be on the ground shortly. Thank you. DING Here he is at the Davenport airport, still busy looking at his maps: IMG_20160909_183813 Every little airport we stopped at featured adults smiling at the boys. People enjoyed watching a dad and his kids flying somewhere together. Oliver kept busy too. He loves to help me on my pre-flight inspections. He will report every little thing to me a scratch, a fleck of paint missing on a wheel cover, etc. He takes it seriously. Both boys love to help get the plane ready or put it away. The Computers Jacob quickly gravitated towards a few interesting things. He sat for about half an hour watching this old Commodore plotter do its thing (click for video): VID_20160910_142044 His other favorite thing was the phones. Several people had brought complete analog PBXs with them. They used them to demonstrate various old phone-related hardware; one had several BBSs running with actual modems, another had old answering machines and home-security devices. Jacob learned a lot about phones, including how to operate a rotary-dial phone, which he d never used before! IMG_20160910_151431 Oliver was drawn more to the old computers. He was fascinated by the IBM PC XT, which I explained was just about like a model I used to get to use sometimes. They learned about floppy disks and how computers store information. IMG_20160910_195145 He hadn t used joysticks much, and found Pong ( this is a soccer game! ) interesting. Somebody has also replaced the guts of a TRS-80 with a Raspberry Pi running a SNES emulator. This had thoroughly confused me for a little while, and excited Oliver. Jacob enjoyed an old TRS-80, which, through a modern Ethernet interface and a little computation help in AWS, provided an interface to Wikipedia. Jacob figured out the text-mode interface quickly. Here he is reading up on trains. IMG_20160910_140524 I had no idea that Commodore made a lot of adding machines and calculators before they got into the home computer business. There was a vast table with that older Commodore hardware, too much to get on a single photo. But some of the adding machines had their covers off, so the boys got to see all the little gears and wheels and learn how an adding machine can do its printing. IMG_20160910_145911 And then we get to my favorite: the big iron. Here is a VAX a working VAX. When you have a computer that huge, it s easier for the kids to understand just what something is. IMG_20160910_125451 When we encountered the table from the Glenside Color Computer Club, featuring the good old CoCo IIs like what I used as a kid (and have up in my attic), I pointed out to the boys that we have a computer just like this that can do these things and they responded wow! I think they are eager to try out floppy disks and disk BASIC now. Some of my favorites were the old Unix systems, which are a direct ancestor to what I ve been working with for decades now. Here s AT&T System V release 3 running on its original hardware: IMG_20160910_144923 And there were a couple of Sun workstations there, making me nostalgic for my college days. If memory serves, this one is actually running on m68k in the pre-Sparc days: IMG_20160910_153418 Returning home After all the excitement of the weekend, both boys zonked out for awhile on the flight back home. Here s Jacob, sleeping with his maps still up. IMG_20160911_132952 As we were nearly home, we hit a pocket of turbulence, the kind that feels as if the plane is dropping a bit (it s perfectly normal and safe; you ve probably felt that on commercial flights too). I was a bit concerned about Oliver; he is known to get motion sick in cars (and even planes sometimes). But what did I hear from Oliver? Whee! That was fun! It felt like a roller coaster! Do it again, dad!

4 July 2016

Benjamin Mako Hill: Studying the relationship between remixing & learning

With more than 10 million users, the Scratch online community is the largest online community where kids learn to program. Since it was created, a central goal of the community has been to promote remixing the reworking and recombination of existing creative artifacts. As the video above shows, remixing programming projects in the current web-based version of Scratch is as easy is as clicking on the see inside button in a project web-page, and then clicking on the remix button in the web-based code editor. Today, close to 30% of projects on Scratch are remixes. Remixing plays such a central role in Scratch because its designers believed that remixing can play an important role in learning. After all, Scratch was designed first and foremost as a learning community with its roots in the Constructionist framework developed at MIT by Seymour Papert and his colleagues. The design of the Scratch online community was inspired by Papert s vision of a learning community similar to Brazilian Samba schools (Henry Jenkins writes about his experience of Samba schools in the context of Papert s vision here), and a comment Marvin Minsky made in 1984:
Adults worry a lot these days. Especially, they worry about how to make other people learn more about computers. They want to make us all computer-literate. Literacy means both reading and writing, but most books and courses about computers only tell you about writing programs. Worse, they only tell about commands and instructions and programming-language grammar rules. They hardly ever give examples. But real languages are more than words and grammar rules. There s also literature what people use the language for. No one ever learns a language from being told its grammar rules. We always start with stories about things that interest us.
In a new paper titled Remixing as a pathway to Computational Thinking that was recently published at the ACM Conference on Computer Supported Collaborative Work and Social Computing (CSCW) conference, we used a series of quantitative measures of online behavior to try to uncover evidence that might support the theory that remixing in Scratch is positively associated with learning. scratchblocksOf course, because Scratch is an informal environment with no set path for users, no lesson plan, and no quizzes, measuring learning is an open problem. In our study, we built on two different approaches to measure learning in Scratch. The first approach considers the number of distinct types of programming blocks available in Scratch that a user has used over her lifetime in Scratch (there are 120 in total) something that can be thought of as a block repertoire or vocabulary. This measure has been used to model informal learning in Scratch in an earlier study. Using this approach, we hypothesized that users who remix more will have a faster rate of growth for their code vocabulary. Controlling for a number of factors (e.g. age of user, the general level of activity) we found evidence of a small, but positive relationship between the number of remixes a user has shared and her block vocabulary as measured by the unique blocks she used in her non-remix projects. Intriguingly, we also found a strong association between the number of downloads by a user and her vocabulary growth. One interpretation is that this learning might also be associated with less active forms of appropriation, like the process of reading source code described by Minksy. The second approach we used considered specific concepts in programming, such as loops, or event-handling. To measure this, we utilized a mapping of Scratch blocks to key programming concepts found in this paper by Karen Brennan and Mitchel Resnick. For example, in the image below are all the Scratch blocks mapped to the concept of loop . scratchblocksctWe looked at six concepts in total (conditionals, data, events, loops, operators, and parallelism). In each case, we hypothesized that if someone has had never used a given concept before, they would be more likely to use that concept after encountering it while remixing an existing project. Using this second approach, we found that users who had never used a concept were more likely to do so if they had been exposed to the concept through remixing. Although some concepts were more widely used than others, we found a positive relationship between concept use and exposure through remixing for each of the six concepts. We found that this relationship was true even if we ignored obvious examples of cutting and pasting of blocks of code. In all of these models, we found what we believe is evidence of learning through remixing. Of course, there are many limitations in this work. What we found are all positive correlations we do not know if these relationships are causal. Moreover, our measures do not really tell us whether someone has understood the usage of a given block or programming concept.However, even with these limitations, we are excited by the results of our work, and we plan to build on what we have. Our next steps include developing and utilizing better measures of learning, as well as looking at other methods of appropriation like viewing the source code of a project.

This blog post and the paper it describes are collaborative work with Sayamindu Dasgupta, Andr s Monroy-Hern ndez, and William Hale. The paper is released as open access so anyone can read the entire paper here. This blog post was also posted on Sayamindu Dasgupta s blog and on Medium by the MIT Media Lab.

9 June 2016

NOKUBI Takatsugu: Recurrent Convolutional Neural Networks for Text Classification

I made a simple implementation of text classification with Recurrent CNN. https://github.com/knok/rcnn-text-classification It uses chainer, a Deep Learning framework. Recurrent convolutional neural networks for text classification
Siwei Lai, Liheng Xu, Kang Liu, Jun Zhao, Chinese Academy of Sciences, China
AAAI. 2015.

13 December 2015

Robert Edmonds: Works with Debian: Intel SSD 750, AMD FirePro W4100, Dell P2715Q

I recently installed new hardware in my primary computer running Debian unstable. The disk used for the / and /home filesystem was replaced with an Intel SSD 750 series NVM Express card. The graphics card was replaced by an AMD FirePro W4100 card, and two Dell P2715Q monitors were installed. Intel SSD 750 series NVM Express card This is an 800 GB SSD on a PCI-Express x4 card (model number SSDPEDMW800G4X1) using the relatively new NVM Express interface, which appears as a /dev/nvme* device. The stretch alpha 4 Debian installer was able to detect and install onto this device, but grub-installer 1.127 on the installer media was unable to install the boot loader. This was due to a bug recently fixed in 1.128:
grub-installer (1.128) unstable; urgency=high
  * Fix buggy /dev/nvme matching in the case statement to determine
    disc_offered_devfs (Closes: #799119). Thanks, Mario Limonciello!
 -- Cyril Brulebois <kibi@debian.org>  Thu, 03 Dec 2015 00:26:42 +0100
I was able to download and install the updated .udeb by hand in the installer environment and complete the installation. This card was installed on a Supermicro X10SAE motherboard, and the UEFI BIOS was able to boot Debian directly from the NVMe card, although I updated to the latest available BIOS firmware prior to the installation. It appears in lspci like this:
02:00.0 Non-Volatile memory controller: Intel Corporation PCIe Data Center SSD (rev 01)
(prog-if 02 [NVM Express])
    Subsystem: Intel Corporation SSD 750 Series [Add-in Card]
    Flags: bus master, fast devsel, latency 0
    Memory at f7d10000 (64-bit, non-prefetchable) [size=16K]
    Expansion ROM at f7d00000 [disabled] [size=64K]
    Capabilities: [40] Power Management version 3
    Capabilities: [50] MSI-X: Enable+ Count=32 Masked-
    Capabilities: [60] Express Endpoint, MSI 00
    Capabilities: [100] Advanced Error Reporting
    Capabilities: [150] Virtual Channel
    Capabilities: [180] Power Budgeting <?>
    Capabilities: [190] Alternative Routing-ID Interpretation (ARI)
    Capabilities: [270] Device Serial Number 55-cd-2e-41-4c-90-a8-97
    Capabilities: [2a0] #19
    Kernel driver in use: nvme
The card itself appears very large in marketing photos, but this is a visual trick: the photographs are taken with the low-profile PCI bracket installed, rather than the standard height PCI bracket which it ships installed with. smartmontools fails to read SMART data from the drive, although it is still able to retrieve basic device information, including the temperature:
root@chase 0 :~# smartctl -d scsi -a /dev/nvme0n1
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.3.0-trunk-amd64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor:               NVMe
Product:              INTEL SSDPEDMW80
Revision:             0135
Compliance:           SPC-4
User Capacity:        800,166,076,416 bytes [800 GB]
Logical block size:   512 bytes
Rotation Rate:        Solid State Device
Logical Unit id:      8086INTEL SSDPEDMW800G4                     1000CVCQ531500K2800EGN  
Serial number:        CVCQ531500K2800EGN
Device type:          disk
Local Time is:        Sun Dec 13 01:48:37 2015 EST
SMART support is:     Unavailable - device lacks SMART capability.
=== START OF READ SMART DATA SECTION ===
Current Drive Temperature:     31 C
Drive Trip Temperature:        85 C
Error Counter logging not supported
[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
Device does not support Self Test logging
root@chase 4 :~# 
Simple tests with cat /dev/nvme0n1 >/dev/null and iotop show that the card can read data at about 1 GB/sec, about twice as fast as the SATA-based SSD that it replaced. apt/dpkg now run about as fast on the NVMe SSD as they do on a tmpfs. Hopefully this device doesn't at some point require updated firmware, like some infamous SSDs have. AMD FirePro W4100 graphics card This is a graphics card capable of driving multiple DisplayPort displays at "4K" resolution and at a 60 Hz refresh rate. It has four Mini DisplayPort connectors, although I only use two of them. It was difficult to find a sensible graphics card. Most discrete graphics cards appear to be marketed towards video gamers who apparently must seek out bulky cards that occupy multiple PCI slots and have excessive cooling devices. (To take a random example, the ASUS STRIX R9 390X has three fans and brags about its "Mega Heatpipes".) AMD markets a separate line of "FirePro" graphics cards intended for professionals rather than gamers, although they appear to be based on the same GPUs as their "Radeon" video cards. The AMD FirePro W4100 is a normal half-height PCI-E card that fits into a single PCI slot and has a relatively small cooler with a single fan. It doesn't even require an auxilliary power connection and is about the same dimensions as older video cards that I've successfully used with Debian. It was difficult to determine whether the W4100 card was actually supported by an open source driver before buying it. The word "FirePro" appears nowhere on the webpage for the X.org Radeon driver, but I was able to find a "CAPE VERDE" listed as an engineering name which appears to match the "Cape Verde" code name for the FirePro W4100 given on Wikipedia's List of AMD graphics processing units. This explains the "verde" string that appears in the firmware filenames requested by the kernel (available only in the non-free/firmware-amd-graphics package):
[drm] initializing kernel modesetting (VERDE 0x1002:0x682C 0x1002:0x2B1E).
[drm] Loading verde Microcode
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_pfp.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_me.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_ce.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_rlc.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_mc.bin
radeon 0000:01:00.0: firmware: direct-loading firmware radeon/verde_smc.bin
The card appears in lspci like this:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde GL [FirePro W4100]
(prog-if 00 [VGA controller])
    Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Device 2b1e
    Flags: bus master, fast devsel, latency 0, IRQ 55
    Memory at e0000000 (64-bit, prefetchable) [size=256M]
    Memory at f7e00000 (64-bit, non-prefetchable) [size=256K]
    I/O ports at e000 [size=256]
    Expansion ROM at f7e40000 [disabled] [size=128K]
    Capabilities: [48] Vendor Specific Information: Len=08 <?>
    Capabilities: [50] Power Management version 3
    Capabilities: [58] Express Legacy Endpoint, MSI 00
    Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
    Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
    Capabilities: [150] Advanced Error Reporting
    Capabilities: [200] #15
    Capabilities: [270] #19
    Kernel driver in use: radeon
The W4100 appears to work just fine, except for a few bizarre error messages that are printed to the kernel log when the displays are woken from power saving mode:
[Sun Dec 13 00:24:41 2015] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_enable_smc_cac failed
[Sun Dec 13 00:24:41 2015] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_enable_smc_cac failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[Sun Dec 13 00:24:41 2015] [drm:si_dpm_set_power_state [radeon]] *ERROR* si_enable_smc_cac failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* displayport link status failed
[Sun Dec 13 00:24:41 2015] [drm:radeon_dp_link_train [radeon]] *ERROR* clock recovery failed
There don't appear to be any ill effects from these error messages, though. I have the following package versions installed:
 / Name                          Version             Description
+++-=============================-===================-================================================
ii  firmware-amd-graphics         20151207-1          Binary firmware for AMD/ATI graphics chips
ii  linux-image-4.3.0-trunk-amd64 4.3-1~exp2          Linux 4.3 for 64-bit PCs
ii  xserver-xorg-video-radeon     1:7.6.1-1           X.Org X server -- AMD/ATI Radeon display driver
The Supermicro X10SAE motherboard has two PCI-E 3.0 slots, but they're listed as functioning in either "16/NA" or "8/8" mode, which apparently means that putting anything in the second slot (like the Intel 750 SSD, which uses an x4 link) causes the video card to run at a smaller x8 link width. This can be verified by looking at the widths reported in the "LnkCap" and "LnkSta" lines in the lspci -vv output:
root@chase 0 :~# lspci -vv -s 01:00.0   egrep '(LnkCap LnkSta):'
        LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
        LnkSta: Speed 8GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
root@chase 0 :~# 
I did not notice any visible artifacts or performance degradation because of the smaller link width. The sensors utility from the lm-sensors package is capable of reporting the temperature of the GPU:
root@chase 0 :~# sensors radeon-pci-0100
radeon-pci-0100
Adapter: PCI adapter
temp1:        +55.0 C  (crit = +120.0 C, hyst = +90.0 C)
root@chase 0 :~# 
Dell P2715Q monitors Two new 27" Dell monitors with a native resolution of 3840x2160 were attached to the new graphics card. They replaced two ten year old Dell 2001FP monitors with a native resolution of 1600x1200 that had experienced burn-in, providing 4.32 times as many pixels. (TV and monitor manufacturers now shamelessly refer to the 3840x2160 resolution as "4K" resolution even though neither dimension reaches 4000 pixels.) There was very little to setup beyond plugging the DisplayPort inputs on these monitors into the DisplayPort outputs on the graphics card. Most of the setup involved reconfiguring software to work with the very high resolution. X.org, for tl;dr CLOSED NOTABUG reasons doesn't set the DPI correctly. These monitors have ~163 DPI resolution, so I added -dpi 168 to /etc/X11/xdm/Xservers. (168 is an even 1.75x multiple of 96.) Software like Google Chrome and xfce4-terminal rendered fonts and graphical elements at the right size, but other software like notion, pidgin, and virt-manager did not fully understand the high DPI. E.g., pidgin renders fonts at the correct size, but icons are too small. The default X cursor was also too small. To fix this, I installed the dmz-cursor-theme package, ran update-alternatives --config x-cursor-theme and selected /usr/share/icons/DMZ-Black/cursor.theme as the cursor theme. Overall, these displays are much brighter and more readable than the ones they replaced.

1 January 2015

Dirk Eddelbuettel: New Year's Run 2015

Nice, crisp and pretty cold morning for the 2015 New Year's 5k. I had run this before with friends: 2003, 2005, 2006 and 2007. Needless to say, I am a little slower now: A hand-stopped 23:13 for a 7:27 min/mile pace is fine by me given the weather, lack of speedwork in the last few months, and generally reduced running---but also almost a minute slower than the last time I ran this. In case you're a fellow running geek and into this, Strava has my result nicely aggregated.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

6 September 2014

Joachim Breitner: ICFP 2014

Another on-my-the-journey-back blog post; this time from the Frankfurt Airport Train Station my flight was delayed (if I knew that I could have watched the remaining Lightning Talks), and so was my train, but despite 5min of running through the Airport just not enough. And now that the free 30 Minutes of Railway Station Internet are used up, I have nothing else to do but blog... Last week I was attending ICFP 2014 in Gothenburg, followed by the Haskell Symposium and the Haskell Implementors Workshop. The justification to attend was the paper on Safe Coercions (joint work with Richard Eisenberg, Simon Peyton Jones and Stephanie Weirich), although Richard got to hold the talk, and did so quite well. So I got to leisurely attend the talks, while fighting the jet-lag that I brought from Portland. There were as expected quite a few interesting talks. Among them the first keynote, Kathleen Fisher on the need for formal methods in cars and toy-quadcopters and unmanned battle helicopters, which made me conclude that my Isabelle skills might eventually become relevant in practical applications. And did you know that if someone gains access to your car s electronics, they can make the seat belt pull you back hard? Stefanie Weirich s keynote (and the subsequent related talks by Jan Stolarek and Richard Eisenberg) on what a dependently typed Haskell would look like and what we could use it for was mouth-watering. I am a bit worried that Haskell will be become a bit obscure for newcomers and people that simply don t want to think about types too much, on the other hand it seems that Haskell as we know it will always stay there, just as a subset of the language. Similarly interesting were refinement types for Haskell (talks by Niki Vazou and by Eric Seidel), in the form of LiquidTypes, something that I have not paid attention to yet. It seems to be a good way for more high assurance in Haskell code. Finally, the Haskell Implementors Workshop had a truckload of exciting developments in and around Haskell: More on GHCJS, Partial type signatures, interactive type-driven development like we know it from Agda, the new Haskell module system and amazing user-defined error messages the latter unfortunately only in Helium, at least for now. But it s not the case that I only sat and listened. During the Haskell Implementors Workshop I held a talk Contributing to GHC with a live demo of me fixing a (tiny) bug in GHC, with the aim of getting more people to hack on GHC (slides, video). The main message here is that it is not that big of deal. And despite me not actually saying much interesting in the talk, I got good feedback afterwards. So if it now actually motivates someone to contribute to GHC, I m even more happier. And then there is of course the Hallway Track. I discussed the issues with fusing a left fold (unfortunately, without a great solution). In order to tackle this problem more systematically, John Wiegley and I created the beginning of a List Fusion Lab , i.e. a bunch of list benchmark and the possibility to compare various implementations (e.g. with different RULES) and various compilers. With that we can hopefully better assess the effect of a change to the list functions. PS: The next train is now also delayed, so I ll likely miss my tram and arrive home even later... PPS: I really have to update my 10 year old picture on my homepage (or redesign it completely). Quite a few people knew my name, but expected someone with shoulder-long hair... PPPS: Haskell is really becoming mainstream: I just talked to a randomly chosen person (the boy sitting next to me in the train), and he is a Haskell enthusiast, building a structured editor for Haskell together with his brother. And all that as a 12th-grader...

10 November 2013

Luke Faraone: Why I use my bank's mobile site on my desktop

(or, cutting out bloat by using a platform where bloat won't fly)

Let me start off by saying I'm generally a huge fan of my bank, USAA. Their offerings are free of hidden fees, their phone support excellent, and the perks they provide are competitive. They don't have the best savings interest rates, but you can always find a better deal online to park money not actively in your checking account.

However, USAA's website is a behemoth. My account page took about 8 seconds to fully load, downloading 1.4MiB of content.

The "My Accounts" page you're redirected to after logging in.
It is frequently buggy; whenever I log in via Google Chrome on Ubuntu 12.04 I land on a page with a URL beginning with "https://www.usaa.com/inet/gas_bank/AccountBannerAjax" and a bunch of GET parameters like "currentaccountkey" and "accnumber" with values like "encrypted12a1f4dd1[ ]". The server returns a 200 OK, promises a Content-Length of 20, but then actually returns zero bytes. After navigating to the homepage and clicking a button, I end up getting logged in, but I wonder what percentage of their userbase are experiencing this problem?

For some strange reason, I get a lot of checks. It appears that nobody else informed the banking system that it's 2013, and the easiest mechanism for people to send money without paying fees is still on paper. To its credit, USAA made remote deposit of checks available to all customers in 2006, when it was mostly an offering limited to businesses. However, it seems like they haven't updated their web workflow since then.


Using it on the web still requires using a signed Java applet (itself discouraged by CMU's CERT) that does the incredibly complex task of letting you select a file from your computer and upload it to their servers. At least, that's what I think it does, because any time I chose "Run", my browser complained a few minutes later that the tab had stopped responding. Regardless of functionality, you can accomplish almost anything their site could currently be doing with HTML5 and a third party service if they want to crop images locally.

Spinning after logging
in on Android
USAA's mobile app for Android has another host of problems; I haven't been able to log into it for 2 weeks, and when I chatted with someone today I was told they were "doing some maintenance this weekend", so I should try again in a few hours once that's finished.

I googled around a bit for some way to perhaps make the applet work in Ubuntu (which admittedly is not a supported platform), and came upon a Facebook thread where a rep suggested using the mobile web site.

A breath of fresh air
I loaded it in my browser, and was amazed at how well it functioned. Obviously designed for higher-end devices (It didn't even load in one WAP emulator I tried), the mobile web interface was a refreshing breath of fresh air. It scaled well to a full-screen device (see below), loaded quickly, and gave me all the information I would have wanted out of the normal web interface.

Most notably: remember the whole "upload a check" workflow that required a buggy Java applet on the main website? We get bog-standard HTML form fields, no additional magic. There goes any theories about the Java client doing some magic validation or prep of the image; here, all they're getting is the images and my session cookie.


I'm still shocked at whoever thought a My Yahoo!-style homepage was the best layout for a bank, but props to the web developers who managed to make a mobile interface that was both pretty and allowed me to work around broken functionality in their implementations on every other platform I had access to.

But why was the mobile web interface the least bloated? Easy. On the desktop, you generally have a nice pipe, or if not, the user knows it and won't be too upset if your site is just as slow as other sites similarly situated. On mobile, the user downloaded all the code already, so the only latency should be the API requests against the server, right?

On the mobile web users have come to expect relatively speedy mobile-optimised sites and there's less screen real estate to do fancy things that get in the way of content. For many sites, that's a huge improvement. Of course, it would be really nice if more banks supported open protocols for interactions (USAA has a read-only, limited-duration OFX feed), but I would settle for a better web interface.

So tl;dr: USAA, please make www.usaa.com redirect to m.usaa.com, kthxbai.

30 July 2013

Daniel Pocock: Switzerland's scariest railway (video)

Although promoted as a funicular railway, the Gelmerbahn funicular has the appearance of a giant roller coaster and all of the fun too. For all those still having nightmares about recent rail tragedies in the age of non-stop surveillance and on-line replays, the 108 degree ascent (or descent) may not be your thing. On the other hand, for those who experience the London Underground each morning, "mind the gap" takes on a whole new meaning as you see your feet hanging out the bottom of the train and the tracks descending vertically into the valley below. <video controls="" height="340" poster="http://danielpocock.com/sites/danielpocock.com/files/gbahn-preview.jpg" width="560">

<source src="http://video.danielpocock.com/Gelmerbahn.webm" type="video/webm"></source>
Click to access video on original page
</video> and here are download links for those who prefer to download now and watch later (about 160MB): At the top At the top, there is a scenic two hour walk around Lake Gelmer, the path takes you across the top of a dam built for hydro-electric power generation. Lake Gelmer - hydro-electric dam Getting there from DebConf13 and the Debian 20th birthday For those coming to Switzerland for DebConf13 next month, the Gelmerbahn is probably a little bit too far for a day trip but makes an excellent place to stop during a 2 or 3 day tour around Switzerland. It is accessible using the all-day bus pass for the tour of 3 or 4 mountain passes or by train to Innertkirchen and then a short ride on the bus. The busses are irregular, use sbb.ch to plan the journey. Walking down the mountain from the lake to the bus stop takes about 90 minutes and for those who choose this option, the ticket is cheaper. More Swiss travel blogs Please click here to access some of my previous blogs about Swiss travel with videos and useful ideas for day trips.

23 December 2012

Benjamin Mako Hill: The Cost of Collaboration for Code and Art

This post was written with Andr s Monroy-Hern ndez for the Follow the Crowd Research Blog. The post is a summary of a paper forthcoming in Computer-Supported Cooperative Work 2013. You read also read the full paper: The Cost of Collaboration for Code and Art: Evidence from Remixing. It is part of a series of papers I have written with Monroy-Hern ndez using data from Scratch. You can find the others on my academic website.
Does collaboration result in higher quality creative works than individuals working alone? Is working in groups better for functional works like code than for creative works like art? Although these questions lie at the heart of conversations about collaborative production on the Internet and peer production, it can be hard to find research settings where you can compare across both individual and group work and across both code and art. We set out to tackle these questions in the context of a very large remixing community.

Example of a remix in the Scratch online community, and the project it is based off. The orange arrows indicate pieces which were present in the original and reused in the remix.

Remixing platforms provide an ideal setting to answer these questions. Most support the sharing, and collaborative rating, of both individually and collaboratively authored creative works. They also frequently combine code with artistic media like sound and graphics. We know that that increased collaboration often leads to higher quality products. For example, studies of Wikipedia have suggested that vandalism is detected and removed within minutes, and that high quality articles in Wikipedia, by several measures, tend to be produced by more collaboration. That said, we also know that collaborative work is not always better for example, that brainstorming results in less good ideas when done in groups. We attempt to answer this broad question, asked many times before, in the context of remixing: Which is the better description, the wisdom of crowds or too many cooks spoil the broth ? That, fundamentally, forms our paper s first research question: Are remixes, on average, higher quality than single-authored works? A number of critics of peer production, and some fans, have suggested that mass collaboration on the Internet might work much better for certain kinds of works. The argument is that free software and Wikipedia can be built by a crowd because they are functional. But more creative works like music, a novel, or a drawing might benefit less, or even be hurt by, participation by a crowd. Our second research question tries to get at this possibility: Are code-intensive remixes, higher quality than media-intensive remixes? We try to answers to these questions using a detailed dataset from Scratch a large online remixing community where young people build, share, and collaborate on interactive animations and video games. The community was built to support users of the Scratch programming environment: a desktop application with functionality similar to Flash created by the Lifelong Kindergarten Group at the MIT Media Lab. Scratch is designed to allow users to build projects by integrating images, music, sound and other media with programming code. Scratch is used by more than a million, mostly young, users. Measuring quality is tricky and we acknowledge that there are many ways to do it. In the paper, we rely most heavily a measure of peer ratings in Scratch called loveits very similar to likes on Facebook. We find similar results with several other metrics and we control for the number of views a project receives. In answering our first research question, we find that remixes are, on average, rated as being of lower quality than works of single authorship. This finding was surprising to us but holds up across a number of alternative tests and robustness checks. In answering our second question, we find rough support for the common wisdom that remixing tends to be more effective for functional works than for artistic media. The more code-intensive a project is, on average, the closer the gap is between a remix and a work of single authorship. But the more media-intensive a project is, the bigger the gap. You can see the relationships that our model predicts in the graph below.

Two plots of estimated values for prototypical projects showing the predicted number of loveits using our estimates. In the left panel, the x-axis varies number of blocks while holding media intensity at the sample median. The right panel varies the number of media elements while holding the number of blocks at the sample median. Ranges for each are from 0 to the 90th percentile.

Both of us are supporters and advocates of remixing. As a result, we were initially a little troubled by our result in this paper. We think the finding suggests an important limit to the broadest claims of the benefit of collaboration in remixing and peer production. That said, we also reject the blind repetition of the mantra that collaboration is always better for every definition of better, and for every type of work. We think it s crucial to learn and understand the limitations and challenges associated with remixing and we re optimistic that this work can influence the design of social media and collaboration systems to help remixing and peer production thrive. For more, see our full paper, The Cost of Collaboration for Code and Art: Evidence from Remixing.

13 December 2012

Andrea Veri: The future is Cloudy

Have you ever heard someone talking extensively about Cloud Computing or generally Clouds? and have you ever noticed the fact many people (even the ones who present themselves as experts) don t really understand what a Cloud is at all? That happened to me multiple times and one of the most common misunderstandings is many see the Cloud as something being on the internet. Many companies add a little logo representing a cloud on their frontpage and without a single change on their infrastructure (but surely with a price increment) they start calling their products as being on the Cloud. Given the lack of knowledge about this specific topic people tend to buy the product presented as being on the Cloud without understanding what they really bought. But what Cloud Computing really means? it took several years and more than fifteen drafts to the National Institute of Standards and Technology s (NIST) to find a definition. The final accepted proposal:

Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.

The above definition requires a few more clarifications specifically when it comes to understand where should we focus on while checking for a Cloud Computing solution. A few key points:
  1. On-demand self-service: every consumer will be able to unilaterally provision multiple computing capabilities like server time, storage, bandwidth, dedicated RAM or CPU without requiring any sort of human interaction from their respective Cloud providers.
  2. Rapid elasticity and scalability: all the computing capabilities outlined above can be elastically provisioned and released depending on how much demand my company will have in a specific period of time. Suppose the X company is launching a new product today and it expects a very large number of customers. The X company will add more resources to their Cloud for the very first days (where they suppose the load to be very high) and then it ll scale the resources back as they were before. Elasticity and scalability permit the X company to improve and enhance their infrastructure when they need it with an huge saving in monetary terms.
  3. Broad network access: capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).
  4. Measured service: Cloud systems allow maximum transparency between the provider and the consumer, the usage of all the resources is monitored, controlled, and reported. The consumer knows how much will spend, when and in how long.
  5. Resource pooling: each provider s computing resources are pooled to serve multiple consumers at the same time. The consumer has no control or knownledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter).
  6. Resources price: when buying a Cloud service make sure the cost for two units of RAM, storage, CPU, bandwidth, server time is exactly the double of the price of one unit of the same capability. An example, if a provider offers you one hour of bandwitdh for 1 Euro, the price of two hours will have to be 2 Euros.
Another common error I usually hear is people feeling Cloud Computing just as a place to put their files online as a backup or for sharing them with co-workers and friends. That is just one of the available Cloud features , specifically the Cloud Storage , where typical examples are companies like Dropbox, Spideroak, Google Drive, iCloud and so on. But let s make a little note about the other three features :
  1. Infrastructure as a Service (IaaS): the capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. In this specific case the consumer has still no control or management over the underlying Cloud infrastructure but has control over operating systems, storage, and deployed applications. A customer will be able to add and destroy virtual machines (VMs), install an operating system on them based on custom kickstart files and eventually manage selected networking components like firewalls, hosted domains, accounts.
  2. Platform as a Service (PaaS). the capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools (like Mysql + PHP + PhpMyAdmin or Ruby on Rails) supported by the provider. In this specific case the consumer has still no control or management over the Cloud infrastructure itself (servers, OSs, storage, bandiwitdh etc.) but has control over the deployed applications and configuration settings for the application-hosting environment.
  3. Software as a Service (SaaS): the capability provided to the consumer is to use the provider s applications running on a Cloud infrastructure. The applications are accessible through various client devices, such as a browser, a mobile phone or a program interface. The consumer doesn t not manage nor control the Cloud infrastructure (servers, OSs, storage, bandwidth, etc.) that allows the applications to run. Even the provided applications aren t customizable by the consumer, which should rely on limited configuration settings.
The Cloud Computing technology is reasonably the future but can we trust Cloud providers? Are we sure that no one will ever have access to our files except us? and what about governments interested in acquiring a specific customer data hosted on the Cloud? I always suggest to read deeply both the Privacy Policy and Terms of Use of a certain service before signing in especially when it comes to choose a Cloud storage provider. Many providers have the same aspect, they seem to provide the same resources, the same amount of storage for the same price but legally they may present different problems, and that is the case of Spideroak vs Dropbox. Quoting from the Dropbox s Privacy Policy:
Compliance with Laws and Law Enforcement Requests; Protection of DropBox s Rights. We may disclose to parties outside Dropbox files stored in your Dropbox and information about you that we collect when we have a good faith belief that disclosure is reasonably necessary to (a) comply with a law, regulation or compulsory legal request; (b) protect the safety of any person from death or serious bodily injury; (c) prevent fraud or abuse of DropBox or its users; or (d) to protect Dropbox s property rights. If we provide your Dropbox files to a law enforcement agency as set forth above, we will remove Dropbox s encryption from the files before providing them to law enforcement. However, Dropbox will not be able to decrypt any files that you encrypted prior to storing them on Dropbox.
It s evident that Dropbox employees can access your data or be forced by legal process to turn over your data unencrypted. On the other side, Spideroak on its latest update to its Privacy Policy states that data stored on their Cloud is encrypted and inaccessible without user s key, which is stored locally on user s computers. And what about the latest research paper, titled Cloud Computing in Higher Education and Research Institutions and the USA Patriot Act written by the legal experts of the University of Amsterdam s Institute for Information Law stating the anti-terror Patriot Act could be theoretically used by U.S. law enforcement to bypass strict European privacy laws to acquire citizen data within the European Union without their consensus? The only requirement for the data acquisition is the provider being an U.S company or an European company conducting systematic business in the U.S. For example an Italian company storing their documents (protected by the European privacy laws and under the general Italian jurisdiction) on a provider based in Europe but conducting systematic business in the United States, could be forced by U.S. law enforcement to transfer data to the U.S. territory for inspection by law enforcement agencies. Does someone really care about the privacy of companies, consumers and users at all? or better does privacy exists at all for the millions of the people that connect to the internet every day?

12 October 2012

Russell Coker: Cheap Bulk Storage

The Problem Some of my clients need systems that store reasonable amounts of data. This is enough data that we can expect some data corruption on disk such that traditional RAID doesn t work, that old fashioned filesystems like Ext3/4 will have unreasonable fsck fimes, and that the number of disks in a small server isn t enough. NetApp is a really good option for bulk reliable storage, but their products are very expensive. BTRFS has a lot of potential, but the currently released versions (as supported in distributions such as Debian/Wheezy) lack significant features. One significant lack in current BTRFS releases is something equivalent to the ZFS send/receive functionality for remote backups, this was a major factor when I analysed the options for hard drive based backup [1], and you should always think about backup before deploying a new system. Currently ZFS is the best choice for bulk storage which is reliable if you can t afford NetApp. Any storage system needs a minimum level of reliability if only to protect it s own metadata and a basic RAID array doesn t protect against media corruption with current data volumes. The combination of performance, lack of fsck (which is a performance feature), large storage support, backup, and significant real-world use makes ZFS a really good option. Now I need to get some servers for more than 8.1TiB of storage (the capacity of a RAID-Z array of 4*3TB disks). One of my clients needs significantly more, probably at least 10 disks in a RAID-Z array so none of the cheaper servers will do. Basically the issue that some of my clients are dealing with (and which I have to solve) is how to provide a relatively cheap ZFS system for storing reasonable amounts of data. For some systems I need to start with about 10 disks and be able to scale to 24 disks or more without excessive expense. Also to make things a little easier and cheaper 24*7 operation is not required, so instead of paying for hot-swap disks we can just schedule down-time outside business hours. The Problem with Dell Dell is really good for small systems, the PowerEdge tower servers that support 2*3.5 or 4*3.5 disks and which have space for an SSD or two are really affordable and easy to order. But even in the mid-size Dell tower servers (which are small by server standards) you have problems with just getting a few disks operating outside a RAID array [2]. The Dell online store is really great for small servers, any time I m buying a server for less than $2500 I check the Dell online store first and usually their price is good enough that there is no need to get a quote from another company. Unfortunately all the servers with bigger storage involve disks that are unreasonably expensive (it seems that Dell makes their profit on the parts) and prices are not available online. I gave my email address and phone number to the Dell web site on Wednesday and they haven t cared to get back to me yet. This is the type of service that makes me avoid IBM and HP for any server deployment where the Dell online store sells something suitable! BackBlaze For some time BackBlaze have been getting interest by describing how they store lots of data in a small amount of space by tightly stacking SATA disks. They don t think that ZFS on Linux is ready for production, but their hardware ideas are useful. They have recently described their latest architecture [3]. They describe it as 135TB for $7,384. Of course the 135TB number is based on the idea of getting the full 3TB capacity out of each disk which they can do as they have redundancy over multiple storage pods. But anyone who wants a single fileserver needs some internal redundancy to cover disk failure. One option might be to have three RAID-Z2 arrays of 15 disks which gives a usable capacity of 42*3TB==126TB==113TiB. Note that while the ZFS documentation recommends between 3 and 9 disks per zpool for performance I don t expect performance problems, when you only have a gigabit Ethernet connection there shouldn t be a problem with three ZFS zpools making the network the bottleneck. For this option the way to go would be to start with an array of 15 disks and then buy a second set of 15 disks when the first storage pool becomes full. It seems likely that 4TB disks will become cheap before a 35TiB array is filled so we can get more efficiency by delaying purchases. The BackBlaze pod isn t cheap, they are sold as a complete system without storage disks for $US5,395 by Protocase [4]. That gives a markup of $US3,411 over the BackBlaze cost which isn t too bad given that BackBlaze are quoting the insane bulk discount hardware prices that I could never get. Protocase also offer the case on it s own for anyone who wants to build a system around it. It seems like the better option is to buy the system from Protocase, but that would end up being over $6,000 when Australian import duty is added and probably close to $7,000 when shipping etc is included. Norco Norco offers a case that takes 24 hot-swap SATA/SAS disks and a regular PC motherboard for $US399 [5]. It s similar to the BackBlaze pod but smaller, cheaper, and there s no obvious option to buy a configured and tested system. 24 disks would allow two RAID-Z2 arrays of 12 disks, the first array could provide 27TiB and the second array could provide something bigger when new disks are released. SuperMicro SuperMicro has a range of storage servers that support from 12 to 36 disks [6]. They seem good, but I d have to deal with a reseller to buy them which would involve pain at best and at worst they wouldn t bother getting me a quote because I only want one server at a time. Conclusion Does anyone know of any other options for affordable systems suitable for running ZFS on SATA disks? Preferably ones that don t involve dealing with resellers. At the moment it seems that the best option is to get a Norco case and build my own system as I don t think that any of my clients needs the capacity of a BackBlaze pod at the moment. Supermicro seems good but I d have to deal with a reseller. In my experience the difference between the resellers of such computer systems and used car dealers is that used car dealers are happy to sell one car at a time and that every used car dealer at least knows how to drive. Also if you are an Australian reader of my blog and you want to build such storage servers to sell to my clients in Melbourne then I d be interested to see an offer. But please make sure that any such offer includes a reference to your contributions to the Linux community if you think I won t recognise your name. If you don t contribute then I probably don t want to do business with you. As an aside, I was recently at a camera store helping a client test a new DSLR when one of the store employees started telling me how good ZFS is for storing RAW images. I totally agree that ZFS is the best filesystem for storing large RAW files and this is what I am working on right now. But it s not the sort of advice I expect to receive at a camera store, not even one that caters to professional photographers. Related posts:
  1. Cheap SATA Disks in a Dell PowerEdge T410 A non-profit organisation I support has just bought a Dell...
  2. ZFS vs BTRFS on Cheap Dell Servers I previously wrote about my first experiences with BTRFS [1]....
  3. Flash Storage and Servers In the comments on my post about the Dell PowerEdge...

15 September 2012

Eddy Petri&#537;or: Why a lack of skepticism is dangerous...

Some of my Romanian readers might know that for the last two years I've got involved in the skeptical movement to such a degree that I am a co-producer of a bi-weekly podcast on science and skepticism (in Romanian) called Skeptics in Romania . Some might even be regular listeners of the show.

(There isn't much to see now visually on the site, but me and the other people behind the project have some ongoing plans to change that.)

In spite of our modest site, up until now we had some successes, one of them being the publication of an article on us in a known Romanian printed publication and another being the invitation to a live show face to face with Oreste Teodorescu, a well known Romanian mysticist and woo promoter.

During that live show we managed to show a demonstration (video below, in Romanian) of how astrology gives the impression of working, without actually working, and, taking into account we had no prior TV camera experience and that it was a live show, I think we managed an honourable presence.


<iframe allowfullscreen="allowfullscreen" frameborder="0" height="315" src="http://www.youtube.com/embed/y5OG1q8_3Ro" width="420"></iframe>

We also have a series of interviews in English with some really interesting people: Dr. Eugenie Scott, Prof. Christopher French, Prof. Edzard Ernst, Samantha Stein and others. We did these interviews at Denkfest 2011, in Zurich, and we integrated the translated (voice over) interviews in our podcast. The conlusion is that most of our activities revolve around the podcast, so let me tell you more about that.

The podcast has a somewhat fixed structure, it starts with a conversation between ourselves, then we have a segment on the history of science, technology, skepticism and woo, and then we have a segment called The dangers of not being skeptical . In this segment we present cases of people who lost their lives, their health, their money or any combination of the former because they were duped into some scam, science-y sounding non-science, unfounded claim or some other woo.

Having lost recently my brother-in-law to a form of cancer known as Hodgkin lymphoma, I became especially sensitive about miracle-cure claims for cancer, and this section of the show has lately seen its fair share of such cases. Honestly, if there could be a way to prosecute the irresponsible, ignorant and/or cynical people promoting all sorts of quack "therapies", especially for cancer*, I would really like to see it happen. But there isn't, and we're trying the best accessible approach, informing the public.

During my brother-in-law's last two years of his life, he went through lots of chemotherapy and radiotherapy sessions, repeated periods of hospitalisation, and lots of drugs. This is the best of what we currently have for treating and curing most forms of cancer, and too many times this isn't enough. I can't even imagine how stressful and discouraging it must feel when the best of what we have doesn't help.

Here is where the desperation and hopes of patients and their families meet the purely irresponsible cynical or ignorant promoters of woo and quack therapies. Because it takes either an ignorant or a really cynical (I really feel this word isn't enough) person to prey on the suffering of other people to make easy money under the false pretence of offering a cure.

It almost happened to my brother-in-law and his family, because they almost went for some herbal concoction promoted as a cancer cure on some forum, blog or page of a seller of this fake therapy. It was really hard for me to make them understand why using such a product it not advisable, not even in parallel with the medical treatment due its possible counter effect or interactions with the real medical treatment, without them getting the wrong idea that I wasn't trying to help. While trying to be brief and informative not to lose their attention, I told them how "natural" doesn't necessarily mean "good" (uranium, lead and Irukandji's venom are all natural), and how plants are drugs because they all contain chemical substances (and no, "chemical" does not mean "human made" or "artificial") which could interact with the medical treatment.


But most people don't even have the chance of having close by a person with a more science-leaning thought process and a skeptical mind. Those unfortunate people are the most vulnerable people and constitute the biggest chunk of the victims of baseless pseudo-cures or pseudo-treatments.

On our last show, I presented the case of Yvonne Main, a cancer suffering patient who mistook an invasive carcinoma for a cyst, and irdologist Ruth Nelson for a real healthcare giver.

Yvonne Main, died from an invasive carcinoma
after seeking help from a iridologist,
and delaying real medical treament for 18 months


Yvonne, after seeking medical advice from a person that essentially promotes the dead idea of guessing diseases by looking at the eyes**, used natural treatments for about 18 months and, after all this time, her carcinoma grew to a size of 10 to 11 cm, eating through her skull and causing damage which was later attempted to be countered through bone transplant from her ribs.

Ruth Nelson wasn't prosecuted in any way and continues her practice of quackery unharmed.

This is not the only case, nor even one case from a select few where woo and quackery lead to grave consequences for patients. There are many, many more; they're so many that even after splitting them in categories they seem too many per category, especially when you realise these are only the findings of, essentially, a single man:


http://www.whatstheharm.net/


This is part of what I have been doing in the last few years, instead of working on Debian. Is it a good thing? Is it a bad thing? Maybe it's good. I want to know what do you think?


* you will, most likely, never hear such a promoter of non-therapies say that there isn't just one cancer, and that, in fact, cancer is a name for a certain family of diseases which are all called cancer - that's a first sign that you might be dealing with quack
** probably in the line of thought that the eyes are the gates to the soul so they must tell something significant about health

Next.