Search Results: "benjamin"

16 August 2015

Benjamin Drung: DebConf 15

I am still alive and currently attending DebConf 15. Feel free to grab me for a talk. I am just shy, not antisocial.

2 August 2015

Benjamin Mako Hill: Understanding Hydroplane Races for the New Seattleite

It s Seafair weekend in Seattle. As always, the centerpiece is the H1 Unlimited hydroplane races on Lake Washington. EllstromManufacturingHydroplaneIn my social circle, I m nearly the only person I know who grew up in area. None of the newcomers I know had heard of hydroplane racing before moving to Seattle. Even after I explain it to them i.e., boats with 3,000+ horse power airplane engines that fly just above the water at more than 320kph (200mph) leaving 10m+ (30ft) wakes behind them! most people seem more puzzled than interested. I grew up near the shore of Lake Washington and could see (and hear!) the races from my house. I don t follow hydroplane racing throughout the year but I do enjoy watching the races at Seafair. Here s my attempt to explain and make the case for the races to new Seattleites. Before Microsoft, Amazon, Starbucks, etc., there were basically three major Seattle industries: (1) logging and lumber based industries like paper manufacturing; (2) maritime industries like fishing, shipbuilding, shipping, and the navy; (3) aerospace (i.e., Boeing). Vintage hydroplane racing represented the Seattle trifecta: Wooden boats with airplane engines! The wooden U-60 Miss Thriftway circa 1955 (Thriftway is a Washinton-based supermarket that nobody outside has heard of) below is a picture of old-Seattle awesomeness. Modern hydroplanes are now made of fiberglass but two out of three isn t bad. miss_thriftwayAlthough the boats are racing this year in events in Indiana, San Diego, and Detroit in addition to the two races in Washington, hydroplane racing retains deep ties to the region. Most of the drivers are from the Seattle area. Many or most of the teams and boats are based in Washington throughout the year. Many of the sponsors are unknown outside of the state. This parochialness itself cultivates a certain kind of appeal among locals. In addition to old-Seattle/new-Seattle cultural divide, there s a class divide that I think is also worth challenging. Although the demographics of hydro-racing fans is surprisingly broad, it can seem like Formula One or NASCAR on the water. It seems safe to suggest that many of the demographic groups moving to Seattle for jobs in the tech industry are not big into motorsports. Although I m no follower of motorsports in general, I ve written before cultivated disinterest in professional sports, and it remains something that I believe is worth taking on. It s not all great. In particular, the close relationship between Seafair and the military makes me very uneasy. That said, even with the military-heavy airshow, I enjoy the way that Seafair weekend provides a little pocket of old-Seattle that remains effectively unchanged from when I was a kid. I d encourage others to enjoy it as well!

Benjamin Mako Hill: Understanding Hydroplane Races for the New Seattleite

It s Seafair weekend in Seattle. As always, the centerpiece is the H1 Unlimited hydroplane races on Lake Washington. EllstromManufacturingHydroplaneIn my social circle, I m nearly the only person I know who grew up in area. None of the newcomers I know had heard of hydroplane racing before moving to Seattle. Even after I explain it to them i.e., boats with 3,000+ horse power airplane engines that fly just above the water at more than 320kph (200mph) leaving 10m+ (30ft) wakes behind them! most people seem more puzzled than interested. I grew up near the shore of Lake Washington and could see (and hear!) the races from my house. I don t follow hydroplane racing throughout the year but I do enjoy watching the races at Seafair. Here s my attempt to explain and make the case for the races to new Seattleites. Before Microsoft, Amazon, Starbucks, etc., there were basically three major Seattle industries: (1) logging and lumber based industries like paper manufacturing; (2) maritime industries like fishing, shipbuilding, shipping, and the navy; (3) aerospace (i.e., Boeing). Vintage hydroplane racing represented the Seattle trifecta: Wooden boats with airplane engines! The wooden U-60 Miss Thriftway circa 1955 (Thriftway is a Washinton-based supermarket that nobody outside has heard of) below is a picture of old-Seattle awesomeness. Modern hydroplanes are now made of fiberglass but two out of three isn t bad. miss_thriftwayAlthough the boats are racing this year in events in Indiana, San Diego, and Detroit in addition to the two races in Washington, hydroplane racing retains deep ties to the region. Most of the drivers are from the Seattle area. Many or most of the teams and boats are based in Washington throughout the year. Many of the sponsors are unknown outside of the state. This parochialness itself cultivates a certain kind of appeal among locals. In addition to old-Seattle/new-Seattle cultural divide, there s a class divide that I think is also worth challenging. Although the demographics of hydro-racing fans is surprisingly broad, it can seem like Formula One or NASCAR on the water. It seems safe to suggest that many of the demographic groups moving to Seattle for jobs in the tech industry are not big into motorsports. Although I m no follower of motorsports in general, I ve written before cultivated disinterest in professional sports, and it remains something that I believe is worth taking on. It s not all great. In particular, the close relationship between Seafair and the military makes me very uneasy. That said, even with the military-heavy airshow, I enjoy the way that Seafair weekend provides a little pocket of old-Seattle that remains effectively unchanged from when I was a kid. I d encourage others to enjoy it as well!

17 July 2015

Simon Kainz: DUCK challenge: week 2

Just a litte update on the DUCK challenge: In the last week, the following packages were fixed and uploaded into unstable: Last week we had 10 packages uploaded & fixed, the current week resulted in 15 fixed packages. So there are currently 25 packages fixed by 20 different uploaders. I really hope i can meet you all at DebConf15!! The list of the fixed and updated packages is availabe here. I will try to update this ~daily. If I missed one of your uploads, please drop me a line. A big "Thank You" to you. There is still lots of time till the end of DebConf15 and the end of the DUCK Challenge, so please get involved. And rememeber: debcheckout fails? FIX MORE URLS

7 July 2015

Petter Reinholdtsen: MPEG LA on "Internet Broadcast AVC Video" licensing and non-private use

After asking the Norwegian Broadcasting Company (NRK) why they can broadcast and stream H.264 video without an agreement with the MPEG LA, I was wiser, but still confused. So I asked MPEG LA if their understanding matched that of NRK. As far as I can tell, it does not. I started by asking for more information about the various licensing classes and what exactly is covered by the "Internet Broadcast AVC Video" class that NRK pointed me at to explain why NRK did not need a license for streaming H.264 video:
According to a MPEG LA press release dated 2010-02-02, there is no charge when using MPEG AVC/H.264 according to the terms of "Internet Broadcast AVC Video". I am trying to understand exactly what the terms of "Internet Broadcast AVC Video" is, and wondered if you could help me. What exactly is covered by these terms, and what is not? The only source of more information I have been able to find is a PDF named AVC Patent Portfolio License Briefing, which states this about the fees:
  • Where End User pays for AVC Video
    • Subscription (not limited by title) 100,000 or fewer subscribers/yr = no royalty; > 100,000 to 250,000 subscribers/yr = $25,000; >250,000 to 500,000 subscribers/yr = $50,000; >500,000 to 1M subscribers/yr = $75,000; >1M subscribers/yr = $100,000
    • Title-by-Title - 12 minutes or less = no royalty; >12 minutes in length = lower of (a) 2% or (b) $0.02 per title
  • Where remuneration is from other sources
    • Free Television - (a) one-time $2,500 per transmission encoder or (b) annual fee starting at $2,500 for > 100,000 HH rising to maximum $10,000 for >1,000,000 HH
    • Internet Broadcast AVC Video (not title-by-title, not subscription) no royalty for life of the AVC Patent Portfolio License
Am I correct in assuming that the four categories listed is the categories used when selecting licensing terms, and that "Internet Broadcast AVC Video" is the category for things that do not fall into one of the other three categories? Can you point me to a good source explaining what is ment by "title-by-title" and "Free Television" in the license terms for AVC/H.264? Will a web service providing H.264 encoded video content in a "video on demand" fashing similar to Youtube and Vimeo, where no subscription is required and no payment is required from end users to get access to the videos, fall under the terms of the "Internet Broadcast AVC Video", ie no royalty for life of the AVC Patent Portfolio license? Does it matter if some users are subscribed to get access to personalized services? Note, this request and all answers will be published on the Internet.
The answer came quickly from Benjamin J. Myers, Licensing Associate with the MPEG LA:
Thank you for your message and for your interest in MPEG LA. We appreciate hearing from you and I will be happy to assist you. As you are aware, MPEG LA offers our AVC Patent Portfolio License which provides coverage under patents that are essential for use of the AVC/H.264 Standard (MPEG-4 Part 10). Specifically, coverage is provided for end products and video content that make use of AVC/H.264 technology. Accordingly, the party offering such end products and video to End Users concludes the AVC License and is responsible for paying the applicable royalties. Regarding Internet Broadcast AVC Video, the AVC License generally defines such content to be video that is distributed to End Users over the Internet free-of-charge. Therefore, if a party offers a service which allows users to upload AVC/H.264 video to its website, and such AVC Video is delivered to End Users for free, then such video would receive coverage under the sublicense for Internet Broadcast AVC Video, which is not subject to any royalties for the life of the AVC License. This would also apply in the scenario where a user creates a free online account in order to receive a customized offering of free AVC Video content. In other words, as long as the End User is given access to or views AVC Video content at no cost to the End User, then no royalties would be payable under our AVC License. On the other hand, if End Users pay for access to AVC Video for a specific period of time (e.g., one month, one year, etc.), then such video would constitute Subscription AVC Video. In cases where AVC Video is delivered to End Users on a pay-per-view basis, then such content would constitute Title-by-Title AVC Video. If a party offers Subscription or Title-by-Title AVC Video to End Users, then they would be responsible for paying the applicable royalties you noted below. Finally, in the case where AVC Video is distributed for free through an "over-the-air, satellite and/or cable transmission", then such content would constitute Free Television AVC Video and would be subject to the applicable royalties. For your reference, I have attached a .pdf copy of the AVC License. You will find the relevant sublicense information regarding AVC Video in Sections 2.2 through 2.5, and the corresponding royalties in Section 3.1.2 through 3.1.4. You will also find the definitions of Title-by-Title AVC Video, Subscription AVC Video, Free Television AVC Video, and Internet Broadcast AVC Video in Section 1 of the License. Please note that the electronic copy is provided for informational purposes only and cannot be used for execution. I hope the above information is helpful. If you have additional questions or need further assistance with the AVC License, please feel free to contact me directly.
Having a fresh copy of the license text was useful, and knowing that the definition of Title-by-Title required payment per title made me aware that my earlier understanding of that phrase had been wrong. But I still had a few questions:
I have a small followup question. Would it be possible for me to get a license with MPEG LA even if there are no royalties to be paid? The reason I ask, is that some video related products have a copyright clause limiting their use without a license with MPEG LA. The clauses typically look similar to this:
This product is licensed under the AVC patent portfolio license for the personal and non-commercial use of a consumer to (a) encode video in compliance with the AVC standard ("AVC video") and/or (b) decode AVC video that was encoded by a consumer engaged in a personal and non-commercial activity and/or AVC video that was obtained from a video provider licensed to provide AVC video. No license is granted or shall be implied for any other use. additional information may be obtained from MPEG LA L.L.C.
It is unclear to me if this clause mean that I need to enter into an agreement with MPEG LA to use the product in question, even if there are no royalties to be paid to MPEG LA. I suspect it will differ depending on the jurisdiction, and mine is Norway. What is MPEG LAs view on this?
According to the answer, MPEG LA believe those using such tools for non-personal or commercial use need a license with them:
With regard to the Notice to Customers, I would like to begin by clarifying that the Notice from Section 7.1 of the AVC License reads: THIS PRODUCT IS LICENSED UNDER THE AVC PATENT PORTFOLIO LICENSE FOR THE PERSONAL USE OF A CONSUMER OR OTHER USES IN WHICH IT DOES NOT RECEIVE REMUNERATION TO (i) ENCODE VIDEO IN COMPLIANCE WITH THE AVC STANDARD ("AVC VIDEO") AND/OR (ii) DECODE AVC VIDEO THAT WAS ENCODED BY A CONSUMER ENGAGED IN A PERSONAL ACTIVITY AND/OR WAS OBTAINED FROM A VIDEO PROVIDER LICENSED TO PROVIDE AVC VIDEO. NO LICENSE IS GRANTED OR SHALL BE IMPLIED FOR ANY OTHER USE. ADDITIONAL INFORMATION MAY BE OBTAINED FROM MPEG LA, L.L.C. SEE HTTP://WWW.MPEGLA.COM The Notice to Customers is intended to inform End Users of the personal usage rights (for example, to watch video content) included with the product they purchased, and to encourage any party using the product for commercial purposes to contact MPEG LA in order to become licensed for such use (for example, when they use an AVC Product to deliver Title-by-Title, Subscription, Free Television or Internet Broadcast AVC Video to End Users, or to re-Sell a third party's AVC Product as their own branded AVC Product). Therefore, if a party is to be licensed for its use of an AVC Product to Sell AVC Video on a Title-by-Title, Subscription, Free Television or Internet Broadcast basis, that party would need to conclude the AVC License, even in the case where no royalties were payable under the License. On the other hand, if that party (either a Consumer or business customer) simply uses an AVC Product for their own internal purposes and not for the commercial purposes referenced above, then such use would be included in the royalty paid for the AVC Products by the licensed supplier. Finally, I note that our AVC License provides worldwide coverage in countries that have AVC Patent Portfolio Patents, including Norway. I hope this clarification is helpful. If I may be of any further assistance, just let me know.
The mentioning of Norwegian patents made me a bit confused, so I asked for more information:
But one minor question at the end. If I understand you correctly, you state in the quote above that there are patents in the AVC Patent Portfolio that are valid in Norway. This make me believe I read the list available from <URL: http://www.mpegla.com/main/programs/AVC/Pages/PatentList.aspx > incorrectly, as I believed the "NO" prefix in front of patents were Norwegian patents, and the only one I could find under Mitsubishi Electric Corporation expired in 2012. Which patents are you referring to that are relevant for Norway?
Again, the quick answer explained how to read the list of patents in that list:
Your understanding is correct that the last AVC Patent Portfolio Patent in Norway expired on 21 October 2012. Therefore, where AVC Video is both made and Sold in Norway after that date, then no royalties would be payable for such AVC Video under the AVC License. With that said, our AVC License provides historic coverage for AVC Products and AVC Video that may have been manufactured or Sold before the last Norwegian AVC patent expired. I would also like to clarify that coverage is provided for the country of manufacture and the country of Sale that has active AVC Patent Portfolio Patents. Therefore, if a party offers AVC Products or AVC Video for Sale in a country with active AVC Patent Portfolio Patents (for example, Sweden, Denmark, Finland, etc.), then that party would still need coverage under the AVC License even if such products or video are initially made in a country without active AVC Patent Portfolio Patents (for example, Norway). Similarly, a party would need to conclude the AVC License if they make AVC Products or AVC Video in a country with active AVC Patent Portfolio Patents, but eventually Sell such AVC Products or AVC Video in a country without active AVC Patent Portfolio Patents.
As far as I understand it, MPEG LA believe anyone using Adobe Premiere and other video related software with a H.264 distribution license need a license agreement with MPEG LA to use such tools for anything non-private or commercial, while it is OK to set up a Youtube-like service as long as no-one pays to get access to the content. I still have no clear idea how this applies to Norway, where none of the patents MPEG LA is licensing are valid. Will the copyright terms take precedence or can those terms be ignored because the patents are not valid in Norway?

7 May 2015

Benjamin Mako Hill: Books Room

Mika trying to open the books room. And failing.Is the locked books room at McMahon Hall at UW a metaphor for DRM in the academy? Could it be, like so many things in Seattle, sponsored by Amazon? Mika noticed the room several weeks ago but felt that today s International Day Against DRM was a opportune time to raise the questions in front of a wider audience.

Benjamin Mako Hill: DRM on Streaming Services

For the 2015 International Day Against DRM, I wrote a short essay on DRM for streaming services posted on the Defective by Design website. I m republishing it here. Between 2003 and 2009, most music purchased through Apple s iTunes store was locked using Apple s FairPlay digital restrictions management (DRM) software, which is designed to prevent users from copying music they purchased. Apple did not seem particularly concerned by the fact that FairPlay was never effective at stopping unauthorized distribution and was easily removed with publicly available tools. After all, FairPlay was effective at preventing most users from playing their purchased music on devices that were not made by Apple. No user ever requested FairPlay. Apple did not build the system because music buyers complained that CDs purchased from Sony would play on Panasonic players or that discs could be played on an unlimited number of devices (FairPlay allowed five). Like all DRM systems, FairPlay was forced on users by a recording industry paranoid about file sharing and, perhaps more importantly, by technology companies like Apple, who were eager to control the digital infrastructure of music distribution and consumption. In 2007, Apple began charging users 30 percent extra for music files not processed with FairPlay. In 2009, after lawsuits were filed in Europe and the US, and after several years of protests, Apple capitulated to their customers complaints and removed DRM from the vast majority of the iTunes music catalog. Fundamentally, DRM for downloaded music failed because it is what I ve called an antifeature. Like features, antifeatures are functionality created at enormous cost to technology developers. That said, unlike features which users clamor to pay extra for, users pay to have antifeatures removed. You can think of antifeatures as a technological mob protection racket. Apple charges more for music without DRM and independent music distributors often use DRM-free as a primary selling point for their products. Unfortunately, after being defeated a half-decade ago, DRM for digital music is becoming the norm again through the growth of music streaming services like Pandora and Spotify, which nearly all use DRM. Impressed by the convenience of these services, many people have forgotten the lessons we learned in the fight against FairPlay. Once again, the justification for DRM is both familiar and similarly disingenuous. Although the stated goal is still to prevent unauthorized copying, tools for stripping DRM from services continue to be widely available. Of course, the very need for DRM on these services is reduced because users don t normally store copies of music and because the same music is now available for download without DRM on services like iTunes. We should remember that, like ten years ago, the real effect of DRM is to allow technology companies to capture value by creating dependence in their customers and by blocking innovation and competition. For example, DRM in streaming services blocks third-party apps from playing music from services, just as FairPlay ensured that iTunes music would only play on Apple devices. DRM in streaming services means that listening to music requires one to use special proprietary clients. For example, even with a premium account, a subscriber cannot listen to music from their catalog using an alternative or modified music player. It means that their television, car, or mobile device manufacturer must cut deals with their service to allow each paying customer to play the catalog they have subscribed to. Although streaming services are able to capture and control value more effectively, this comes at the cost of reduced freedom, choice, and flexibility for users and at higher prices paid by subscribers. A decade ago, arguments against DRM for downloaded music focused on the claim that users should have control over the music they purchase. Although these arguments may not seem to apply to subscription services, it is worth remembering that DRM is fundamentally a problem because it means that we do not have control of the technology we use to play our music, and because the firms aiming to control us are using DRM to push antifeatures, raise prices, and block innovation. In all of these senses, DRM in streaming services is exactly as bad as FairPlay, and we should continue to demand better.

1 April 2015

Benjamin Mako Hill: RomancR: The Future of the Sharing-Your-Bed Economy

romancer_logo Today, Aaron Shaw and I are pleased to announce a new startup. The startup is based around an app we are building called RomancR that will bring the sharing economy directly into your bedrooms and romantic lives. When launched, RomancR will bring the kind of market-driven convenience and efficiency that Uber has brought to ride sharing, and that AirBnB has brought to room sharing, directly into the most frustrating and inefficient domain of our personal lives. RomancR is Uber for romance and sex. Here s how it will work: Of course, there are many existing applications like Tinder and Grindr that help facilitate romance, dating, and hookups. Unfortunately, each of these still relies on old-fashion intrinsic ways of motivating people to participate in romantic endeavors. The sharing economy has shown us that systems that rely on these non-monetary motivations are ineffective and limiting! For example, many altruistic and socially-driven ride-sharing systems existed on platforms like Craigslist or Ridejoy before Uber. Similarly, volunteer-based communities like Couchsurfing and Hospitality Club existed for many years before AirBnB. None of those older systems took off in the way that their sharing economy counterparts were able to! The reason that Uber and AirBnB exploded where previous efforts stalled is that this new generation of sharing economy startups brings the power of markets to bear on the problems they are trying to solve. Money both encourages more people to participate in providing a service and also makes it socially easier for people to take that service up without feeling like they are socially in debt to the person providing the service for free. The result has been more reliable and effective systems for proving rides and rooms! The reason that the sharing economy works, fundamentally, is that it has nothing to do with sharing at all! Systems that rely on people s social desire to share without money projects like Couchsurfing are relics of the previous century. RomancR, which we plan to launch later this year, will bring the power and efficiency of markets to our romantic lives. You will leave your pitiful dating life where it belongs in the dustbin of history! Go beyond antiquated non-market systems for finding lovers. Why should we rely on people s fickle sense of taste and attractiveness, their complicated ideas of interpersonal compatibility, or their sense of altruism, when we can rely on the power of prices? With RomancR, we won t have to! Note: Thanks to Yochai Benkler whose example of how leaving a $100 bill on the bedside table of a person with whom you spent the night can change the nature of the a romantic interaction inspired the idea for this startup.

Benjamin Mako Hill: RomancR: The Future of the Sharing-Your-Bed Economy

romancer_logo Today, Aaron Shaw and I are pleased to announce a new startup. The startup is based around an app we are building called RomancR that will bring the sharing economy directly into your bedrooms and romantic lives. When launched, RomancR will bring the kind of market-driven convenience and efficiency that Uber has brought to ride sharing, and that AirBnB has brought to room sharing, directly into the most frustrating and inefficient domain of our personal lives. RomancR is Uber for romance and sex. Here s how it will work: Of course, there are many existing applications like Tinder and Grindr that help facilitate romance, dating, and hookups. Unfortunately, each of these still relies on old-fashion intrinsic ways of motivating people to participate in romantic endeavors. The sharing economy has shown us that systems that rely on these non-monetary motivations are ineffective and limiting! For example, many altruistic and socially-driven ride-sharing systems existed on platforms like Craigslist or Ridejoy before Uber. Similarly, volunteer-based communities like Couchsurfing and Hospitality Club existed for many years before AirBnB. None of those older systems took off in the way that their sharing economy counterparts were able to! The reason that Uber and AirBnB exploded where previous efforts stalled is that this new generation of sharing economy startups brings the power of markets to bear on the problems they are trying to solve. Money both encourages more people to participate in providing a service and also makes it socially easier for people to take that service up without feeling like they are socially in debt to the person providing the service for free. The result has been more reliable and effective systems for proving rides and rooms! The reason that the sharing economy works, fundamentally, is that it has nothing to do with sharing at all! Systems that rely on people s social desire to share without money projects like Couchsurfing are relics of the previous century. RomancR, which we plan to launch later this year, will bring the power and efficiency of markets to our romantic lives. You will leave your pitiful dating life where it belongs in the dustbin of history! Go beyond antiquated non-market systems for finding lovers. Why should we rely on people s fickle sense of taste and attractiveness, their complicated ideas of interpersonal compatibility, or their sense of altruism, when we can rely on the power of prices? With RomancR, we won t have to! Note: Thanks to Yochai Benkler whose example of how leaving a $100 bill on the bedside table of a person with whom you spent the night can change the nature of the a romantic interaction inspired the idea for this startup.

Benjamin Mako Hill: More Community Data Science Workshops

Pictures from the CDSW sessions in Spring 2014Pictures from the CDSW sessions in Spring 2014
After two successful rounds in 2014, I m helping put on another round of the Community Data Science Workshops. Last year, our 40+ volunteer mentorss taught more than 150 absolute beginners the basics of programming in Python, data collection from web APIs, and tools for data analysis and visualization and we re still in the process of improving our curriculum and scaling up. Once again, the workshops will be totally free of charge and open to anybody. Once again, they will be possible through the generous participation of a small army of volunteer mentors. We ll be meeting for four sessions over three weekends: If you re interested in attending, or interested in volunteering as mentor, you can go to the information and registration page for the current round of workshops and sign up before April 3rd.

10 February 2015

Benjamin Mako Hill: Kuchisake-onna Decision Tree

Mika recently brought up the Japanese modern legend of Kuchisake-onna ( ). For background, I turned to the English Wikipedia article on Kuchisake-onna which had the following to say about the figure (the description matches Mika s memory):
According to the legend, children walking alone at night may encounter a woman wearing a surgical mask, which is not an unusual sight in Japan as people wear them to protect others from their colds or sickness. The woman will stop the child and ask, Am I pretty? If the child answers no, the child is killed with a pair of scissors which the woman carries. If the child answers yes, the woman pulls away the mask, revealing that her mouth is slit from ear to ear, and asks How about now? If the child answers no, he/she will be cut in half. If the child answers yes, then she will slit his/her mouth like hers. It is impossible to run away from her, as she will simply reappear in front of the victim.
To help anyone who is not only frightened, but also confused, Mika and I made the following decision tree of possible conversations with Kuchisake-onna and their universally unfortunate outcomes.
Decision tree of conversations with Kuchisake-onna.Decision tree of conversations with Kuchisake-onna.
Of course, we uploaded the SVG source for the diagram to Wikimedia Commons and used the diagram to illustrate the Wikipedia article.

30 December 2014

Benjamin Mako Hill: Consider the Redirect

In wikis, redirects are special pages that silently take readers from the page they are visiting to another page. Although their presence is noted in tiny gray text (see the image below) most people use them all the time and never know they exist. Redirects exist to make linking between pages easier, they populate Wikipedia s search autocomplete list, and are generally helpful in organizing information. In the English Wikipedia, redirects make up more than half of all article pages. seattle_redirectOver the years, I ve spent some time contributing to to Redirects for Discussion (RfD). I think of RfD as like an ultra-low stakes version of Articles for Deletion where Wikipedians decide whether to delete or keep articles. If a redirect is deleted, viewers are taken to a search results page and almost nobody notices. That said, because redirects are almost never viewed directly, almost nobody notices if a redirect is kept either! I ve told people that if they want to understand the soul of a Wikipedian, they should spend time participating in RfD. When you understand why arguing about and working hard to come to consensus solutions for how Wikipedia should handle individual redirects is an enjoyable way to spend your spare time where any outcome is invisible you understand what it means to be a Wikipedian. That said, wiki researchers rarely take redirects into account. For years, I ve suspected that accounting for redirects was important for Wikipedia research and that several classes of findings were noisy or misleading because most people haven t done so. As a result, I worked with my colleague Aaron Shaw at Northwestern earlier this year to build a longitudinal dataset of redirects that can capture the dynamic nature of redirects. Our work was published as a short paper at OpenSym several months ago. It turns out, taking redirects into account correctly (especially if you are looking at activity over time) is tricky because redirects are stored as normal pages by MediaWiki except that they happen to start with special redirect text. Like other pages, redirects can be updated and changed over time are frequently are. As a result, taking redirects into account for any study that looks at activity over time requires looking at the text of every revision of every page. Using our dataset, Aaron and I showed that the distribution of edits across pages in English Wikipedia (a relationships that is used in many research projects) looks pretty close to log normal when we remove redirects and very different when you don t. After all, half of articles are really just redirects and, and because they are just redirects, these articles are almost never edited. edits_over_pagesAnother puzzling finding that s been reported in a few places and that I repeated myself several times is that edits and views are surprisingly uncorrelated. I ll write more about this later but the short version is that we found that a big chunk of this can, in fact, be explained by considering redirects. We ve published our code and data and the article itself is online because we paid the ACM s open access fee to ransom the article.

27 December 2014

Benjamin Mako Hill: My Government Portrait

A friend recently commented on my rather unusual portrait on my (out of date) page on the Berkman website. Here s the story. I joined Berkman as a fellow with a fantastic class of fellows that included, among many other incredibly accomplished people, Vivek Kundra: first Chief Information Officer of the United States. At Berkman, all the fellows are all asked for photos and Vivek apparently sent in his official government portrait. You are probably familiar with the genre. In the US at least, official government portraits are mostly pictures of men in dark suits, light shirts, and red or blue ties with flags draped blurrily in the background. Not unaware of the fact that Vivek sat right below me on the alphabetically sorted Berkman fellows page, a small group that included Paul Tagliamonte very familiar with the genre from his work with government photos in Open States decided to create a government portrait of me using the only flag we had on hand late one night. fellows_list_subsetThe result shown in the screenshot above and in the WayBack Machine was almost entirely unnoticed (at least to my knowledge) but was hopefully appreciated by those who did see it.

24 December 2014

Benjamin Mako Hill: Images of Japan

Going through some photos, I was able to revisit some of the more memorable moments of my trip to Japan earlier this year. For example, the time I visited Genkai Quasi National Park a beautiful spot in Fukuoka that had a strong resemblance to, but may not actually have been, a national park. Genkai Quasi National Park There was the time that I saw a Saw a curry fault bread. Saw a Curry Fault Bread And a shrine one could pray at in a barcalounger. Shrine Comfortable Chair There was the also the fact that we had record snowfall while in Tokyo which left the cities drainage system in a rather unhappy state. Japan Unhappy Drain

19 October 2014

Benjamin Mako Hill: Another Round of Community Data Science Workshops in Seattle

Pictures from the CDSW sessions in Spring 2014Pictures from the CDSW sessions in Spring 2014
I am helping coordinate three and a half day-long workshops in November for anyone interested in learning how to use programming and data science tools to ask and answer questions about online communities like Wikipedia, free and open source software, Twitter, civic media, etc. This will be a new and improved version of the workshops run successfully earlier this year. The workshops are for people with no previous programming experience and will be free of charge and open to anyone. Our goal is that, after the three workshops, participants will be able to use data to produce numbers, hypothesis tests, tables, and graphical visualizations to answer questions like: If you are interested in participating, fill out our registration form here before October 30th. We were heavily oversubscribed last time so registering may help. If you already know how to program in Python, it would be really awesome if you would volunteer as a mentor! Being a mentor will involve working with participants and talking them through the challenges they encounter in programming. No special preparation is required. If you re interested, send me an email.

28 September 2014

Benjamin Mako Hill: Community Data Science Workshops Post-Mortem

Earlier this year, I helped plan and run the Community Data Science Workshops: a series of three (and a half) day-long workshops designed to help people learn basic programming and tools for data science tools in order to ask and answer questions about online communities like Wikipedia and Twitter. You can read our initial announcement for more about the vision. The workshops were organized by myself, Jonathan Morgan from the Wikimedia Foundation, long-time Software Carpentry teacher Tommy Guy, and a group of 15 volunteer mentors who taught project-based afternoon sessions and worked one-on-one with more than 50 participants. With overwhelming interest, we were ultimately constrained by the number of mentors who volunteered. Unfortunately, this meant that we had to turn away most of the people who applied. Although it was not emphasized in recruiting or used as a selection criteria, a majority of the participants were women. The workshops were all free of charge and sponsored by the UW Department of Communication, who provided space, and the eScience Institute, who provided food. cdsw_combo_images-1The curriculum for all four session session is online: The workshops were designed for people with no previous programming experience. Although most our participants were from the University of Washington, we had non-UW participants from as far away as Vancouver, BC. Feedback we collected suggests that the sessions were a huge success, that participants learned enormously, and that the workshops filled a real need in the Seattle community. Between workshops, participants organized meet-ups to practice their programming skills. Most excitingly, just as we based our curriculum for the first session on the Boston Python Workshop s, others have been building off our curriculum. Elana Hashman, who was a mentor at the CDSW, is coordinating a set of Python Workshops for Beginners with a group at the University of Waterloo and with sponsorship from the Python Software Foundation using curriculum based on ours. I also know of two university classes that are tentatively being planned around the curriculum. Because a growing number of groups have been contacting us about running their own events based on the CDSW and because we are currently making plans to run another round of workshops in Seattle late this fall I coordinated with a number of other mentors to go over participant feedback and to put together a long write-up of our reflections in the form of a post-mortem. Although our emphasis is on things we might do differently, we provide a broad range of information that might be useful to people running a CDSW (e.g., our budget). Please let me know if you are planning to run an event so we can coordinate going forward.

24 August 2014

Lucas Nussbaum: on the Dark Ages of Free Software: a Free Service Definition ?

Stefano Zacchiroli opened DebConf 14 with an insightful talk titled Debian in the Dark Ages of Free Software (slides available, video available soon). He makes the point (quoting slide 16) that the Free Software community is winning a war that is becoming increasingly pointless: yes, users have 100% Free Software thin client at their fingertips [or are really a few steps from there]. But all their relevant computations happen elsewhere, on remote systems they do not control, in the Cloud. That give-up on control of computing is a huge and important problem, and probably the largest challenge for everybody caring about freedom, free speech, or privacy today. Stefano rightfully points out that we must do something about it. The big question is: how can we, as a community, address it? Towards a Free Service Definition? I believe that we all feel a bit lost with this issue because we are trying to attack it with our current tools & weapons. However, they are largely irrelevant here: the Free Software Definition is about software, and software is even to be understood strictly in it, as software programs. Applying it to services, or to computing in general, doesn t lead anywhere. In order to increase the general awareness about this issue, we should define more precisely what levels of control can be provided, to understand what services are not providing to users, and to make an informed decision about waiving a particular level of control when choosing to use a particular service. Benjamin Mako Hill pointed out yesterday during the post-talk chat that services are not black or white: there aren t impure and pure services. Instead, there s a graduation of possible levels of control for the computing we do. The Free Software Definition lists four freedoms how many freedoms, or types of control, should there be in a Free Service Definition, or a Controlled-Computing Definition? Again, this is not only about software: the platform on which a particular piece of software is executed has a huge impact on the available level of control: running your own instance of WordPress, or using an instance on wordpress.com, provides very different control (even if as Asheesh Laroia pointed out yesterday, WordPress does a pretty good job at providing export and import features to limit data lock-in). The creation of such a definition is an iterative process. I actually just realized today that (according to Wikipedia) the very first occurrence of an attempt at a Free Software Definition was published in 1986 (GNU s bulletin Vol 1 No.1, page 8) I thought it happened a couple of years earlier. Are there existing attempts at defining such freedoms or levels of controls, and at benchmarking such criteria against existing services? Such criteria would not only include control over software modifications and (re)distribution, but also likely include mentions of interoperability and open standards, both to enable the user to move to a compatible service, and to avoid forcing the user to use a particular implementation of a service. A better understanding of network effects is also needed: how much and what type of service lock-in is acceptable on social networks in exchange of functionality? I think that we should inspire from what was achieved during the last 30 years on Free Software. The tools that were produced are probably irrelevant to address this issue, but there s a lot to learn from the way they were designed. I really look forward to the day when we will have: Exciting times!

18 May 2014

Benjamin Mako Hill: Installing GNU/Linux on a 2014 Lenovo Thinkpad X1 Carbon

I recently bought a new Lenovo X1 Carbon. It is the new second-generation, type 20A7 laptop, based on Intel s Haswell microarchiteture with the adaptive keyboard. It is the version released in 2014. I also ordered the Thinkpad OneLink Dock which I have returned for the OneLink Pro Dock which I have not yet received. The system is still very new, challenging, and different, but seems to support GNU/Linux reasonably well if you are willing to run a bleeding edge version and/or patch your kernel and if you are not afraid to spend an afternoon or two tweaking things. What follows are my installation notes for Debian testing (jessie) when I installed it in early May 2014. My general impressions about the laptop as a GNU/Linux system and overall are at the end of this write-up.
System Description The X1 Carbon I ordered included the 512GB SSD, the 14.0 inch WQHD (2560 1440) 260 nit touchscreen, and the maximum 8GB of memory. I believe the rest is not particularly negotiable but includes a 720p HD Camera, a 45.2Wh battery, and an Intel Dual Band Wireless 7260AC with Bluetooth 4.0. For those that are curious Here is the output of lspci on the system:
00:00.0 Host bridge: Intel Corporation Haswell-ULT DRAM Controller (rev 0b)
00:02.0 VGA compatible controller: Intel Corporation Haswell-ULT Integrated Graphics Controller (rev 0b)
00:03.0 Audio device: Intel Corporation Haswell-ULT HD Audio Controller (rev 0b)
00:14.0 USB controller: Intel Corporation Lynx Point-LP USB xHCI HC (rev 04)
00:16.0 Communication controller: Intel Corporation Lynx Point-LP HECI #0 (rev 04)
00:16.3 Serial controller: Intel Corporation Lynx Point-LP HECI KT (rev 04)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I218-LM (rev 04)
00:1b.0 Audio device: Intel Corporation Lynx Point-LP HD Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation Lynx Point-LP PCI Express Root Port 6 (rev e4)
00:1c.1 PCI bridge: Intel Corporation Lynx Point-LP PCI Express Root Port 3 (rev e4)
00:1d.0 USB controller: Intel Corporation Lynx Point-LP USB EHCI #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation Lynx Point-LP LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation Lynx Point-LP SATA Controller 1 [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation Lynx Point-LP SMBus Controller (rev 04)
BIOS/Firmware The BIOS firmware is non-free and proprietary as it the case with all ThinkPads and nearly all laptops. According to this thread there is a bug in the default BIOS that means that suspend to RAM is broken in GNU/Linux. You can get updated BIOS at the Lenovo s ThinkPad X1 Carbon (Type 20A7, 20A8) Drivers and software page by looking in the the BIOS section. Honestly, the easiest approach is probably to download the Windows BIOS Update utility (documentation is here) which you can use to run the BIOS update from within Windows before you install GNU/Linux. If that s not an option (e.g., if you ve already installed GNU/Linux) the best method is to download the bootable CD ISO from the same page. Of course, since the X1 Carbon has no optical media, you have to find another way to boot the CD image. I struggled to get the ISO to boot from USB using the usually reliable dd method. This message suggest that the issue had to do with the El Torito wrapper:
I had to dump the eltorito image from the ISO they provide, after that I was able to dd the resulting image to a flash drive and the bios update went well, no cdrom needed.
I updated to version 1.13 of the BIOS which fixes the suspend/resume bug. By the time you read this, there may be newer versions that fix other things so check the Lenovo website.
Installing Debian I installed Debian testing using the March 19, 2014 Alpha 1 release of the Debian Installer for Jessie (currently testing). I installed in graphical mode. With the WQHD screen, everything was extremely tiny but it worked flawlessly. I downloaded the amd64 net install image from the normal place and installed the rest of the system using the built-in Ethernet port which required no firmware or extra drivers. I did the normal dd if=FILENAME.iso of=/dev/sdX method of getting the installer onto the a USB stick to boot. I turned off restricted boot in BIOS first. In general, the latest version of the Debian installation guide is always a good source of guidance on installing Debian. I used the Debian installer wizard to partition and selected Use entire disk and partition it for LVM and encrypted data which kept the UEFI partitions around. The system installed with no errors or issues and booted up normally afterward. The grub menu is hilariously narrow on the WQHD screen. If you want to use the built-in wireless and/or Bluetooth, you will need to install the non-free iwlwifi firmware package. It is very lame that we still have to do this to use hardware we have purchased.
What Works and Doesn t The following stuff works the first time I booted into the GNOME 3 desktop and logged in:
  • The WQHD 2560 1440 screen
  • The touchscreen
  • Both the TrackPoint and the touchpad
  • Built-in e1000e Ethernet using the dongle
  • The keyboard plus the adaptive row of F1-F12 keys.
  • External monitor using the full HDMI or mini-DisplayPort connectors
  • Audio (both speakers and microphone)
  • The camera/webcam
The following stuff works if you install non-free firmware:
  • Internal Wireless
  • Bluetooth 4.0
The following stuff works with qualifications:
  • Suspend to RAM Works once you have updated the firmware.
  • The adaptive keyboard The F1-F12 keys work but the button that theoretically lets you switch to different sets of function buttons (e.g., volume, brightness) does nothing.
  • Disabling the touchpad There is a BIOS option to disable the touchpad. It works in Windows and does nothing at all in GNU/Linux.
I have not tried:
  • The fingerprint reader
Disabling the touchpad As a long-term ThinkPad user, I love the TrackPoint pointing stick. If you plan on using this, the built-in touchpad is incredibly aggravating because it is very easy to brush against it while using the TrackPoint. In BIOS, there is an option to disable the touchpad. Although this works in Windows, it does absolutely nothing in GNU/Linux. Part of the issue is that, unlike the older X1 Carbon and other ThinkPads, there are no TrackPoint buttons. Instead of buttons, there are regions at the top of the touchpad which are configured, in software, to act like buttons. If you want to be able to click, the touchpad can never be truly turned off. This is not problem unique to the Haswell X1 Carbon and a number of people have been struggling with this issue on other Lenovo laptops. Essentially, what you need to do is configure your touchpad so that the buttons are where you want them and so that it ignores any input for the purposes of cursor movement. There are a few ways of doing this but this answer from an askubuntu.com question has the solution I ended up using:

Open file /usr/share/X11/xorg.conf.d/50-synaptics.conf for edit.

Find Section InputClass which the following line is Identifier Default clickpad buttons .

Edit option for SoftButtonAreas to values 64% 0 1 42% 36% 64% 1 42%, this is size of the right and middle button.

Enable option AreaBottomEdge and change value to 1, this will disable touchpad movement.

If everything done right, your class should looks like:

Section "InputClass"
     Identifier "Default clickpad buttons"
     MatchDriver "synaptics"
     Option "SoftButtonAreas" "64% 0 1 42% 36% 64% 1 42%"
     Option "AreaBottomEdge" "1"
EndSection
Essentially, the first Option line will create a middle button that is 32% of the width and 42% of the height, and a right button that is 32% of the width and 42% of the height. The synaptics manpage (man synpatics) will give you more detail on the general way this works. Of course, something does feel very wrong about editing a file in /usr/share.
Fixing the Adaptive Keyboard The most wild feature of the laptop is the adaptive keyboard strip. The strip is a back-lit LCD that looks almost like E Ink screen and acts as a touchscreen keyboard. The default mode gives you the F1-F12 keys. If you press the keys (since they aren t buttons, you just put your finger on top of them) they act like normal F-keys. You can Ctrl-Alt-F1, etc., to switch to virtual terminals out of the box. There are four modes: Function (i.e., normal F-keys), Home, Web, and Chat. The last three overlap quite a bit (e.g., they all have brightness and volume). You can play with an example on the Lenovo homepage. In Windows, switching programs will apparently change these keys so that an appropriate set of buttons is shown for the application you are using. You can also change these keys manually with a big Fn button at the far left of the adaptive keyboard strip. As I write this this, released kernels do not support the adaptive keyboard Fn button which means you cannot use anything other than the F-keys out of the box. I believe it also means that resuming from suspend to RAM breaks these keys. That said, Shuduo Sang from Canonical has released several versions of a patch to to the thinkpad_acpi kernel module which adds support for the Home mode. The other modes (web and chat) do not seem to be supported. The latest version of the patch is on on the Linux Kernel Mailing List and the relevant commits are:
330947b save and restore adaptive keyboard mode for suspend and,resume
3a9d20b support Thinkpad X1 Carbon 2nd generation's adaptive keyboard
Although this is not supported in Debian testing at the time of writing, a bug was filed in Debian and quickly fixed by Ben Hutchings in Debian kernel version 3.14.2-1 which is currently in sid/unstable. As a result, if you install the latest version kernel from Debian unstable (3.14.2-1 or later), the adaptive keyboard just works. If you aren t using Debian and if kernel you are using does not have support, you might be patching your kernel.
General Impressions As I have described in my interview with The Setup, I have been a user of ThinkPad X-series laptops for many years. This is my sixth X-series ThinkPad. Overall, I quite like the hardware! Once things mature a little bit, I think that this will be a great laptop for running GNU/Linux. That said, I ordered the laptop without realizing that the X1 Carbon had gone through a major revision! The keyboard was quite a suprise. I think that changing a system so radically without changing the model name/number is a very bad move on Lenovo s part. There are two remaining issues with the system I m still struggling with: (1) the keyboard layout is freaky and weird, and (2) the super high resolution screen breaks many things. The quality of the keyboard itself is great and worthy of the ThinkPad name. That said, there are two ways in which it is strange. The first is the adaptive keyboard strip. Overall, it works surprisingly well and I think it is a clever idea. My sense is that the strip is more annoying in Windows because it changes out from under you all the time. In GNU/Linux, only manual changing of modes is supported. This, in my opinion, is a feature. I do miss the real feedback you get from pressing keys but for F-keys and volume-keys that I don t use often this isn t too important. On the downside, I have realized several times that I had been holding down a button for several seconds and not noticed. The more annoying issue with the keyboard is the way that the other keys have moved around. Getting rid of the CapsLock is wonderful! How has this taken so long? Replacing it with a split Home and End keys is nuts. I ve remapped the Home and End to put Control back where it should be. My right Control to now Home but I still don t have an End key. The split Backspace and Delete is not a problem for me. The tilde/apostrophe is in a very bad place. There is no Insert, Print Screen/SysRq, Scroll Lock, Pause/Break or NumLock. They are all just gone. Surprisingly, I haven t missed any of them. The second issue is the 2560 1440 resolution on the 14 inch screen. I use a 27 inch external monitor with the same native resolution laptop but, by my arithmetic, the pixel density on the laptop is 210 DPI instead 109 DPI on the external monitor. The result is the scaling problem and it s a huge pain that seems mostly unsolved on any operating system. Fonts and widgets that look good on the laptop look huge on my external monitor. Stuff that looks good on my external monitor looks minuscule on the laptop. I routinely move windows between my laptop screen and my large monitor. Until I find a display system that can handle this kind of scaling effectively, this requires changing font size and zooming all the time. At the moment, I m shrinking and expanding my font size using the built in hot keys in Emacs, Gnome Terminal, and Firefox/Iceweasel. I love the high resolution screen but the current situation is crazy-making. Finally, this setup will not get you into the Church of Emacs and it s not about to find its way onto the FSF s list of endorsed hardware. For one, I paid the Windows tax. Beyond that, there is the non-free BIOS and the need for non-free firmware to use the wireless and Bluetooth. This is standard for ThinkPads but it isn t getting any easier to swallow. There are alternatives in the form of Gluglug s X60 laptops running CoreBoot, Lemote Yeelong laptops, Bunnie Huang s Novena and others that are better in these regards. I am very excited for these projects but, for a number of reasons, these just weren t an option for the laptop I use for my research computing.

12 May 2014

Benjamin Mako Hill: Google Has Most of My Email Because It Has All of Yours

Republished by Slate. Translations available in French (Fran ais), Spanish (Espa ol), Chinese ( ) For almost 15 years, I have run my own email server which I use for all of my non-work correspondence. I do so to keep autonomy, control, and privacy over my email and so that no big company has copies of all of my personal email. A few years ago, I was surprised to find out that my friend Peter Eckersley a very privacy conscious person who is Technology Projects Director at the EFF used Gmail. I asked him why he would willingly give Google copies of all his email. Peter pointed out that if all of your friends use Gmail, Google has your email anyway. Any time I email somebody who uses Gmail and anytime they email me Google has that email. Since our conversation, I have often wondered just how much of my email Google really has. This weekend, I wrote a small program to go through all the email I have kept in my personal inbox since April 2004 (when Gmail was started) to find out. One challenge with answering the question is that many people, like Peter, use Gmail to read, compose, and send email but they configure Gmail to send email from a non-gmail.com From address. To catch these, my program looks through each message s headers that record which computers handled the message on its way to my server and to pick out messages that have traveled through google.com, gmail.com, or googlemail.com. Although I usually filter them, my personal mailbox contains emails sent through a number of mailing lists. Since these mailing lists often hide the true provenance of a message, I exclude all messages that are marked as coming from lists using the (usually invisible) Precedence header. The following graph shows the numbers of emails in my personal inbox each week in red and the subset from Google in blue. Because the number of emails I receive week-to-week tends to vary quite a bit, I ve included a LOESS smoother which shows a moving average over several weeks. Emails, total and from GMail, over timeFrom eyeballing the graph, the answer to seems to be that, although it varies, about a third of the email in my inbox comes from Google! Keep in mind that this is all of my personal email and includes automatic and computer generated mail from banks and retailers, etc. Although it is true that Google doesn t have these messages, it suggests that the proportion of my truly personal email that comes via Google is probably much higher. I would also like to know how much of the email I send goes to Google. I can do this by looking at emails in my inbox that I have replied to. This works if I am willing to assume that if I reply to an email sent from Google, it ends up back at Google. In some ways, doing this addresses the problem with the emails from retailers and banks since I am very unlikely to reply to those emails. In this sense, it also reflects a measure of more truly personal email. I ve broken down the proportions of emails I received that come from Google in the graph below for all email (top) and for emails I have replied to (bottom). In the graphs, the size of the dots represents the total number of emails counted to make that proportion. Once again, I ve included the LOESS moving average. Proportion of emails from GMail over timeThe answer is surprisingly large. Despite the fact that I spend hundreds of dollars a year and hours of work to host my own email server, Google has about half of my personal email! Last year, Google delivered 57% of the emails in my inbox that I replied to. They have delivered more than a third of all the email I ve replied to every year since 2006 and more than half since 2010. On the upside, there is some indication that the proportion is going down. So far this year, only 51% of the emails I ve replied to arrived from Google. The numbers are higher than I imagined and reflect somewhat depressing news. They show how it s complicated to think about privacy and autonomy for communication between parties. I m not sure what to do except encourage others to consider, in the wake of the Snowden revelations and everything else, whether you really want Google to have all your email. And half of mine. If you want to run the analysis on your own, you re welcome to the Python and R code I used to produce the numbers and graphs.

9 April 2014

Petter Reinholdtsen: S3QL, a locally mounted cloud file system - nice free software

For a while now, I have been looking for a sensible offsite backup solution for use at home. My requirements are simple, it must be cheap and locally encrypted (in other words, I keep the encryption keys, the storage provider do not have access to my private files). One idea me and my friends had many years ago, before the cloud storage providers showed up, was to use Google mail as storage, writing a Linux block device storing blocks as emails in the mail service provided by Google, and thus get heaps of free space. On top of this one can add encryption, RAID and volume management to have lots of (fairly slow, I admit that) cheap and encrypted storage. But I never found time to implement such system. But the last few weeks I have looked at a system called S3QL, a locally mounted network backed file system with the features I need. S3QL is a fuse file system with a local cache and cloud storage, handling several different storage providers, any with Amazon S3, Google Drive or OpenStack API. There are heaps of such storage providers. S3QL can also use a local directory as storage, which combined with sshfs allow for file storage on any ssh server. S3QL include support for encryption, compression, de-duplication, snapshots and immutable file systems, allowing me to mount the remote storage as a local mount point, look at and use the files as if they were local, while the content is stored in the cloud as well. This allow me to have a backup that should survive fire. The file system can not be shared between several machines at the same time, as only one can mount it at the time, but any machine with the encryption key and access to the storage service can mount it if it is unmounted. It is simple to use. I'm using it on Debian Wheezy, where the package is included already. So to get started, run apt-get install s3ql. Next, pick a storage provider. I ended up picking Greenqloud, after reading their nice recipe on how to use S3QL with their Amazon S3 service, because I trust the laws in Iceland more than those in USA when it come to keeping my personal data safe and private, and thus would rather spend money on a company in Iceland. Another nice recipe is available from the article S3QL Filesystem for HPC Storage by Jeff Layton in the HPC section of Admin magazine. When the provider is picked, figure out how to get the API key needed to connect to the storage API. With Greencloud, the key did not show up until I had added payment details to my account. Armed with the API access details, it is time to create the file system. First, create a new bucket in the cloud. This bucket is the file system storage area. I picked a bucket name reflecting the machine that was going to store data there, but any name will do. I'll refer to it as bucket-name below. In addition, one need the API login and password, and a locally created password. Store it all in ~root/.s3ql/authinfo2 like this:
[s3c]
storage-url: s3c://s.greenqloud.com:443/bucket-name
backend-login: API-login
backend-password: API-password
fs-passphrase: local-password
I create my local passphrase using pwget 50 or similar, but any sensible way to create a fairly random password should do it. Armed with these details, it is now time to run mkfs, entering the API details and password to create it:
# mkdir -m 700 /var/lib/s3ql-cache
# mkfs.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 \
  --ssl s3c://s.greenqloud.com:443/bucket-name
Enter backend login: 
Enter backend password: 
Before using S3QL, make sure to read the user's guide, especially
the 'Important Rules to Avoid Loosing Data' section.
Enter encryption password: 
Confirm encryption password: 
Generating random encryption key...
Creating metadata tables...
Dumping metadata...
..objects..
..blocks..
..inodes..
..inode_blocks..
..symlink_targets..
..names..
..contents..
..ext_attributes..
Compressing and uploading metadata...
Wrote 0.00 MB of compressed metadata.
# 
The next step is mounting the file system to make the storage available.
# mount.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 \
  --ssl --allow-root s3c://s.greenqloud.com:443/bucket-name /s3ql
Using 4 upload threads.
Downloading and decompressing metadata...
Reading metadata...
..objects..
..blocks..
..inodes..
..inode_blocks..
..symlink_targets..
..names..
..contents..
..ext_attributes..
Mounting filesystem...
# df -h /s3ql
Filesystem                              Size  Used Avail Use% Mounted on
s3c://s.greenqloud.com:443/bucket-name  1.0T     0  1.0T   0% /s3ql
#
The file system is now ready for use. I use rsync to store my backups in it, and as the metadata used by rsync is downloaded at mount time, no network traffic (and storage cost) is triggered by running rsync. To unmount, one should not use the normal umount command, as this will not flush the cache to the cloud storage, but instead running the umount.s3ql command like this:
# umount.s3ql /s3ql
# 
There is a fsck command available to check the file system and correct any problems detected. This can be used if the local server crashes while the file system is mounted, to reset the "already mounted" flag. This is what it look like when processing a working file system:
# fsck.s3ql --force --ssl s3c://s.greenqloud.com:443/bucket-name
Using cached metadata.
File system seems clean, checking anyway.
Checking DB integrity...
Creating temporary extra indices...
Checking lost+found...
Checking cached objects...
Checking names (refcounts)...
Checking contents (names)...
Checking contents (inodes)...
Checking contents (parent inodes)...
Checking objects (reference counts)...
Checking objects (backend)...
..processed 5000 objects so far..
..processed 10000 objects so far..
..processed 15000 objects so far..
Checking objects (sizes)...
Checking blocks (referenced objects)...
Checking blocks (refcounts)...
Checking inode-block mapping (blocks)...
Checking inode-block mapping (inodes)...
Checking inodes (refcounts)...
Checking inodes (sizes)...
Checking extended attributes (names)...
Checking extended attributes (inodes)...
Checking symlinks (inodes)...
Checking directory reachability...
Checking unix conventions...
Checking referential integrity...
Dropping temporary indices...
Backing up old metadata...
Dumping metadata...
..objects..
..blocks..
..inodes..
..inode_blocks..
..symlink_targets..
..names..
..contents..
..ext_attributes..
Compressing and uploading metadata...
Wrote 0.89 MB of compressed metadata.
# 
Thanks to the cache, working on files that fit in the cache is very quick, about the same speed as local file access. Uploading large amount of data is to me limited by the bandwidth out of and into my house. Uploading 685 MiB with a 100 MiB cache gave me 305 kiB/s, which is very close to my upload speed, and downloading the same Debian installation ISO gave me 610 kiB/s, close to my download speed. Both were measured using dd. So for me, the bottleneck is my network, not the file system code. I do not know what a good cache size would be, but suspect that the cache should e larger than your working set. I mentioned that only one machine can mount the file system at the time. If another machine try, it is told that the file system is busy:
# mount.s3ql --cachedir /var/lib/s3ql-cache --authfile /root/.s3ql/authinfo2 \
  --ssl --allow-root s3c://s.greenqloud.com:443/bucket-name /s3ql
Using 8 upload threads.
Backend reports that fs is still mounted elsewhere, aborting.
#
The file content is uploaded when the cache is full, while the metadata is uploaded once every 24 hour by default. To ensure the file system content is flushed to the cloud, one can either umount the file system, or ask S3QL to flush the cache and metadata using s3qlctrl:
# s3qlctrl upload-meta /s3ql
# s3qlctrl flushcache /s3ql
# 
If you are curious about how much space your data uses in the cloud, and how much compression and deduplication cut down on the storage usage, you can use s3qlstat on the mounted file system to get a report:
# s3qlstat /s3ql
Directory entries:    9141
Inodes:               9143
Data blocks:          8851
Total data size:      22049.38 MB
After de-duplication: 21955.46 MB (99.57% of total)
After compression:    21877.28 MB (99.22% of total, 99.64% of de-duplicated)
Database size:        2.39 MB (uncompressed)
(some values do not take into account not-yet-uploaded dirty blocks in cache)
#
I mentioned earlier that there are several possible suppliers of storage. I did not try to locate them all, but am aware of at least Greenqloud, Google Drive, Amazon S3 web serivces, Rackspace and Crowncloud. The latter even accept payment in Bitcoin. Pick one that suit your need. Some of them provide several GiB of free storage, but the prize models are quite different and you will have to figure out what suits you best. While researching this blog post, I had a look at research papers and posters discussing the S3QL file system. There are several, which told me that the file system is getting a critical check by the science community and increased my confidence in using it. One nice poster is titled "An Innovative Parallel Cloud Storage System using OpenStack s SwiftObject Store and Transformative Parallel I/O Approach" by Hsing-Bung Chen, Benjamin McClelland, David Sherrill, Alfred Torrez, Parks Fields and Pamela Smith. Please have a look. Given my problems with different file systems earlier, I decided to check out the mounted S3QL file system to see if it would be usable as a home directory (in other word, that it provided POSIX semantics when it come to locking and umask handling etc). Running my test code to check file system semantics, I was happy to discover that no error was found. So the file system can be used for home directories, if one chooses to do so. If you do not want a locally file system, and want something that work without the Linux fuse file system, I would like to mention the Tarsnap service, which also provide locally encrypted backup using a command line client. It have a nicer access control system, where one can split out read and write access, allowing some systems to write to the backup and others to only read from it. As usual, if you use Bitcoin and want to show your support of my activities, please send Bitcoin donations to my address 15oWEoG9dUPovwmUL9KWAnYRtNJEkP1u1b.

Next.

Previous.