Search Results: "karthik"

10 September 2017

Sylvain Beucler: dot-zed archive file format

TL,DR: I reverse-engineered the .zed encrypted archive format.
Following a clean-room design, I'm providing a description that can be implemented by a third-party.
Interested? :) (reference version at: https://www.beuc.net/zed/) .zed archive file format Introduction Archives with the .zed extension are conceptually similar to an encrypted .zip file. In addition to a specific format, .zed files support multiple users: files are encrypted using the archive master key, which itself is encrypted for each user and/or authentication method (password, RSA key through certificate or PKCS#11 token). Metadata such as filenames is partially encrypted. .zed archives are used as stand-alone or attached to e-mails with the help of a MS Outlook plugin. A variant, which is not covered here, can encrypt/decrypt MS Windows folders on the fly like ecryptfs. In the spirit of academic and independent research this document provides a description of the file format and encryption algorithms for this encrypted file archive. See the conventions section for conventions and acronyms used in this document. Structure overview The .zed file format is composed of several layers. Or as a diagram:
+----------------------------------------------------------------------------------------------------+
  .zed archive (MS-CBF)                                                                               
                                                                                                      
   stream #1                         stream #2                       stream #3...                     
  +------------------------------+  +---------------------------+  +---------------------------+      
    metadata (MS-OLEPS)               encryption (AES)               encryption (AES)                 
                                      512-bytes chunks               512-bytes chunks                 
    +--------------------------+                                                                      
      obfuscation (static key)        +-----------------------+      +-----------------------+        
      +----------------------+       -  compression (zlib)     -    -  compression (zlib)     -       
       _ctlfile (TLV)                                                                            ...  
      +----------------------+          +---------------+              +---------------+               
    +--------------------------+          file contents                  file contents                
                                                                                                      
    +--------------------------+     -  +---------------+      -    -  +---------------+      -       
      _catalog (TLV)                                                                                  
    +--------------------------+      +-----------------------+      +-----------------------+        
  +------------------------------+  +---------------------------+  +---------------------------+      
+----------------------------------------------------------------------------------------------------+
Encryption schemes Several AES key sizes are supported, such as 128 and 256 bits. The Cipher Block Chaining (CBC) block cipher mode of operation is used to decrypt multiple AES 16-byte blocks, which means an initialisation vector (IV) is stored in clear along with the ciphertext. All filenames and file contents are encrypted using the same encryption mode, key and IV (e.g. if you remove and re-add a file in the archive, the resulting stream will be identical). No cleartext padding is used during encryption; instead, several end-of-stream handlers are available, so the ciphertext has exactly the size of the cleartext (e.g. the size of the compressed file). The following variants were identified in the 'encryption_mode' field. STREAM This is the end-of-stream handler for: This end-of-stream handler is apparently specific to the .zed format, and applied when the cleartext's does not end on a 16-byte boundary ; in this case special processing is performed on the last partial 16-byte block. The encryption and decryption phases are identical: let's assume the last partial block of cleartext (for encryption) or ciphertext (for decryption) was appended after all the complete 16-byte blocks of ciphertext: In either case, if the full ciphertext is less then one AES block (< 16 bytes), then the IV is used instead of the second-to-last block. CTS CTS or CipherText Stealing is the end-of-stream handler for: It matches the CBC-CS3 variant as described in Recommendation for Block Cipher Modes of Operation: Three Variants of Ciphertext Stealing for CBC Mode. Empty cleartext Since empty filenames or metadata are invalid, and since all files are compressed (resulting in a minimum 8-byte zlib cleartext), no empty cleartext was encrypted in the archive. metadata stream It is named 05356861616161716149656b7a6565636e576a33317a7868304e63 (hexadecimal), i.e. the character with code 5 followed by '5haaaaqaIekzeecnWj31zxh0Nc' (ASCII). The format used is OLE Property Set (MS-OLEPS). It introduces 2 property names "_ctlfile" (index 3) and "_catalog" (index 4), and 2 instances of said properties each containing an application-specific VT_BLOB (type 0x0041). _ctlfile: obfuscated global properties and access list This subpart is stored under index 3 ("_ctlfile") of the MS-OLEPS metadata. It consists of: The ciphertext is encrypted with AES-CBC "STREAM" mode using 128-bit static key 37F13CF81C780AF26B6A52654F794AEF (hexadecimal) and the prepended IV so as to obfuscate the access list. The ciphertext is continuous and not split in chunks (unlike files), even when it is larger than 512 bytes. The decrypted text contain properties in a TLV format as described in _ctlfile TLV: Archives may include "mandatory" users that cannot be removed. They are typically used to add an enterprise wide recovery RSA key to all archives. Extreme care must be taken to protect these key, as it can decrypt all past archives generated from within that company. _catalog: file list This subpart is stored under index 4 ("_catalog") of the MS-OLEPS metadata. It contains a series of 'fileprops' TLV structures, one for each file or directory. The file hierarchy can be reconstructed by checking the 'parent_id' field of each file entry. If 'parent_id' is 0 then the file is located at the top-level of the hierarchy, otherwise it's located under the directory with the matching 'file_id'. TLV format This format is a series of fields : Value semantics depend on its Type. It may contain an uint32be integer, a UTF-16LE string, a character sequence, or an inner TLV structure. Unless otherwise noted, TLV structures appear once. Some fields are optional and may not be present at all (e.g. 'archive_createdwith'). Some fields are unique within a structure (e.g. 'files_iv'), other may be repeated within a structure to form a list (e.g. 'fileprops' and 'passworduser'). The following top-level types that have been identified, and detailed in the next sections: Some additional unidentified types may be present. _ctlfile TLV _catalog TLV Decrypting the archive AES key rsauser The user accessing the archive will be authenticated by comparing his/her X509 certificate with the one stored in the 'certificate' field using DER format. The 'files_key_ciphertext' field is then decrypted using the PKCS#1 v1.5 encryption mechanism, with the private key that matches the user certificate. passworduser An intermediary user key, a user IV and an integrity checksum will be derived from the user password, using the deprecated PKCS#12 method as described at rfc7292 appendix B. Note: this is not PKCS#5 (nor PBKDF1/PBKDF2), this is an incompatible method from PKCS#12 that notably does not use HMAC. The 'pkcs12_hashfunc' field defines the underlying hash function. The following values have been identified: PBA - Password-based authentication The user accessing the archive will be authenticated by deriving an 8-byte sequence from his/her password. The parameters for the derivation function are: The derivation is checked against 'pba_checksum'. PBE - Password-based encryption Once the user is identified, 2 new values are derived from the password with different parameters to produce the IV and the key decryption key, with the same hash function: The parameters specific to user key are: The user key needs to be truncated to a length of 'encryption_strength', as specified in bytes in the archive properties. The parameters specific to user IV are: Once the key decryption key and the IV are derived, 'files_key_ciphertext' is decrypted using AES CBC, with PKCS#7 padding. Identifying file streams The name of the MS-CFB stream is derived by shuffling the bytes from the 'file_id' field and then encoding the result as hexadecimal. The reordering is:
Initial  offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Shuffled offset: 3 2 1 0 5 4 7 6 8 9 10 11 12 13 14 15
The 16th byte is usually a NUL byte, hence the stream identifier is a 30-character-long string. Decrypting files The compressed stream is split in chunks of 512 bytes, each of them encrypted separately using AES CBS and the global archive encryption scheme. Decryption uses the global AES key (retrieved using the user credentials), and the global IV (retrieved from the deobfuscated archive metadata). The IV for each chunk is computed by: Each chunk is an independent stream and the decryption process involves end-of-stream handling even if this is not the end of the actual file. This is particularly important for the CTS handler. Note: this is not to be confused with CTR block cipher mode of operation with operates differently and requires a nonce. Decompressing files Compressed streams are zlib stream with default compression options and can be decompressed following the zlib format. Test cases Excluded for brevity, cf. https://www.beuc.net/zed/#test-cases. Conventions and references Feedback Feel free to send comments at beuc@beuc.net. If you have .zed files that you think are not covered by this document, please send them as well (replace sensitive files with other ones). The author's GPG key can be found at 8FF1CB6E8D89059F. Copyright (C) 2017 Sylvain Beucler Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty.

Sylvain Beucler: dot-zed archive file format

TL,DR: I reverse-engineered the .zed encrypted archive format.
Following a clean-room design, I'm providing a description that can be implemented by a third-party.
Interested? :) (reference version at: https://www.beuc.net/zed/) .zed archive file format Introduction Archives with the .zed extension are conceptually similar to an encrypted .zip file. In addition to a specific format, .zed files support multiple users: files are encrypted using the archive master key, which itself is encrypted for each user and/or authentication method (password, RSA key through certificate or PKCS#11 token). Metadata such as filenames is partially encrypted. .zed archives are used as stand-alone or attached to e-mails with the help of a MS Outlook plugin. A variant, which is not covered here, can encrypt/decrypt MS Windows folders on the fly like ecryptfs. In the spirit of academic and independent research this document provides a description of the file format and encryption algorithms for this encrypted file archive. See the conventions section for conventions and acronyms used in this document. Structure overview The .zed file format is composed of several layers. Or as a diagram:
+----------------------------------------------------------------------------------------------------+
  .zed archive (MS-CBF)                                                                               
                                                                                                      
   stream #1                         stream #2                       stream #3...                     
  +------------------------------+  +---------------------------+  +---------------------------+      
    metadata (MS-OLEPS)               encryption (AES)               encryption (AES)                 
                                      512-bytes chunks               512-bytes chunks                 
    +--------------------------+                                                                      
      obfuscation (static key)        +-----------------------+      +-----------------------+        
      +----------------------+       -  compression (zlib)     -    -  compression (zlib)     -       
       _ctlfile (TLV)                                                                            ...  
      +----------------------+          +---------------+              +---------------+               
    +--------------------------+          file contents                  file contents                
                                                                                                      
    +--------------------------+     -  +---------------+      -    -  +---------------+      -       
      _catalog (TLV)                                                                                  
    +--------------------------+      +-----------------------+      +-----------------------+        
  +------------------------------+  +---------------------------+  +---------------------------+      
+----------------------------------------------------------------------------------------------------+
Encryption schemes Several AES key sizes are supported, such as 128 and 256 bits. The Cipher Block Chaining (CBC) block cipher mode of operation is used to decrypt multiple AES 16-byte blocks, which means an initialisation vector (IV) is stored in clear along with the ciphertext. All filenames and file contents are encrypted using the same encryption mode, key and IV (e.g. if you remove and re-add a file in the archive, the resulting stream will be identical). No cleartext padding is used during encryption; instead, several end-of-stream handlers are available, so the ciphertext has exactly the size of the cleartext (e.g. the size of the compressed file). The following variants were identified in the 'encryption_mode' field. STREAM This is the end-of-stream handler for: This end-of-stream handler is apparently specific to the .zed format, and applied when the cleartext's does not end on a 16-byte boundary ; in this case special processing is performed on the last partial 16-byte block. The encryption and decryption phases are identical: let's assume the last partial block of cleartext (for encryption) or ciphertext (for decryption) was appended after all the complete 16-byte blocks of ciphertext: In either case, if the full ciphertext is less then one AES block (< 16 bytes), then the IV is used instead of the second-to-last block. CTS CTS or CipherText Stealing is the end-of-stream handler for: It matches the CBC-CS3 variant as described in Recommendation for Block Cipher Modes of Operation: Three Variants of Ciphertext Stealing for CBC Mode. Empty cleartext Since empty filenames or metadata are invalid, and since all files are compressed (resulting in a minimum 8-byte zlib cleartext), no empty cleartext was encrypted in the archive. metadata stream It is named 05356861616161716149656b7a6565636e576a33317a7868304e63 (hexadecimal), i.e. the character with code 5 followed by '5haaaaqaIekzeecnWj31zxh0Nc' (ASCII). The format used is OLE Property Set (MS-OLEPS). It introduces 2 property names "_ctlfile" (index 3) and "_catalog" (index 4), and 2 instances of said properties each containing an application-specific VT_BLOB (type 0x0041). _ctlfile: obfuscated global properties and access list This subpart is stored under index 3 ("_ctlfile") of the MS-OLEPS metadata. It consists of: The ciphertext is encrypted with AES-CBC "STREAM" mode using 128-bit static key 37F13CF81C780AF26B6A52654F794AEF (hexadecimal) and the prepended IV so as to obfuscate the access list. The ciphertext is continuous and not split in chunks (unlike files), even when it is larger than 512 bytes. The decrypted text contain properties in a TLV format as described in _ctlfile TLV: Archives may include "mandatory" users that cannot be removed. They are typically used to add an enterprise wide recovery RSA key to all archives. Extreme care must be taken to protect these key, as it can decrypt all past archives generated from within that company. _catalog: file list This subpart is stored under index 4 ("_catalog") of the MS-OLEPS metadata. It contains a series of 'fileprops' TLV structures, one for each file or directory. The file hierarchy can be reconstructed by checking the 'parent_id' field of each file entry. If 'parent_id' is 0 then the file is located at the top-level of the hierarchy, otherwise it's located under the directory with the matching 'file_id'. TLV format This format is a series of fields : Value semantics depend on its Type. It may contain an uint32be integer, a UTF-16LE string, a character sequence, or an inner TLV structure. Unless otherwise noted, TLV structures appear once. Some fields are optional and may not be present at all (e.g. 'archive_createdwith'). Some fields are unique within a structure (e.g. 'files_iv'), other may be repeated within a structure to form a list (e.g. 'fileprops' and 'passworduser'). The following top-level types that have been identified, and detailed in the next sections: Some additional unidentified types may be present. _ctlfile TLV _catalog TLV Decrypting the archive AES key rsauser The user accessing the archive will be authenticated by comparing his/her X509 certificate with the one stored in the 'certificate' field using DER format. The 'files_key_ciphertext' field is then decrypted using the PKCS#1 v1.5 encryption mechanism, with the private key that matches the user certificate. passworduser An intermediary user key, a user IV and an integrity checksum will be derived from the user password, using the deprecated PKCS#12 method as described at rfc7292 appendix B. Note: this is not PKCS#5 (nor PBKDF1/PBKDF2), this is an incompatible method from PKCS#12 that notably does not use HMAC. The 'pkcs12_hashfunc' field defines the underlying hash function. The following values have been identified: PBA - Password-based authentication The user accessing the archive will be authenticated by deriving an 8-byte sequence from his/her password. The parameters for the derivation function are: The derivation is checked against 'pba_checksum'. PBE - Password-based encryption Once the user is identified, 2 new values are derived from the password with different parameters to produce the IV and the key decryption key, with the same hash function: The parameters specific to user key are: The user key needs to be truncated to a length of 'encryption_strength', as specified in bytes in the archive properties. The parameters specific to user IV are: Once the key decryption key and the IV are derived, 'files_key_ciphertext' is decrypted using AES CBC, with PKCS#7 padding. Identifying file streams The name of the MS-CFB stream is derived by shuffling the bytes from the 'file_id' field and then encoding the result as hexadecimal. The reordering is:
Initial  offset: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Shuffled offset: 3 2 1 0 5 4 7 6 8 9 10 11 12 13 14 15
The 16th byte is usually a NUL byte, hence the stream identifier is a 30-character-long string. Decrypting files The compressed stream is split in chunks of 512 bytes, each of them encrypted separately using AES CBS and the global archive encryption scheme. Decryption uses the global AES key (retrieved using the user credentials), and the global IV (retrieved from the deobfuscated archive metadata). The IV for each chunk is computed by: Each chunk is an independent stream and the decryption process involves end-of-stream handling even if this is not the end of the actual file. This is particularly important for the CTS handler. Note: this is not to be confused with CTR block cipher mode of operation with operates differently and requires a nonce. Decompressing files Compressed streams are zlib stream with default compression options and can be decompressed following the zlib format. Test cases Excluded for brevity, cf. https://www.beuc.net/zed/#test-cases. Conventions and references Feedback Feel free to send comments at beuc@beuc.net. If you have .zed files that you think are not covered by this document, please send them as well (replace sensitive files with other ones). The author's GPG key can be found at 8FF1CB6E8D89059F. Copyright (C) 2017 Sylvain Beucler Copying and distribution of this file, with or without modification, are permitted in any medium without royalty provided the copyright notice and this notice are preserved. This file is offered as-is, without any warranty.

29 November 2014

Dirk Eddelbuettel: CRAN Task Views for Finance and HPC now (also) on GitHub

The CRAN Task View system is a fine project which Achim Zeileis initiated almost a decade ago. It is described in a short R Journal article in Volume 5, Number 1. I have been editor / maintainer of the Finance task view essentially since the very beginning of these CRAN Task Views, and added the High-Performance Computing one in the fall of 2008. Many, many people have helped by sending suggestions or even patches; email continues to be the main venue for the changes. The maintainers of the Web Technologies task view were, at least as far as I know, the first to make the jump to maintaining the task view on GitHub. Karthik and I briefly talked about this when he was in town a few weeks ago for our joint Software Carpentry workshop at Northwestern. So the topic had been on my mind, but it was only today that I realized that the near-limitless amount of awesome that is pandoc can probably help with maintenance. The task view code by Achim neatly converts the very regular, very XML, very boring original format into somewhat-CRAN-website-specific html. Pandoc, being as versatile as it is, can then make (GitHub-flavoured) markdown out of this, and with a minimal amount of sed magic, we get what we need. And hence we now have these two new repos: Contributions are now most welcome by pull request. You can run the included converter scripts, it differs between both repos only by one constant for the task view / file name. As an illustration, the one for Finance is below.
#!/usr/bin/r
## if you do not have /usr/bin/r from littler, just use Rscript
ctv <- "Finance"
ctvfile  <- paste0(ctv, ".ctv")
htmlfile <- paste0(ctv, ".html")
mdfile   <- "README.md"
## load packages
suppressMessages(library(XML))          # called by ctv
suppressMessages(library(ctv))
r <- getOption("repos")                 # set CRAN mirror
r["CRAN"] <- "http://cran.rstudio.com"
options(repos=r)
check_ctv_packages(ctvfile)             # run the check
## create html file from ctv file
ctv2html(read.ctv(ctvfile), htmlfile)
### these look atrocious, but are pretty straight forward. read them one by one
###  - start from the htmlfile
cmd <- paste0("cat ", htmlfile,
###  - in lines of the form  ^<a href="Word">Word.html</a>
###  - capture the 'Word' and insert it into a larger URL containing an absolute reference to task view 'Word'
  "   sed -e 's ^<a href=\"\\([a-zA-Z]*\\)\\.html <a href=\"http://cran.rstudio.com/web/views/\\1.html\" '   ",
###  - call pandoc, specifying html as input and github-flavoured markdown as output
              "pandoc -s -r html -w markdown_github   ",
###  - deal with the header by removing extra  , replacing  ** with ** and **  with **:              
              "sed -e's/ //g' -e's/ \\*\\*/\\*\\*/g' -e's/\\*\\* /\\*\\* /g' -e's/ $/  /g' ",
###  - make the implicit URL to packages explicit
              "-e's ../packages/ http://cran.rstudio.com/web/packages/ g' ",
###  - write out mdfile
              "> ", mdfile)
system(cmd)                             # run the conversion
unlink(htmlfile)                        # remove temporary html file
cat("Done.\n")
I am quite pleased with this setup---so a quick thanks towards the maintainers of the Web Technologies task view; of course to Achim for creating CRAN Task Views in the first place, and maintaining them all those years; as always to John MacFarlance for the magic that is pandoc; and last but not least of course to anybody who has contributed to the CRAN Task Views.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

6 November 2014

Dirk Eddelbuettel: A Software Carpentry workshop at Northwestern

On Friday October 31, 2014, and Saturday November 1, 2014, around thirty-five graduate students and faculty members attended a Software Carpentry workshop. Attendees came primarily from the Economics department and the Kellogg School of Management, which was also the host and sponsor providing an excellent venue with the Allen Center on the (main) Evanston campus of Northwestern University. The focus of the two-day workshop was an introduction, and practical initiation, to working effectively at the shell, getting introduced and familiar with the git revision control system, as well as a thorough introduction to working and programming in R---from the basics all the way to advanced graphing as well as creating reproducible research documents. The idea of the workshop had come out of discussion during our R/Finance 2014 conference. Bob McDonald of Northwestern, one of this year's keynote speakers, and I were discussing various topic related to open source and good programming practice --- as well as the lack of a thorough introduction to either for graduate students and researcher. And once I mentioned and explained Software Carpentry, Bob was rather intrigued. And just a few months later we were hosting a workshop (along with outstanding support from Jackie Milhans from Research Computing at Northwestern). We were extremely fortunate in that Karthik Ram and Ramnath Vaidyanathan were able to come to Chicago and act as lead instructors, giving me an opportunity to get my feet wet. The workshop began with a session on shell and automation, which was followed by three session focusing on R: a core introduction, a session focused on function, and to end the day, a session on the split-apply-combine approach to data transformation and analysis. The second day started with two concentrated session on git and the git workflow. In the afternoon, one session on visualization with R as well as a capstone-alike session on reproducible research rounded out the second day.

Things that worked The experience of the instructors showed, as the material was presented and an effective manner. The selection of topics, as well as the pace were seen by most students to be appropriate and just right. Karthik and Ramnath both did an outstanding job. No students experienced any real difficulties installing software, or using the supplied information. Participants were roughly split between Windows and OS X laptops, and had generally no problem with bash, git, or R via RStudio. The overall Software Carpentry setup, the lesson layout, the focus on hands-on exercises along with instruction, the use of the electronic noteboard provided by etherpad and, of course, the tried-and-tested material worked very well.

Things that could have worked better Even more breaks for exercises could have been added. Students had difficulty staying on pace in some of the exercise: once someone fell behind even for a small error (maybe a typo) it was sometimes hard to catch up. That is a general problem for hands-on classes. I feel I could have done better with the scope of my two session. Even more cohesion among the topics could have been achieved via a single continually used example dataset and analysis.

Acknowledgments Robert McDonald from Kellogg, and Jackie Milhans from Research Computing IT, were superb hosts and organizers. Their help in preparing for the workshop was tremendous, and the pick of venue was excellent, and allowed for a stress-free two days of classes. We could not have done this without Karthik and Ramnath, so a very big Thank You to both of them. Last but not least the Software Carpentry 'head office' was always ready to help Bob, Jackie and myself during the earlier planning stage, so another big Thank You! to Greg and Arliss.