by Kevin Schroeder | 12:00 am

Most PHP developers are used to dealing with files.  Files that are uploaded, downloaded, etc. If we work with data files it is usually in the form of XML or CSV or something like that.  But what if the files that users were uploaded and downloading had information in them that you wanted to get.  Say that you were hosting MP3 files on your website that people could upload.  You might want to get the ID3 information that states who has the copyright.  Or if people were uploading Word documents and you wanted to get author information.  There are often libraries available to read certain file formats in PHP, but more often than not, there isn't.  The purpose of this chapter is to get you started in being able to read and understand binary files.  Even if you aren't using them directly in your application, knowing how to read them is a good exercise since there is a good chance that at some point you will need to be able to work with them.  Even if it's something that you would be writing a one-off script for to do some basic data transormation, knowing how to access binary files is a good thing and, as I said earlier, a lot of PHP developers don't do this.

In this chapter we go through the basics of accessing structured files.  We start with TAR files move to WAV files and then we write a read-only interface to an EXT2 file system.  You'll never do that in a production environment, but by looking at it you might learn a bunch of things.  Plus a lot of PHP developers write their applications without having any understand of how it will affect storage.  The EXT2 file system example will help you.  Then we wrap up by writing our own binary file for writing linked lists with durable storage.

   Chapter 1: Networking and Sockets
   Chapter 2: Binary Protocols
   Chapter 3: Character Encoding
   Chapter 4: Streams
   Chapter 5: SPL
   Chapter 6: Asynchronous Operations with Some Encryption Thrown In
   Chapter 7: Structured File Access
   Chapter 8: Daemons
   Chapter 9: Debugging, Profiling, and Good Development
   Chapter 10: Preparing for Success

Structured File Access

In the desktop world, structured files are relatively commonplace. But in the web world we tend not to deal with them very much. The reason for this is that we usually end up dealing with structured data via a database. Often it doesn’t make much sense for a web developer to store data structured according to a proprietary format. A database does for us what we generally need to do. Additionally, we tend to work with string data as opposed to binary data, which is something that structured data files tend to use more.

But there are some times when knowing how to figure out the internal format of a file can be useful. Other times being able to write to those files, or even write your own format could be beneficial. And, like with networking, it is just good to have an understanding of how to do things that aren’t in your regular tool belt.

If you are not familiar with structured files, read this chapter slowly. There is a lot of detail and it is very easy to get lost. So read it slowly, take breaks and try writing out some of the code yourself. And don’t expect to get it all at one shot. This chapter actually took me a very long time to write. Don’t expect to understand it all at once. In fact, it would probably be a good idea to read each individual section separately and intersperse other chapters in between moving forward on this one. This chapter will probably be the most difficult one to get through, so take your time.

Tar Files

To start with, let’s look at some file formats with open standards. Often files will have a file header. This will contain meta data about the file. Depending on the file this meta data could be file version number, author, bitrate or any number of other parameters.

The tar format is short for “tape archive”. It was initially used for the purpose of storing backup data on tape drives but, as any developer who touches a Linux system knows, it has expanded well beyond that use.

The tar format is a relatively simple format that allows individual files to be stored in one file for easy transport. The use of gzip has become virtually synonymous with tar, though we will not look at that in great depth simply because gzipping a tar file is just the simple act of taking the raw tar file and compressing it.

Let’s first look at an existing tar file containing the source code for PHP 5.2.11. The tar file headers are actually just simple text strings but they are stored in a structured format. In other words, they are just text strings, but they are fixed length text strings, similar to a CHAR text field in SQL.

Before going into the actual file itself, here is the structure of a file header record.

 

Offset Size Description
0 100 File Name
100 8 File Mode (permissions)
108 8 Numeric User ID
116 8 Numeric Group ID
124 12 File size in bytes
136 12 Last modified Unix timestamp
148 8 Header checksum
156 1 Record Type
157 100 Linked file name

Figure 7.1 Tar header record format

The Record Type can be one of 7 different values.

 

Value Type
0 Regular File
1 Unix link
2 Unix symbolic link
3 Character Device (virtual terminal, modem, COM1)
4 Block Device (disk partition, CD-ROM drive)
5 Directory
6 FIFO or named pipe

Figure 7.2 Tar record types

Reading a single header record is quite easy, as we’ll show in the following code. The tar block size is 512 bytes and so even though we only use about 250 bytes, we read the entire 512 byte block. As you look at more structured files, this block based approach will be a very common occurrence.

$fh = fopen('php-5.2.11.tar', 'r');
$fields
= readHeader($fh);
foreach
($fields as $name => $value) {
    $value
= trim($value);
    echo
"{$name}: {$value}n";
}

function readHeader($resource)
{
    $data
= fread($resource, 512);
    return
strunpack(
        '100name/8mode/8owner/8group/'
        .
'12size/12ts/8cs/1type/100link', $data);
}

function strunpack($format, $data)
{
    $return
= array();
    $fieldLengths
= explode('/', $format);
    foreach
($fieldLengths as $lens) {
        $name
= preg_replace('/^d+/', '', $lens);
       $lens
= (int)$lens;
       if
(ctype_alpha($name)) {
          $return
[$name] = substr($data, 0, $lens);
       }
else {
          $return
[] = substr($data, 0, $lens);
       }
       $data
= substr($data, $lens);
       if
(strlen($data) === 0) {
          break
;
       }
    }
    return
$return;
}

Figure 7.3 Reading the Tar header

Most of this code is simply there to make it easier to read string-based data. unpack() will return an array of individual characters and not full strings, so this method gets a little cumbersome when dealing with anything beyond simple string operations. That is the purpose of the strunpack() function. It takes characters that are returned individually and groups them in a single record. You might think that you could use something like fscanf(), but %s does not like NULL characters. Since there are many NULL characters in a tar file this will not work well for us. So most of this code is here to handle reading the file information but we will use it a fair amount later on.

The output for this code is

name: php-5.2.11/
mode: 0000755
owner: 0026631
group: 0024461
size: 00000000000
ts: 11261402465
cs: 012435
type: 5
link:

Tags:

Comments

No comments yet...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.