Jumat, 11 November 2016

PowerShell v3 Check File Headers

The following function is a very quick way to validate a set of file signatures. The files were provided to us in a project with a single, incorrect extension. Seeing as I need to process about 3.8 million files, there was no way I wanted to manually do that. So, I narrowed down the headers to the usual suspects for a project like this (.pdf and .tif) and wrote this function:
function Check-Header
{
       param(
             $path
       )
      
       # Hexidecimal signatures for expected files
       $pdf = 25504446;
       $TIFF_1 = 492049;
       $TIFF_2 = 49492A00;
       $TIFF_3 = 4D4D002A;
       $TIFF_4 = 4D4D002B;
            
       # Get content of each file (up to 4 bytes) for analysis
       ([Byte[]] $fileheader = Get-Content -Path $path -TotalCount 4 -Encoding Byte) |
       ForEach-Object {
             if(("{0:X}" -f $_).length -eq 1)
             {
                   $HeaderAsHexString += "0{0:X}" -f $_
             }
             else
             {
                   $HeaderAsHexString += "{0:X}" -f $_
             }
       }
      
       # Validate file header
       @($pdf, $tiff_1, $tiff_2, $tiff_3, $tiff_4) -contains $HeaderAsHexString
}
This function does a few things:

  1. Takes a file path argument
  2. Declares five known signatures (there are the headers we want files to have)
  3. Reads the first 4 bytes of the file into a [Byte[]] array
  4. Passes this byte array to a simple if/else statement to convert each byte from byte to a hexidecimal string
  5. Compares an array of all known good signatures to see if any of them match the converted file signature 
If the -contains operator validates that one of the binary arrays matches our header the function returns true. If it does not find a match it returns false. On a directory of 1024 files this took just over 3.9 seconds on my test server. If I can get a straight run, I anticipate my 3.8 million file collection to take just a shade more than 4 hours. I will be doing some other manipulation, so, it will be considerably slower, but, in cases like this, it just goes to show there is no alternative to a good automated solution.

lamsim

About lamsim

Author Description here.. Nulla sagittis convallis. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.

Subscribe to this Blog via Email :