Steganographic watermarking for documents and files

Posted by Monica Mellinger on

By: Keith McIver

During our studies, we all worried about committing plagiarism by failing to give proper credit. Now that we are professionals, the boot is most often on the other leg and we are more likely to have our work stolen. Internal use only documents are protectable with hardware and software that keep them from leaving the building, but even intelligence agencies have to publish matter for the public to see.

Watermarking inserts or changes the data or its container without damaging its usability. Watermarks are usually unmissable: colorful shiny elements on banknotes, the flying bird hologram on credit cards, or thermochromic ink on checks. Steganography uses existing information as a cover for secrets, as by pricking holes in letters on a page to spell out a message. Combining them yields steganographic watermarking: using data and containers to hide little notes that show they came from a certain person or firm.

The “majors” in every field employing chemical engineers can buy and enforce the use of many commercial tools to do this. The supermajors may go beyond this and commission their own unique tools. In any workplace, we might subtly mark compositions or collections of data to protect against (and possibly expose) the villainous workplace plagiarist.

Since all responsible workplaces enforce policies against unapproved software, methods for persons like us in entry-level positions need to work without requiring tools or configuration changes that are certain to need permission from information services employees.

To put it in a problem statement: our watermarking methods must allow insertion of free text into documents that are difficult to find. The hidden text must not interfere with any use of our documents and, if detected, must be clearly innocuous. They must be purely defensive, not offensive.

As files go, the most common are the Microsoft Office formats for Word (*.DOCX ), Excel (*.XLSX), and PowerPoint (*.PPTX). These have been the defaults since Office 2007 and can be considered standard everywhere except specific legacy places. Unlike their binary predecessors (*.DOC, *.XLS, *.PPT), these files are disguised XML containers (hence the “X” in the file extension) and may be modified by a text editor.

Below are some examples of this usage. In both scenarios, I use a spreadsheet in Excel as an example. Word and PowerPoint files may be served the same way with changes for their different user interfaces and defaults. Here are some example directions for Windows:

  1. Save your file to a specific location, for example the Desktop, and close the instance of Excel accessing it.
  2. Change the Windows settings to show file extensions, if it does not already.
  3. Add .zip to the end of the file name, so results.xlsx becomes
  4. Open the file by double clicking it.
  5. Go to .\xl\worksheets and copy sheet1.xml to your Desktop.
  6. Right-click it and select Open with > Notepad.
  7. Anywhere in the file, add <!-- original author is [me] on [date] -->
  8. Save sheet1.xml and copy it back into .\xl\worksheets, overwriting the old one.
  9. Close the folder windows.
  10. Rename the file and remove the .zip at the end.
  11. Open it again in Excel to confirm that it has not been corrupted. If you are not presented with any warnings or errors from Excel and can see your data as it was, it worked.

This comment will stick with the file, but not the data if copied into a new spreadsheet. Similarly, if your comment is in the middle of a cell’s contents, it will be overwritten if changed. An alternate way to secretly watermark a file:

  1. In the Office program, find the Insert tab. In the Text group, click Text Box.
  2. Click somewhere out of the way, but not beyond the last used column or below the last used row.
  3. Type your watermark text, but do not press Enter or click anywhere.
  4. Type Ctrl+A to select all of it.
  5. In the Ribbon, find the Shape Format tab. In WordArt Styles, select Text Fill > No Fill.
  6. Press Esc. Your watermarking should not be visible.

If a dishonest person were to select all and copy it to “their” file and save it, this invisible text would be picked up and copied too. You can play with the font size and rotation of the object to make it even less likely to be found by accident. Do not place it over the top of anything that will be selected or edited, or it will get in the way and be found out.

The next most common file type is probably Adobe’s PDF. Since PDFs are not usually working files but for reference and distribution, protection is best included earlier, in whatever application the content comes from. Adobe Acrobat (not Reader) has digital rights management (DRM) tools to “protect” PDFs, but other software has been around since at least 2001 to evade this.

A little story. At an internship many years ago, a back-burner task was to create a process status display for a pilot line using OSIsoft PI. After using some of the reasonable clip art in the PI built-in library to depict the unit operations, I was sorely tempted to include some of the stranger images (“Colin after drinking the experimental soda” and “long haired freaky people need not apply”, for example) but determined against it. I did include some brown-on-brown text lurking in unused places of the process flow diagram saying “copyright [company], summer 20[xx]”. The pilot project ended up getting called off before going to production, so it’s unlikely my little notes will ever turn up in a courtroom.

Ultimately, though, watermarking should be as preventative as possible. Once someone gets burned for stealing credit rightly yours or your entity’s, word should get around. Your misattribution traps will not prevent the theft, but should make “careful” plagiarists so paranoid that the effort required to assure themselves there’s no evidence left exceeds what it would take to perform the work honestly.