Detecting Software Theft
Software Source Code Review

A software source code review expert typically engages in three different types of litigation. The first is litigation involving intellectual property cases, the second is in products liability cases, and finally in contracts disputes. In intellectual property cases the software source code review can help determine whether or not a system or device is practicing claimed elements of a patent. Also in intellectual property cases, the software source code review expert can determine whether software source code has been copied or stolen, in other words, software theft. In products liability cases the software source code review expert can determine whether or not software source code is a the root of an incident with a machine that resulted in an injury to a person or damage to property. In contracts disputes the software source code review can help determine whether or not the terms of a contract have been breached.

The process of detecting stolen software generally falls into two categories. The first thing that happens is the entity that feels their software has been misappropriated become suspicious for some reason. They then engage a lawyer that may or may not file a lawsuit. The software source code expert is not generally involved until after this step because a lawsuit, or the expressed possibility of one, is typically required for the entity accused of software theft to produce the subject software. The bulk of the software source code expert's effort is generally focused on detecting matching software elements. I have found that this step is best followed in a structured fashion.

1. Make sure you have the right code, and all of the code - Modern software can involve many different third party framework products. These software packages can bring hundreds of thousands of lines of source into the code base. Unique software that was written specifically for the subject applications can also add hundreds of thousands of lines of source into the code base. Sometimes, the core intellectual property at issue is encapsulated within a few thousand or even a few hundred lines of source code. It is relatively easy for the party accused of stealing software to "mistakenly" fail to produce these lines of source code. Detecting the absence of this code can be quite difficult for the expert evaluating the accused software. Requiring that the subject software is produced in a format that can be compiled and executed is one good way of raising the confidence that all of the subject source code has been produced.

2. Overview the code base - What language(s) is(are) used, 3rd party software tools and frameworks, general structure of the code base, high level analysis of what the code does. - This step is important for the software expert to form an understanding of the source code, but it can be of less value if the subject software has been restructured in order to obfuscate evidence of software plagiarism.

3. Folder structure, file names, total number of files - As with the general overview of the code base, this step is important for the software expert to form an understanding of the source code, but it can be of less value if the subject software has been restructured in order to hide evidence of software stealing.

4. Text matches between code bases - Matching misspellings, matching jargon, grammatical errors, and unusual capitalization can all be very strong indicators of software theft. There are a number of publicly available software packages that will generate statistics as to the degree of matching between the two code bases, and I've also written software to generate matching statistics. There are also plenty of academic papers describing various techniques for software fingerprinting. This is is really the only part of stolen software detection process that can be easily automated and in my expert experience it is generally of limited value. This is because people that steal software will inevitably go to some effort to hide the evidence of the theft. It's rather easy to change text, variable names, use different third part framework software, change the source language, etc. to make the statistics indicate very low levels of text matching. Rarely do these statistics provide any meaningful insight with regards to software plagiarism.

5. Words that do not belong - Having text references to the plaintiff's company can be pretty strong indicators of software theft.

6. Database debris - It is not unusual for modern software to include a database and it is not unusual for people that steal software to try to remove data that would indicate software theft. Deleting data from databases, however, can sometimes leave "orphans" that were not deleted. These orphans can be invisible when looking at the data in the native development software that was used to create the application, but they can be found with database management tools. Sometimes using this database management tools will illuminate evidence of software stealing that is not evident in the native development tools.

7. Matching ordered lists and enumerations

8. Matching common computer algorithms and data structures

9. Matching domain-specific algorithms (predictive maintenance, well bore temperature profile predictions)

For code that controls machines and devices we can generally divide the software into embedded software and automation software. As the name implies embedded systems have computing elements completely embedded in the device, typically with no interface that would usually be associated with a computer. A modern drilling tool is a good example of an embedded device where the software is downloaded into the tool. A digital camera is another example of a device with embedded software. The computing languages C, C++, VHDL and sometimes assembly language are often found in embedded systems.

The software found in automation systems further subdivides into higher-level software that choreographs general activities in the factory and lower-level software the controls the activities of individual machines. The higher level software is often called factory automation software or Supervisory Control And Data Acquisition (SCADA) software and it typically runs on a personal computer or server. The computing languages C, C++ and sometimes Visual Basic are often found in factory automation and SCADA systems. When browsers are employed as a user interface in these systems, we might find many other computing languages including HTML, XML, Java Script, ASP, and ASP.net.

The lower-level source code that controls the activities of individual machines sometimes runs on a personal computer, but much more often it will run on a Programmable Logic Controller, or PLC. A PLC contains computing elements very much like the desktop computer you use every day. It is "hardened" for the factory environment but has a microprocessor, non-volatile memory and perhaps network connections. The parts of the PLC that are quite different from the typical desktop computer are the input and output modules. These modules allow the PLC to communicate with the machine. The PLC is programmed with software source code, but the languages used are typically specialized for use with PLCs. These software languages are the five IEC standard programming languages: structured text, function block diagram, ladder diagram, instruction list and sequential function chart.

I have extensive experience with all of the software source code languages discussed above for programming machines and can support your litigation efforts in that regards as a software source code review expert. My qualifications include peer-reviewed publications and over thirty years of engineering experience with software, robotics, instrumentation, medical devices, computer-controlled machines and factory automation.

Software, Robotics and Computer Controlled Machines

Manufacturing Software	Robot & Machine Kinematics	Motion Control
Process Control	Machine Control Software	Medical Robotics
Instrumentation	Factory Automation	Electro-Mechanical Engineering
Robotics Software	Automatic Doors	Automatic Guided Vehicle
Programmable Logic Controller	Machine Safety	Automatic Test Equipment
Machine Control	Ladder Logic Software

Detecting Software TheftSoftware Source Code Review

Detecting Software Theft
Software Source Code Review