A software source code review expert typically engages in three different types of litigation. The first is litigation involving intellectual property cases, the second is in products liability cases, and finally in contracts disputes. In intellectual property cases the software source code
review can help determine whether or not a system or device is practicing claimed elements of a patent.
Also in intellectual property cases, the software source code review expert can
determine whether software source code has been copied or stolen, in
other words, software theft. In products liability cases the software source code review
expert can determine whether or not software source code is a the
root of an incident with a machine that resulted in an injury to a
person or damage to property. In contracts disputes the software source code review can help determine whether
or not the terms of a contract have been breached.
The process of detecting stolen
software generally falls into two categories. The first thing that
happens is the entity that feels their software has been
misappropriated become suspicious for some reason. They then engage
a lawyer that may or may not file a lawsuit. The software source code expert is
not generally involved until after this step because a lawsuit, or
the expressed possibility of one, is typically required for the
entity accused of software theft to produce the subject software.
The bulk of the software source code expert's effort is generally
focused on detecting matching software elements. I have found that
this step is best followed in a structured fashion.
1. Make sure you have the right code, and all of the
code
- Modern software can involve many different third party
framework products. These software packages can bring hundreds of
thousands of lines of source into the code base. Unique software
that was written specifically for the subject applications can also
add hundreds of thousands of lines of source into the code base.
Sometimes, the core intellectual property at issue is encapsulated
within a few thousand or even a few hundred lines of source code. It
is relatively easy for the party accused of stealing software to
"mistakenly" fail to produce these lines of source code.
Detecting the absence of this code can be quite difficult for the
expert evaluating the accused software. Requiring
that the subject software is produced in a format that can be compiled and
executed is one good way of raising the confidence that all of the
subject source code has been produced.
2. Overview the code base - What language(s) is(are) used, 3rd party software tools and frameworks, general structure of the code base, high level analysis of what the code does.
- This step is important for the software expert to form an understanding of
the source code, but it can be of less value if the subject software
has been restructured in order to obfuscate evidence of software
plagiarism.
3. Folder structure, file names, total number of files - As with the
general overview of the code base, this step is important for the
software expert to form an understanding of the source code, but it can be of
less value if the subject software has been restructured in order to
hide evidence of software stealing.
4. Text matches between code bases - Matching misspellings, matching jargon, grammatical errors,
and unusual capitalization can all be very strong indicators of
software theft. There are a number of publicly available software
packages that will generate statistics as to the degree of matching
between the two code bases, and I've also written software to
generate matching statistics. There are also plenty of academic
papers describing various techniques for software fingerprinting.
This is is really the only part of stolen software detection process
that can be easily automated and in my expert experience it is generally of
limited value. This is because people that steal software will
inevitably go to some effort to hide the evidence of the theft. It's
rather easy to change text, variable names, use different third part
framework software, change the source language, etc. to make the statistics indicate very low
levels of text matching. Rarely do these statistics provide any
meaningful insight with regards to software plagiarism.
5. Words that do not belong - Having text references to the
plaintiff's company can be pretty strong indicators of software
theft.
6. Database debris - It is not unusual for modern software to
include a database and it is not unusual for people that steal
software to try to remove data that would indicate software theft.
Deleting data from databases, however, can sometimes leave
"orphans" that were not deleted. These orphans can be
invisible when looking at the data in the native development
software that was used to create the application, but they can be
found with database management tools. Sometimes using this database
management tools will illuminate evidence of software stealing that
is not evident in the native development tools.
7. Matching ordered lists and enumerations
8. Matching common computer algorithms and data structures
9. Matching domain-specific algorithms (predictive maintenance, well bore temperature profile predictions)
For code that controls machines and devices we can generally divide the software into embedded
software and automation software. As the name implies embedded systems have computing elements completely embedded in the device, typically with no interface that would usually be associated with a computer. A modern drilling tool is a good example of an embedded device where the software is downloaded into the tool. A digital camera is another example of a device with embedded software. The computing languages C, C++, VHDL and sometimes assembly language are often found in embedded systems.
The software found in automation systems further subdivides into higher-level software that choreographs general activities in the factory and lower-level software the controls the activities of individual machines. The higher level software is often called factory automation software or Supervisory Control And Data Acquisition (SCADA) software and it typically runs on a personal computer or server. The computing languages C,
C++ and sometimes Visual Basic are often found in factory automation and SCADA systems. When browsers are employed as a user interface in these systems, we might find many other computing languages including HTML, XML, Java Script, ASP, and ASP.net.
The lower-level source code that controls the activities of individual machines sometimes runs on a personal computer, but much more often it will run on a Programmable Logic Controller,
or PLC. A PLC contains computing elements very much like the desktop computer you use every day. It is "hardened" for the factory environment but has a microprocessor, non-volatile memory and perhaps network connections. The parts of the PLC that are quite different from the typical desktop computer are the input and output modules. These modules allow the PLC to communicate with the machine. The PLC is programmed with software source code, but the languages used are typically specialized for use with PLCs. These software languages are the five IEC standard programming languages: structured text, function block diagram, ladder diagram, instruction list and sequential function chart.
|