Judy Hochberg, Automatic identification of classified documents




CERIAS Weekly Security Seminar - Purdue University show

Summary: How can one automatically identify classified documents? This is a vital question for the Department of Energy (DOE), which is reviewing millions of classified documents for possible declassification, and for Los Alamos National Laboratory (LANL), which is checking its unclassified computing storage systems for the presence of classified documents. The DOE, having already developed an expert rule system for automatic document classification, provided LANL with a small set of documents with which to explore a statistical classifier as an alternative. We represented documents as vectors of character trigram frequencies, used a chi-square statistic to select the optimal trigrams, and trained a linear classifier to distinguish classified and unclassified documents. Results ranged from 60% to 87% accuracy, depending on the training set size and other variables. In contrast, the LANL effort started "from scratch" and needed to be moved rapidly into large-scale production. We implemented an expert system tailored to the classified documents of most concern to LANL. The talk will discuss the practical issues that arose in canvassing large amounts of files in a variety of formats, and the security issues involved in the sampling, analysis, and notification processes. About the speaker: Judy Hochberg is a staff scientist at Los Alamos National Laboratory. She received a B.A. in linguistics from Harvard and a Ph.D. in linguistics from Stanford. Before joining the Laboratory in 1989, she was a post-doctoral researcher at the University of Chicago, then a visiting Assistant Professor at Northwestern University. She has published in journals including Computers and Security, IEEE Transactions in Pattern Analysis and Machine Intelligence, and Language. She has been an R&D 100 award winner and a national finalist in the Johns Hopkins National Search for Computing to Assist Persons with Disabilities. Judy is interested in all manifestations of human language, including document analysis -- text and images -- and speech.