Back to Chapter Back to Chapter

Case 5a :

Data Integrity—The Problem of Big, Cloudy, Data

Zheng is studying search engine optimization (SEO) techniques in scientific publication for his PhD. Part of his dissertation will involve studying the effectiveness of previous SEO techniques to justify his new codes and algorithms. Fortunately, his supervising professor has access to five years of data from a prestigious publisher of scientific and medical journals, their optimization codes, and when these changed.

This treasure trove of data, however, is turning out to be a nightmare. The publisher employed multiple contractors (who employed multiple subcontractors) who wrote code in two languages and stored data with two different cloud based services, which the publisher did not ultimately control, and which have different security and metadata management formats.

Zheng has spent two weeks attempting to sort the data alone into a manageable, chronological, file with identical variable names. He has not even begun to sort through the changes to the code to produce this data, which has come in multiple files. In some cases, Zheng had to deduce the logic and now feels like this data is a hopeless mess. His attempts to contact the original contractors have not been fruitful as two of the companies have closed. When Zheng has reported to his supervisor these difficulties, his response has been,

"Zheng, you must figure it out, this is part of what you must do".

Zheng is tempted to scrap his efforts to assimilate the code and the data and just write a plausible sounding story for the data acquisition section of his study.

Zheng thought as he opened his word processing program and began to write a new story of the data,

"No one reads those sections and no one is going to ever put as much effort into solving this mess as I have already. If the data show bad results, I will come back to this problem".

  Case Questions
  • What are the problems that Zheng faces?
  • What problems might arise for Zheng if he does not undertake the work to sort out this data mess?
  • Are there problems for Zheng's supervisor if Zheng writes a fictional account of his data acquisition?
     

Imagine you are a good friend and colleague of Zheng:

  • If he told you his plan, what advice would you give him?
  • What might be the risks and rewards from Zheng's current strategy?
  • What options does Zheng have that he may not have explored already?