Virginia Disc One Exploration
Posted by TRII in text-retrieval-and-search-engines
Introduction / Overview¶
Virginia Disc One was "the first large-scale distribution of test collections" used in Information Retrieval. The goal was to create a large test collection that hundreds of researchers could contribute to and utilize for work in the IR field. While many larger, more comprehensive collections have been created and distributed since VD1 was first distributed in 1990, we thought it would be interesting (and fun!) to take a look at some of the contents and use them for future notebooks / articles.
In this article, we perform some basic web-scraping in order to obtain the contents of VD1. Then, we look at a small subset of the contents to extract a collection of example queries, documents, and relevancies. This extraction involves some simple file reading and parsing.
As a note, none of this notebook is intended to be particularly rigorous. The primary goal is to examine a new data set using some basic tools and see what we can get out of it As such, much of the code and methods are fairly "raw", and mostly a first-pass / one-off attempt. Also, Jupyter / IPython notebooks do not have built-in spell-checking (which is annoying); while we try to catch most mistakes, some may still remain due to the lack of attention intentionally given to this minor endeavor.
from urllib.request import urlopen
from urllib.error import URLError
from bs4 import BeautifulSoup as bs
import pickle
Fetch Full Directory¶
The copy of VD1 we found is hosted by Virginia Tech as a file directory structure. In addition to documents and queries, VD1 also contained programs for exploring the collections contained. Some of the collections are images as well.
To extract the full contents of VD1, we need to crawl the directory structure and copy files from the server to our local machine. There are quite a few files, so we need to make quite a few requests, resulting in a fairly lengthy retrieval time (about 30 minutes) for content weighing in at under 650MB. After retrieval, we pickle the disc for later use.
main_url = "http://fox.cs.vt.edu/VAD1/"
def getLinksFromPage(url):
try:
page = bs(urlopen(url), 'html.parser')
links = [li.find('a')['href'] for li in page.find_all('li') \
if li.text.strip() != 'Parent Directory']
links = [li for li in links if li not in ['/', '/VAD1/']]
links = [url + li for li in links]
return(links)
except URLError:
print(url)
return([])
def processLink(url):
if url.endswith('/'):
return(processDir(url))
elif url.find('.') > -1:
return(processFile(url))
else:
return({})
def processFile(url):
try:
file = urlopen(url).readlines()
ftype = url.split('.')[-1]
return({'url' : url,
'contents' : file,
'type' : ftype})
except URLError:
print(url)
return({})
def processDir(url):
links = getLinksFromPage(url)
contents = [processLink(li) for li in links]
return({'url' : url,
'type' : 'DIR',
'contents' : contents})
%%time
contents = processLink(main_url)
with open('VirginiaDiskOne.pkl', 'wb') as f:
pickle.dump(contents, f)
Below we list all of the files and directories under the main "VAD" directory. The "BOOKLET.TXT" file is probably the most interesting file at the top level, as it contains descriptions for the various contents of the disc. Of particular interest to us as far as directories go is the "DOWN" folder, which contains several Query - Document collections. This will be covered later.
It would likely be rather futile to attempt an install of the disc contents on any current OS.
for c in contents['contents']:
if c['type'] != 'DIR':
print({'url' : c['url'],
'type' : c['type']})
for c in contents['contents']:
if c['type'] == 'DIR':
print({'url' : c['url'],
'total_content' : len(c['contents'])})
Re-Build Directory Locally¶
The current dictionary structure for the VD1 is a bit difficult to navigate and use. Thus, we reproduce the directory and file structure locally. The python "os" module is built to handle this task quite well. IPython Magics can also be used for creating and navigating directory structures, but we stick with Python here. Note that the code below is essentially the inverse of the code above, except that the "urllib" module has been swapped out with the standard file I/O methods of Python.
import os
def writeFile(file):
name = file['url'].split('/')[-1]
contents = file['contents']
with open(name, 'wb') as f:
f.writelines(contents)
def writeDir(directory):
name = directory['url'].strip('/').split('/')[-1]
os.mkdir(name)
# Dive down a layer...
os.chdir(name)
# Write ALL the things....
contents = directory['contents']
writeContents(contents)
# Go back to where you came from...
os.chdir(os.pardir)
def writeContents(contents):
for content in contents:
# Cannot use 'type' attribute because some files are of type 'DIR'
# and i'm an idiot and didn't realize it earlier
if content['url'].endswith('/'):
writeDir(content)
else:
writeFile(content)
writeContents(contents['contents'])
ls
Extract Names of Text Test Collections¶
In this section, we extract content first from the "BOOKLET.TXT" file and one of the sample collections. The "BOOKLET.TXT" document is essentially a Master Readme file, and points to various programs and collections contained on the disc.
One of these collections is the National Library of Medicine Test Collection, located in the "DOWN" sub-directory. This is the collection we extract content from below for later use.
Find the Names of all Collections from 'BOOKLET.TXT'¶
Due to how the file (and all extracted files) were read, the 'BOOKLET.TXT' document is a list of individual lines in the document, where each list item is in byte code format, not string format (though the lines are rendered as strings in the notebook). We will convert the lines to Unicode before moving forward. This can be performed with the 'decode' class method. While calling "str(line)" for each line would also technically convert the line to a string, several artifacts would remain from the byte encoding. This is shown below. Additionally, we will remove new-lines lines in the document.
with open('BOOKLET.TXT', 'rb') as f:
booklet = f.readlines()
booklet[:10]
This is what the resulting object looks like when we call "str" with the line as an argument. Notice that Python does the right thing and simply encodes the entire content as a string, as opposed to attempting to be fancy and strip away content.
str(booklet[2])
booklet[2].decode('utf-8').strip()
lines = [line.decode('utf-8').strip() for line in booklet]
lines = [line for line in lines if line]
lines[:4]
Let's start with the "Downloadable Data Collections" as they are a bit more straight-forward.
The code below extracts the name of each collection, as well as its location in the VD1 directory. Descriptions are also contained within this file, but we do not extract them at this time. The interested reader can view the file at their leisure.
start_line = 'Public Domain Data (mostly in \down):'
end_line = 'DVI (tm) Image and Video Presentation Data'
target_lines = []
start_flag = False
for line in lines:
if line == start_line:
start_flag = True
elif line == end_line:
break
if start_flag:
target_lines.append(line)
locations = {}
for i,line in enumerate(target_lines):
if line[0] == '*':
name = line.strip('*').strip().split('(')[0].strip()
if line.find('(\\') > -1:
loc = line.split('(\\')[1].strip().strip(')')
loc = '/' + loc
else:
try:
line = target_lines[i+1]
loc = line.split('(\\')[1].strip().split(')')[0].strip()
loc = '/' + loc
except IndexError:
print(line)
loc = loc.replace('\\', '/').upper()
locations[name] = loc
locations
Extract National Library of Medicine Test Collection¶
This collection contains a three Query - Response datasets, each for a difference general area -- "Science", "Health", and "Medicine".
Queries range from fairly simple one-liners to fairly lengthy descriptions. See two examples below.
- "Medicaid and Medicare vs. health insurance and their role in health care delivery."
- "HTLVIII- AIDS-related lymphomas and other malignancies, T helper, cell deficiency. Hodgkin's disease, molecular biology, gene rearrangements, biochemical genetics, cell of origin, etiology, Reed-Sternberg cell. T-cell acute lymphocytic leukemia (ALL) - in children. Immunophenotype, surface markers, t-cell receptors, gene rearrangements, chromosomal translocations of t-cell receptors. HTLVI- (ATL) adult T cell leukemia, IL-2 and IL-2 receptor as it relates, biology of infection of HTLVI. Role of retroviruses in human lymphomas. Burkitt's lymphoma."
Each query is contained in a file with a set of abstracts, each of which is either a positive or negative match for the query. Positive / negative match quality is indicated in a set of separate "Relevncy" files, one for each query.
As it stands currently, this collection is not very user-friendly in terms of being able to use the collection in order to evaluate an IR system. Our goal here is to extract the relevant information from the files and convert the data into a more usable form.
os.chdir('Dropbox/Analytics/Text Retrieval and Search Engines/VAD1')
nlm_loc = locations['National Library of Medicine test collection'][1:]
nlm_loc
os.chdir(nlm_loc)
main_dirs = [f for f in os.listdir(os.curdir) if os.path.isdir(f)]
main_dirs
os.listdir(main_dirs[0])
os.listdir(os.path.join(main_dirs[0], 'QUERIES1'))[:5]
Extract Query, Abstract Contents¶
First, we will take a look at the query files Each of these files (supposedly) contains a single query along with a set of documents related to that query, either in a 'positive' or 'negative' manner. The 'positive' or 'negative' factor is contained in a separate file, which is covered in the next section. Below, an example of one such document is shown. After the file, we cover the list of attributes we wish to extract for each file, and each document within each file.
with open(os.path.join(main_dirs[0], 'QUERIES1', 'BSR03'), 'rb') as f:
bsr = f.readlines()
bsr[:50]
General format:
- Header row with query name
- Query ID starts row with the actual query on it
- 'UI' denotes doc ids associated with each query
- General index also seems to be associated, but this does not seem to be used elsewhere, so it will be ignored; instead, for each document we will extract
- UI --> Doc ID
- AU --> Authors
- TI --> Document Title
- AB --> Document abstract
- The document text in this case
- MH --> Appears to be a set of keywords associated with the document
- SO --> Publication / Journal in which the paper corresponding to the abstract appeared
In some cases, some documents do not contain all of these elements. In such cases, we remove / ignore the documents and do not extract anything from them. This will result in a different count for the number of documents extracted and the number of relevency scores extracted.
Some notes about the code below. For each file, the focus is first on the first few lines in a "query extraction" mode. Specific markers are looked for while processing each document, each at the beginning of the line. Note that the files are carefully formatted so that non-section-break lines begin with whitspace. Thus there is not a concern about accidently moving through sections prematurly.
In transitioning from the query to the documents (and into "document extraction mode"), the code looks for the first line beginning with a numeric character that is not trailed by a period. Once this milestone is reached, document breaks are flagged by this feature.
Within a document, six (6) sections are present for a complete document. Again, each of these sections is flagged with a two-letter abbreviation indicating which section has been encountered. This makes parsing documents fairly straight-forward when all sections are present. In some cases, not all sections are present, and we account for errors arrising from such imcomplete documents below. Additionally, we treat such documents as errors, and ignore the docments.
After all of the lines in a file have been addressed, each document is revisited and the various components cleaned up a bit. This is the point at which imcomplete and empty documents are removed.
def firstCharIntCheck(line):
try:
if int(line[0]) in [0,1,2,3,4,5,6,7,8,9]:
return(True)
else:
return(False)
except ValueError:
return(False)
def processQueryDoc(doc, topic):
prefix_lookup = {'HEALTH' : 'HSR',
'SCIENCE' : 'BSR',
'MEDICINE' : 'CMR'}
doc_sections = ['UI', 'AU', 'TI', 'AB', 'MH', 'SO']
query = {}
abstracts = {}
doc = [line.decode('utf-8') for line in doc]
label = doc[0].strip().strip('=')
nbr = str(int(doc[0].strip().strip('=').strip(prefix_lookup[topic])))
doc = doc[1:]
query['topic'] = topic
query['id'] = nbr
query_flag = False
doc_flag = False
doc_section = None
cur_a = {}; temp = [];
for line in doc:
if not query_flag and not doc_flag:
if line.startswith(nbr + '.'):
query_flag = True
temp = []
line = line.strip(nbr + '.').strip()
temp.append(line)
else:
raise AttributeError('Something\'s wrong here...')
elif query_flag:
if firstCharIntCheck(line):
# End of the query, beginning of the first abstract
# Write out query content and move on.
query['content'] = temp.copy()
query_flag = False
doc_flag = True
cur_a = {}
temp = []
else:
temp.append(line.strip())
elif doc_flag:
if firstCharIntCheck(line):
# We've reached the beginning of a new abstract
# and we currently have an active abstract stored
# as 'cur_a'.
# First, we need to write the last section of the previous
# abstract to "cur_a". Then, we need to store the active
# abstract in the "abstracts" dict.
# Additionally, the abs id is stored as "cur_a['UI']"
try:
cur_a[doc_section] = temp.copy()
cur_a['UI'] = int(cur_a['UI'][0])
abstracts[cur_a['UI']] = cur_a
except KeyError:
pass
# Clear out everything
doc_section = None
cur_a = {}
temp = []
elif line[:2] in doc_sections:
# Reached the beginning of a new section
if doc_section:
# Currently in a section; store stuff from section
# Start new list for the new section
cur_a[doc_section] = temp.copy()
temp = []
# Add data to section
doc_section = line[:2]
line = line[2:]
temp.append(line.strip().strip('-').strip())
else:
# In the core of a section; no flags to update
temp.append(line.strip())
# Clean up at the end of the file
try:
cur_a[doc_section] = temp.copy()
cur_a['UI'] = int(cur_a['UI'][0])
abstracts[cur_a['UI']] = cur_a
doc_section = None
cur_a = {}
temp = []
except KeyError:
pass
# Post Process
if not doc_flag:
# Only the query in the file, no docs
# (yes this happens...)
query['content'] = temp.copy()
query_flag = False
doc_flag = True
cur_a = {}
temp = []
query['content'] = ' '.join(query['content'])
# Clean up document formatting and remove empty / incomplete docs
purge_docs = []
for docid, abstract in abstracts.items():
try:
abstract['AB'] = ' '.join(abstract['AB'])
abstract['AU'] = [au.strip() for au in ';'.join(abstract['AU']).split(';') if au]
abstract['MH'] = [s.strip() for s in ';'.join(abstract['MH']).split(';') if s]
abstract['SO'] = abstract['SO'][0]
abstract['TI'] = abstract['TI'][0]
abstract['QueryID'] = label
abstract['Topic'] = topic
except KeyError:
purge_docs.append(docid)
for pd_id in purge_docs:
_ = abstracts.pop(pd_id)
return(query, abstracts)
query, abstracts = processQueryDoc(bsr, 'SCIENCE')
query
abstracts[86252676]
Below we apply the extraction code to all query files in the NLM collection. The extracted Queries and Documents are then stored for later use.
primary_path = '..../Analytics/VAD1/DOWN/NLM/'
queries = []
abstracts = []
for md in main_dirs:
for sub_dir in os.listdir(os.path.join(primary_path, md)):
if sub_dir.find('QUERIES') > -1:
for fp in os.listdir(os.path.join(primary_path, md, sub_dir)):
with open(os.path.join(primary_path, md, sub_dir, fp), 'rb') as f:
r = f.readlines()
try:
q, ab = processQueryDoc(r, md)
queries.append(q)
abstracts.extend(ab)
except (KeyError, ValueError, UnboundLocalError) as e:
print(e)
print(os.path.join(md, sub_dir, fp))
print(len(queries))
print(len(abstracts))
os.mkdir('vad1')
os.chdir('vad1/')
with open('queries.pkl', 'wb') as f:
pickle.dump(queries, f)
with open('abstracts.pkl', 'wb') as f:
pickle.dump(abstracts, f)
Extract Abstract "Relevncy"¶
os.listdir(os.path.join(main_dirs[0], 'RELEVNCY'))[:5]
'BSR03.UI' in set(os.listdir(os.path.join(main_dirs[0], 'RELEVNCY')))
with open(os.path.join(main_dirs[0], 'RELEVNCY', 'BSR03.UI'), 'rb') as f:
bsrr = f.readlines()
bsrr[:6]
The newline indicators make the file look confusing. The last "\n" on each line is simply indicating a newline, while the character of interest is the "y" or "n" directly before that, which indicates whether the particular document included in the query file is relevant or not relevant.
These files are much easier to extract data from. Each line, other than the header line, list the Document ID and the "Relevant / Not Relevant" indicator.
For a handful of cases, the query has neither positive nor negative documents associated with it. For such files, which contain a "no retrieval" line, no content is extracted.
def processRelevncyDoc(doc):
doc = [line.decode('utf-8') for line in doc]
label = doc[0].strip().strip('=')
doc = doc[1:]
if doc[0].find('no retrieval') > -1:
relevncys = []
else:
relevncys = [{'query' : label,
'abstract' : int(line.split()[0].strip()),
'relevant' : line.split()[1].strip()} \
for line in doc if line.strip()]
return(relevncys)
relev = processRelevncyDoc(bsrr)
relev[:5]
Again, at this point we apply the above code to all Relevncy files present in the collection. These scores are stored for later use.
primary_path = '..../Analytics/VAD1/DOWN/NLM/'
relevncys = []
for md in main_dirs:
for sub_dir in os.listdir(os.path.join(primary_path, md)):
if sub_dir.find('RELEVNCY') > -1:
for fp in os.listdir(os.path.join(primary_path, md, sub_dir)):
with open(os.path.join(primary_path, md, sub_dir, fp), 'rb') as f:
r = f.readlines()
try:
rels = processRelevncyDoc(r)
relevncys.extend(rels)
except (KeyError, ValueError, IndexError) as e:
print(e)
print(os.path.join(md, sub_dir, fp))
len(relevncys)
with open('relevncys.pkl', 'wb') as f:
pickle.dump(relevncys, f)
Summary¶
This article involved copying a directory structure and its contents from the web, reproducing the content and structure locally, and extracting a Query - Document collection (the National Library of Medicine Test Collection) used for Information Retrieval method development from the content for future use.
The content in question was none other than the Virginia Disc One, which was the first text collection of its kind, as it was mass-distributed to various players in the field of IR in order to initialize the trend of uniform test collections for developing and advancing IR methods.