immersinn-ds

Mon 24 October 2016

Virginia Disc One Exploration

Posted by TRII in text-retrieval-and-search-engines   

Introduction / Overview

Virginia Disc One was "the first large-scale distribution of test collections" used in Information Retrieval. The goal was to create a large test collection that hundreds of researchers could contribute to and utilize for work in the IR field. While many larger, more comprehensive collections have been created and distributed since VD1 was first distributed in 1990, we thought it would be interesting (and fun!) to take a look at some of the contents and use them for future notebooks / articles.

In this article, we perform some basic web-scraping in order to obtain the contents of VD1. Then, we look at a small subset of the contents to extract a collection of example queries, documents, and relevancies. This extraction involves some simple file reading and parsing.

As a note, none of this notebook is intended to be particularly rigorous. The primary goal is to examine a new data set using some basic tools and see what we can get out of it As such, much of the code and methods are fairly "raw", and mostly a first-pass / one-off attempt. Also, Jupyter / IPython notebooks do not have built-in spell-checking (which is annoying); while we try to catch most mistakes, some may still remain due to the lack of attention intentionally given to this minor endeavor.

In [78]:
from urllib.request import urlopen
from urllib.error import URLError
from bs4 import BeautifulSoup as bs
In [116]:
import pickle

Fetch Full Directory

The copy of VD1 we found is hosted by Virginia Tech as a file directory structure. In addition to documents and queries, VD1 also contained programs for exploring the collections contained. Some of the collections are images as well.

To extract the full contents of VD1, we need to crawl the directory structure and copy files from the server to our local machine. There are quite a few files, so we need to make quite a few requests, resulting in a fairly lengthy retrieval time (about 30 minutes) for content weighing in at under 650MB. After retrieval, we pickle the disc for later use.

In [10]:
main_url = "http://fox.cs.vt.edu/VAD1/"
In [134]:
def getLinksFromPage(url):
    try:
        page = bs(urlopen(url), 'html.parser')
        links = [li.find('a')['href'] for li in page.find_all('li') \
                 if li.text.strip() != 'Parent Directory']
        links = [li for li in links if li not in ['/', '/VAD1/']]
        links = [url + li for li in links]
        return(links)
    except URLError:
        print(url)
        return([])

def processLink(url):
    if url.endswith('/'):
        return(processDir(url))
    elif url.find('.') > -1:
        return(processFile(url))
    else:
        return({})
    
def processFile(url):
    try:
        file = urlopen(url).readlines()
        ftype = url.split('.')[-1]
        return({'url' : url,
                'contents' : file,
                'type' : ftype})
    except URLError:
        print(url)
        return({})

def processDir(url):
    links = getLinksFromPage(url)
    contents = [processLink(li) for li in links]
    return({'url' : url,
            'type' : 'DIR',
            'contents' : contents})
In [135]:
%%time
contents = processLink(main_url)
CPU times: user 46.1 s, sys: 11 s, total: 57.1 s
Wall time: 29min 53s
In [137]:
with open('VirginiaDiskOne.pkl', 'wb') as f:
    pickle.dump(contents, f)

Below we list all of the files and directories under the main "VAD" directory. The "BOOKLET.TXT" file is probably the most interesting file at the top level, as it contains descriptions for the various contents of the disc. Of particular interest to us as far as directories go is the "DOWN" folder, which contains several Query - Document collections. This will be covered later.

It would likely be rather futile to attempt an install of the disc contents on any current OS.

In [139]:
for c in contents['contents']:
    if c['type'] != 'DIR':
        print({'url' : c['url'],
               'type' : c['type']})
{'type': 'TXT', 'url': 'http://fox.cs.vt.edu/VAD1/ABSTRACT.TXT'}
{'type': 'TXT', 'url': 'http://fox.cs.vt.edu/VAD1/BOOKLET.TXT'}
{'type': 'TXT', 'url': 'http://fox.cs.vt.edu/VAD1/COPYRGHT.TXT'}
{'type': 'BAT', 'url': 'http://fox.cs.vt.edu/VAD1/DINSTALL.BAT'}
{'type': 'TXT', 'url': 'http://fox.cs.vt.edu/VAD1/DONATION.TXT'}
{'type': 'BAT', 'url': 'http://fox.cs.vt.edu/VAD1/INSTALL.BAT'}
{'type': 'BAT', 'url': 'http://fox.cs.vt.edu/VAD1/INSTWPL.BAT'}
{'type': 'TXT', 'url': 'http://fox.cs.vt.edu/VAD1/LICENCES.TXT'}
{'type': 'BAT', 'url': 'http://fox.cs.vt.edu/VAD1/VADISC.BAT'}
In [140]:
for c in contents['contents']:
    if c['type'] == 'DIR':
        print({'url' : c['url'],
               'total_content' : len(c['contents'])})
{'url': 'http://fox.cs.vt.edu/VAD1/ACMACAD/', 'total_content': 20}
{'url': 'http://fox.cs.vt.edu/VAD1/ACMCOMP/', 'total_content': 20}
{'url': 'http://fox.cs.vt.edu/VAD1/ACMPOL/', 'total_content': 20}
{'url': 'http://fox.cs.vt.edu/VAD1/ACMPUB/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/AILIST/', 'total_content': 23}
{'url': 'http://fox.cs.vt.edu/VAD1/AILISTZ/', 'total_content': 23}
{'url': 'http://fox.cs.vt.edu/VAD1/AITOPIC/', 'total_content': 30}
{'url': 'http://fox.cs.vt.edu/VAD1/DIALOG/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/DOWN/', 'total_content': 12}
{'url': 'http://fox.cs.vt.edu/VAD1/DVI/', 'total_content': 7}
{'url': 'http://fox.cs.vt.edu/VAD1/FACTS/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/FAIRS/', 'total_content': 13}
{'url': 'http://fox.cs.vt.edu/VAD1/GEOGPOP/', 'total_content': 3}
{'url': 'http://fox.cs.vt.edu/VAD1/GUIDE/', 'total_content': 178}
{'url': 'http://fox.cs.vt.edu/VAD1/GUIDE86/', 'total_content': 22}
{'url': 'http://fox.cs.vt.edu/VAD1/HASHING/', 'total_content': 3}
{'url': 'http://fox.cs.vt.edu/VAD1/IMAGES/', 'total_content': 238}
{'url': 'http://fox.cs.vt.edu/VAD1/IPIS/', 'total_content': 27}
{'url': 'http://fox.cs.vt.edu/VAD1/IRLIST/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/KJVC/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/KJVV/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/LOUIS/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/MELBIB/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/MGH/', 'total_content': 21}
{'url': 'http://fox.cs.vt.edu/VAD1/PCWRITE/', 'total_content': 35}
{'url': 'http://fox.cs.vt.edu/VAD1/PL/', 'total_content': 32}
{'url': 'http://fox.cs.vt.edu/VAD1/PUBLIST/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/RECIPE/', 'total_content': 13}
{'url': 'http://fox.cs.vt.edu/VAD1/SC/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/SLIDES/', 'total_content': 10}
{'url': 'http://fox.cs.vt.edu/VAD1/SPRAYBLT/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/SS/', 'total_content': 19}
{'url': 'http://fox.cs.vt.edu/VAD1/VADISC1A/', 'total_content': 97}
{'url': 'http://fox.cs.vt.edu/VAD1/VAGEN01/', 'total_content': 25}
{'url': 'http://fox.cs.vt.edu/VAD1/VEG/', 'total_content': 6}
{'url': 'http://fox.cs.vt.edu/VAD1/VEGWITHG/', 'total_content': 9}
{'url': 'http://fox.cs.vt.edu/VAD1/VET/', 'total_content': 79}
{'url': 'http://fox.cs.vt.edu/VAD1/VIRG/', 'total_content': 21}
{'url': 'http://fox.cs.vt.edu/VAD1/VTSSBROW/', 'total_content': 20}
{'url': 'http://fox.cs.vt.edu/VAD1/WPL/', 'total_content': 216}
{'url': 'http://fox.cs.vt.edu/VAD1/XPERT/', 'total_content': 19}

Re-Build Directory Locally

The current dictionary structure for the VD1 is a bit difficult to navigate and use. Thus, we reproduce the directory and file structure locally. The python "os" module is built to handle this task quite well. IPython Magics can also be used for creating and navigating directory structures, but we stick with Python here. Note that the code below is essentially the inverse of the code above, except that the "urllib" module has been swapped out with the standard file I/O methods of Python.

In [215]:
import os
In [246]:
def writeFile(file):
    name = file['url'].split('/')[-1]
    contents = file['contents']
    with open(name, 'wb') as f:
        f.writelines(contents)

def writeDir(directory):
    name = directory['url'].strip('/').split('/')[-1]
    os.mkdir(name)
    # Dive down a layer...
    os.chdir(name)
    
    # Write ALL the things....
    contents = directory['contents']
    writeContents(contents)
    
    # Go back to where you came from...
    os.chdir(os.pardir)

def writeContents(contents):
    for content in contents:
        # Cannot use 'type' attribute because some files are of type 'DIR'
        # and i'm an idiot and didn't realize it earlier
        if content['url'].endswith('/'):
            writeDir(content)
        else:
            writeFile(content)
In [247]:
writeContents(contents['contents'])
In [249]:
ls
ABSTRACT.TXT  COPYRGHT.TXT  GUIDE/       KJVV/         SC/         VET/
ACMACAD/      DIALOG/       GUIDE86/     LICENCES.TXT  SLIDES/     VIRG/
ACMCOMP/      DINSTALL.BAT  HASHING/     LOUIS/        SPRAYBLT/   VTSSBROW/
ACMPOL/       DONATION.TXT  IMAGES/      MELBIB/       SS/         WPL/
ACMPUB/       DOWN/         INSTALL.BAT  MGH/          VADISC1A/   XPERT/
AILIST/       DVI/          INSTWPL.BAT  PCWRITE/      VADISC.BAT
AILISTZ/      FACTS/        IPIS/        PL/           VAGEN01/
AITOPIC/      FAIRS/        IRLIST/      PUBLIST/      VEG/
BOOKLET.TXT   GEOGPOP/      KJVC/        RECIPE/       VEGWITHG/

Extract Names of Text Test Collections

In this section, we extract content first from the "BOOKLET.TXT" file and one of the sample collections. The "BOOKLET.TXT" document is essentially a Master Readme file, and points to various programs and collections contained on the disc.

One of these collections is the National Library of Medicine Test Collection, located in the "DOWN" sub-directory. This is the collection we extract content from below for later use.

Find the Names of all Collections from 'BOOKLET.TXT'

Due to how the file (and all extracted files) were read, the 'BOOKLET.TXT' document is a list of individual lines in the document, where each list item is in byte code format, not string format (though the lines are rendered as strings in the notebook). We will convert the lines to Unicode before moving forward. This can be performed with the 'decode' class method. While calling "str(line)" for each line would also technically convert the line to a string, several artifacts would remain from the byte encoding. This is shown below. Additionally, we will remove new-lines lines in the document.

In [252]:
with open('BOOKLET.TXT', 'rb') as f:
    booklet = f.readlines()
In [253]:
booklet[:10]
Out[253]:
[b'\t\t\t  VIRGINIA DISC ONE\n',
 b'\n',
 b'\t\t    Copyright (c) 1988, 1989, 1990\n',
 b'\t Virginia Polytechnic Institute and State University\n',
 b'\t\t\t All Rights Reserved.\n',
 b'\n',
 b'                      Produced and Supported by\n',
 b'                         Nimbus Records Inc.\n',
 b'\n',
 b'Grants by\n']

This is what the resulting object looks like when we call "str" with the line as an argument. Notice that Python does the right thing and simply encodes the entire content as a string, as opposed to attempting to be fancy and strip away content.

In [254]:
str(booklet[2])
Out[254]:
"b'\\t\\t    Copyright (c) 1988, 1989, 1990\\n'"
In [255]:
booklet[2].decode('utf-8').strip()
Out[255]:
'Copyright (c) 1988, 1989, 1990'
In [256]:
lines = [line.decode('utf-8').strip() for line in booklet]
lines = [line for line in lines if line]
In [257]:
lines[:4]
Out[257]:
['VIRGINIA DISC ONE',
 'Copyright (c) 1988, 1989, 1990',
 'Virginia Polytechnic Institute and State University',
 'All Rights Reserved.']

Let's start with the "Downloadable Data Collections" as they are a bit more straight-forward.

The code below extracts the name of each collection, as well as its location in the VD1 directory. Descriptions are also contained within this file, but we do not extract them at this time. The interested reader can view the file at their leisure.

In [186]:
start_line = 'Public Domain Data (mostly in \down):'
end_line = 'DVI (tm) Image and Video Presentation Data'
In [190]:
target_lines = []
start_flag = False
for line in lines:
    if line == start_line:
        start_flag = True
    elif line == end_line:
        break
    if start_flag:
        target_lines.append(line)
In [272]:
locations = {}
for i,line in enumerate(target_lines):
    if line[0] == '*':
        name = line.strip('*').strip().split('(')[0].strip()
        if line.find('(\\') > -1:
            loc = line.split('(\\')[1].strip().strip(')')
            loc = '/' + loc
        else:
            try:
                line = target_lines[i+1]
                loc = line.split('(\\')[1].strip().split(')')[0].strip()
                loc = '/' + loc
            except IndexError:
                print(line)
        loc = loc.replace('\\', '/').upper()
        locations[name] = loc
In [273]:
locations
Out[273]:
{'Architecture Images': '/DOWN/ARCHDATA',
 'Artificial Intelligence List': '/DOWN/AILIST',
 'Dual Independent Map Format': '/DOWN/DIMECO',
 'Florida Extension Images': '/DOWN/FLIMAGES',
 'Information Retrieval List': '/DOWN/IRLIST',
 'Information Retrieval Test Collections': '/DOWN/IRCOLLS',
 'King James Bible': '/DOWN/KJV',
 'National Library of Medicine test collection': '/DOWN/NLM',
 'Survey Points in Virginia, Washington, D.C., and Baltimore, MD': '/DOWN/SURVPT',
 'U.S. Geographical  & Population Data': '/GEOGPOP',
 'U.S. Geological Data': '/DOWN/CARSTENS',
 'University of Maryland CVL Images': '/DOWN/GRAPHIC',
 'University of Melbourne Computer Science Bibliography': '/DOWN/BIB'}

Extract National Library of Medicine Test Collection

This collection contains a three Query - Response datasets, each for a difference general area -- "Science", "Health", and "Medicine".

Queries range from fairly simple one-liners to fairly lengthy descriptions. See two examples below.

  1. "Medicaid and Medicare vs. health insurance and their role in health care delivery."
  2. "HTLVIII- AIDS-related lymphomas and other malignancies, T helper, cell deficiency. Hodgkin's disease, molecular biology, gene rearrangements, biochemical genetics, cell of origin, etiology, Reed-Sternberg cell. T-cell acute lymphocytic leukemia (ALL) - in children. Immunophenotype, surface markers, t-cell receptors, gene rearrangements, chromosomal translocations of t-cell receptors. HTLVI- (ATL) adult T cell leukemia, IL-2 and IL-2 receptor as it relates, biology of infection of HTLVI. Role of retroviruses in human lymphomas. Burkitt's lymphoma."

Each query is contained in a file with a set of abstracts, each of which is either a positive or negative match for the query. Positive / negative match quality is indicated in a set of separate "Relevncy" files, one for each query.

As it stands currently, this collection is not very user-friendly in terms of being able to use the collection in order to evaluate an IR system. Our goal here is to extract the relevant information from the files and convert the data into a more usable form.

In [399]:
os.chdir('Dropbox/Analytics/Text Retrieval and Search Engines/VAD1')
In [277]:
nlm_loc = locations['National Library of Medicine test collection'][1:]
nlm_loc
Out[277]:
'DOWN/NLM'
In [400]:
os.chdir(nlm_loc)
In [401]:
main_dirs = [f for f in os.listdir(os.curdir) if os.path.isdir(f)]
main_dirs
Out[401]:
['SCIENCE', 'MEDICINE', 'HEALTH']
In [402]:
os.listdir(main_dirs[0])
Out[402]:
['QUERIES2', 'QUERIES1', 'RELEVNCY']
In [293]:
os.listdir(os.path.join(main_dirs[0], 'QUERIES1'))[:5]
Out[293]:
['BSR03', 'BSR09', 'BSR22', 'BSR26', 'BSR01']

Extract Query, Abstract Contents

First, we will take a look at the query files Each of these files (supposedly) contains a single query along with a set of documents related to that query, either in a 'positive' or 'negative' manner. The 'positive' or 'negative' factor is contained in a separate file, which is covered in the next section. Below, an example of one such document is shown. After the file, we cover the list of attributes we wish to extract for each file, and each document within each file.

In [295]:
with open(os.path.join(main_dirs[0], 'QUERIES1', 'BSR03'), 'rb') as f:
    bsr = f.readlines()
In [393]:
bsr[:50]
Out[393]:
[b'==================================BSR03==================================\n',
 b'3.   Tumor and normal tissue blood flow in animals.\n',
 b'     Effect of hyperthermia and drugs on tumor and normal tissue \n',
 b'     blood flow.\n',
 b'1\n',
 b'UI  - 87084960\n',
 b'AU  - Reinhold HS ; Endrich B\n',
 b'TI  - Tumour microcirculation as a target for hyperthermia.\n',
 b'AB  - A great number of investigators have, independently, shown that tumour\n',
 b'      blood flow is affected by a hyperthermic treatment to a larger extent\n',
 b'      than normal tissue blood flow. While the majority of the studies on\n',
 b'      experimental tumours show a decrease and even a lapse in blood flow\n',
 b'      within the microcirculation during or after hyperthermia, the data on\n',
 b'      human tumours are less conclusive. Some of the investigators do not find\n',
 b'      a decrease in circulation, while others do. Obviously, this is an\n',
 b'      important field of investigation in the clinical application of\n',
 b'      hyperthermia because a shut down of the circulation would not only\n',
 b'      facilitate tumour heating (by reducing venous outflow, this reducing the\n',
 b"      'heat clearance' from the tumour), but would also facilitate tumour cell\n",
 b'      destruction. The same holds for alterations that occur subsequently to\n',
 b'      the circulatory changes, like a heat-induced decrease of tissue pO2 and\n',
 b'      pH. If the frequently reported circulatory collapse of the tumour\n',
 b'      circulation could selectively be stimulated by, e.g. acidification or by\n',
 b'      vasoactive agents, hyperthermic treatment of patients would possibly be\n',
 b'      greatly facilitated and intensified. In hyperthermic tumour therapy a\n',
 b'      number of complex processes and interactions takes place, especially when\n',
 b'      the treatment is performed in combination with radiation therapy. One of\n',
 b'      them represents the group of processes related to the random probability\n',
 b'      of cell sterilization of individual tumour cells resulting in exponential\n',
 b'      survival curves which are typically evaluated with e.g. cell survival\n',
 b'      assays. This aspect has not been the issue of this paper. The other group\n',
 b'      of processes deals with the heat-induced changes in the micro-physiology\n',
 b'      of tumours and normal tissues which, as discussed before, may not only\n',
 b'      enhance the exponential cell kill, but which may also culminate in\n',
 b'      vascular collapse with the ensuing necrosis of the tumour tissue in the\n',
 b'      areas affected. If this takes place, a process of bulk killing of tumour\n',
 b'      cells results, rather than the random type of cell sterilization. At\n',
 b'      present it is not clear to what extent the various separate mechanisms\n',
 b'      contribute to the total effect of tumour control. With all these\n',
 b'      considerations in mind, one should be aware of the fact that effects,\n',
 b'      secondary to heat-induced vascular stasis alone will never be efficient\n',
 b'      enough to eliminate all tumour cells, even though a heat reservoir is\n',
 b'      created. This is so because some malignant cells will inevitably have\n',
 b'      already infiltrated normal, surrounding structures and will therefore not\n',
 b'      be affected by changes in the tumour vascular bed.(ABSTRACT TRUNCATED AT\n',
 b'      400 WORDS)\n',
 b'MH  - Animal ; Human ; *Hyperthermia, Induced ; Microcirculation ; Neoplasms,\n',
 b"      Experimental/*BLOOD SUPPLY/THERAPY ; Review ; Support, Non-U.S. Gov't\n",
 b'SO  - Int J Hyperthermia 1986 Apr-Jun;2(2):111-37\n',
 b'2\n']

General format:

  • Header row with query name
  • Query ID starts row with the actual query on it
  • 'UI' denotes doc ids associated with each query
  • General index also seems to be associated, but this does not seem to be used elsewhere, so it will be ignored; instead, for each document we will extract
    • UI --> Doc ID
    • AU --> Authors
    • TI --> Document Title
    • AB --> Document abstract
      • The document text in this case
    • MH --> Appears to be a set of keywords associated with the document
    • SO --> Publication / Journal in which the paper corresponding to the abstract appeared

In some cases, some documents do not contain all of these elements. In such cases, we remove / ignore the documents and do not extract anything from them. This will result in a different count for the number of documents extracted and the number of relevency scores extracted.

Some notes about the code below. For each file, the focus is first on the first few lines in a "query extraction" mode. Specific markers are looked for while processing each document, each at the beginning of the line. Note that the files are carefully formatted so that non-section-break lines begin with whitspace. Thus there is not a concern about accidently moving through sections prematurly.

In transitioning from the query to the documents (and into "document extraction mode"), the code looks for the first line beginning with a numeric character that is not trailed by a period. Once this milestone is reached, document breaks are flagged by this feature.

Within a document, six (6) sections are present for a complete document. Again, each of these sections is flagged with a two-letter abbreviation indicating which section has been encountered. This makes parsing documents fairly straight-forward when all sections are present. In some cases, not all sections are present, and we account for errors arrising from such imcomplete documents below. Additionally, we treat such documents as errors, and ignore the docments.

After all of the lines in a file have been addressed, each document is revisited and the various components cleaned up a bit. This is the point at which imcomplete and empty documents are removed.

In [448]:
def firstCharIntCheck(line):
    try:
        if int(line[0]) in [0,1,2,3,4,5,6,7,8,9]:
            return(True)
        else:
            return(False)
    except ValueError:
        return(False)


def processQueryDoc(doc, topic):
    
    prefix_lookup = {'HEALTH' : 'HSR',
                     'SCIENCE' : 'BSR',
                     'MEDICINE' : 'CMR'}
    doc_sections = ['UI', 'AU', 'TI', 'AB', 'MH', 'SO']
    
    query = {}
    abstracts = {}
    
    doc = [line.decode('utf-8') for line in doc]
    label = doc[0].strip().strip('=')
    nbr = str(int(doc[0].strip().strip('=').strip(prefix_lookup[topic])))
    doc = doc[1:]
    
    query['topic'] = topic
    query['id'] = nbr
    
    query_flag = False
    doc_flag = False
    doc_section = None
    cur_a = {}; temp = [];
    
    for line in doc:
        if not query_flag and not doc_flag:
            if line.startswith(nbr + '.'):
                query_flag = True
                temp = []
                line = line.strip(nbr + '.').strip()
                temp.append(line)
            else:
                raise AttributeError('Something\'s wrong here...')
        elif query_flag:
            if firstCharIntCheck(line):
                # End of the query, beginning of the first abstract
                # Write out query content and move on.
                query['content'] = temp.copy()
                query_flag = False
                doc_flag = True
                cur_a = {}
                temp = []
            else:
                temp.append(line.strip())
        elif doc_flag:
            if firstCharIntCheck(line):
                # We've reached the beginning of a new abstract
                # and we currently have an active abstract stored
                # as 'cur_a'.  
                # First, we need to write the last section of the previous
                # abstract to "cur_a".  Then, we need to store the active
                # abstract in the "abstracts" dict.
                # Additionally, the abs id is stored as "cur_a['UI']"
                try:
                    cur_a[doc_section] = temp.copy()
                    cur_a['UI'] = int(cur_a['UI'][0])
                    abstracts[cur_a['UI']] = cur_a
                except KeyError:
                    pass
                
                # Clear out everything
                doc_section = None
                cur_a = {}
                temp = []
                
            elif line[:2] in doc_sections:
                # Reached the beginning of a new section
                if doc_section:
                    # Currently in a section; store stuff from section
                    # Start new list for the new section
                    cur_a[doc_section] = temp.copy()
                    temp = []
                # Add data to section
                doc_section = line[:2]
                line = line[2:]
                temp.append(line.strip().strip('-').strip())
            else:
                # In the core of a section; no flags to update
                temp.append(line.strip())
                
    # Clean up at the end of the file
    try:
        cur_a[doc_section] = temp.copy()
        cur_a['UI'] = int(cur_a['UI'][0])
        abstracts[cur_a['UI']] = cur_a
        doc_section = None
        cur_a = {}
        temp = []
    except KeyError:
        pass
        
    # Post Process
    if not doc_flag:
        # Only the query in the file, no docs
        # (yes this happens...)
        query['content'] = temp.copy()
        query_flag = False
        doc_flag = True
        cur_a = {}
        temp = []
    query['content'] = ' '.join(query['content'])
    
    # Clean up document formatting and remove empty / incomplete docs
    purge_docs = []
    for docid, abstract in abstracts.items():
        try:
            abstract['AB'] = ' '.join(abstract['AB'])
            abstract['AU'] = [au.strip() for au in ';'.join(abstract['AU']).split(';') if au]
            abstract['MH'] = [s.strip() for s in ';'.join(abstract['MH']).split(';') if s]
            abstract['SO'] = abstract['SO'][0]
            abstract['TI'] = abstract['TI'][0]
            abstract['QueryID'] = label
            abstract['Topic'] = topic
        except KeyError:
            purge_docs.append(docid)
    for pd_id in purge_docs:
        _ = abstracts.pop(pd_id)
        
            
    return(query, abstracts)
In [449]:
query, abstracts = processQueryDoc(bsr, 'SCIENCE')
In [450]:
query
Out[450]:
{'content': 'Tumor and normal tissue blood flow in animals. Effect of hyperthermia and drugs on tumor and normal tissue blood flow.',
 'id': '3',
 'topic': 'SCIENCE'}
In [451]:
abstracts[86252676]
Out[451]:
{'AB': 'A strategy for controlling the temperature profile in the tissue with a single applicator hyperthermia system is described. By manipulating the cooling water temperature as well as the heating power, the tissue temperatures in two selected locations can be controlled. By proper choice of these two locations and the corresponding temperature set-points, a temperature maximum can be obtained in a fairly superficial tumour. If the tissue composition and consequently the temperature distribution is fairly regular, a temperature profile above 43 degrees C in the tumour and below that in normal tissue can be obtained along an axis perpendicular to the surface. The controller is self-tuning and provides dynamic decoupling, bumpless transfer and anti-reset windup. Test of the controller by simulation and on a phantom indicates it is superior to the single point controller currently used.',
 'AU': ['Knudsen M', 'Heinzl L'],
 'MH': ['Adipose Tissue/BLOOD SUPPLY',
  '*Body Temperature',
  'Human',
  'Hyperthermia,',
  'Induced/INSTRUMENTATION/*METHODS',
  'Microcomputers',
  'Models, Biological',
  'Muscles/BLOOD SUPPLY',
  'Neoplasms/BLOOD SUPPLY',
  'Regional Blood Flow',
  'Skin/BLOOD SUPPLY'],
 'QueryID': 'BSR03',
 'SO': 'Int J Hyperthermia 1986 Jan-Mar;2(1):21-38',
 'TI': 'Two-point control of temperature profile in tissue.',
 'Topic': 'SCIENCE',
 'UI': 86252676}

Below we apply the extraction code to all query files in the NLM collection. The extracted Queries and Documents are then stored for later use.

In [452]:
primary_path = '..../Analytics/VAD1/DOWN/NLM/'
queries = []
abstracts = []
for md in main_dirs:
    for sub_dir in os.listdir(os.path.join(primary_path, md)):
        if sub_dir.find('QUERIES') > -1:
            for fp in os.listdir(os.path.join(primary_path, md, sub_dir)):
                with open(os.path.join(primary_path, md, sub_dir, fp), 'rb') as f:
                    r = f.readlines()
                try:
                    q, ab = processQueryDoc(r, md)
                    queries.append(q)
                    abstracts.extend(ab)
                except (KeyError, ValueError, UnboundLocalError) as e:
                    print(e)
                    print(os.path.join(md, sub_dir, fp))
In [462]:
print(len(queries))
print(len(abstracts))
155
3071
In [468]:
os.mkdir('vad1')
In [469]:
os.chdir('vad1/')
In [470]:
with open('queries.pkl', 'wb') as f:
    pickle.dump(queries, f)
with open('abstracts.pkl', 'wb') as f:
    pickle.dump(abstracts, f)

Extract Abstract "Relevncy"

In [403]:
os.listdir(os.path.join(main_dirs[0], 'RELEVNCY'))[:5]
Out[403]:
['BSR51.UI', 'BSR01.UI', 'BSR55.UI', 'BSR24.UI', 'BSR34.UI']
In [404]:
'BSR03.UI' in set(os.listdir(os.path.join(main_dirs[0], 'RELEVNCY')))
Out[404]:
True
In [405]:
with open(os.path.join(main_dirs[0], 'RELEVNCY', 'BSR03.UI'), 'rb') as f:
    bsrr = f.readlines()
In [415]:
bsrr[:6]
Out[415]:
[b'==================================BSR03.ui==================================\n',
 b'87084960  y\n',
 b'87083036  y\n',
 b'87059500  n\n',
 b'87051346  y\n',
 b'87002186  n\n']

The newline indicators make the file look confusing. The last "\n" on each line is simply indicating a newline, while the character of interest is the "y" or "n" directly before that, which indicates whether the particular document included in the query file is relevant or not relevant.

These files are much easier to extract data from. Each line, other than the header line, list the Document ID and the "Relevant / Not Relevant" indicator.

For a handful of cases, the query has neither positive nor negative documents associated with it. For such files, which contain a "no retrieval" line, no content is extracted.

In [459]:
def processRelevncyDoc(doc):
    
    doc = [line.decode('utf-8') for line in doc]
    label = doc[0].strip().strip('=')
    doc = doc[1:]
    if doc[0].find('no retrieval') > -1:
        relevncys = []
    else:
        relevncys = [{'query' : label,
                      'abstract' : int(line.split()[0].strip()),
                      'relevant' : line.split()[1].strip()} \
                    for line in doc if line.strip()]
    return(relevncys)
In [455]:
relev = processRelevncyDoc(bsrr)
In [456]:
relev[:5]
Out[456]:
[{'abstract': 87084960, 'query': 'BSR03.ui', 'relevant': 'y'},
 {'abstract': 87083036, 'query': 'BSR03.ui', 'relevant': 'y'},
 {'abstract': 87059500, 'query': 'BSR03.ui', 'relevant': 'n'},
 {'abstract': 87051346, 'query': 'BSR03.ui', 'relevant': 'y'},
 {'abstract': 87002186, 'query': 'BSR03.ui', 'relevant': 'n'}]

Again, at this point we apply the above code to all Relevncy files present in the collection. These scores are stored for later use.

In [460]:
primary_path = '..../Analytics/VAD1/DOWN/NLM/'
relevncys = []
for md in main_dirs:
    for sub_dir in os.listdir(os.path.join(primary_path, md)):
        if sub_dir.find('RELEVNCY') > -1:
            for fp in os.listdir(os.path.join(primary_path, md, sub_dir)):
                with open(os.path.join(primary_path, md, sub_dir, fp), 'rb') as f:
                    r = f.readlines()
                try:
                    rels = processRelevncyDoc(r)
                    relevncys.extend(rels)
                except (KeyError, ValueError, IndexError) as e:
                    print(e)
                    print(os.path.join(md, sub_dir, fp))
In [461]:
len(relevncys)
Out[461]:
3078
In [471]:
with open('relevncys.pkl', 'wb') as f:
    pickle.dump(relevncys, f)

Summary

This article involved copying a directory structure and its contents from the web, reproducing the content and structure locally, and extracting a Query - Document collection (the National Library of Medicine Test Collection) used for Information Retrieval method development from the content for future use.

The content in question was none other than the Virginia Disc One, which was the first text collection of its kind, as it was mass-distributed to various players in the field of IR in order to initialize the trend of uniform test collections for developing and advancing IR methods.