Data Mining the Internet Archive

I learned some of the basics of python over 5 years ago in the context of using it in a Geographic Information System (GIS). At the time, I learned it while using a windows-based computer. I was told that python was a powerful programming language, but since we only used it within the confines of a GIS , I had never truly discovered its usefulness outside of this world…until now, through the programming historian. As I look through these examples through the lens of an information professional, I can clearly see the advantages of being able to search through various collections in internet archive with the use of a programming language. With the use of some of these python modules and scripts, information professionals can extract various bits of information from collections held in Internet Archive, quickly and easily (well, after the setup is complete that is).

With the use of python, information professionals can download metadata from collections and then with the use of specific modules such as pymarc, extract specific MARC fields from items within a collection. This was mind-boggling!! Once users have gone through the headache of properly setting up python and its various modules, things tend to run rather smoothly. A big advantage of using python scripts is the ease in which these can be re-used for various other collections within internet archive.

When I originally learned python in a windows environment and found at the time that the setup was quite finicky. You needed to ensure that you had the proper version installed along with the proper modules installed as well. I remember having to install a 3rd party software (pythonwin) to write and execute my scripts, since the native python interpreter in my GIS was rather poor (according to the prof – he was right too!). When doing the setup with my MAC, I found myself looking for a python interpreter similar to pythonwin. I felt extremely stupid (and was rather glad after the fact) once I discovered that the MAC terminal easily transforms into the python interpreter after typing the words “python” in terminal (magically, the terminal adds the «<, which indicates that this is the terminal.

The programming historian does a decent of job of describing the general setup and uses of these specific tools. Where it faults is specific instructions to get individuals started and helping users troubleshoot various issues with the tools. Having used python in the past, I could somewhat easily troubleshoot the issues and move on to the following step.

Written on February 1, 2016