(Or why HoudahSpot is so awesome)
When I moved from Minnesota to Georgia, I realized that I couldn’t take with me all of my old paper versions of my journals – there simply wasn’t space in my new office, and 26 years of being an academic with at least 10 journal issues per year = a big old pile of journals. Being a rhetorician, however, means that some of those old journals still have practical use today. Thus, the beginnings of the project to get the PDF’s of more than 3,900 journal articles (I’m at over 2,000 so far, but still many more to go.).
Now I knew most of what I had to get, thanks to the fact that I had diligently put the articles in EndNote over the years, although I learned that I either never received or never entered several issues of various journals. That’s probably because a large company whose name I won’t mention, but rhymes with “Haylor and Bansis,” either sometimes forgets to send the journals, or when they do, sends me 3 issues at once. They’ve even sent me an entire year’s worth of Communication Studies in one bound journal, which for the curious, checks in at 1159 pages.
Part of the problem with this project is that there are quite a few years of the various journals that were converted to PDF, but never were converted to searchable PDF’s. I could make this post a rant about some of our “favorite” database companies that make lots of money from libraries, but didn’t bother to do the basic task of OCRing (i.e., not using optical character recognition) the journals – and yes, that even includes some journals from 2019! I won’t mention any names, but let’s just say the journals can be found in the database of a company name that rhymes with Websco Toast or Dough Quest.
I started the project assuming that anything from the early 1990’s was probably not OCR’ed correctly (i.e., no optical character recognition was used). But, lo and behold, I was wrong. My fingers accidentally pressed Cmd-I, which in MacOS, is the “Get Info” command that works on any file in MacOS. And so I looked at one of the files. Back in those days, the people who were scanning our favorite communication journals into databases were using ABBYY FineReader! That meant that the files were already properly OCR’ed, so I didn’t have to do the work in Adobe Acrobat!
Now we get to the tip, and where HoudaSpot comes in handy. It turns out that Get Info shows that the metadata of the file includes a “Content Creator” field. The “Content Creator” from that period was ABBYY FineReader. Thus, here’s where HoudaSpot comes in – and where Finder simply didn’t work for me.
You can add “Content Creator” in HoudaSpot by going to Search – Add Criterion – Other, and then selecting Content Creator. Or, in the visual interface, select Other in the far left-hand menu, then add Content Creator to the options. You’ll see in this screenshot that all I really need is the first search term. You don’t need to search for any text; just hit Option-Cmd-F or press the play button next to the “Search Any Text” field. You’ll then get a list of files that match the desired term. In my case, I had over 80 files that were created by ABBYY FineReader. [1] I’ll try to put up a list of some of the “known good” content creator types used by the various companies over the years.
HoudahSpot is normally $34, with a 30% student discount. Some may say, “Why spend $25-35 just for a search tool when you have Finder already?” But there are some things that Finder can’t do easily… and given that I would have otherwise OCR’ed about 80 files that I didn’t need to, I think I just paid for the program in the time that I saved.
As noted elsewhere on this site, I don’t receive any money for mentioning various programs. I mention them because they’ve helped in my workflow in one way or another, and I think they may possibly help others.
Notes
↑1 | I’ll try to put up a list of some of the “known good” content creator types used by the various companies over the years. |
---|