Slide 1
WORDij 3.0
Advanced Features
How to use Utilities: Proper Nouns, TimeSegs, WordLink Advanced Options: Select Files, Include Files, String Files; and the Program: OptiComm
The Proper Nouns program extracts nearly all proper nouns in the text. This is useful if one is planning on using advanced features of WordLink to identify networks among actors, programs, and perhaps some additional words.
The first “Output List File” is the list of proper nouns in alphabetical order.
The second output file “Output String Replace File” prepares a rough draft of a string concatenation file that could be used in the WordLink advanced features. Each string after an arrow -> is treated as a unigram, equivalent to a single word in WordLink.
For example, White House would be treated as White_House, a single word. More than two word strings are also created, like United_States_of_America.
Here are screenshots of a sample Proper Nouns run and sample output.
Using the Proper Noun String Replace List File in WordLink Advanced Options
One can concatenate multiple words into single unigram strings in two ways. One way is using the Proper Noun program’s approach. The other way is by creating one’s own entries in an editor.
When only a string file and not an include file (to be described later) is used, the unigrams are network analyzed along with all of the other words not modified by the string replacements.
The next slides show the use of a Proper Noun String Replace File in analyzing the semantic network of the 2009 twitter file.
CAUTION: When running WordLink multiple times change the file for each unique run because the program will otherwise overwrite the previous files without warning.
CAUTION: When running WordLink multiple times change the file for each unique run because the program will otherwise overwrite the previous files without warning.
Notice the result of the string replacement
Using the String Replace List File and the Include List File in WordLink Advanced Options
This produces a network of only those strings in the include list.
Recall that the String Replace List File was demonstrated when we were in Utilities, Proper Nouns. There we created a String Replace List File for all proper nouns in the text. But, one can create any sort of string replacements in a text editor. For example, here we examine the cabinet of Pres. Clinton and a portion of the file shows the use of various aliases for his name, but all pointing to summary name strings, such as “Pres. Clinton->Bill_Clinton.”
Here is a portion of the include file. These will be the only word strings (names) that appear in the network output. All other words are dropped from the analysis.
An important difference compared to the simple “bag of words” network approach is that WORDij 3.0 maintains the order of the appearance of the names, thus allowing for directionality of links among them.
Here we do a time segmentation of the first and second 100 days of the Clinton administration by adding the select file produced by the TimeSegs program.
The optimal message creator, OptiComm, traces all shortest paths between a seed word and a target word, both of which must be connected indirectly in the network. These strings can be used to produce messages that could be used in further communication in the language community in order to either promote change to move two words closer, more them further apart, or to reinforce aspects of the semantic networks.
Opticomm defaults to producing five-word strings, which you can set to be a longer value. It also defaults to producing 16 messages of alternative shortest paths.
If you do not enter a target word it defaults to the most central word in the network.
Because the shortest paths are using directional word pairs, they are sensitive to embedded syntax of the language, which automatically determines typical word pair order. Optimal messages, therefore, can be created quite readily from the strings by adding function words (prepositions, conjunctions, etc.) that may have been dropped to produce a linguistically valid statement.
If you wanted to move two words closer together, and the concept was innovative, it may be best to select the shortest path of low frequency, using the output labeled, “Strings with Low Average Pair Frequency,” listed first in the output. Our lab experiments have shown this to be most effective. The theory is that while the words are central, they are attractive because their use is less frequent in the particular language community.
If you wanted to more two words closer together and reinforce an already strong connection, you may want to use the shortest strings of most frequent words, labeled “Strings with High Average Pair Frequency,” which is listed second in the output. These words are more frequently used in the language community.
To move two words further apart, one would select a target that is on the periphery of the word network, trying different peripheral targets until finding a desirable string connects the seed and remote target. Pick the strings in the first section, strings with low average pair frequency, listed first in the output.
For example, if twitter and myspace are close together and one wants to move them further apart, one would use myspace as the seed and find the strings that would move it furthest toward the periphery, thus increasing the distance from twitter.
To use all of the major WordLink options for file selection. We will add to the string and include file selections the TimeSeg select file. So, in summary we are using three Advanced Options types of files: 1) Select File, 2) String Replace List File, and 3) Include List File.
This will show the presidential cabinet network for the first 100 days of the Clinton administration and for the second 100 days.
Note that select files need not be based on time segments, but any kind of file comparisons. See How To Use Select Specifications.
This will conclude our demonstration of the three major Advanced Options. Nevertheless, there are a 11 other check box options you can select or remove from the Advanced Options page.
The Select File is like a batch file or macro which systematically selects marked sections of text and places them in alternative files for analysis. The sections are marked with headers that begin with @@ followed by any alphanumerical content. The segments can be mixed in any order. The select options are flexible and therefore numerous, to much so to demonstrate fully here. Nevertheless, here is an example of selecting the first 100 days and second 100 days of the Clinton administration.
TimeSegs takes as input any Lexis/Nexis text output and NewsBank output. Most colleges and universities have Lexis/Nexis Academic, which gives excellent and extensive coverage of the news in various forms of media. The complete textual database is generally available at law libraries.
Lexis/Nexis covers many world, national, and major wires, trade publications, electronic media transcripts, web documents, etc.
NewsBank is particularly useful because it adds to this a heavy concentration of small market newspapers.
The program looks for the standard Lexis/Nexis date header and the NewsBank date header. If you followed one of these conventions you could create your own time segments in other kinds of source files.
Note that the source files can be completely mixed in different times but that TimeSegs in effect sorts the files into contiguous time files. This is particularly important when combining multiple search output files.
TimeSegs is for doing automatic segmentation of files into a series of files so that a separate network analysis is performed on each, enabling observations of change over time. For example, in a preliminary analysis we took the two terms of Pres. Reagan and divided it into months, resulting in 96 time segments across 650 megabytes of news stories (not shown here).
The Pres. Clinton news stories file is divided into two time segments, the first 100 and second 100 days but the entire two terms has 675 megabytes of stories.
The Utilities, TimeSeg program uses one text input file and produces two output files, the original text but with time code headers inserted throughout. It then produces the key file, called the “Select File.”
The user selects the start and end dates and then selects the period number for the time segmentation interval, whether it is to be daily, weekly, monthly, or yearly.
For example, in this tutorial we selected the beginning of the Clinton first term, starting date of January 20, 1993 and end date of August 7, 1993 with period width of 100 days, resulting in files necessary to produce two time segment files.
Here is the Utilities, TimeSegs screen for the 100 day segmentation run.
This is the WordLink Advanced Options setup to run the Clinton 100 day analysis over two time periods.