程序代写代做代考 Excel information retrieval Slide 1

Slide 1

WORDij 3.0
Basic Features
How to use WordLink, QAPNet, VISij,
Z-Utilities – a Twitter Example

Assumptions

This presentation assumes you have installed WORDij 3.0. If not, see the “How to Install WORDij 3.0” tutorial.

The files used in this tutorial, with name “twitter” and the drop list accompany the program download package. Here we use three files: twitter2008.txt, twitter2009.txt, and droplist.txt.

When we say “See Notes” for more information this means the slide has notes at the bottom which you cannot see unless you turn off the Slide Show mode and go to View, Normal Mode.

How to use WordLink
Start WORDij 3.0 by going to the stored location and clicking on the WORDij.jar This will open the program.
To do a basic WordLink procedure, click the “Browse…” Source Text File box to locate the sample file: twitter2009.txt
Click “Browse…” for the Drop List File to locate file: droplist.txt (See Notes below for more detail.)

The droplist file is what many information retrieval people call the “stop word” list. It contains words to exclude from the process to follow. We have created a short droplist.txt file that contains mainly prepositions, pronouns, and other function words, as well as format words used in Lexis/Nexis from where we obtained the twitter.txt file.

You should edit this droplist.txt file to fit your purposes once you move beyond playing with the program and want to do a serious analysis. For example, in marketing, pronoun use is indicative of how close the respondent feels to the product/service, so one would want to include rather than drop pronouns in this case.

Your WordLink screen should look like this:

“Output in Pajek format” is checked as a default setting. This produces the standard .net file that can be input directly to Pajek or imported to UCINET for statistical analysis and graphic representation of the word networks. If you want to use MultiNet/Negopy you can click the Conversions tab and select the “Net To .CSV.” Thus, WORDij provides the means for interoperability between WordLink, Pajek, UCINET, and MultiNet/Negopy. You can think of WORDij as your bridge among these network programs.

“Drop Words Appearing Less Often Than” with a default of “3;” “Drop Pairs Appearing Less Often Than” with a default of “3.” It is standard practice in the natural language processing community to drop occurrences or cooccurrences of words or pairs that happen only once or twice in the text. For small datasets you may wish to set the value to “1” and no words or pairs would be dropped.

“Use Constant Linkage Strength Method” is set to default. This means that words appearing within the “Window Size for Extracting Word Pairs” will be unweighted, without regard to the distance from one another in the window. If “Linear” is selected then a word that is next to the target word gets a score of 1, while a word two steps away gets 2, etc. “Exponential” is consistent with the notion in Physics that the attraction of two bodies increases as the square of the distance between them. If you choose this option each word distance value away from the target word is squared.

“Window Size for Extracting Word Pairs” with default set to 3. Think of this as a window that slides through the text and counts the pairs of words appearing within it. With a default of three words that precede the target word by 3 words, 2 words, or 1 word are all counted as paired, as are the three words appearing after the target word. This is a functional window width of 7 when you count the target word. Testing of the window size has found 3 to be optimal. For advanced uses such as automatic social network analysis, we will discuss setting the width at other distances.

“Advanced Options” will be discussed later.

Leave other settings at defaults. See Notes below for further detail.
Click the “Analyze Now” button to execute your WordLink analysis.
“Quit” terminates the program without executing further analyses.

You will see three boxes while the program is running:

“WordLink is working” and showing a progress bar. You can abort the program by pressing the “Stop Now” button.
“Program is finished” and you press “OK”
A log file lists program options and an output file after you click “Close.”

There are 8 output files generated:
These files serve as input for other analyses. See Notes for description of each file.

1. The twitter.net file is in Pajek format. This format can also be imported into UCINET for statistical analysis of the network and also used in Netdraw to generate graphic of the semantic network. The first step is to import into UCINET and then go to Netdraw and open the UCINET system file that is twitter.##h If you do not perform this step and read the twitter.net file directly into Netdraw it will produce a very cluttered graph with each node having a different icon but with no meaning associated with it. This twitter.net file can also be converted for use in MultiNet/Negopy. See Conversions for further information.
2. twitter.pr is word pair listing in the form of From, To, and Frequency Count. The file contains three columns: Word1, Word 2 and Frequency.
3. twitter.ptg is like the “.pr” file except it contains IDs rather than words. The file contains three columns: the ID for Word1, the ID for Word 2 and the Frequency.
4. twitter.stp.csv is a file showing the number of pairs, number of unique pairs, average pair frequency, and pair negative entropy. Then there are five columns listing the pair, frequency, proportion, negentropy term, and mutual information.
5. twitter.stw.csv a file showing the number of words, number of unique words, average word frequency, and word negative entropy. Then there are four columns listing the word, frequency, proportion, and negentropy term.
6. twitter.log, is a log file of the run settings.
7.. twitter.wrd, is an alphabetically listing of the words and a frequency count of their occurrence. The file contains two columns: Word and Frequency.
8.. twitter.wtg is an is an alphabetically listing of the words, a unique ID number assigned to it and a frequency count of their occurrence. The file contains three columns: Word, ID number and Frequency.

There are options on how to draw networks. In this presentation we use VISij which is part of the WORDij 3.0 package.
For an alternative See the tutorial “How to Draw a Semantic Network using UCINET/NetDraw.” This approach has more steps (17) and does not show motion over time like VISij does.
You can also graph in Pajek or in MultiNet/Negopy, or in other programs that accept the Pajek .net file format. The WORDij 3.0 Conversion tab contains utilities that will convert .net to MultiNet node and link files, .csv with required headers. WORDij 3.0 provides a high degree of network program interoperability.

The default is for 30 nodes and 3 minimum link strength but Zoom needs to be changed for most graphs. The default is no zoom and the resulting graph may look suboptimal. It is the initial graph which you then tailor to your preferences through exploring changing nodes and links and zooming in and out.

As shown in the previous slide, to graph your network, click on the main tab at the top labeled “VISij. It does a spring-embedded optimal layout.
Click on ADD to include a .net file(s) for graphing.
The default settings are 30 nodes with Minimum link value of 3.
Click Min Link Value to increase or decrease.
Click on Zoom + or – to move in or out to set an optimal view within the screen. For other options see Notes below.

The links that are green are weak links while black ones represent stronger links.
To save a picture of a graph do a screen shot.
The Transforming option allows you to see arrows showing movement of nodes from one file representation to another.
The Picking option allows you to pick a node to remain stable across the representations. If you hold the shift button down you draw a rectangle around a group of nodes and change their color. To turn this off, put the cursor inside the box and click it to turn off the different color.
Show Disconnected Nodes allows you to see which nodes drop out from on view to the next.
Click down on any part of the graph and you can move it around to position it as you like.
If you enter multiple files you can click on Play and Stop to start and stop the movie as it moves from one slide to the next, showing the changes in-between.
Because the program holds the parameters of nodes and minimum length strengths the same you get an accurate view of change over time if your files are originally based on time.

Screen shot of Twitter semantic network:

To show the network around the key word, Twitter for 2009, we ran the Nodetric program in the Conversions tab. Steps:
Browse for the 2009 .pr file.
Put in the Focal Word: twitter
Select the default of 5 Link Steps away from this node.
Browse for where to put the output file and what to call it.
Then to run click Start.
When it ends, click Close.
Repeat this process with the 2008 file.
Next we graph these 2008 & 2009 Nodetric .net files in VISij.

Note that you can run VISij on either the whole network .net file or first run the Nodetric program if you want to center your attention on a key word’s “node-centric” network. If these were people not words we would call this the “ego-centric network.
When using news stories as text input we generally use NodeTric because in many stories that may mention the search term there is often irrelevant content and putting the relevant and irrelevant content in the same graph is confusing because the concepts are not related to your interest in selecting the texts. For example, a news story may have a review of a number of different technology products only one of which is twitter and by including the network for the entire stories you are mixing twitter apples with irrelevant oranges.

Summary and Next Demonstration:
So far we have demonstrated how to take a text file, run WordLink to get the word & word-pair frequencies and to graph the semantic network with VISij. It allows for graphing a single point in time or over multiple points in time.
Next we will more beyond visualization to the comparison of text files from two different time periods. First QAP will determine the overall degree of network similarity. Second, the Z-utilities will show us what words and word pairs are new, what is dropping in relative frequency, and what is remaining the same in relative frequencies (proportions).
We will compare Twitter news in early 2008 (n=58) and early 2009 (n=182).

If you wish to see how to do a graph of single points in time for networks, see the 17 step process in slides XX to XX.

QAP is an overall measure of the similarity of two networks using a correlation coefficient.
It does not matter in which order the two files are. Enter the two .pr files you wish to compare. Here we enter the 2008 and 2009 twitter files.
You may leave the Permutations value to the default of 100 to generate 100 bootstrap random samples against which to arrive at a probability of significance value.
Click Analyze Now.
After the message that the program is finished, click OK and observe the contents of the log file for the correlation and probability.
Click Close. (If you click Quit it will terminate WORDij.)
The results for the two twitter files show a small correlation.
See Notes file below.

The correlation value for most files rarely is large. Correlations range from
-1.00 a perfect negative correlation to +1.00 a perfect positive correlation. This correlation of the two twitter files was r = -.142. Correlations are usually small because you are comparing a large number of cells in the virtual matrices. Nevertheless, this does not mean that on the word and word pair level there are not significant differences that are large and perhaps substantively significant to you. These are revealed in the Z Utilities in the next section.

QAPNet is a single measure of overall similarity at the whole network level.

Even those the QAPNet correlation of the two twitter files is low there may be a number of words or word pairs that have higher relative frequency in one file compared to the other and others with no significant difference. This is revealed with the Z Utilities.
There are two different types of Z comparisons: word proportions and word-pair proportions. There are actually two types of word-pair comparisons, one is for the main .pr file and the other is for the results of the NodeTric .net files pairs, in which only the pairs in the node-centric network around a focal word are compared.

Z-Utilities allow you to compare two text files and determine what the significant differences are for either the words or the word pairs or the pairs from NodeTric .nets.

Although these are called Z-Utilities, there are actually two significance tests for comparisons of words, pairs, and the pairs output of the NodeTric .net files. One is the Z-test for proportions (relative frequencies) and the other is the Chi-Square tests of differences in counts. The Z-test cannot produce a value when one of the pairs has a frequency of zero, so we enter a very small constant to replace zero. The critical z value for two proportions at p < .05 is + /- 1.64, while at p < .01 is +/- 2.389, while at p < .001 is +/- 3.5 The Chi-Square test may be preferred by some analysts because it is not an inferential statistic whereas Z-tests are. Nevertheless, if the number of occurrences in one or both of the files is less than 5 then Chi-Square statistics should not be used because the estimates are invalid. The value of Chi-Square that is statistically significant for degrees of freedom 1 (number of cells -1) is 3.841. Values higher than this are significant at higher levels. For example for p < .01 the critical value is 6.635, and for p < .005 is 7.879. Enter or browse for the first file and then for the second file. Take note of the order because the statistical comparisons will be based on order in terms of whether showing a negative or positive z-score. You must browse to create the output file name inside the Select Window. Do not merely type the name in the WORDij screen slot for output file or it will produce no output file. You will get a warning message if you browse to a file name that is one of the input files. The next slide is a screen shot of the z-word pair comparison. The above screen shot shows the output file name we have chosen. We suggest you open it in Excel and specify Fixed Width format. This will place each column of the output file into a separate Excel column. The default is sorting of the file by z-score in ascending order, so that the highest negative values appear first. The negative values means there is a higher frequency for group 2 (time 2) than group 1 (time 1) These are word pairs that are relative new or a significantly increased frequency. Or, sort the Chi Square as descending although all the “NAs” will be listed first when there are fewer than 5 counts in a group, after the list of NAs come the highest chi-square values. Chi-square does not distinguish direction of difference with negative or positive signs (all values appear to be positive) so you look at the proportions and counts to see which one is lower or higher to determine the direction of difference, increasing or decreasing. There are three kinds of z-tests and chi-squares that WORDij 3.0 computes: words, word pairs, and pairs from the selected “ego-centric,” node-centric networks from NodeTric. Because pairs are the most important elements to identifying networks we illustrate reading them into Excel and interpreting them. The same process would apply to words and word pairs from NodeTric .net file comparisons. Reading a z-test pair output file into Excel: Open the file and click Finish. Click the upper left corner to select the entire spreadsheet. Then click Format Column, AutoFit Selection. This will adjust your column widths. Here is how to format the Excel file for optimal viewing. Select columns D, E, F, & G to format Select Format, Cells, Number, Increment decimals to 4. Click OK. This is a snapshot of a portion of the Excel file, sorted by Z-score in ascending order. Negative z-score values mean that group 2 (here time 2) had higher relative frequencies than group 1 (here time 1). These are word pairs that are significant increasing over time. If the comparison files were not time-based it would show the second file as having higher values. Word pairs with positive z-scores are “dropping” or lower in relative frequency in group 2 compared to group 1. Here are word pairs that did not change or were not different in the two files. These are indicated by insignificant z-scores, those between -1.63 and +1.63: Word pairs sorted by chi square in descending order: This shows the value of using both z-scores and chi-square tests because they do not always produce the same results. For example, ch-square is unsigned. One must examine the difference in word pair counts and proportions to judge directionality and/or substantive significance. Chi-square is based on the actual counts, while z-tests are based on proportions. Here is a sample of a screen from the z-tests and chi-square tests for comparing two .net files produced by NodeTric with “twitter” as the focal word. This illustrates word pairs that are higher in relative frequency in group 2 than in group 1. Notice that the Z-tests have significant values because we use a very small constant for zero. But, one should not use chi-square with cell frequencies less than 5, so this shows NA for these tests even though the results appear possibly substantively significant. Here are the positive NodeTric .net pair changes, increasing significantly for time 2 (group 2) compared to time 1 (group 1). Notice how “virtual worlds” drops out. Remember that files need not be time-based. With this utility you can compare two files based on any criterion for creation and see which has more or less of some word or word-pair attribute. These sample screen shots show some of the NodeTric .net word pairs that remained the same in relative frequency from time 1 (group 1) to time 2 (group 2) according to z-scores but not all chi-squares. Here is a focus on chi-square results for the NodeTric .net word pairs. End Slide How to Create Semantic Network Graphs in UCINET and NetDraw Importing WORDij .net files Converting .net files to system files in UCINET Creating the visualization in NetDraw Graphing the semantic network with NetDraw is a 17 step process. Breaking it down step by step makes it easy to make a graph: Start UCINET (http://www.analytictech.com/ ) Click on the Data tab. Move your cursor to the Import option. Move your cursor to Pajek and click it. In the dialog box browse for your twitter.net file and accept the default file names for the remaining slots in the box. Close the Output Log file. Graphing the semantic network with NetDraw (cont’d): 7. Click the Visualize tab. 8. Select NetDraw 9. Click File. 10. Click Open and select UCINET dataset and click Network. 11. Click on the … box and browse to find your twitter.##h file and click OK. Graphing the semantic network with NetDraw (cont’d): 12. After the file loads and you see red circles, click the Rels tab in the upper right corner. 13. Go near the bottom of that box to the small boxes containing > and 0. Replace the 0 with 100 which drops word pairs whose frequency is less than 100.
14. Click the top-level tab ISO to remove isolate nodes after frequency pruning.

Graphing the semantic network in Netdraw (cont’d):

15. Click Layout and select Graph Theoretic, Spring Embedding and click OK in the dialog box.
16. Experiment with different frequency prunings to get a graph you want to save and click File, Save Diagram As, Bitmap, and give the file a name.
17. You can insert the twitter.bmp file into documents or slides

Note: Spring layout produces a different perspective on each run but keeps the same distances between all nodes. So, your graph will look different each time you run Layout, Graph Theoretic Layout, Spring Embedding.