3/18/22, 8:02 PM CMPUT379 – Tokenizing Strings
CMPUT 379: Experiments with Tokenizing Strings 1. Introduction
A programming task that is often needed is to read a multi-line input
text file and pick each line apart to retrieve the stored data. To do
Copyright By PowCoder代写 加微信 powcoder
this, it is often best to read the file one line at a time, store the
line in a buffer, and process the buffer.
When the format of each line follows a simple fixed pattern, one may consider using function sscanf() to solve the problem. Often, however, each line carries a variable number of data items separated by a variable number of white space characters (blanks, tabs, new lines, and carriage returns). In such cases, a better approach is to split each line into tokens (using, e.g., function strtok()) and then use the obtained tokens to parse the line. This lab activity develops this direction further.
2. A Running Example
We use, as a running example, the task of reading a graph, composed
of nodes and edges, where each node v has an (x,y)-coordinate.
A compact encoding of the input may use two blocks of the following
Graph = [ (1,2) (2,3) (3,4) …. ]
XY = [ (1,10,20) (2,20,30) (3,30,40) … ]
each element in the “Graph= [ … ]” block specifies an edge (v,w), and
each element in the “XY= [ … ]” block has the form (v, x- coordinate, y-coordinate) specifying the (x,y)-coordinates of node v.
We’d like to give the user the flexibility of inserting white space
characters as the user wishes. So, each input line stores an unknown
number of elements, subject to having a bound on the length of each
In addition, we want to allow the user to input any number of comment
lines (a comment line starts with a #).
Suggested Activity: Store the following data in a file, called ‘graph.data’
# Each comment line starts with a #
Graph = [ ( 1 , 2 ) ( 2 , 3 ) ( 3 , 4 )
(4,5) (5,6)
XY= [ (1,10,10) (2,20,20) (3, 30, 30) (4, 40, 40)
(5,50,50) (6, 60,60) (7, 70,70) (8, 80,80) ]
webdocs.cs.ualberta.ca/~cmput379/W22/379only/lab-tokenizing.html
3/18/22, 8:02 PM CMPUT379 – Tokenizing Strings
# End of file
The main solution method discussed here relies on developing two carefully designed string processing functions, called ‘split’ and ‘gsub’, as described next.
3. Function split
Function split is declared as
int split(char inStr[], char token[][MAXWORD], char fs[])
The function splits the input string inStr into the output array token using the characters in fs as field separators. It returns the number of tokens stored in array token. Each token has at most MAXWORD characters.
Example: Suppose string line= “Graph = [ (1,2)” then
split (line, token, “=”) returns 2 tokens token[0]= “Graph “, and token[1]= ” [ (1,2)”
split (line, token, “=[“) returns 3 tokens token[0]= “Graph “, token[1]= ” “, and token[2]= ” (1,2)”
split (line, token, “=[(“) returns 4 tokens token[0]= “Graph “, token[1]= ” “, token[2]= ” “, and token[3]= “1,2)”
split (line, token, “=[( ),”) returns 3 tokens token[0]= “Graph”, token[1]= “1”, and token[2]= “2”
4. Implementing Function split
Function split can be implemented using the strtok function (a
standard C string function).
The C Programming Language book (B.W. Kernighan and D.M. Ritchie) describes function char *strtok(s,ct) concisely and accurately as follows:
“strtok searches s for tokens delimited by characters from ct. A sequence of calls of strtok(s,ct) splits s into tokens, each delimited by a character from ct.
The first call in a sequence has a non-NULL s.
It finds the first token in s consisting of characters not in ct; it terminates that by overwriting the next character of s with a ‘\0’ and returns a pointer to the token.
Each subsequent call, indicated by a NULL value of s, returns the next such token, searching from just past the end of the previous one.
strtok returns NULL when no further token is found.” The following is a possible implementation of split:
#include …. // stdio.h, string.h, stdlib.h, stdarg.h
#define MAXLINE 132
#define MAX_NTOKEN MAXLINE
#define MAXWORD 32
webdocs.cs.ualberta.ca/~cmput379/W22/379only/lab-tokenizing.html
3/18/22, 8:02 PM CMPUT379 – Tokenizing Strings
int split(char inStr[], char token[][MAXWORD], char fs[])
int i, count;
char *tokenp, inStrCopy[MAXLINE];
memset (inStrCopy, 0, sizeof(inStrCopy));
for (i=0; i < MAX_NTOKEN; i++) memset (token[i], 0, sizeof(token[i]));
strcpy (inStrCopy, inStr);
if ( (tokenp= strtok(inStr, fs)) == NULL) return(0);
strcpy(token[count],tokenp); count++;
while ( (tokenp= strtok(NULL, fs)) != NULL) {
strcpy(token[count],tokenp); count++;
strcpy (inStr, inStrCopy);
return(count);
5. Function gsub
Function gsub is declared as
int gsub (char *t, char *omitSet, char *repStr)
The function replaces each occurrence in the string t of a character that matches some character in the omitSet with the replacement string repStr. When the first character of the omitSet is '^', matching and replacement proceeds until we encounter the first character in t not in omitSet. The function returns the number of substitutions made.
Examples: Suppose string line= " ? Graph = [ (1,2)" then
gsub (line, "^ ?", ".") returns line= "...Graph = [ (1,2)" (3 substitutions)
gsub (line, " ?", ".") returns line= "...Graph.=.[.(1,2)" (6 substitutions)
gsub (line, " ?", "") returns line= "Graph=[(1,2)" (6 substitutions)
6. Implementing Function gsub
Below is an implementation of function gsub using basic C string
functions (e.g., strcat and strcpy).
Briefly, the function tests (omitSet[0] == '^'). If yes, it sets task= MBEGIN to match and replace at the beginning of the string only, i.e., until it encounters an ordinary character that is not in the omitSet (this sets the ordinary character seen variable ocSeen to one). Else, it sets task= MALL to globally match and replace characters in the input string t.
#include .... // stdio.h, string.h, stdlib.h, stdarg.h
#define MAXLINE 132
#define MAX_NTOKEN MAXLINE
#define MAXWORD 32
webdocs.cs.ualberta.ca/~cmput379/W22/379only/lab-tokenizing.html
3/18/22, 8:02 PM CMPUT379 - Tokenizing Strings
int gsub (char *t, char *omitSet, char *repStr)
int i,j, ocSeen, match, nmatch;
char outStr[MAXLINE];
enum {MBEGIN= 1, MALL= 2} task;
task= (omitSet[0] == '^')? MBEGIN : MALL;
memset (outStr, 0, sizeof(outStr));
nmatch= 0; ocSeen= 0; // Ordinary character not seen yet
for (i= 0; i < strlen(t); i++)
if ( (task == MBEGIN) && (ocSeen == 1)) {
outStr[strlen(outStr)]= t[i]; continue;
match= 0; // no match found yet for character t[i]
for (j= 0; j < strlen(omitSet); j++)
if (t[i] == omitSet[j])
{ match= 1; nmatch++; strcat(outStr, repStr); break; }
if (match == 0) { ocSeen= 1; outStr[strlen(outStr)]= t[i]; }
strcpy (t, outStr);
return(nmatch);
7. Suggested Activity
Write a function to test function split and its returned value. Write a function to test function gsub and its returned value.
Write a function to read input from file 'graph.data' and print the
set of edges defined in the Graph block. Below is a partial
implementation of such a function where
The function uses fgets to read the input file one line at a time, and store the line in buf.
If needed, it splits the string in buf into array x (nx stores the returned number of fields). If a further splitting of a field, say x[i], is needed, array y is used (ny stores the returned number of fields).
Initially, the function is parsing the main block (so, block= MAIN). When the Graph block is entered, it sets block= GRAPH. When the ']' in the graph block is detected, the program resets block= MAIN.
#include .... // stdio.h, string.h, stdlib.h, stdarg.h
#define MAXLINE 132
#define MAX_NTOKEN MAXLINE
#define MAXWORD 32
#define MAX_NEDGE 30
void readInput (char fileName[])
int i, edge, nEdges, g[MAX_NEDGE][2];
char buf[MAXLINE];
int nx, ny;
webdocs.cs.ualberta.ca/~cmput379/W22/379only/lab-tokenizing.html
3/18/22, 8:02 PM CMPUT379 - Tokenizing Strings
char x[MAX_NTOKEN][MAXWORD], y[MAX_NTOKEN][MAXWORD];
FILE *fpin;
enum {MAIN, GRAPH, XY} block;
fpin = fopen (fileName, "r");
if (fpin < 0) perror("readInput: file open failed.\n");
block= MAIN; nEdges= 0;
while (fgets (buf, MAXLINE, fpin) != NULL)
... ... ...
fclose (fpin);
printf ("Graph [nEdges= %d]= ", nEdges);
for (edge= 0; edge < nEdges; edge++)
printf("(%d,%d) ", g[edge][0], g[edge][1]);
printf ("\n");
8. Acknowledgment
If you use the above code in this course or some other course,
acknowledge the code shared in CMPUT 379 labs.
CMPUT 379: U. of Alberta, Author: E. Elmallah
webdocs.cs.ualberta.ca/~cmput379/W22/379only/lab-tokenizing.html
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com