Report
Report
Project 1:
Approximate String Search for Geolocation of Tweets
Subject: COMP90049 Knowledge Technology
Ying Chen 633582
1. Task Description
First,preprocess the data files to extract the tweet texts and the location names. Then for each location name, use some methods and get the tweet ID (or IDs!) from the text if it contains the location name.
2. Method 1
2.1 Analysis
The configuration of my laptop is as follows: Intel Core i3-380 2.53GHz, 2GB RAM. Working with the full data set is too challenging for my laptop, so I can only work with samples of both the query and the target datasets. So the task can be divided into 2 parts: creating sample data set and find the tweet ID from the target text.
So how big the sample data set should be created? This is the first problem.
First, I did an experiment to get the running time using the small dataset. It cost nearly 4 seconds to get the tweet ID for every query. The small data set is about 0.1% of the full one. If there are m location names and n target texts, and the length of each text tokens is about p, the time complexity to getting the tweet ID from every text will be O(m*n*p). That is to say, if the sample data set is 1% of the full one, it will be 10 times of the small one and the running time will be nearly 4*10*10=400 seconds, which is acceptable. So, I decided to create 3 samples: 1%, 0.5% and 0.1% of the full data set.
2.2 Creating Samples
The method to create sample is as follows: reading the full data set, randomly select some data from it based on the selected proportion, then save it.
2.3 Finding Tweet ID
I extract the tweet ID and its corresponding tweet text one by one from the sample target file and save them into a dictionary in the format: {tweet ID: tweet text}. Then extract the location names and save them into a list. Each element in the list is a query. The searching strategy is as follows: For each query, search it in every tweet text. If the location name is found in a tweet text, return its tweet ID. As mentioned above, the time complexity to do this part will be O(m*n*p).
2.4 Result
I did 3 experiments in 3 sample data sets for this task. The result is showed in table 1.
Table 1 running time with 3 sample data sets by method 1
Sample data set
Time to select data(s)
Time to find tweet ID(s)
Total time(s)
0.1%
13.84
3.86
17.70
0.5%
17.91
121.34
139.25
1%
15.50
511.49
526.99
2.5 Effectiveness
It can be proved by theoretical analysis that the method is effective. And here are 2 samples.
Because my own sample data set is different in every time running the program, so this time I use the small data set to show the effectiveness of the method. The first location name is “Pittsburgh”, which appears 2 times in the target file:
“39912153
4642820112
New Cleveland Cavaliers star Shaquille O’Neal rooting for Pittsburgh Steelers http://getu.in/37G-
2009-10-05 19:51:04”
“51719882
5843125142
Acting US Attorney for Pittsburgh Named http://bit.ly/3Sv1v4
2009-11-18 18:47:23”
So the result should be {4642820112, 5843125142}. Actually, it is my result.
The second location name is “Franklin”, which appears 4 times in the target file:
“6704922
2225007969
Tonight (6/18): FRIEND REQUEST- my one person show about social networking. 8pm UCB Theatre LA (5919 Franklin). Please come. $5.
2009-06-18 12:11:52”
“22171352
6416961167
“Well done is better than well said.” Ben Franklin 2009-12-06 19:39:19”
“23249204
6405537804
He that lieth down with dogs shall, rise us as with fleas.-Benjamin Franklin
2009-12-06 12:10:10”
“300 N Franklin, Kirksville”
We can note that the 4th sentence is pathological data, so the result should be {2225007969, 6416961167, 6405537804}, which is same as my result. So it can be proved that the method is effective.
2.6 Advantages and Disadvantages
Advantages: it is easy to think, understand and use. And its effectiveness is really good.
Disadvantages: it takes too long time to find the location name from the target texts.
3 Method 2
3.1 Method Description
In method 1, I processed the texts and queries into separate tokens. In method 2, I also processed queries into separate tokens while treated the full tweets file as a single text. I extract the tweet ID and its corresponding tweet text one by one from the sample target file and save them into a string in the format: { ID1 text1 ID2 text2 …IDm textm}. Then search the location name from the whole text directly, and if the location name was found, search the ID just before it, it is the right tweet ID.
3.2 Result
Method 2 took much less time than method 1 to find tweet ID. From table 2 we can see, method 2 made a great progress in searching time.
Table 2 time to find tweet ID with 3 sample data sets by method 1
Sample data set
Time to find tweet ID(s)
0.1%
0.92
0.5%
4.17
1%
84.64
3.3 Effectiveness
The previous examples show the effectiveness of method 2. When searching “Pittsburgh”, it returned 2 right IDs. While when searching “Franklin”, it can only return 2, not 3 IDs. So the effectiveness of method 2 is better than method 1. But taking into account the time, the effectiveness is acceptable.
3.3 Advantages and Disadvantages
Advantages: it can take less time to find the location name from the target texts and get the right tweet ID (or IDs).
Disadvantage: one is that it treats full data set as a string, so it need more memory; the other one is that its effectiveness is not as good as method1’s.
4 Scalability
As we can see in table 1, the 3 times spending in selecting sample data sets were 13.84s, 17.91s and 15.50s. There were little difference between each other. But it took quite different time to search the location name from different target data set with different size. The time complexity is O(m*n*p), where m is equal to the number of location names, n is equal to the number of target texts and p is equal to the length of each target text. If there are 10-fold increase in m, 10-fold increase in n and p is unchanged, then there will be 100-fold increase in running time at least. So I think a faster method should be found.
Method 2 is a much faster strategy than method 1. It can process a bigger data set with an acceptable effectiveness.
5 Challenges
The biggest challenge for me is that the full data set is too big to solve the task in it. I have to select some sample data set from the full one to do my job. Regrettably, at most 1% of the full data were selected in my experiment.
Another challenge is the method to solve the task. One task can be solved by several methods, but we should find the best one. This can be used for all studies and research.