COMP5349: Cloud Computing
Week 1: Git Tutorial
A brief Intro to Git
Git is a version control system that allows people, mainly developers, to manage a project or a set of files by tracking their changes over time. A centralised hosting service is usually required to synchronise the changes made by different users or by the same user from different machines.
The most popular public web-based hosting services using Git are https://github. com/ and https://bitbucket.org/. Many organisations set up their own enterprise Git services to give members a free access to the premium features. Our university has recently launched a hosting service based on GitHub Enterprise. You can find all the information and the login link on this page https://informatics.sydney.edu.au/code-repository/
The most basic concept in Git is repository. It is a data structure used to store all the information of a project. A repository lives in the root directory of a project. It is stored in a hidden subdirectory called .git.
In the first exercise, we will explore the repository structure and familiarise ourselves with a few basic git commands. In the second exercise, we will examine ways to synchro- nise the repository.
Please note that the tutorial only introduces some basic commands. They do not cover all possible scenarios you may encounter while using Git. When you encounter a unique scenario that you are unsure about, please consult the online resources. The main focus of this tutorial is to help you understand the internal structure of a Git repository. This is important as it helps you in understanding the Git commands and subsequently solving various Git problems.
You are encouraged to use Linux based system for this lab to be able to run all exercises. However, you can run basic GIT commands under Windows.
Question 1: Understanding the Local Repository
The aim of this exercise is to understand a Git local repository.
a) Reboot to Linux (optional)
In this tutorial, we will be using Linux OS. All the exercises can be done on your personal Mac OS without any changes. Please ensure that Git is installed on it. If you do not have Git installed in your Mac, please follow the instructions in https:
1
School of Computer Science Dr. Ying Zhou
Sem. 1/2020
27.02.2020
//www.atlassian.com/git/tutorials/install-git#mac-os-x. If you prefer to do the lab exercises on Windows OS, make sure you use Windows PowerShell. At some point in time, you may need to use Windows specific commands. These commands are not covered in this tutorial. Please consult the online resources.
If the lab machine is currently on Windows, reboot it into Linux. You will see the selection screen after restarting the machine. Once you are at the Linux login screen, login with your unikey and password. The Linux version installed in the lab machine is Red Hat Enterprise 7.
b) Initialise an empty repository
We will begin by initialising an empty local repository. Open a terminal window and
change to your working directory using cd command such as cd wards run the following commands:
mkdir week1
cd week1
echo comp5349 > enrol.txt
git init
/comp5349. After-
The mkdir and cd command creates and changes your current working directory to a directory named week1. This directory will be used as your project’s root directory. The echo command creates a text file enrol.txt with content “comp5349”. The git init command initialises an empty local Git repository inside week1 directory.
In this directory, you should be able to see a hidden .git directory by using the com- mand ls -a. This .git repository contains all the metadata and the actual project data stored in various sub-directories. In general, Git repositories have the same structure i.e. same set of sub-directories. You can see the sub-directories of the .git directory using command ls .git. There should be directories with names like: HEAD,objects, refs, etc.
In the subsequent exercises, we will focus on the objects directory. Git stores every version of the project data in an “Object Database” residing in the objects directory. Currently, you will not see any files in it except for the two sub-directories. This is because our Git repository is still empty.
c) Adding files to your repository
Let’s add some files into our empty Git repository. Go back to the project’s root direc- tory and issue the following commands:
git add enrol.txt
find .git/objects -type f
You should now see .git/objects/cd/bb2ce4ae5b3765b0d33a1acbff426c258b4bcd
The git add enrol.txt command adds the file enrol.txt into the repository. This creates a blob object in the .git/objects directory. The blob object is not given the same name as the file name. Instead, it is named using the SHA1 hash of the file content. Git uses a clever way to create a two level storage structure for all objects.
2
It uses the first two characters of the SHA1 hash as the directory name, and the rest as the object’s file name. In the above example, the SHA1 hash of the file content is cdbb2ce4ae5b3765b0d33a1acbff426c258b4bcd. The first 2 characters cd is used as the directory name.
Note that the file is not stored as it is in the repository. You cannot view the content of the blob object with command like cat. This is because the file is first compressed then stored. Git provides its own facility for you to view the object content.
To view the content of object cdbb2ce4ae5b3765b0d33a1acbff426c258b4bcd, proceed to .git/objects directory using the command cd .git/objects, and execute the following command:
git cat-file -p cdbb2ce4ae5b3765b0d33a1acbff426c258b4bcd
You should see “comp5349”; the content of the file.
If you have two files with exactly the same content e.g. one created by copying the other, there will only be one copy in the Git’s object database. This is because they share the same SHA1 value. To test this out, create a backup of enrol.txt file and add it to the Git repository:
cp enrol.txt enrol.bak
git add enrol.bak
You will notice that the repository remain unchanged.
d) Creating a commit
So far, the repository is only capturing the content of your project files i.e. no name is associated with any blob object. The name-content association only happens during commit. A commit is a snapshot of your project that can be retrieved when necessary.
Run the following command to create a commit:
git commit -m “First Commit”
If you have never run git commit before, the terminal will prompt you to set up the name and email address of the committer – that is you. Set them up by running the following command (replace
git config –global user.name “
git config –global user.email “
Run find .git/objects -type f to find all the objects in the object database. You will notice there are 3 objects there. If you use the same file names as the ones in the previous exercises, two of your objects will be:
.git/objects/cd/bb2ce4ae5b3765b0d33a1acbff426c258b4bcd
.git/objects/8d/935bd4ec8fad4bc451e9d26fe206bd0ebdd5e7
If you compare the name of the files with other classmates, you will find the two objects (tree and commit object) above to have the same name, but not the other object. The
3
third object will always have a name that is unique to each repository. Hence there is no way to predict it.
The commit command creates two objects: a tree object and a commit object. The tree object records the mapping between the blob and the file names. As usual, the object name is the SHA1 hash of the content of the tree object. The commit object records the content of this commit as well as the reference to its parent commit.
Let us inspect the tree object starting with 8d – recall that git cat-file -p
no changes added to commit (use “git add” and/or “git commit -a”)
We can commit this change by executing:
git commit -a -m “add another course to enrol.txt”
You should see 6 objects in the Git object database. Two of them are the commit objects, each with its own tree object and two blob objects. There are two blob objects because the two files (enrol.txt and enrol.bak) have different content now. All objects have the SHA1 hash name, you cannot tell which one is which by simply staring at the name. You can use git cat-file to inspect their content.
Any repository would contain many commits and by default, Git only shows the latest one. It saves the name of the latest commit in a file as a plain text. We can find the latest commit by inspecting that file using the cat command:
cat .git/refs/heads/master
Copy the file name of the file containing the latest commit, and let’s inspect it with git cat-file -p
tree f7ad1f5c28a07089569c740babf15cd484b7e964
parent 7915e5ae9267f5e59197849509adf90347609c11
author ying.zhou
committer ying.zhou
add another course to enrol.txt
If you compare this commit with the one in the previous exercise e, you will notice that this one contains an extra piece of information; the link to the parent commit object. This is essentially how Git tracks the history of changes to the content. All commits are linked together with the parent pointer as a linked list. The head of the list is pointed to by an external file. Our repository’s commit history looks like: head -> c2 -> c1
f) Branching
At some point in time, you may want to add an experimental code into the reposi- tory while not risk breaking the main code/workflow. In this situation, you will use branch feature. Branching allows you to create a working repository that is separate to the main one. The content of the working repository can be based on any commits (snapshots) you have made in the past.
In Git, the first and default branch is always called master. So far, our repository only contain 1 master branch with a history that looks like a single linked list with a head pointing to the latest commit and a parent pointing to the previous commit recursively.
To create a branch, you need to first identify which commit you want the branch to come out from. This involves finding out the name of the branch that contains the commit as well as the SHA1 file name of the commit object.
5
For instance, suppose you want to create a branch and start working on it from the first commit of the master branch. You can refer to the commit by the entire commit object’s SHA1 name or just its first few characters. Git provides a convenient command git log to find out the list of commits and their associated object names. Running it would give you an output similar to:
commit 5f9daf94afc5142217962ab061c2dc388f742e45
Author: yzho8449
Date: Tue Mar 13 10:40:22 2018 +1100
add another course to enrol.txt
commit 7915e5ae9267f5e59197849509adf90347609c11
Author: yzho8449
Date: Tue Mar 13 10:36:12 2018 +1100
First Commit
For example, if your the commit’s object name is 7915e5ae9267f5e59197849509adf90347609c11, the following command would create a branch called exp and set its head to that commit.
git branch exp 7915e5ae9267f5e59197849509adf90347609c11
Checkyourrepository’sobjectdatabaseusingfind .git/objects/ -type fandsee if anything has changed.
Each branch has its own head and they are stored under .git/refs/heads. Since there are 2 branches in our repository now, you will find that there are two plain text files under .git/refs/heads: exp and master.
Git uses another file .git/HEAD to remember which branch we are currently working on. Running cat .git/HEAD will give you an output: ref: refs/heads/master that tells us that we are currently at the master branch. Alternatively, we can also use commands like git branch or git status to find out where we are. The output of both are quite intuitive.
To switch branch, we can use the command git checkout
git checkout exp
g) Merging
At some future time, we may be sure of all our changes in a branch, and would like to merge them into the master. This process is call merging.
Let’s try this. Add another line “comp5329” at the end of enrol.txt and commit the change. You can use echo comp5329 >> enrol.txt or other text editor to modify the enrol.txt file.
6
After this, you will have three commits (snapshots) in the repository. You may notice that they are no longer a single linked list (history line). Moreover, if you check the object database, you will find that there are nine objects in there now (you should be able to work out what they are). The repository’s history would look like Figure 1.
master
C1 C2
C3
Figure 1: After exp branch commit. The first commit C1 is the common ancestor of master and exp. Both master and exp have each committed a different version after C1.
Now let’s try to merge the exp branch into the master branch. Execute the following commands:
git checkout master
git merge exp
You will get an output like the following:
Auto-merging enrol.txt
CONFLICT (content): Merge conflict in enrol.txt
Automatic merge failed; fix conflicts and then commit the result.
The message is quite self-explanatory. It indicates that the merging was not success- ful because there is a conflict in enrol.txt file. A conflict (merge conflict) means there are multiple versions of enrol.txt, and Git was not able to decide how to merge them. Hence a manual intervention is required.
As files get bigger and more complicated, manually resolving a merge conflict can be very daunting. Developers often rely on a GUI tool to resolve merge conflict. However, since our file is still simple and tiny, we can just simply use a text editor (vim or nano) built in to the terminal to resolve our merge conflict.
Open enrol.txt using your favourite text editor (nano, vim, etc), and you will see that the file content has been modified to:
comp5349
<<<<<<< HEAD
comp5318
=======
comp5329
>>>>>>> exp
<<<<<<< indicates the start of the merge conflict. It shows that the line “comp5318” is in HEAD i.e. master, but not in the branch (exp). The line underneath it (=======)
7
HEAD
exp
shows what is the version of that line in exp branch i.e. “comp5329”. The end of the conflict is marked by >>>>>>> followed by the branch name.
Edit the file to a state where you are happy, say just keep the versions in both the “exp” branch and “master” i.e. change the above to:
comp5349
comp5318
comp5329
and save it.
Afterwards,executegit commit -a -m “Resolved by incorporating both branches”
You will notice that this commit has two parents by inspecting the commit object’s content. The history of our repository will look like Figure 2.
HEAD
C1 C2 C4 master
exp
Figure 2: After merging exp with master
There are many other ways to merge histories. For more details on how to do them, please visit online resources such as https://www.atlassian.com/git/tutorials/ using-branches/git-merge.
Question 2: Collaboration Through Distributed Control
As mentioned in the introduction, Git is powerful in tracking changes to a set of files or a project, and hence is often used in organisations to allow several people to collaborate on a project. In order to assist collaboration, Git uses a distributed control model to allow repositories to be copied and synchronised in various ways. Unlike many other version control systems that rely on a central repository for collaboration, Git assumes no central repository, i.e. it ensures that every repository is self contained.
However, since local repositories only reside in a local host and are usually invisible to others, having a visible remote repository in a hosting platform is the de facto configuration for a collaboration team. Such remote repository takes the role as the central one.
a) Copying Repositories
To be able to copy repository, we need a way to specify the location of the repository.
Git provides various protocols for locating repositories. These include SSH, HTTPS
8
C3
and local file system. The actual way of expression location depends on the protocol used.
If all your team members work in the same computer, you can refer to each other’s repository using local file path.
The following commands would create a copy of your own local repository and save it in a folder called week1-copy.
cd ..
git clone week1 week1-copy
Thegit clonecommandalsocreatesaremotereferencecalledoriginpointingback to the original repository (week-1 in this case). However, such reference is merely a convenient way to refer to repositories that your one may have some interest in. Such reference does not enforce any relationship. A local repository can have many such references. Use the command git remote -v to list them.
cd week1-copy
git remote -v
You are likely to see an output similar to the following:
origin
origin
The repository residing in a hosting platform is only accessible through SSH or HTTPS in general. The location of any particular repository can easily be found from the web UI. For instance, the repository hosting python resource of this course has the following location:
https://github.sydney.edu.au/COMP5349-Cloud-Computing-2020/python-resources.git
The same git clone command can be used to clone the remote repository. Go back to the parent directory of week1 e.g. ~/comp5349, and run the following command to clone a remote repository.
git clone \
https://github.sydney.edu.au/COMP5349-Cloud-Computing-2020/python-resources.git
You will now find a new folder called python resources there. It contains the source files for week 1 homework.
A reference to the remote repository is created as usual with the default name origin. It is a convention for developers to call the remote repository with a central role origin. The remote repository you get the initial resources are usually called upstream. The current remote repository attached to the newly cloned local repository is categorised as upstream. Please note that you only have a read only access to the upstream remote repository. Hence a commit command will not work.
You can set up a remote repository under your own account to be synchronised with the existing local one(s). This way you can attempt the homework, save the changes,
9
and keep your own version of the remote repository. This remote repository would be the “central” repository for aggregating changes, and should be called origin
Firstly, let’s designate the current remote repository as upstream:
git remote rename origin upstream
Now create an empty repository on either the university’s Github service or https: //bitbucket.org/ or https://github.com/. It is yet another convention to make the repository’s name the same as your repository’s root folder.
Once you have created an empty remote repository, you can use it as the origin of your local repository:
git remote add origin
Try to push the content of your local repository to the remote origin with the following command.
git push origin –all
This way you set up typical relationship among the three repository as illustrated in Figure 3
There are other ways to set up the repository relationship. For example, you can fork a repository on the server side, then clone your forked repository locally.
b) Synchronising Repositories
Git has several commands to synchronise repositories. They are: git push, git pull and git fetch. There are many online tutorials on those commands. For instance, https://www.atlassian.com/git/tutorials/syncing
We illustrate their basic usage with the two local repositories: week1 and week1-copy. week1-copy has a origin reference to week1.
Go back to week1 directory and update the enrol.txt file by adding another line “info5990” to the end of it and commit the change.
Open another terminal window and proceed to week1-copy directory. Fetch the changes from week1 with the following command:
git fetch origin
The output would look like the following:
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.
From …./week1
362226e..46c352a master -> origin/master
The git fetch command retrieves the new commit objects in the remote (origin) repository and put them under their respective branches without committing them.
10
Figure 3: Typical Repository Relationship1
1 Picture is taken from https://stackoverflow.com/questions/9257533/ what-is-the-difference-between-origin-and-upstream-on-github/9257901#9257901
To differentiate a local branch from a remote branch, all remote branches use the remote reference prefix such as origin/master.
You can inspect the changes by checking out the particular branch:
git checkout origin/master
git log
cat enrol.txt
If you decided that you want to merge those changes with your own one, you can use the same git merge command as if all the branches are local.
git merge origin/master
Because the new commit is a direct child of the current master head, your merge will be successful with an output similar to the following:
Updating 362226e..46c352a
Fast-forward
11
enrol.txt | 1 +
1 file changed, 1 insertion(+)
Any merge conflict can be handled in the same way we described earlier on.
If you do not wish to check the details of changes in the remote repository and just
want to merge everything from there, you can issue a single command git pull.
The command git push updates the remote repository with all the changes that have been committed in the local repository. If there are changes in the remote repository that have not been reflected in the local repository, you will have to firstly perform either a git pull or a git fetch then a git merge before you can push your changes.
c) Branching, Merging, and Code Reviews
When you are collaborating with others on a project which code is hosted remotely, you are generally required to make all the changes in a new branch when working on a task. This is because it is very common for multiple team members to be implementing or fixing different parts of the same code.
Before a branch can be merged into a master or other branch, it often has to pass a code review process. The code review process involves getting other fellow develop- ers to check on your code to ensure that it is correct, accurate, consistent, sufficient, etc. Furthermore, it is also a good opportunity for both developers to share their knowledge. For more information on code review process, you can have a look at the following resource https://www.atlassian.com/agile/software-development/ code-reviews.
Prior to start of code review, you have to create a pull request to let your team knows that you intend to “push” some changes to the master or a branch. The pull request will show all your commits as well as allow you to pick which team members you want to review your changes.
Different public hosting service has different way of creating a pull request.
If you are using GitHub, have a look here: https://help.github.com/articles/ about-pull-requests/. Ohterwise, if you are using bitbucket, have a look here: https://www.atlassian.com/git/tutorials/making-a-pull-request.
Question 3: Collaborative Exercise
Form a group of 3, and elect a project manager. The project manager should create a repository on a hosting platform (either the university GitHub or Bitbucket or normal GitHub), and invite the other members as collaborators. He/she should also add some initial code to the repository, for instance, the week 1 homework skeleton code.
In different branches, all members are then required to add some implementation details to different parts of the repository. Once done, each member is to create a pull request for his/her changes, and elect the project manager and the other team member as the reviewer. The reviewers will then comment on your changes, and either approve or reject
12
the pull request. If the pull request is rejected, the pull request creator will then have to go back and fix his/her code. Otherwise, he or she is to merge the branch to the master and resolve any conflicts.
Your team should aim to produce a consistent “solution”.
Reference
• Nick Farina, Git is Simpler Than You Think. http://nfarina.com/post/9868516270/ git-is-simpler
• GitHub Help, Resolving a merge conflict using the command line. https://help. github.com/articles/resolving-a-merge-conflict-using-the-command-line/
• Ben Lynn, Git Magic (chapter 8) http://www-cs-students.stanford.edu/~blynn/ gitmagic/ch08.html
• Charles Duan, Understanding Git Conceptually. https://www.sbf5.com/ cduan/technical/git/
13