October 5, 2016

The Kyu real-time operating system -- pruning the Git history

At one point in time, I wanted to borrow some linux kernel code. As anyone who has tried this knows, everything in the linux kernel is connected to everything else. So my strategy was to import entire directories, which ended up pulling in most of lib and include just to get printf() and strcpy(). And all of this ended up in my git repository. I pruned down my borrowing (this was two days of pain and suffering), and placed the result into a new "lib" directory. Now I can delete the "linux" directory, but ....

All of that linux source code remains in the repository! As I contemplate pushing this to Github, these seems sloppy and almost irresponsible, so we are investigating doing something that runs against the grain in Git, namely permanently deleting this "linux" subdirectory from the Git history.

Git is wisely designed to make this hard, but not impossible. A google search like "git remove directory from history" turns up a lot of information:

I have no tags or branches, which ought to make things "easy". I also have no downstream consumers who would need to do a "git rebase" because of this. Note that the "-r" in the following is required for a directory, but not for a single file.

Before going any further I make a backup of my working directory (which of course contains my local git repository). I both burn a CD and create a handy tar-ball.
Then we roll up our sleeves, take a deep breath, and get ready for the next command.

git gc
git filter-branch --force --index-filter \
'git rm -r --cached --ignore-unmatch linux' \
--prune-empty --tag-name-filter cat -- --all
Any command this long goes into a file (a one line shell script), in this case named "zprune".
[tom@trona src]$ ./zprune
Rewrite 76cab40deed6a636628f1a19dc68f5934d568f19 (44/106) (0 seconds passed, remaining 0 predicted)    rm 'linux/Makefile'
rm 'linux/arch/arm/include/asm/Kbuild'
rm 'linux/arch/arm/include/asm/arch_timer.h'
rm 'linux/arch/arm/include/asm/asm-offsets.h'
rm 'linux/arch/arm/include/asm/assembler.h'
rm 'linux/arch/arm/include/asm/atomic.h'

--- hundreds, perhaps thousands of lines like this .......

rm 'linux/mm/Makefile'
rm 'linux/mm/internal.h'
rm 'linux/mm/slab.c'
rm 'linux/mm/slab.h'
rm 'linux/mm/slab_common.c'
rm 'linux/mm/slob.c'
rm 'linux/mm/slub.c'

Ref 'refs/heads/master' was rewritten
Ref 'refs/remotes/origin/master' was rewritten
[tom@trona src]$ 
Then I do this, which if you read further was unnecessary and perhaps ill-advised.
git push origin --force --all
The tutorial page at github says you should wait "a while" to be sure nothing unhappy has happened and then clean things up via one of two measures:
git for-each-ref --format='delete %(refname)' refs/original | git update-ref --stdin
git reflog expire --expire=now --all
git gc --prune=now
Either this, or create a fresh new remote repository, push to it and then clone from it.

Since my repository is hosted on a local machine that I have access to, I take the latter approach. In fact what I do is to copy the pre-pruning repository to a new name.
As follows:

mv kyu.git kyu_pre_linux.git
mkdir kyu.git
cd kyu.git
git --bare init
Then on the client side, I do:
mv src src_with_linux
mkdir src
cd src
git clone xxxx.git
Someday I will go back and delete these pre-pruning backups.

The overall effect is as dramatic as I hoped:

[tom@trona Kyu]$ ls -l *.tar
-rw-rw-r-- 1 tom tom 53800960 Oct  5 15:49 kyu_with_linux.tar
-rw-rw-r-- 1 tom tom  2007040 Oct  5 16:23 kyu.tar

The above are tar-balls of my working directory before and after the prune. The size has dropped from 54M to 2M.


Have any comments? Questions? Drop me a line!

Kyu / tom@mmto.org