Sunday, January 3, 2016

Merging multiple git repositories into one and purging sensitive data

Git is a very powerful, distributed version control system. It's based on simple concept - directed graph without cycles (a tree) pointing four types of objects in it's database. I love git and it's brilliant design. Therefore, when I saw how misused it was in a company which I joined, I've had to fix it.

The state before

Due to multiple factors, there were around 4 repositories, which had to be cloned in one directory. Each repository was using or was being used by another repository. In other words, projects in VisualStudio were having dependencies on another projects, or worse - on compiled dlls in another repositories. Therefore, sometimes, one change required 4 commits (including projects rebuild and adding compiled dlls to the commit). During one month, around 2 man-days were lost for checking in/out changes from multiple repositories and for false (or true...) alarms that somebody forgot to check in/out something from the repositories. What's more, this was only for one branch - master - because only one existed back then. However, in future, to support multiple environments or development on fine-grained features/stories, multiply those problems by the number of branches and the number of new developers, at least. As always in IT, there wasn't much time, so setting up internal company NuGet Server wasn't the best thing to do. It isn't that it takes a long time to setup NuGet Server, but training all developers requires a great amount of time. Instead, I've decided to create one repository.
The state before was like this:
\Repo1
  \src
    \project1
      project1.csproj with dll reference to project 2
    \ExternalDlls
      project2.dll
    Solution1.sln
\Repo2
  \project2
    project2.csproj with project reference to project 3
  \project3
    project3.csproj with project reference to project 4
  Solution2.sln
\Repo3
  \project4
    project4.csproj
  Solution3.sln

One to rule them all

In those 4 repositories passwords or MachineKey's to production environment were stored in plain text. Therefore I've decided to create a new repository. Side note: remember, passwords pushed to git repository always there will be, Yoda said. Therefore the new repository will have entirely rewritten history (removed passwords). Naturally, all branches (masters in this case) from all repositories with their's history must be included in the new repository. It will look like this:

new repo HEAD
|
M
|  \
M    \
| \    \
x' \     \
|   y'    z'
x'  |     |
|   y'    z'
x'  |     |
.   y'    z'
.   .     .
.   .     .
    .     .
 
Legend:
x' - commits from repository 1 with removed sensitive data
y' - commits from repository 2 with removed sensitive data
z' - commits from repository 3 with removed sensitive data
M  - merges in the new repository
new repo HEAD - the brand new, future repo HEAD (master)

Migration scripts

Migration must be done in "atomic" way, well at least it must be seen from developers perspective as atomic operation - they commit to the old repos and from some point in time they commit to the new repo (note: stashes will have to be discarded). Therefore, I've decided to run the migration during the weekend, when repositories are inactive. However, I don't like to work during weekends, so I wrote a script or two to automate the majority of the work. The git filter-branch command which I will be using is painfully slow, so additionally I've used powerful Amazon EC2 instance to make things a little faster.

Step 1 - fetch all repos and form a nice repository structure

Look that in the state before not all repos have had the code in src folder. To fix it, I'll use git filter-branch command to entirely rewrite the history. Each commit in history, blame etc. will look like it was committed to the right, src folder. Additionally, I've seen that someone was committing Packages folder to the git (possibly due to poor .gitignore file), so now it's a chance to remove that bloat permanently. Here is the bash script. Save it as mergerepos.sh and run it from git bash console like normal linux script (./mergerepos.sh):
#!/bin/bash 
FinalRepo="main" 
echo $FinalRepo

mkdir $FinalRepo
cd $FinalRepo
git init

touch tmp
git add -A
git commit -m 'merge all repositories'


declare -a reponames=("repo1" "repo2" "repo3")
declare -a repourls=("https://user@bitbucket.org/Company/repo1.git" "https://user@bitbucket.org/Company/repo2.git" "https://user@bitbucket.org/Company/repo3.git")
numberofrepos=${#reponames[@]}

function rewriterepo {
 git checkout $1/master
 git checkout -b "$1master"
 git filter-branch -f --tree-filter 'rm -rf packages
 mkdir "src"
 rm -rf src/packages
 ls -A | grep -v ^[Ss]rc | grep -v \.git | while read filename
 do
 mv "$filename" "src/"
 done' HEAD
}

for (( i=0; i<${numberofrepos}; i++ ));
do
  echo $i " -> " ${reponames[$i]} $(date) "-" ${repourls[$i]} " STARTED"
  git remote add ${reponames[$i]} ${repourls[$i]}
  git fetch ${reponames[$i]}
  rewriterepo ${reponames[$i]}
  git remote rm ${reponames[$i]}
  echo $i " -> " ${reponames[$i]} $(date) "-" ${repourls[$i]} " FINISHED"
done
The script will:
  • set up a new repository
  • make a dummy commit
  • go through the list of given repositories and for each do
    • add it as remote, fetch it, checkout it to repoXmaster branch
    • clean each commit as follows
      • create src folder, remove src/packages folder
      • move each file/directory from the root, except of src and git folder to src folder
    • remove added remote
So far so good.

Step 2 - merge all branches (repositores)

As I have all repositories in the right structure and in our one "chosen" repository, merging them is just a normal merge operation.

Step 3 - delete sensitive data (passwords etc)

This can be done by painfully slow git filter-branch or... fast and easy to use BFG Repo Cleaner. 1Check the project website, it's self explanatory. 

Step 4 - add a nice, root .gitignore

All my work of removing redundant Packages folder can be destroyed by a single commit. Therefore I've merged all existing .gitignore files and added those rules to well known github/gitignore for VS file.

Further steps

I have now one repository with the right structure and good history. Further steps?
Taking the chance, I've introduced one solution for all projects in the new VS 2015, migrated to Automatic NuGet package restore (check all those scripts - one also fixes project hint paths), changed all dll references to project references and upgraded projects to the new VS version. This is how I've done the csproj update:
$listOfBadStuff = @(
 "<Project DefaultTargets=""Build"" xmlns=""http://schemas.microsoft.com/developer/msbuild/2003"" ToolsVersion=""4.0"">",
 "<OldToolsVersion>[0-9]\.0</OldToolsVersion>",
 "<Project ToolsVersion=""12.0"""
)
$listOfGoodStuff = @(
 "<Project DefaultTargets=""Build"" xmlns=""http://schemas.microsoft.com/developer/msbuild/2003"" ToolsVersion=""14.0"">",
 "<OldToolsVersion>14.0</OldToolsVersion>",
 "<Project ToolsVersion=""14.0"""
)

ls -Recurse -include *.csproj, *.sln, *.fsproj, *.vbproj, *.wixproj |
  foreach {
    $content = cat $_.FullName | Out-String
    $origContent = $content
 For ($i=0; $i -lt $listOfBadStuff.Length; $i++) {
  $content = $content -replace $listOfBadStuff[$i], $listOfGoodStuff[$i]
    }
    if ($origContent -ne $content)
    { 
        $content | out-file -encoding "UTF8" $_.FullName
        write-host messed with $_.Name
    }      
}

Summary

It was relatively easy to get from nightmare to a reasonable repository environment. Those one or two days of merging repositories will pay off very quickly. Not mentioning the removal of sensitive data from the repository - this can be priceless.

No comments:

Post a Comment