Home | History | Annotate | Download | only in repo_diff
      1 # Repo Diff Trees
      2 
      3 repo_diff_trees.py compares two repo source trees and outputs reports on the
      4 findings.
      5 
      6 The ouput is in CSV and is easily consumable in a spreadsheet.
      7 
      8 In addition to importing to a spreadsheet, you can also create your own
      9 Data Studio dashboard like [this one](https://datastudio.google.com/open/0Bz6OwjyDcWYDbDJoQWtmRl8telU).
     10 
     11 If you wish to create your own dashboard follow the instructions below:
     12 
     13 1. Sync the two repo workspaces you wish to compare. Example:
     14 
     15 ```
     16 mkdir android-8.0.0_r1
     17 cd android-8.0.0_r1
     18 repo init \
     19   --manifest-url=https://android.googlesource.com/platform/manifest \
     20   --manifest-branch=android-8.0.0_r1
     21 # Adjust the number of parallel jobs to your needs
     22 repo sync --current-branch --no-clone-bundle --no-tags --jobs=8
     23 cd ..
     24 mkdir android-8.0.0_r11
     25 cd android-8.0.0_r11
     26 repo init \
     27   --manifest-url=https://android.googlesource.com/platform/manifest \
     28   --manifest-branch=android-8.0.0_r11
     29 # Adjust the number of parallel jobs to your needs
     30 repo sync --current-branch --no-clone-bundle --no-tags --jobs=8
     31 cd ..
     32 ```
     33 
     34 2. Run repo_diff_trees.py. Example:
     35 
     36 ```
     37 python repo_diff_trees.py --exclusions_file=android_exclusions.txt \
     38   android-8.0.0_r1 android-8.0.0_r11
     39 ```
     40 
     41 3. Create a [new Google spreadsheet](https://docs.google.com/spreadsheets/create).
     42 4. Import projects.csv to a new sheet.
     43 5. Create a [new data source in Data Studio](https://datastudio.google.com/datasources/create).
     44 6. Connect your new data source to the project.csv sheet in the Google spreadsheet.
     45 7. Add a "Count Diff Status" field by selecting the menu next to the "Diff
     46    Status" field and selecting "Count".
     47 8. Copy the [Data Studio dashboard sample](https://datastudio.google.com/open/0Bz6OwjyDcWYDbDJoQWtmRl8telU).
     48     Make sure you are logged into your Google account and you have agreed to Data Studio's terms of service. Once
     49     this is done you should get a link to "Make a copy of this report".
     50 9. Select your own data source for your copy of the dashboard when prompted.
     51 10. You may see a "Configuration Incomplete" message under
     52     the "Modified Projects" pie chart. To address this select the pie chart,
     53     then replace the "Invalid Metric" field for "Count Diff Status".
     54 
     55 ## Analysis method
     56 
     57 repo_diff_trees.py goes through several stages when comparing two repo
     58 source trees:
     59 
     60 1. Match projects in source tree A with projects in source tree B.
     61 2. Diff projects that have a match.
     62 3. Find commits in source tree B that are not in source tree A.
     63 
     64 The first two steps are self explanatory. The method
     65 of finding commits only in B is explaned below.
     66 
     67 ## Finding commits not upstream
     68 
     69 After matching up projects in both source tree
     70 and diffing, the last stage is to iterate
     71 through each project matching pair and find
     72 the commits that exist in the downstream project (B) but not the
     73 upstream project (A).
     74 
     75 'git cherry' is a useful tool that finds changes
     76 which exist in one branch but not another. It does so by
     77 not only by finding which commits that were merged
     78 to both branches, but also by matching cherry picked
     79 commits.
     80 
     81 However, there are many instances where a change in one branch
     82 can have an equivalent in another branch without being a merge
     83 or a cherry pick. Some examples are:
     84 
     85 * Commits that were squashed with other commits
     86 * Commits that were reauthored
     87 
     88 Cherry pick will not recognize these commits as having an equivalent
     89 yet they clearly do.
     90 
     91 This is addressed in two steps:
     92 
     93 1. First listing the "git cherry" commits that will give us the
     94    list of changes for which "git cherry" could not find an equivalent.
     95 2. Then we "git blame" the entire project's source tree and compile
     96    a list of changes that actually have lines of code in the tree.
     97 3. Finally we find the intersection: 'git cherry' changes
     98    that have lines of code in the final source tree.
     99 
    100 
    101 ## Caveats
    102 
    103 The method described above has proven effective on Android
    104 source trees. It does have shortcomings.
    105 
    106 * It does not find commits that only delete lines of code.
    107 * It does take into accounts merge conflict resolutions.
    108