GIT 并行克隆所有存储库,即克隆所有存储库所花费的总时间接近最大存储库所需的时间:致命:索引包失败



OK. Mac OS.

alias gcurl
alias gcurl='curl -s -H "Authorization: token IcIcv21a5b20681e7eb8fe7a86ced5f9dbhahaLOL" '
echo $IG_API_URL 
https://someinstance-git.mycompany.com/api/v3

已运行以下内容以查看:用户有权访问的所有组织的列表。注意:给新用户(此处仅传递$IG_API_URL将为您提供可以使用的所有REST端点(。

gcurl ${IG_API/URL}/user/orgs

运行上面给了我一个很好的 JSON 对象输出,我投入jq并获得了信息,最后现在我有了相应的 git url,可以用来克隆存储库。

我创建了一个主存储库文件:

git@someinstance-git.mycompany.com:someorg1:some-repo1.git
git@someinstance-git.mycompany.com:someorg1:some-repo2.git
git@someinstance-git.mycompany.com:someorg2:some-repo1.git
git@someinstance-git.mycompany.com:someorgN:some-repoM.git
...
....
some 1000+ such entries here in this file.

我创建了一个小的单行脚本(一行一行地阅读 - 我知道它是顺序的,但是(并运行了 git 克隆,它工作正常。

我讨厌并试图找到更好的解决方案的是:
1(它按顺序进行并且速度很慢(即一件接一件(。

2(我想在克隆最大存储库所需的最大时间内克隆所有存储库。即,如果存储库 A 需要 3 秒,B 需要 20 秒,C 需要 3 秒,所有其他存储库需要不到 10 秒,那么我想知道是否有办法在20-30秒内快速克隆所有存储库(相对于 3+20+3+...+...+...秒>分钟,这将是很多(。

为了做同样的事情,我尝试了我的思想贫困在后台运行 git 克隆步骤,以便我可以更快地迭代以阅读这些行。

git clone ${git_url_line} $$_${datetimestamp}_${git_repo_fetch_from_url} &

嘿,剧本很快就结束了,运行ps -eAf|egrep "ssh|git"表明正在运行一些有趣的东西。巧合的是,其中一个人喊了:)Incinga正在为非常高的东西显示很酷的指标。我以为这是由于我,但我想我可以做 N 不。来自我的 GIT 实例的 git 克隆,而不会影响任何网络中断/奇怪的事情。

好的,事情成功运行了一段时间,我开始在屏幕上看到一堆 git 克隆输出。在第二个会话中,我看到文件夹填充得很好,直到我终于看到了我所期望的:

Resolving deltas: 100% (3392/3392), done.
remote: Total 5050 (delta 0), reused 0 (delta 0), pack-reused 5050
Receiving objects: 100% (5050/5050), 108.50 MiB | 1.60 MiB/s, done.
Resolving deltas: 100% (1777/1777), done.
remote: Total 10691 (delta 0), reused 0 (delta 0), pack-reused 10691
Receiving objects: 100% (10691/10691), 180.86 MiB | 1.57 MiB/s, done.
Resolving deltas: 100% (5148/5148), done.
remote: Total 5994 (delta 6), reused 0 (delta 0), pack-reused 5968
Receiving objects: 100% (5994/5994), 637.66 MiB | 2.61 MiB/s, done.
Resolving deltas: 100% (3017/3017), done.
Checking out files: 100% (794/794), done.
packet_write_wait: Connection to 10.20.30.40 port 22: Broken pipe
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

我怀疑您一次启动~1000个进程正在耗尽本地机器或远程机器上的资源。 您可能希望限制启动的进程数。 一种技术是使用xargs.

如果你可以访问 GNU xargs,它可能看起来像这样:

xargs --replace -P10 git clone {} < repos.txt
  • -P10是"10个进程">
  • --replace- 将{}替换为映射的参数

如果你坚持使用残缺的BSDxargs,比如在osx上(或者想要更高的兼容性(,你可以使用更便携的:

xargs -I{} -P10 git clone {} < repos.txt

这种形式也适用于GNU xargs。

感谢安东尼。

为了并行执行 GIT 克隆(对于 xargs 最多为 -P(,我尝试了各种数字(-P5-P10-P15,... ,-P100,...-P<Limit_number_as_per_ulimit>-P<No.of.processes_a_user_can_have_at_a_given_time>(。结论是坚持使用xargs-P5-P10因为更高的数字适用于-P<N>,每次都不成功(由于我运行命令/脚本的机器上的资源问题 ((。

如果增加 -P (N 值(,您可能会看到如下错误:

packet_write_wait: Connection to 10.20.30.40 port 22: Broken pipe
or
fatal: The remote end hung up unexpectedly
or
fatal: early EOF
or
fatal: index-pack failed
or
sign_and_send_pubkey: signing failed: agent refused operation
or
ssh: connect to host somegit-instance.mycompany.com port 22: Operation timed out
fatal: Could not read from remote repository.

最终脚本:

#!/bin/bash
# Variables
pattern=""; # Create git pattern to fetch enteries from master config based upon user's parameters, defaults to blank.
usage() {
echo -e "nUsage:n------ngit-clone-repos.parallel.sh [usage | help | <pattern>]n"
echo "git-clone-repos.parallel.sh "github.mycompany.com"             .................................... (This will re-clone every repository under every org in Git instance 'github.mycompany.com')"
echo "git-clone-repos.parallel.sh "github.mycompany.com:tools-ansible-some-org"  ................ (This will re-clone every repository under org: 'tools-ansible-some-org' in Git instance 'github.mycompany.com')"
echo "git-clone-repos.parallel.sh "somegit-instance.mycompany.com:coolrepo-org/somerepo.git"  .... (This will re-clone repo: 'somerepo' in org: 'coolrepo-org' in Git instance: 'somegit-instance.mycompany.com')"
echo -e "nn"
}
# If help/usage as first arg, show usage help
if [[ ("$1" == "usage" || "$1" == "help") || $# -eq 0 ]]; then usage; exit 0; fi
# Set pattern
pattern="$1"
mc_file=~/AKS/common/master-config.git-repos-ssh-urls.txt
echo "-- Master config file: $mc_file"; echo
echo "-- Pattern passed for fetching repos from master config file is: "$pattern""
# Create a workspace dir in PWD so that everything sits fresh in a new folder. Tweak it if you don't want it.
dir="$$_$(date +%s)"
mkdir ${dir} && cd $dir
# First create a temp repo file filtered by pattern and for '@' lines only (i.e. ignoring commented out lines)
tmprepofile=$(mktemp)
grep "${pattern}" ${mc_file} | grep '@' | cut -d':' -f3- > ${tmprepofile}
# GIT clone in parallel mode (xargs -P5 is optimal, -P10 can be used).
# Git a repo as a different name so that all repos in any organization in any instance clones without any conflict.
xargs -I{} -P10 bash -c 'git clone {} $(echo {} | cut -d'@' -f2 | sed "s#:#__#g;s#/#__#g;s#.git##")' < ${tmprepofile}

使用的示例主配置文件是:

#-- Sample Master Config file, which can be generated using GIT rest api - against a user's org to find all user org repositories (in my case) looks like:
## github coolrepo-org org/repogroup contains:
##-----------
github.mycompany.com:coolrepo-org:git@github.mycompany.com:coolrepo-org/somerepo1.git
github.mycompany.com:coolrepo-org:git@github.mycompany.com:coolrepo-org/somerepo2.git
## somegit-instance pipeline org/repogroup contains:
##-----------
somegit-instance.mycompany.com:pipeline:git@somegit-instance.mycompany.com:pipeline/shinynew-cool-pipeline.git
## !!!!! NO ORG ACCESS REPO ENTRIES BELOW !!!!! ##
## -----------------------------------------------
## somegit-instance Misc no access org but access at just repo level enteries contains:
##----------- (appended to the master file at the end of master file generation script) ---------
somegit-instance.mycompany.com:someorg-org:git@somegit-instance.mycompany.com:someorg-org/somerepofooter.git
somegit-instance.mycompany.com:someorg-org:git@somegit-instance.mycompany.com:someorg-org/somereponav.git

最新更新