子外壳结束后,"wait"等待 ENTER 命令



给出了一个带有表格列表的文件,此脚本启动了一个带有选项列表的sqoop import命令。我从这里借来的"调度程序"中的英特尔是在"调度程序"中启动另一个以填补队列。这是在表结束到Sqoop的末端完成的。

脚本和调度程序正常工作,除了脚本在子壳完成其作业之前结束

我尝试在脚本末尾插入wait,但是这样它就等待我按Enter。

我无法透露完整脚本,对不起。希望您无论如何都会理解。

感谢您的帮助。

#!/bin/bash
# Script to parallel offloading RDB tables to Hive via Sqoop
confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# This file contains various configuration options, as long as "parallels",
#  which is the number of concurrent jobs I want to launch
# Some nice functions.
usage () {
  ...
}
doSqoop() {
  This function launches a Sqoop command compiled with informations extracted
# in the while loop. It also writes 2 log files and look for Sqoop RC.
}
queue() {
    queue="$queue $1"
    num=$(($num+1))
}
regeneratequeue() {
    oldrequeue=$queue
    queue=""
    num=0
    for PID in $oldrequeue
    do
        if [ -d /proc/"$PID"  ] ; then
            queue="$queue $PID"
            num=$(($num+1))
        fi
    done
}
checkqueue() {
    oldchqueue=$queue
    for PID in $oldchqueue
    do
        if [ ! -d /proc/"$PID" ] ; then
            regeneratequeue # at least one PID has finished
            break
        fi
    done
}
# Check for mandatory values.
 ...
#### HeavyLifting ####
# Since I have a file containing the list of tables to Sqoop along with other
# useful arguments like sourceDB, sourceTable, hiveDB, HiveTable, number of parallels,
# etc, all in the same line, I use awk to grab them and then
# I pass them to the function doSqoop().
# So, here I:
# 1. create a temp folder
# 2. grab values from line with awk
# 3. launch doSqoop() as below:
# 4. Monitor spawned jobs 
awk '!/^($|#)/' < "$listOfTables" | { while read -r line; 
do
  # look for the folder or create it
  # .....
  # extract values fro line with awk
  # ....
  # launch doSqoop() with this line:
  (doSqoop) &
  PID=$!
  queue $PID
  while [[ "$num" -ge "$parallels" ]]; do
    checkqueue
    sleep 0.5
  done
done; }
# Here I tried to put wait, without success.

编辑(2(

好的,所以我设法实现了建议的 deebee ,据我所知,这是正确的。我没有实现 duffy 说什么,因为我不太了解,而且我没有时间ATM。

现在问题是我在 dosqoop 函数中移动了一些代码,并且它无法创建日志所需的/tmp文件夹。
我不明白怎么了。这是代码,然后是错误。请考虑查询参数很长,并且包含空格

脚本

#!/bin/bash
# Script to download lot of tables in parallel with Sqoop and write them to Hive
confFile=$1
listOfTables=$2
# Source configuration values
. "$confFile"
# TODO: delete sqoop tmp directory after jobs ends #
doSqoop() {
  local origSchema="$1"
  local origTable="$2"
  local hiveSchema="$3"
  local hiveTable="$4"
  local splitColumn="$5"
  local sqoopParallels="$6"
  local query="$7"
  local logFileSummary="$databaseBaseDir"/"$hiveTable"-summary.log
  local logFileRaw="$databaseBaseDir/"$hiveTable"-raw.log
  databaseBaseDir="$baseDir"/"$origSchema"-"$hiveSchema"
  [ -d "$databaseBaseDir" ] || mkdir -p "$databaseBaseDir"
  if [[ $? -ne 0 ]]; then
    echo -e "Unable to complete the process. n
    Cannot create logs folder $databaseBaseDir"
    exit 1
  fi
  echo "#### [$(date +%Y-%m-%dT%T)] Creating Hive table $hiveSchema.$hiveTable from source table $origSchema.$origTable ####" | tee -a "$logFileSummary" "$logFileRaw"
  echo -e "nn"
  quote="'"
  sqoop import -Dmapred.job.queuename="$yarnQueue" -Dmapred.job.name="$jobName" 
  --connect "$origServer" 
  --username SQOOP --password-file file:///"$passwordFile" 
  --delete-target-dir 
  --target-dir "$targetTmpHdfsDir"/"$hiveTable" 
  --outdir "$dirJavaCode" 
  --hive-import 
  --hive-database "$hiveSchema" 
  --hive-table "$hiveTable" 
  --hive-partition-key "$hivePartitionName" --hive-partition-value "$hivePartitionValue" 
  --query "$quote $query where $CONDITIONS $quote" 
  --null-string '' --null-non-string '' 
  --num-mappers 1 
  --fetch-size 2000000 
  --as-textfile 
  -z --compression-codec org.apache.hadoop.io.compress.SnappyCodec |& tee -a "$logFileRaw"
  sqoopRc=$?
  if [[ $sqoopRc -ne 0 ]]; then 
    echo "[$(date +%Y-%m-%dT%T)] Error importing $hiveSchema.$hiveTable !" | tee -a "$logFileSummary" "$logFileRaw"
    echo "$hiveSchema.$hiveTable" >> $databaseBaseDir/failed_imports.txt 
  fi
  echo "Tail of : $logFileRaw" >> "$logFileSummary"
  tail -10 "$logFileRaw"  >> "$logFileSummary"
}
export -f doSqoop
# Check for mandatory values.
if [[ ! -f "$confFile" ]]; then
  echo -e "   $confFile does not appear to be a valid file.n"
  usage
fi
if [[ ! -f "$listOfTables" ]]; then
  echo -e "   $listOfTables does not appear to be a valid file.n"
  usage
fi
if [[ -z "${username+x}" ]]; then
  echo -e "   A valid username is required to access the Source.n"
  usage
fi
if [[ ! -f "$passwordFile" ]]; then
  echo -e "   Password File $password does not appear to be a valid file.n"
  usage
fi
if [[ -z "${origServer+x}" ]]; then
  echo -e "   Sqoop connection string is required.n"
  usage
fi
#### HeavyLifting ####
awk -F"|" '!/^($|#)/ {print $1 $2 $3 $4 $5 $6 $7}' < "$listOfTables" | xargs -n7 -P$parallels bash -c "doSqoop {}"

错误

mkdir: cannot create directory `/{}-'mkdir: : Permission deniedcannot create directory `/{}-'
mkdir: : Permission denied
cannot create directory `/{}-': Permission denied
Unable to complete the process.
    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
    Cannot create logs folder /{}-
Unable to complete the process.
    Cannot create logs folder /{}-
Unable to complete the process.
    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
    Cannot create logs folder /{}-
mkdir: cannot create directory `/{}-': Permission denied
mkdir: cannot create directory `/{}-': Permission denied
Unable to complete the process.
    Cannot create logs folder /{}-
mkdir: mkdir: cannot create directory `/{}-'cannot create directory `/{}-': Permission denied: Permission denied
Unable to complete the process.
    Cannot create logs folder /{}-
Unable to complete the process.
    Cannot create logs folder /{}-
Unable to complete the process.
    Cannot create logs folder /{}-

由于您将doSqoop推到&的后台作业,因此限制脚本执行时间的唯一一件事是sleep 0.5,并且运行checkqueue多长。

您是否考虑过使用xargs并行运行该功能?

我认为近似您的用例的示例:

$ cat sqoop.bash
#!/bin/bash
doSqoop(){
  local arg="${1}"
  sleep $(shuf -i 1-10 -n 1)  # random between 1 and 10 seconds
  echo -e "${arg}t$(date +'%H:%M:%S')"
}
export -f doSqoop  # so xargs can use it
threads=$(nproc)  # number of cpu cores
awk '{print}' < tables.list | xargs -n1 -P${threads} -I {} bash -c "doSqoop {}"
$ seq 1 15 > tables.list

结果:

$ ./sqoop.bash
3   11:29:14
4   11:29:14
8   11:29:14
9   11:29:15
11  11:29:15
1   11:29:20
2   11:29:20
6   11:29:21
14  11:29:22
7   11:29:23
5   11:29:23
13  11:29:23
15  11:29:24
10  11:29:24
12  11:29:24

有时候很高兴让xargs为您完成工作。

编辑:

示例将3个ARG传递到该函数中,并并行多达8个操作:

$ cat sqoop.bash
#!/bin/bash
doSqoop(){
  a="${1}"; b="${2}"; c="${3}"
  sleep $(shuf -i 1-10 -n 1)  # do some work
  echo -e "$(date +'%H:%M:%S') $a $b $c"
}
export -f doSqoop
awk '{print $1,$3,$5}' tables.list | xargs -n3 -P8 -I {} bash -c "doSqoop {}"
$ cat tables.list
1a 1b 1c 1d 1e
2a 2b 2c 2d 2e
3a 3b 3c 3d 3e
4a 4b 4c 4d 4e
5a 5b 5c 5d 5e
6a 6b 6c 6d 6e
7a 7b 7c 7d 7e
$ ./sqoop.bash
09:46:57 1a 1c 1e
09:46:57 7a 7c 7e
09:47:05 3a 3c 3e
09:47:06 4a 4c 4e
09:47:06 2a 2c 2e
09:47:09 5a 5c 5e
09:47:09 6a 6c 6e

使用GNU并行您可以做:

export -f doSqoop
grep -Ev '^#' "$listOfTables" |
  parallel -r --colsep '|' -P$parallels doSqoop {}

如果您只想每个CPU核心一个进程:

  ... | parallel -r --colsep '|' doSqoop {}

一段时间后,我现在有一些时间可以回答我的问题,因为我真的没有其他人陷入这种问题。

我经历了多个问题,与我的代码中的错误以及使用 XARGS 有关。事后看来,根据我的经验,我肯定可以建议不是将XARGS用于此类内容。Bash不是最适合的语言,但是如果您被迫使用它,请考虑使用 GNU并行。我将很快将脚本移至此。

关于问题:

  • 我将参数传递给功能时遇到了问题。在拳头的地方,因为它们包含我没有注意到的特殊字符,然后因为我没有使用-i args。我解决了此问题,以清洁当时和使用XARGS选项-l1 -I args之间的新线的输入线。这样,它将线路视为单个参数,将其传递到函数(我用尴尬地解析它们(。
  • 我试图实施的调度程序无法正常工作。我最终使用XARGS并平行于功能中的执行和自定义代码,以编写一些控制文件,以帮助我理解(在脚本的末尾(出了什么问题和有效的内容。
  • Xargs不提供为单独工作收集输出的设施。它只是将其倒在Stdout上。我与Hadoop一起工作,我有很多输出,这只是一团糟。
  • 再次,如果您将其与其他shell命令(如查找,猫,zip等(一起使用,则Xargs很好。如果您有我的用例,请不要使用它。只是不要,你最终会拿着白发。设置,花一些时间学习GNU平行或更好地使用完整的特色语言(如果可以的话(。

最新更新