比较网站迁移结果(同时并行运行两个网站)



我正在将一个客户的本土网站迁移到Drupal 7中。这个过程需要一段时间——设计决策,一些新的需求,等等。我相信你们都经历过。

我开始开发一个工具,(a)从旧数据库中获取URL路径列表,(b)从Drupal站点和旧站点获取每个页面的内容,(c)对页面进行xpath查询,使用xidel获取div#maincontent和div#main的内容,以及(d)将数据保存在new.txt和old.txt文件中,同时保持与站点类似的文件夹结构以供参考。

gather_data.sh

#!/bin/bash
# get URLS
urls=$(ssh user@old_ser "~/data_urls.sh" | egrep "^/" | sort -u)
# clear out current working folder
rm -rf ./working
# loop through paths
for i in $urls
do  
    # screen status update, set storage area with url_path in folder path, make folder
    echo $i
    storage_area=./working/$i/
    mkdir -p $storage_area

    # strip trailing space
    i=${i%/}
    # pull and and run xpath query
    xidel http://old_server$i  -e '//div[@id="maincontent"]//p' > $storage_area/old.txt
    xidel http://new_server$i -e '//div[@id="content"]//p' > $storage_area/new.txt
    # run a compare and output data into cmp.cmp
    cmp $storage_area/old.txt $storage_area/new.txt > $storage_area/cmp.cmp
done

辅助脚本循环执行cmp.cmp文件的结果。

run_diff.sh

echo "------------------------------------------------------- "
echo "The following may have differences in content based on wdiff analysis"
for i in `find ./working/ -type d`; do
  better_url_name=`echo $i | sed -e 's#./working##g'`

  echo -e "e[1;37m"
  echo -----------------------------------------------------------------------
  echo http://old_server$better_url_name
  echo http://new_server$better_url_name
  echo -----------------------------------------------------------------------
  echo -e "e[00m"
  wdiff -3s $i/old.txt $i/new.txt  | colordiff
done

上面的结果产生了如下的结果。

-----------------------------------------------------------------------
http://old_server/career_services/career_fair.php
http://new_server/career_services/career_fair.php
-----------------------------------------------------------------------

======================================================================
 [-9. 
School-] {+9.School+}
======================================================================
 [-Imagination
April-] {+ImaginationApril+}
======================================================================
 [-contract.
April-] {+contract.April+}
======================================================================
{+ +}
======================================================================
./working/epics/career_services/career_fair.php/old.txt: 1001 words  995 99% common  0 0% deleted  6 1% changed
./working/epics/career_services/career_fair.php/new.txt: 999 words  995 100% common  1 0% inserted  3 0% changed

我的问题:

  • 如何忽略这些误报
  • 如何筛选出空格和返回标记
  • 这是正确的方法吗?我是否应该放弃这种方法,换一种能产生更好结果的方法

使用diff命令,您可以使用以下选项-

   -b  --ignore-space-change
         Ignore changes in the amount of white space.
   -w  --ignore-all-space
         Ignore all white space.
   -B  --ignore-blank-lines
         Ignore changes whose lines are all blank.
       --strip-trailing-cr
         Strip trailing carriage return on input.

最新更新