屏幕抓取器脚本不会写入 ouptut 文件



我无法让下面的Perl脚本写入文件output.html

我还不需要成为CGI脚本,但这是最终意图。

谁能告诉我为什么它不写入任何文本来输出.html?

#!/usr/bin/perl
#-----------------------------------------------------------------------
# This script should work as a CGI script, if I get it correctly.
# Most CGI scripts for Perl begin with the same line and must be
# stored in your servers cgi-bin directory. (I think this is set by
# your web server.
#
# This scripts will scrape news sites for stories about topics input
# by the users.
#
# Lara Landis
# Sinister Porpoise Computing
# 1/4/2018
# Personal Perl Project
#-----------------------------------------------------------------------
@global_sites = ();
print( "Starting program.n" );
if ( !( -e "sitedata.txt" ) ) {
    enter_site_info( @global_sites );
}
if ( !( -e "scrpdata.txt" ) ) {
    print( "scrpdata.txt does not exist. Creating file now.n" );
    print( "Enter the search words you wish to search for below. Press Ctrl-D to finish.n" );
    open( SCRAPEFILE, ">scrpdata.txt" );
    while ( $line = <STDIN> ) {
        chop( $line );
        print SCRAPEFILE ( "$linen" );
    }
    close( SCRAPEFILE );
}
print( "Finished getting site data..." );
scrape_sites( @global_sites );
#----------------------------------------------------------------------
# This routine gets information from the user after the file has been
# created. It also has some basic checking to make sure that the lines
# fed to it are legimate domains.  This is not an exhaustive list of
# all domains in existence.
#----------------------------------------------------------------------
sub enter_site_info {
    my ( @sisites ) = @_;
    $x = 1;
    open( DATAFILE, ">sitedata.txt" ) || die( "Could not open datafile.n" );
    print( "Enter websites below. Press Crtl-D to finish.n" );
    while ( $x <= @sisites ) {
        $sisites[$x] = <STDIN>;
        print( "$sisites[$x] added.n" );
        print DATAFILE ( "$sisites[$x]n" );
        $x++;
    }
    close( DATAFILE );
    return @sisites;
}
#----------------------------------------------------------------------
# If the file exists, just get the information from it.  Read info in
# from the sites. Remember to create a global array for the sites
# data.
#-----------------------------------------------------------------------
#-----------------------------------------------------------------------
# Get the text to find in the sites that are being scraped. This requires
# nested loops. It starts by going through the loops for the text to be
# scraped, and then it goes through each of the websites listend in the
# sitedata.txt file.
#-----------------------------------------------------------------------
sub scrape_sites {
    my ( @ss_info ) = @_;
    @gsi_info = ();
    @toscrape = ();
    $y        = 1;
    #---------------------------
    # Working code to be altered
    #---------------------------
    print( "Getting site info..." );
    $x = 1;
    open( DATAFILE, "sitedata.txt" ) || die( "Can't open sitedata.txt.txtn" );
    while ( $gsi_info[$x] = <DATAFILE> ) {
        chop( $gsi_info[$x] );
        print( "$gsi_info[$x]n" );
        $x++;
    }
    close( DATAFILE );
    open( SCRAPEFILE, "scrpdata.txt" ) || die( "Can't open scrpdata.txtn" );
    print( "Getting scrape data.n" );
    $y = 1;
    while ( $toscrape[$y] = <SCRAPEFILE> ) {
        chop( $toscrape[$y] );
        $y++;
    }
    close( SCRAPEFILE );
    print( "Now opening the output file.n" );
    $z = 1;
    open( OUTPUT, ">output.html" );
    print( "Now scraping sites.n" );
    while ( $z <= @gsi_info ) {    #This loop contains SITES
        system( "rm -f index.html.*" );
        system( "wget $gsi_info[$z]" );
        $z1 = 1;
        print( "Searching site $gsi_info[$z] for $toscrape[$z1]n" );
        open( TEMPFILE, "$gsi_info[$z]" );
        $comptext = <TEMPFILE>;
        while ( $comptext =~ /$toscrape[z1]/ig ) {    # This loop fetches data from the search terms
            print( "Now scraping $gsi_info[$z] for $toscrape[$z1]n" );
            print OUTPUT ( "$toscrape[$z1]n" );
            $z1++;
        }
        close( TEMPFILE );
        $z++;
    }
    close( OUTPUT );
    return ( @gsi_info );
}

您对当前工作目录的假设通常是不正确的。您似乎假设当前工作目录是脚本所在的目录,但这永远无法保证,并且对于 CGI 脚本通常/

"sitedata.txt"

应该是

use FindBin qw( $RealBin );
"$RealBin/sitedata.txt"

还可能存在权限错误。当open失败时,您应该在错误消息中包含错误原因($!(,以便您知道导致问题的原因!

当您检查某些内容时,您不会检查所有opensystem呼叫。如果它们失败,程序将继续运行,而不会显示错误消息告诉您原因。

您可以为所有这些添加检查,但很容易忘记。相反,请使用autodie为您执行检查。

您还需要use strict以确保您没有犯任何可变的拼写错误,并use warnings警告您注意小错误。请参阅此答案了解更多信息。

此外,@global_sites是空的enter_site_info()所以不会做任何事情。scrape_sites()的论点无所作为,@ss_info.

所有这些都是有帮助的。谢谢。我发现了问题。我打开了错误的文件。 它将错误检查放在文件上,让我发现错误。它应该是

打开 (临时文件, "索引.html"( || die ("无法打开索引.html"(;

我已经采纳了我记得的尽可能多的建议,并将它们包含在代码中。我仍然需要实施目录建议,但这应该不难。

最新更新