Apache Falcon:在实际集群中设置数据管道[加载数据失败,错误:400个错误请求]



我试图在实际集群中实现HotonWorks的数据管道示例。我在集群中安装了HDP 2.2版本,但在UI中为进程和数据集选项卡获得以下错误

Failed to load data. Error: 400 Bad Request

除了HBase、Kafka、Knox、Ranger、Slider和Spark之外,我所有的服务都在运行。

我已经阅读了描述集群、提要和流程定义的单个标记的猎鹰实体规范,并修改了提要和流程的xml配置文件,如下所示

集群定义

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<cluster name="primaryCluster" description="Analytics1" colo="Bangalore" xmlns="uri:falcon:cluster:0.1">
    <interfaces>
        <interface type="readonly" endpoint="hftp://node3.com.analytics:50070" version="2.6.0"/>
        <interface type="write" endpoint="hdfs://node3.com.analytics:8020" version="2.6.0"/>
        <interface type="execute" endpoint="node1.com.analytics:8050" version="2.6.0"/>
        <interface type="workflow" endpoint="http://node1.com.analytics:11000/oozie/" version="4.1.0"/>
        <interface type="messaging" endpoint="tcp://node1.com.analytics:61616?daemon=true" version="5.1.6"/>
    </interfaces>
    <locations>
        <location name="staging" path="/user/falcon/primaryCluster/staging"/>
        <location name="working" path="/user/falcon/primaryCluster/working"/>
    </locations>
    <ACL owner="falcon" group="hadoop"/>
</cluster>

饲料定义

RawEmailFeed

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="rawEmailFeed" description="Raw customer email feed" xmlns="uri:falcon:feed:0.1">
    <tags>externalSystem=USWestEmailServers,classification=secure</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <late-arrival cut-off="hours(4)"/>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
            <retention limit="days(3)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/falcon/input/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
        <location type="stats" path="/none"/>
        <location type="meta" path="/none"/>
    </locations>
    <ACL owner="falcon" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

cleansedEmailFeed

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<feed name="cleansedEmailFeed" description="Cleansed customer emails" xmlns="uri:falcon:feed:0.1">
    <tags>owner=USMarketing,classification=Secure,externalSource=USProdEmailServers,externalTarget=BITools</tags>
    <groups>churnAnalysisDataPipeline</groups>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <clusters>
        <cluster name="primaryCluster" type="source">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
            <retention limit="days(10)" action="delete"/>
        </cluster>
    </clusters>
    <locations>
        <location type="data" path="/user/falcon/processed/enron/${YEAR}-${MONTH}-${DAY}-${HOUR}"/>
    </locations>
    <ACL owner="falcon" group="users" permission="0755"/>
    <schema location="/none" provider="none"/>
</feed>

过程定义

rawEmailIngestProcess

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="rawEmailIngestProcess" xmlns="uri:falcon:process:0.1">
    <tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup,externalSystem=USWestEmailServers</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <outputs>
        <output name="output" feed="rawEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailIngestWorkflow" version="2.0.0" engine="oozie" path="/user/falcon/apps/ingest/fs"/>
    <retry policy="periodic" delay="minutes(15)" attempts="3"/>
    <ACL owner="falcon" group="hadoop"/>
</process>

cleanseEmailProcess

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<process name="cleanseEmailProcess" xmlns="uri:falcon:process:0.1">
    <tags>pipeline=churnAnalysisDataPipeline,owner=ETLGroup</tags>
    <clusters>
        <cluster name="primaryCluster">
            <validity start="2014-02-28T00:00Z" end="2016-03-31T00:00Z"/>
        </cluster>
    </clusters>
    <parallel>1</parallel>
    <order>FIFO</order>
    <frequency>hours(1)</frequency>
    <timezone>UTC</timezone>
    <inputs>
        <input name="input" feed="rawEmailFeed" start="now(0,0)" end="now(0,0)"/>
    </inputs>
    <outputs>
        <output name="output" feed="cleansedEmailFeed" instance="now(0,0)"/>
    </outputs>
    <workflow name="emailCleanseWorkflow" version="5.0" engine="pig" path="/user/falcon/apps/pig/id.pig"/>
    <retry policy="periodic" delay="minutes(15)" attempts="3"/>
    <ACL owner="falcon" group="hadoop"/>
</process>

我没有对摄取.sh,工作流.xml和id进行任何更改。猪的文件。它们存在于hdfs location/user/falcon/apps/ingest/fs (ingest.sh和workflow.xml)和/user/falcon/apps/pig (id.pig)中。此外,我不确定是否需要隐藏的。ds_store文件,因此没有将它们包含在上述hdfs位置。

ingest.sh

#!/bin/bash
# curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put wiki-data/*.txt $1
curl -sS http://bailando.sims.berkeley.edu/enron/enron_with_categories.tar.gz | tar xz && hadoop fs -mkdir -p $1 && hadoop fs -put enron_with_categories/*/*.txt $1

workflow.xml

<!--
  Licensed to the Apache Software Foundation (ASF) under one
  or more contributor license agreements.  See the NOTICE file
  distributed with this work for additional information
  regarding copyright ownership.  The ASF licenses this file
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at
       http://www.apache.org/licenses/LICENSE-2.0
  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License.
-->
<workflow-app xmlns="uri:oozie:workflow:0.4" name="shell-wf">
    <start to="shell-node"/>
    <action name="shell-node">
        <shell xmlns="uri:oozie:shell-action:0.2">
            <job-tracker>${jobTracker}</job-tracker>
            <name-node>${nameNode}</name-node>
            <configuration>
                <property>
                    <name>mapred.job.queue.name</name>
                    <value>${queueName}</value>
                </property>
            </configuration>
            <exec>ingest.sh</exec>
            <argument>${feedInstancePaths}</argument>
            <file>${wf:appPath()}/ingest.sh#ingest.sh</file>
            <!-- <file>/tmp/ingest.sh#ingest.sh</file> -->
            <!-- <capture-output/> -->
        </shell>
        <ok to="end"/>
        <error to="fail"/>
    </action>
    <kill name="fail">
        <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
    </kill>
    <end name="end"/>
</workflow-app>

id.pig

A = load '$input' using PigStorage(',');
B = foreach A generate $0 as id;
store B into '$output' USING PigStorage();

我不太确定HDP示例的流程流是如何发生的,如果有人能澄清这一点,我将非常感激。

具体来说,我不明白给摄取.sh的参数$1的来源。我相信这是hdfs的位置,其中传入的数据是存储。我注意到工作流.xml有标签<argument>${feedInstancePaths}</argument> .

feedinstancepath从哪里得到它的值?我想我得到的错误,因为饲料没有被存储在适当的位置。但这可能是另一个问题。

用户Falcon对/user/Falcon

下的所有hdfs目录也有755权限。

您正在运行自己的集群,但是本教程需要在shellscript (ingest.sh)中分配的资源:

curl -sS http://sandbox.hortonworks.com:15000/static/wiki-data.tar.gz

我猜你的集群在sandbox.hortonworks.com上没有地址,而且你没有所需的资源wiki-data.tar.gz。本教程只适用于提供的沙盒。

最新更新