Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装
注意
1、本文档使用的基础hadoop环境是基于本人写的另一篇文章的基础上新增的spark和hudi的安装部署文档,基础环境部署文档:
Hadoop2.7.6+Mysql5.7+Hive2.3.2+Hbase1.4.9+Kylin2.4单机伪分布式安装文档
2、整篇文章配置相对简单,走了一些坑,没有写在文档里,为了像我一样的小白看我的文档,按着错误的路径走了,文章整体写的较为详细,按照文章整体过程来做应该不会出错,如果需要搭建基础大数据环境的,可以看上面本人写的hadoop环境部署文档,写的较为详细。
3、关于spark和hudi的介绍这里不再赘述,网上和官方文档有很多的文字介绍,本文所有安装所需的介质或官方文档均已给出可以直接下载或跳转的路径,方便各位免费下载与我文章安装的一致版本的介质。
4、下面是本实验安装完成后本人实验环境整体hadoop系列组件的版本情况:
软件名称 | 版本号 |
---|---|
Hadoop | 2.7.6 |
Mysql | 5.7 |
Hive | 2.3.2 |
Hbase | 1.4.9 |
Spark | 2.4.4 |
Hudi | 0.5.2 |
JDK | 1.8.0_151 |
Scala | 2.11.12 |
OGG for bigdata | 12.3 |
Kylin | 2.4 |
Kafka | 2.11-1.1.1 |
Zookeeper | 3.4.6 |
Oracle Linux | 6.8x64 |
一、安装spark依赖的Scala
因为其他版本的Spark都是基于2.11.版本,只有2.4.2版本的才使用Scala2.12. 版本进行开发,hudi官方用的是spark2.4.4,而spark:“Using Scala version 2.11.12 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_151)”,所以这里我们下载scala2.11.12。
1.1 下载和解压缩Scala
下载地址:
进入
下载linux版本:
在Linux服务器的opt目录下新建一个名为scala的文件夹,并将下载的压缩包上载上去:
[root@hadoop opt]# cd /usr/ [root@hadoop usr]# mkdir scala [root@hadoop usr]# cd scala/ [root@hadoop scala]# pwd /usr/scala [root@hadoop scala]# ls scala-2.11.12.tgz [root@hadoop scala]# tar -zxvf scala-2.11.12.tgz [root@hadoop scala]# ls scala-2.11.12 scala-2.11.12.tgz [root@hadoop scala]# rm -rf *tgz [root@hadoop scala]# cd scala-2.11.12/ [root@hadoop scala-2.11.12]# pwd /usr/scala/scala-2.11.12
1.2 配置环境变量
编辑/etc/profile这个文件,在文件中增加配置:
export SCALA_HOME=/usr/scala/scala-2.11.12 在该文件的PATH变量中增加下面的内容: ${SCALA_HOME}/bin
添加完成后,我的/etc/profile的配置如下:
export JAVA_HOME=/usr/java/jdk1.8.0_151 export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/hadoop/ export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR" export HIVE_HOME=/hadoop/hive export HIVE_CONF_DIR=${HIVE_HOME}/conf export HCAT_HOME=$HIVE_HOME/hcatalog export HIVE_DEPENDENCY=/hadoop/hive/conf:/hadoop/hive/lib/*:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.3.jar:/hadoop/hiv e/hcatalog/share/hcatalog/hive-hcatalog-server-extensions-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-streaming-2.3.3.jar:/hadoop/hive/lib/hive-exec-2.3.3.jarexport HBASE_HOME=/hadoop/hbase/ export ZOOKEEPER_HOME=/hadoop/zookeeper export KAFKA_HOME=/hadoop/kafka export KYLIN_HOME=/hadoop/kylin/ export GGHOME=/hadoop/ogg12 export SCALA_HOME=/usr/scala/scala-2.11.12 export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HCAT_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME:$KAFKA_HOME:$KYLIN_HOME/bin:${SCALA_HOME}/bin export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:${HIVE_HOME}/lib:$HBASE_HOME/lib:$KYLIN_HOME/lib export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/libjsig.so:$JAVA_HOME/jre/lib/amd64/server/libjvm.so:$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64:$GG_HOME:/lib
保存退出,source一下使环境变量生效:
[root@hadoop ~]# source /etc/profile
1.3 验证Scala
[root@hadoop scala-2.11.12]# scala -version Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL
二、 下载和解压缩Spark
2.1、下载Spark
2.2 解压缩Spark
在/hadoop创建spark目录用户存放spark。
[root@hadoop scala-2.11.12]# cd /hadoop/ [root@hadoop hadoop]# mkdir spark [root@hadoop hadoop]# cd spark/ 通过xftp上传安装包到spark目录 [root@hadoop spark]# tar -zxvf spark-2.4.4-bin-hadoop2.7.tgz [root@hadoop spark]# ls spark-2.4.4-bin-hadoop2.7 spark-2.4.4-bin-hadoop2.7.tgz [root@hadoop spark]# rm -rf *tgz [root@hadoop spark]# mv spark-2.4.4-bin-hadoop2.7/* . [root@hadoop spark]# ls bin conf data examples jars kubernetes LICENSE licenses NOTICE python R README.md RELEASE sbin spark-2.4.4-bin-hadoop2.7 yarn
三、Spark相关的配置
3.1、配置环境变量
编辑/etc/profile文件,增加
export SPARK_HOME=/hadoop/spark
上面的变量添加完成后编辑该文件中的PATH变量,添加
${SPARK_HOME}/bin
修改完成后,我的/etc/profile文件内容是:
export JAVA_HOME=/usr/java/jdk1.8.0_151 export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/hadoop/ export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib:$HADOOP_COMMON_LIB_NATIVE_DIR" export HIVE_HOME=/hadoop/hive export HIVE_CONF_DIR=${HIVE_HOME}/conf export HCAT_HOME=$HIVE_HOME/hcatalog export HIVE_DEPENDENCY=/hadoop/hive/conf:/hadoop/hive/lib/*:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-pig-adapter-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-core-2.3.3.jar:/hadoop/hiv e/hcatalog/share/hcatalog/hive-hcatalog-server-extensions-2.3.3.jar:/hadoop/hive/hcatalog/share/hcatalog/hive-hcatalog-streaming-2.3.3.jar:/hadoop/hive/lib/hive-exec-2.3.3.jarexport HBASE_HOME=/hadoop/hbase/ export ZOOKEEPER_HOME=/hadoop/zookeeper export KAFKA_HOME=/hadoop/kafka export KYLIN_HOME=/hadoop/kylin/ export GGHOME=/hadoop/ogg12 export SCALA_HOME=/usr/scala/scala-2.11.12 export SPARK_HOME=/hadoop/spark export PATH=$PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$HADOOP_HOME/sbin:$HIVE_HOME/bin:$HCAT_HOME/bin:$HBASE_HOME/bin:$ZOOKEEPER_HOME:$KAFKA_HOME:$KYLIN_HOME/bin:${SCALA_HOME}/bin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar:${HIVE_HOME}/lib:$HBASE_HOME/lib:$KYLIN_HOME/lib export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/libjsig.so:$JAVA_HOME/jre/lib/amd64/server/libjvm.so:$JAVA_HOME/jre/lib/amd64/server:$JAVA_HOME/jre/lib/amd64:$GG_HOME:/lib
编辑完成后,执行命令 source /etc/profile使环境变量生效。
3.2、配置参数文件
进入conf目录
[root@hadoop conf]# pwd /hadoop/spark/conf
复制一份配置文件并重命名
root@hadoop conf]# cp spark-env.sh.template spark-env.sh [root@hadoop conf]# ls docker.properties.template fairscheduler.xml.template log4j.properties.template metrics.properties.template slaves.template spark-defaults.conf.template spark-env.sh spark-env.sh.template
编辑spark-env.h文件,在里面加入配置(具体路径以自己的为准):
export SCALA_HOME=/usr/scala/scala-2.11.12 export JAVA_HOME=/usr/java/jdk1.8.0_151 export HADOOP_HOME=/hadoop export HADOOP_CONF_DIR=${HADOOP_HOME}/etc/hadoop export SPARK_HOME=/hadoop/spark export SPARK_MASTER_IP=192.168.1.66 export SPARK_EXECUTOR_MEMORY=1G
source /etc/profile生效。
3.3、新建slaves文件
以spark为我们创建好的模板创建一个slaves文件,命令是:
[root@hadoop conf]# pwd /hadoop/spark/conf [root@hadoop conf]# cp slaves.template slaves
四、启动spark
因为spark是依赖于hadoop提供的分布式文件系统的,所以在启动spark之前,先确保hadoop在正常运行。
[root@hadoop hadoop]# jps 23408 RunJar 23249 JobHistoryServer 23297 RunJar 24049 Jps 22404 DataNode 22774 ResourceManager 23670 Kafka 22264 NameNode 22889 NodeManager 23642 QuorumPeerMain 22589 SecondaryNameNode
在hadoop正常运行的情况下,在hserver1(也就是hadoop的namenode,spark的marster节点)上执行命令:
[root@hadoop hadoop]# cd /hadoop/spark/sbin [root@hadoop sbin]# ./start-all.sh starting org.apache.spark.deploy.master.Master, logging to /hadoop/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-hadoop.out localhost: starting org.apache.spark.deploy.worker.Worker, logging to /hadoop/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop.out [root@hadoop sbin]# cat /hadoop/spark/logs/spark-root-org.apache.spark.deploy.master.Master-1-hadoop.out Spark Command: /usr/java/jdk1.8.0_151/bin/java -cp /hadoop/spark/conf/:/hadoop/spark/jars/*:/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.master.Master --host hadoop --port 7077 --webui-port 8080 ======================================== 20/03/30 22:42:27 INFO master.Master: Started daemon with process name: 24079@hadoop 20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for TERM 20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for HUP 20/03/30 22:42:27 INFO util.SignalUtils: Registered signal handler for INT 20/03/30 22:42:27 WARN master.MasterArguments: SPARK_MASTER_IP is deprecated, please use SPARK_MASTER_HOST 20/03/30 22:42:27 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/03/30 22:42:27 INFO spark.SecurityManager: Changing view acls to: root 20/03/30 22:42:27 INFO spark.SecurityManager: Changing modify acls to: root 20/03/30 22:42:27 INFO spark.SecurityManager: Changing view acls groups to: 20/03/30 22:42:27 INFO spark.SecurityManager: Changing modify acls groups to: 20/03/30 22:42:27 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permiss ions: Set(root); groups with modify permissions: Set()20/03/30 22:42:27 INFO util.Utils: Successfully started service 'sparkMaster' on port 7077. 20/03/30 22:42:27 INFO master.Master: Starting Spark master at spark://hadoop:7077 20/03/30 22:42:27 INFO master.Master: Running Spark version 2.4.4 20/03/30 22:42:28 INFO util.log: Logging initialized @1497ms 20/03/30 22:42:28 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/03/30 22:42:28 INFO server.Server: Started @1560ms 20/03/30 22:42:28 INFO server.AbstractConnector: Started ServerConnector@a{
HTTP/1.1,[http/1.1]}{
0.0.0.0:8080} 20/03/30 22:42:28 INFO util.Utils: Successfully started service 'MasterUI' on port 8080. 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f1f0276{
/app,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f1af444{
/app/json,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@259b10d3{
/,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6fc2f56f{
/json,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@37a28407{
/static,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e99fa57{
/app/kill,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66be5bb8{
/driver/kill,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO ui.MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://hadoop:8080 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6b2c0980{
/metrics/master/json,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4ac1749f{
/metrics/applications/json,null,AVAILABLE,@Spark} 20/03/30 22:42:28 INFO master.Master: I have been elected leader! New state: ALIVE 20/03/30 22:42:31 INFO master.Master: Registering worker 192.168.1.66:39384 with 8 cores, 4.6 GB RAM [root@hadoop sbin]# cat /hadoop/spark/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-hadoop.out Spark Command: /usr/java/jdk1.8.0_151/bin/java -cp /hadoop/spark/conf/:/hadoop/spark/jars/*:/hadoop/etc/hadoop/ -Xmx1g org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://hadoop:7077 ======================================== 20/03/30 22:42:29 INFO worker.Worker: Started daemon with process name: 24173@hadoop 20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for TERM 20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for HUP 20/03/30 22:42:29 INFO util.SignalUtils: Registered signal handler for INT 20/03/30 22:42:30 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/03/30 22:42:30 INFO spark.SecurityManager: Changing view acls to: root 20/03/30 22:42:30 INFO spark.SecurityManager: Changing modify acls to: root 20/03/30 22:42:30 INFO spark.SecurityManager: Changing view acls groups to: 20/03/30 22:42:30 INFO spark.SecurityManager: Changing modify acls groups to: 20/03/30 22:42:30 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permiss ions: Set(root); groups with modify permissions: Set()20/03/30 22:42:30 INFO util.Utils: Successfully started service 'sparkWorker' on port 39384. 20/03/30 22:42:30 INFO worker.Worker: Starting Spark worker 192.168.1.66:39384 with 8 cores, 4.6 GB RAM 20/03/30 22:42:30 INFO worker.Worker: Running Spark version 2.4.4 20/03/30 22:42:30 INFO worker.Worker: Spark home: /hadoop/spark 20/03/30 22:42:31 INFO util.log: Logging initialized @1682ms 20/03/30 22:42:31 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/03/30 22:42:31 INFO server.Server: Started @1758ms 20/03/30 22:42:31 INFO server.AbstractConnector: Started ServerConnector@3d598dff{
HTTP/1.1,[http/1.1]}{
0.0.0.0:8081} 20/03/30 22:42:31 INFO util.Utils: Successfully started service 'WorkerUI' on port 8081. 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5099c1b0{
/logPage,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@{
/logPage/json,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@46dcda1b{
/,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1617f7cc{
/json,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@56e77d31{
/static,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@b6{
/log,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO ui.WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://hadoop:8081 20/03/30 22:42:31 INFO worker.Worker: Connecting to master hadoop:7077... 20/03/30 22:42:31 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1cf30aaa{
/metrics/json,null,AVAILABLE,@Spark} 20/03/30 22:42:31 INFO client.TransportClientFactory: Successfully created connection to hadoop/192.168.1.66:7077 after 36 ms (0 ms spent in bootstraps) 20/03/30 22:42:31 INFO worker.Worker: Successfully registered with master spark://hadoop:7077
启动没问题,访问Webui:http://192.168.1.66:8080/
五、运行Spark提供的计算圆周率的示例程序
这里只是简单的用local模式运行一个计算圆周率的Demo。按照下面的步骤来操作。
[root@hadoop sbin]# cd /hadoop/spark/ [root@hadoop spark]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master local examples/jars/spark-examples_2.11-2.4.4.jar 20/03/30 22:45:59 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/03/30 22:45:59 INFO spark.SparkContext: Running Spark version 2.4.4 20/03/30 22:45:59 INFO spark.SparkContext: Submitted application: Spark Pi 20/03/30 22:45:59 INFO spark.SecurityManager: Changing view acls to: root 20/03/30 22:45:59 INFO spark.SecurityManager: Changing modify acls to: root 20/03/30 22:45:59 INFO spark.SecurityManager: Changing view acls groups to: 20/03/30 22:45:59 INFO spark.SecurityManager: Changing modify acls groups to: 20/03/30 22:45:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permiss ions: Set(root); groups with modify permissions: Set()20/03/30 22:45:59 INFO util.Utils: Successfully started service 'sparkDriver' on port 39352. 20/03/30 22:45:59 INFO spark.SparkEnv: Registering MapOutputTracker 20/03/30 22:45:59 INFO spark.SparkEnv: Registering BlockManagerMaster 20/03/30 22:45:59 INFO storage.BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information 20/03/30 22:45:59 INFO storage.BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up 20/03/30 22:45:59 INFO storage.DiskBlockManager: Created local directory at /tmp/blockmgr-63bf7c92-8908-4784-8e16-4c6ef0c93dc0 20/03/30 22:45:59 INFO memory.MemoryStore: MemoryStore started with capacity 366.3 MB 20/03/30 22:45:59 INFO spark.SparkEnv: Registering OutputCommitCoordinator 20/03/30 22:46:00 INFO util.log: Logging initialized @2066ms 20/03/30 22:46:00 INFO server.Server: jetty-9.3.z-SNAPSHOT, build timestamp: unknown, git hash: unknown 20/03/30 22:46:00 INFO server.Server: Started @2179ms 20/03/30 22:46:00 INFO server.AbstractConnector: Started ServerConnector@3abd581e{
HTTP/1.1,[http/1.1]}{
0.0.0.0:4040} 20/03/30 22:46:00 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@36dce7ed{
/jobs,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a1ebcff{
/jobs/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@{
/jobs/job,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@c20be82{
/jobs/job/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@13c612bd{
/stages,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3ef41c66{
/stages/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6b{
/stages/stage,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5f{
/stages/stage/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@28fa700e{
/stages/pool,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3d526ad9{
/stages/pool/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@e041f0c{
/storage,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6a{
/storage/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@{
/storage/rdd,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3f3c966c{
/storage/rdd/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@11ee02f8{
/environment,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4102b1b1{
/environment/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@61a5b4ae{
/executors,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@3a71c100{
/executors/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@5b69fd74{
/executors/threadDump,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@f{
/executors/threadDump/json,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@437e951d{
/static,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@467f77a5{
/,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@1bb9aa43{
/api,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@66b72664{
/jobs/job/kill,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@7a34b7b8{
/stages/stage/kill,null,AVAILABLE,@Spark} 20/03/30 22:46:00 INFO ui.SparkUI: Bound SparkUI to 0.0.0.0, and started at http://hadoop:4040 20/03/30 22:46:00 INFO spark.SparkContext: Added JAR file:/hadoop/spark/examples/jars/spark-examples_2.11-2.4.4.jar at spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar with timestamp 87 20/03/30 22:46:00 INFO executor.Executor: Starting executor ID driver on host localhost 20/03/30 22:46:00 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 38875. 20/03/30 22:46:00 INFO netty.NettyBlockTransferService: Server created on hadoop:38875 20/03/30 22:46:00 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 20/03/30 22:46:00 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, hadoop, 38875, None) 20/03/30 22:46:00 INFO storage.BlockManagerMasterEndpoint: Registering block manager hadoop:38875 with 366.3 MB RAM, BlockManagerId(driver, hadoop, 38875, None) 20/03/30 22:46:00 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, hadoop, 38875, None) 20/03/30 22:46:00 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, hadoop, 38875, None) 20/03/30 22:46:00 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@6f8e0cee{
/metrics/json,null,AVAILABLE,@Spark} 20/03/30 22:46:01 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38) 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Parents of final stage: List() 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Missing parents: List() 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents 20/03/30 22:46:01 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB) 20/03/30 22:46:01 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 366.3 MB) 20/03/30 22:46:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on hadoop:38875 (size: 1256.0 B, free: 366.3 MB) 20/03/30 22:46:01 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1)) 20/03/30 22:46:01 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 20/03/30 22:46:01 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 7866 bytes) 20/03/30 22:46:01 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0) 20/03/30 22:46:01 INFO executor.Executor: Fetching spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar with timestamp 87 20/03/30 22:46:01 INFO client.TransportClientFactory: Successfully created connection to hadoop/192.168.1.66:39352 after 45 ms (0 ms spent in bootstraps) 20/03/30 22:46:01 INFO util.Utils: Fetching spark://hadoop:39352/jars/spark-examples_2.11-2.4.4.jar to /tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839/userFiles--1e78-45f2-a9ed-8ac4360ab170/fetchFileTem p.tmp20/03/30 22:46:01 INFO executor.Executor: Adding file:/tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839/userFiles--1e78-45f2-a9ed-8ac4360ab170/spark-examples_2.11-2.4.4.jar to class loader 20/03/30 22:46:01 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 824 bytes result sent to driver 20/03/30 22:46:01 INFO scheduler.TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 7866 bytes) 20/03/30 22:46:01 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1) 20/03/30 22:46:01 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 308 ms on localhost (executor driver) (1/2) 20/03/30 22:46:01 INFO executor.Executor: Finished task 1.0 in stage 0.0 (TID 1). 824 bytes result sent to driver 20/03/30 22:46:01 INFO scheduler.TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 31 ms on localhost (executor driver) (2/2) 20/03/30 22:46:01 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 20/03/30 22:46:01 INFO scheduler.DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.606 s 20/03/30 22:46:01 INFO scheduler.DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0. s Pi is roughly 3.84667 20/03/30 22:46:01 INFO server.AbstractConnector: Stopped Spark@3abd581e{
HTTP/1.1,[http/1.1]}{
0.0.0.0:4040} 20/03/30 22:46:01 INFO ui.SparkUI: Stopped Spark web UI at http://hadoop:4040 20/03/30 22:46:01 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 20/03/30 22:46:01 INFO memory.MemoryStore: MemoryStore cleared 20/03/30 22:46:01 INFO storage.BlockManager: BlockManager stopped 20/03/30 22:46:01 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 20/03/30 22:46:01 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 20/03/30 22:46:01 INFO spark.SparkContext: Successfully stopped SparkContext 20/03/30 22:46:01 INFO util.ShutdownHookManager: Shutdown hook called 20/03/30 22:46:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-e019897d-3160-4bb1-ab59-f391e32ec47a 20/03/30 22:46:01 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-9e0481a2-756b-436f-bc74-dd42fb5ea839
可以看到输出:Pi is roughly 3.8434
已经打印出了圆周率。
上面只是使用了单机本地模式调用Demo,使用集群模式运行Demo,请继续看。
六、用yarn-cluster模式执行计算程序
进入到Spark的安装目录,执行命令,用yarn-cluster模式运行计算圆周率的Demo:
[root@hadoop spark]# ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster examples/jars/spark-examples_2.11-2.4.4.jar Warning: Master yarn-cluster is deprecated since 2.0. Please use master "yarn" with specified deploy mode instead. 20/03/30 22:47:47 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/03/30 22:47:48 INFO client.RMProxy: Connecting to ResourceManager at /192.168.1.66:8032 20/03/30 22:47:48 INFO yarn.Client: Requesting a new application from cluster with 1 NodeManagers 20/03/30 22:47:48 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 20/03/30 22:47:48 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 20/03/30 22:47:48 INFO yarn.Client: Setting up container launch context for our AM 20/03/30 22:47:48 INFO yarn.Client: Setting up the launch environment for our AM container 20/03/30 22:47:48 INFO yarn.Client: Preparing resources for our AM container 20/03/30 22:47:48 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 20/03/30 22:47:51 INFO yarn.Client: Uploading resource file:/tmp/spark-d554f7cd-c7d4-4dfa-bc86-11adb6/__spark_libs__.zip -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_ 54_0001/__spark_libs__.zip20/03/30 22:47:59 INFO yarn.Client: Uploading resource file:/hadoop/spark/examples/jars/spark-examples_2.11-2.4.4.jar -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_54_0001/spark-exa mples_2.11-2.4.4.jar20/03/30 22:47:59 INFO yarn.Client: Uploading resource file:/tmp/spark-d554f7cd-c7d4-4dfa-bc86-11adb6/__spark_conf__.zip -> hdfs://192.168.1.66:9000/user/root/.sparkStaging/application_1 4_0001/__spark_conf__.zip20/03/30 22:47:59 INFO spark.SecurityManager: Changing view acls to: root 20/03/30 22:47:59 INFO spark.SecurityManager: Changing modify acls to: root 20/03/30 22:47:59 INFO spark.SecurityManager: Changing view acls groups to: 20/03/30 22:47:59 INFO spark.SecurityManager: Changing modify acls groups to: 20/03/30 22:47:59 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permiss ions: Set(root); groups with modify permissions: Set()20/03/30 22:48:01 INFO yarn.Client: Submitting application application_54_0001 to ResourceManager 20/03/30 22:48:01 INFO impl.YarnClientImpl: Submitted application application_54_0001 20/03/30 22:48:02 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:02 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 88 final status: UNDEFINED tracking URL: http://hadoop:8088/proxy/application_54_0001/ user: root 20/03/30 22:48:03 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:04 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:05 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:06 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:07 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:08 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:09 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:11 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:12 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:13 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:14 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:15 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:16 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:17 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:19 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:20 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:21 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:22 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:23 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:24 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:25 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:26 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:27 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:28 INFO yarn.Client: Application report for application_54_0001 (state: ACCEPTED) 20/03/30 22:48:29 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:29 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: hadoop ApplicationMaster RPC port: 37844 queue: default start time: 88 final status: UNDEFINED tracking URL: http://hadoop:8088/proxy/application_54_0001/ user: root 20/03/30 22:48:30 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:31 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:32 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:33 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:34 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:35 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:36 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:37 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:38 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:39 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:40 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:41 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:42 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:43 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:44 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:45 INFO yarn.Client: Application report for application_54_0001 (state: RUNNING) 20/03/30 22:48:46 INFO yarn.Client: Application report for application_54_0001 (state: FINISHED) 20/03/30 22:48:46 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: hadoop ApplicationMaster RPC port: 37844 queue: default start time: 88 final status: SUCCEEDED tracking URL: http://hadoop:8088/proxy/application_54_0001/ user: root 20/03/30 22:48:46 INFO util.ShutdownHookManager: Shutdown hook called 20/03/30 22:48:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-4c243c24-9489-4c8a-a1bc-a6ad6 20/03/30 22:48:46 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-d554f7cd-c7d4-4dfa-bc86-11adb6
注意,使用yarn-cluster模式计算,结果没有输出在控制台,结果写在了Hadoop集群的日志中,如何查看计算结果?注意到刚才的输出中有地址:
tracking URL: http://hadoop:8088/proxy/application_54_0001/
进去看看:
再点进logs:
查看stdout内容:
圆周率结果已经打印出来了。
这里再给出几个常用命令:
启动spark ./sbin/start-all.sh 启动Hadoop以及Spark: ./starths.sh 停止命令改成stop
七、配置spark读取hive表
由于在hive里面操作表是通过mapreduce的方式,效率较低,本文主要描述如何通过spark读取hive表到内存进行计算。
第一步,先把$HIVE_HOME/conf/hive-site.xml放入$SPARK_HOME/conf内,使得spark能够获取hive配置
[root@hadoop spark]# pwd /hadoop/spark [root@hadoop spark]# cp $HIVE_HOME/conf/hive-site.xml conf/ [root@hadoop spark]# chmod 777 conf/hive-site.xml [root@hadoop spark]# cp /hadoop/hive/lib/mysql-connector-java-5.1.47.jar jars/
通过spark-shell进入交互界面
[root@hadoop spark]# /hadoop/spark/bin/spark-shell 20/03/31 10:31:39 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 20/03/31 10:32:41 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041. 20/03/31 10:32:41 WARN util.Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042. Spark context Web UI available at http://hadoop:4042 Spark context available as 'sc' (master = local[*], app id = local-60). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151) Type in expressions to have them evaluated. Type :help for more information. scala> import org.apache.spark.sql.hive.HiveContext import org.apache.spark.sql.hive.HiveContext scala> import org.apache.spark.sql.functions._ import org.apache.spark.sql.functions._ scala> val hiveContext = new HiveContext(sc) warning: there was one deprecation warning; re-run with -deprecation for details hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@62966c9f scala> hiveContext.sql("show databases").show() 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.client.capability.check does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.false.positive.probability does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.broker.address.default does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.orc.time.counters does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve-fraction.min does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.ms.footer.cache.ppd.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.event.message.factory does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.metrics.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.hs2.user.access does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.storage.storageDirectory does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.liveness.connection.timeout.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.dynamic.semijoin.reduction.threshold does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.connect.retry.limit does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.xmx.headroom does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.dynamic.semijoin.reduction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.direct does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.stats does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.client.consistent.splits does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.session.lifetime does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.timedout.txn.reaper.start does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.ttl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.management.acl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.delegation.token.lifetime does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.guidKey does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.ats.hook.queue.capacity does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.large.query does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bigtable.minsize.semijoin.reduction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.alloc.min does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.user does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.alloc.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.wait.queue.comparator.class.name does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.cache.use.soft.references does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve.fraction.max does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.listener.thread-count does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.container.max.java.heap.fraction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.stats.column.autogather does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.am.liveness.heartbeat.interval.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.decoding.metrics.percentiles.intervals does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.groupby.position.alias does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.txn.store.impl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.groupby.shuffle does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.object.cache.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.parallel.ops.in.session does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.groupby.limit.extrastep does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.use.ssl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.file.location does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.retry.delay.seconds does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.fileformat does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.num.file.cleaner.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.fail.compaction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.use.blobstore.as.scratchdir does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.class does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.mmap.path does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.download.permanent.fns does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.max.historic.queries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.execution.reducesink.new.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.max.num.delta does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.attempted does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.initiator.failed.compacts.threshold does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.reporter does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.max.pending.writes does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.execution.mode does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.enable.grace.join.in.llap does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.memory.mode does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.threadpool.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.select.threshold does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.scratchdir.lock does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.use.spnego does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.file.frequency does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.hs2.coordinator.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.timeout.seconds does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.filter.stats.reduction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.orc.base.delta.ratio does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.fastpath does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.clear.dangling.scratchdir does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.fail.heartbeater does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.file.cleanup.delay.seconds does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.management.rpc.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mapjoin.hybridgrace.bloomfilter does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.tree does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.stats.ndv.tuner does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.query.length does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.failed does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.close.session.on.disconnect does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.ppd.windowing does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.initial.metadata.count.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.host does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.ms.footer.cache.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.point.lookup.min does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.file.metadata.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.service.refresh.interval.sec does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.max.output.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.driver.parallel.compilation does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.remote.token.requires.signing does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bucket.pruning does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.cache.allow.synthetic.fileid does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.hash.table.inflation.factor does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.hbase.ttl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.enforce.vectorized does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.writeset.reaper.interval does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.vector.serde.deserialize does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.order.columnalignment does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.service.send.buffer.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.schema.evolution does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.elements.values.clause does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.llap.concurrent.queries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.allow.uber does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.partition.size.max does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.auth does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.include.fileid does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.communicator.num.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orderby.position.alias does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.connection.sleep.between.retries.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.max.partitions does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.hadoop2.component does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.yarn.shuffle.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.direct.sql.max.elements.in.clause does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.passiveWaitTimeMs does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.load.dynamic.partitions.thread does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.segments.granularity does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.http.response.header.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.conf.internal.variable.list does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose.reductionpercentage does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.retry.limit does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.serialize.in.tasks does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.query.timeout.seconds does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.service.metrics.hadoop2.frequency does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.directory.batch.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.reader.wait does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.reenable.max.timeout.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.max.open.txns does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.auto.convert.sortmerge.join.reduce.side does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.zookeeper.publish.configs does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.auto.convert.join.hashtable.max.entries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.sessions.init.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.authorization.storage.check.externaltable.drop does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.execution.mode does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cbo.cnf.maxnodes does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.adaptor.usage.mode does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.rewriting does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.groupMembershipKey does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.catalog.cache.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cbo.show.warnings does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.fshandler.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.max.bloom.filter.entries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.metadata.fraction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.materializedview.serde does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.scheduler.wait.queue.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.cache.entries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.operational.properties does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.memory.ttl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.rpc.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.nonvector.wrapper.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.cache.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.vectorized.input.format does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.cte.materialize.threshold does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.clean.until does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.semijoin.conversion does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.dynamic.partition.pruning does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.metrics.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.rootdir does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.limit.partition.request does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.async.log.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.logger does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.allow.udf.load.on.demand does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.cli.tez.session.async does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bloom.filter.factor does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.am-reporter.max.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.file.size.for.mapjoin does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.bucketing does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.bucket.pruning.compat does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.spnego.principal does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.preemption.metrics.intervals does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.shuffle.dir.watcher.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.arena.count does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.use.SSL does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.communicator.connection.timeout.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.transpose.aggr.join does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.maxTries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.dynamic.partition.pruning.max.data.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.metadata.base does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggr.stats.invalidator.frequency does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.use.lrfu does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.mmap does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.coordinator.address.default does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.max.fetch.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.conf.hidden.list does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.io.sarg.cache.max.weight.mb does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.clear.dangling.scratchdir.interval does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.sleep.time does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.vectorized.use.row.serde.deserialize does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.compile.lock.timeout does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.timedout.txn.reaper.interval does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.aggregate.stats.max.variance does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.lrfu.lambda does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.metadata.db.type does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.output.stream.timeout does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.transactional.events.mem does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.resultset.default.fetch.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.retain does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.merge.cardinality.check does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.authentication.ldap.groupClassKey does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.point.lookup does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.allow.permanent.fns does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.web.ssl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.manager.dump.lock.state.on.acquire.timeout does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.retention.succeeded does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.use.fileid.path does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.slice.row.count does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mapjoin.optimized.hashtable.probe.percent does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.select.distribute does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.use.fqdn does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.reenable.min.timeout.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.validate.acls does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.support.special.characters.tablename does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.mv.files.thread does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.skip.compile.udf.check does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.vector.serde.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cm.interval does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.sleep.interval.between.start.attempts does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.yarn.container.mb does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.http.read.timeout does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.optimizations.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.orc.gap.cache does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.dynamic.partition.hashjoin does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.copyfile.maxnumfiles does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.formats does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.http.numConnection does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.task.scheduler.enable.preemption does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.num.executors does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.full does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.connection.class does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.sessions.custom.queue.allowed does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.slice.lrr does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.client.password does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.metastore.hbase.cache.max.writer.wait does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.thrift.http.request.header.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.webui.max.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.limittranspose.reductiontuples does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.test.rollbacktxn does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.num.schedulable.tasks.per.node does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.acl does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.memory.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.type.safety does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.async.exec.async.compile does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.auto.max.input.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.enable.memory.manager does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.msck.repair.batch.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.blobstore.supported.schemes does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.splits.allow.synthetic.fileid does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.stats.filter.in.factor does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.use.op.stats does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.exec.input.listing.max.threads does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.tez.session.lifetime.jitter does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.web.port does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.strict.checks.cartesian.product does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.rpc.num.handlers does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.vcpus.per.instance does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.count.open.txns.interval does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.min.bloom.filter.entries does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.optimize.partition.columns.separate does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.orc.cache.stripe.details.mem.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.heartbeat.threadpool.size does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.locality.delay does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.repl.cmrootdir does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.task.scheduler.node.disable.backoff.factor does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.am.liveness.connection.sleep.between.retries.ms does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.spark.exec.inplace.progress does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.working.directory does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.daemon.memory.per.instance.mb does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.msck.path.validation does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.task.scale.memory.reserve.fraction does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.merge.nway.joins does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.compactor.history.reaper.interval does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.txn.strict.locking.mode does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.encode.vector.serde.async.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.tez.input.generate.consistent.splits does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.in.place.progress does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.druid.indexer.memory.rownum.max does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.server2.xsrf.filter.enabled does not exist 20/03/31 10:33:53 WARN conf.HiveConf: HiveConf of name hive.llap.io.allocator.alloc.max does not exist +------------+ |databaseName| +------------+ | default| | hadoop| +------------+ scala> hiveContext.sql("show tables").show() +--------+--------------------+-----------+ |database| tableName|isTemporary| +--------+--------------------+-----------+ | default| aa| false| | default| bb| false| | default| dd| false| | default| kylin_account| false| | default| kylin_cal_dt| false| | default|kylin_category_gr...| false| | default| kylin_country| false| | default|kylin_intermediat...| false| | default|kylin_intermediat...| false| | default| kylin_sales| false| | default| test| false| | default| test_null| false| +--------+--------------------+-----------+
可以看到已经查询到结果了,但是为啥上面报了一堆WARN 。
比如:
WARN conf.HiveConf: HiveConf of name hive.llap.skip.compile.udf.check does not exis
hive-site配置文件删除掉:
<property> <name>hive.llap.skip.compile.udf.check</name> <value>false</value> <description> Whether to skip the compile-time check for non-built-in UDFs when deciding whether to execute tasks in LLAP. Skipping the check allows executing UDFs from pre-localized jars in LLAP; if the jars are not pre-localized, the UDFs will simply fail to load. </description> </property>
再次登录执行警告就消失了。
八、配置Hudi
8.1、检阅官方文档重点地方
先来看下官方文档getstart首页:
我之前装的hadoop环境是2.7版本的,前面之所以装spark2.4.4就是因为目前官方案例就是用的hadoop2.7+spark2.4.4,而且虽然现在hudi、spark是支持scala2.11.x/2.12.x,但是官网这里也是用的2.11,我这里为了保持和hudi官方以及spark2.4.4(Using Scala version 2.11.12 (Java HotSpot™ 64-Bit Server VM, Java 1.8.0_151))一致,也就装的2.11.12版本的scala。
因为目前为止,Hudi已经出了0.5.2版本,但是Hudi官方仍然用的0.5.1的做示例,接下来,先切换到hudi0.5.1的发布文档:
查看
上面发布文档讲的意思是:
版本升级 将Spark版本从2.1.0升级到2.4.4 将Avro版本从1.7.7升级到1.8.2
将Parquet版本从1.8.1升级到1.10.1
将Kafka版本从0.8.2.1升级到2.0.0,这是由于将spark-streaming-kafka
artifact从0.8_2.11升级到0.10_2.11/2.12间接升级 重要:Hudi
0.5.1版本需要将spark的版本升级到2.4+Hudi现在支持Scala 2.11和2.12,可以参考Scala 2.12构建来使用Scala 2.12来构建Hudi,另外,
hudi-spark, hudi-utilities, hudi-spark-bundle and
hudi-utilities-bundle包名现已经对应变更为 hudi-spark_{scala_version},
hudi-spark_{scala_version}, hudi-utilities_{scala_version},
hudi-spark-bundle_{scala_version}和
hudi-utilities-bundle_{scala_version}. 注意这里的scala_version为2.11或2.12。
在0.5.1版本中,对于timeline数据的操作不再使用重命名方式,这个特性在创建Hudi表时默认是打开的。对于已存在的表,这个特性默认是关闭的,在已存在表开启这个特性之前,请参考这部分(https://hudi.apache.org/docs/deployment.html#upgrading)。若开启新的Hudi
timeline布局方式(layout),即避免重命名,可设置写配置项hoodie.timeline.layout.version=1。当然,你也可以在CLI中使用repair
overwrite-hoodie-props命令来添加hoodie.timeline.layout.version=1至hoodie.properties文件。注意,无论使用哪种方式,在升级Writer之前请先升级Hudi
Reader(查询引擎)版本至0.5.1版本。 CLI支持repair
overwrite-hoodie-props来指定文件来重写表的hoodie.properties文件,可以使用此命令来的更新表名或者使用新的timeline布局方式。注意当写hoodie.properties文件时(毫秒),一些查询将会暂时失败,失败后重新运行即可。
DeltaStreamer用来指定表类型的参数从–storage-type变更为了–table-type,可以参考wiki来了解更多的最新变化的术语。
配置Kafka Reset
Offset策略的值变化了。枚举值从LARGEST变更为LATEST,SMALLEST变更为EARLIEST,对应DeltaStreamer中的配置项为auto.offset.reset。
当使用spark-shell来了解Hudi时,需要提供额外的–packages
org.apache.spark:spark-avro_2.11:2.4.4,可以参考quickstart了解更多细节。 Key
generator(键生成器)移动到了单独的包下org.apache.hudi.keygen,如果你使用重载键生成器类(对应配置项:hoodie.datasource.write.keygenerator.class),请确保类的全路径名也对应进行变更。
Hive同步工具将会为MOR注册带有_ro后缀的RO表,所以查询也请带_ro后缀,你可以使用–skip-ro-suffix配置项来保持旧的表名,即同步时不添加_ro后缀。
0.5.1版本中,供presto/hive查询引擎使用的hudi-hadoop-mr-bundle包shaded了avro包,以便支持real
time
queries(实时查询)。Hudi支持可插拔的记录合并逻辑,用户只需自定义实现HoodieRecordPayload。如果你使用这个特性,你需要在你的代码中relocate
avro依赖,这样可以确保你代码的行为和Hudi保持一致,你可以使用如下方式来relocation。
org.apache.avro.
org.apache.hudi.org.apache.avro.
DeltaStreamer更好的支持Delete,可参考blog了解更多细节。
DeltaStreamer支持AWS Database Migration Service(DMS) ,可参考blog了解更多细节。
支持DynamicBloomFilter(动态布隆过滤器),默认是关闭的,可以使用索引配置项hoodie.bloom.index.filter.type=DYNAMIC_V0来开启。
HDFSParquetImporter支持bulkinsert,可配置–command为bulkinsert。 支持AWS WASB和
WASBS云存储。
8.2、错误的安装尝试
好了,看完了发布文档,而且已经定下了我们的使用版本关系,那么直接切换到Hudi0.5.2最新版本的官方文档:
点此跳转因为之前没用过spark和hudi,在看到hudi官网的第一眼时候,首先想到的是先下载一个hudi0.5.1对应的应用程序,然后再进行部署,部署好了之后再执行上面官网给的命令代码,比如下面我之前做的错误示范:
由于官方目前案例都是用的0.5.1,所以我也下载这个版本: https://downloads.apache.org/incubator/hudi/0.5.1-incubating/hudi-0.5.1-incubating.src.tgz 将下载好的安装包,上传到/hadoop/spark目录下并解压: [root@hadoop spark]# ls bin conf data examples hudi-0.5.1-incubating.src.tgz jars kubernetes LICENSE licenses logs NOTICE python R README.md RELEASE sbin spark-2.4.4-bin-hadoop2.7 work yarn [root@hadoop spark]# tar -zxvf hudi-0.5.1-incubating.src.tgz [root@hadoop spark]# ls bin conf data examples hudi-0.5.1-incubating hudi-0.5.1-incubating.src.tgz jars kubernetes LICENSE licenses logs NOTICE python R README.md RELEASE sbin spark-2.4.4-bin-hadoop2.7 work yarn [root@hadoop spark]# rm -rf *tgz [root@hadoop ~]# /hadoop/spark/bin/spark-shell \ > --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \ > --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' Ivy Default Cache set to: /root/.ivy2/cache The jars for the packages stored in: /root/.ivy2/jars :: loading settings :: url = jar:file:/hadoop/spark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml org.apache.hudi#hudi-spark-bundle_2.11 added as a dependency org.apache.spark#spark-avro_2.11 added as a dependency :: resolving dependencies :: org.apache.spark#spark-submit-parent-5717aa3e-7bfb-42c4-aadd-2a884f3521d5;1.0 confs: [default] You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. You probably access the destination server through a proxy server that is not well configured. :: resolution report :: resolve 454ms :: artifacts dl 1ms :: modules in use: --------------------------------------------------------------------- | | modules || artifacts | | conf | number| search|dwnlded|evicted|| number|dwnlded| --------------------------------------------------------------------- | default | 2 | 0 | 0 | 0 || 0 | 0 | --------------------------------------------------------------------- :: problems summary :: :::: WARNINGS Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.1-incubating/hudi-spark-bundle_2.11-0.5.1-incubating.pom Host repo1.maven.org not found. url=https://repo1.maven.org/maven2/org/apache/hudi/hudi-spark-bundle_2.11/0.5.1-incubating/hudi-spark-bundle_2.11-0.5.1-incubating.jar 。。。。。。。。。。 :::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating: not found :: org.apache.spark#spark-avro_2.11;2.4.4: not found :::::::::::::::::::::::::::::::::::::::::::::: :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS Exception in thread "main" java.lang.RuntimeException: [unresolved dependency: org.apache.hudi#hudi-spark-bundle_2.11;0.5.1-incubating: not found, unresolved dependency: org.apache.spark#spark-avro_2.11;2.4.4: not found] at org.apache.spark.deploy.SparkSubmitUtils$.resolveMavenCoordinates(SparkSubmit.scala:1302) at org.apache.spark.deploy.DependencyUtils$.resolveMavenDependencies(DependencyUtils.scala:54) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:304) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:774) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
8.3、正确的“安装部署”
其实下载的这个应该算是个源码包,不是可直接运行的。
而且spark-shell --packages是指定java包的maven地址,若不给定,则会使用该机器安装的maven默认源中下载此jar包,也就是说指定的这两个jar是需要自动下载的,我的虚拟环境一没设置外部网络,二没配置maven,这肯定会报错找不到jar包。
官方这里的代码:
–packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.1-incubating,org.apache.spark:spark-avro_2.11:2.4.4
说白了其实就是指定maven项目pom文件的依赖,翻了一下官方文档,找到了Hudi给的中央仓库地址,然后从中找到了官方案例代码中指定的两个包:
直接拿出来,就是下面这两个:
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-avro_2.11</artifactId> <version>2.4.4</version> </dependency> <dependency> <groupId>org.apache.hudi</groupId> <artifactId>hudi-spark-bundle_2.11</artifactId> <version>0.5.2-incubating</version> </dependency>
好吧,那我就在这直接下载了这俩包,然后再继续看官方文档:
这里说了我也可以通过自己构建hudi来快速开始, 并在spark-shell命令中使用–jars
,看到这个提示,我在linux看了下 spark-shell的帮助:
[root@hadoop external_jars]# /hadoop/spark/bin/spark-shell --help Usage: ./bin/spark-shell [options] Scala REPL options: -I <file> preload <file>, enforcing line-by-line interpretation Options: --master MASTER_URL spark://host:port, mesos://host:port, yarn, k8s://https://host:port, or local (Default: local[*]). --deploy-mode DEPLOY_MODE Whether to launch the driver program locally ("client") or on one of the worker machines inside the cluster ("cluster") (Default: client). --class CLASS_NAME Your application's main class (for Java / Scala apps). --name NAME A name of your application. --jars JARS Comma-separated list of jars to include on the driver and executor classpaths. --packages Comma-separated list of maven coordinates of jars to include on the driver and executor classpaths. Will search the local maven repo, then maven central and any additional remote repositories given by --repositories. The format for the coordinates should be groupId:artifactId:version. --exclude-packages Comma-separated list of groupId:artifactId, to exclude while resolving the dependencies provided in --packages to avoid dependency conflicts. --repositories Comma-separated list of additional remote repositories to search for the maven coordinates given with --packages. --py-files PY_FILES Comma-separated list of .zip, .egg, or .py files to place on the PYTHONPATH for Python apps. --files FILES Comma-separated list of files to be placed in the working directory of each executor. File paths of these files in executors can be accessed via SparkFiles.get(fileName). --conf PROP=VALUE Arbitrary Spark configuration property. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 1024M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G). --proxy-user NAME User to impersonate when submitting the application. This argument does not work with --principal / --keytab. --help, -h Show this help message and exit. --verbose, -v Print additional debug output. --version, Print the version of current Spark. Cluster deploy mode only: --driver-cores NUM Number of cores used by the driver, only in cluster mode (Default: 1). Spark standalone or Mesos with cluster deploy mode only: --supervise If given, restarts the driver on failure. --kill SUBMISSION_ID If given, kills the driver specified. --status SUBMISSION_ID If given, requests the status of the driver specified. Spark standalone and Mesos only: --total-executor-cores NUM Total cores for all executors. Spark standalone and YARN only: --executor-cores NUM Number of cores per executor. (Default: 1 in YARN mode, or all available cores on the worker in standalone mode) YARN-only: --queue QUEUE_NAME The YARN queue to submit to (Default: "default"). --num-executors NUM Number of executors to launch (Default: 2). If dynamic allocation is enabled, the initial number of executors will be at least NUM. --archives ARCHIVES Comma separated list of archives to be extracted into the working directory of each executor. --principal PRINCIPAL Principal to be used to login to KDC, while running on secure HDFS. --keytab KEYTAB The full path to the file that contains the keytab for the principal specified above. This keytab will be copied to the node running the Application Master via the Secure Distributed Cache, for renewing the login tickets and the delegation tokens periodically.
原来–jasrs是指定机器上存在的jar文件,接下来将前面下载的两个包上传到服务器:
[root@hadoop spark]# mkdir external_jars [root@hadoop spark]# cd external_jars/ [root@hadoop external_jars]# pwd /hadoop/spark/external_jars 通过xftp上传jar到此目录 [root@hadoop external_jars]# ls hudi-spark-bundle_2.11-0.5.2-incubating.jar scala-library-2.11.12.jar spark-avro_2.11-2.4.4.jar spark-tags_2.11-2.4.4.jar unused-1.0.0.jar
然后将官方案例代码:
spark-2.4.4-bin-hadoop2.7/bin/spark-shell \ --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.4 \ --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer'
修改为:
[root@hadoop external_jars]# /hadoop/spark/bin/spark-shell --jars /hadoop/spark/external_jars/spark-avro_2.11-2.4.4.jar,/hadoop/spark/external_jars/hudi-spark-bundle_2.11-0.5.2-incubating.jar --conf 'spark.seri alizer=org.apache.spark.serializer.KryoSerializer '20/03/31 15:19:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://hadoop:4040 Spark context available as 'sc' (master = local[*], app id = local-81). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.4 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_151) Type in expressions to have them evaluated. Type :help for more information. scala>
OK!!!没有报错了,接下来开始尝试进行增删改查操作。
8.4、Hudi增删改查
基于上面步骤
8.4.1、设置表名、基本路径和数据生成器来生成记录
scala> import org.apache.hudi.QuickstartUtils._ import org.apache.hudi.QuickstartUtils._ scala> import scala.collection.JavaConversions._ import scala.collection.JavaConversions._ scala> import org.apache.spark.sql.SaveMode._ import org.apache.spark.sql.SaveMode._ scala> import org.apache.hudi.DataSourceReadOptions._ import org.apache.hudi.DataSourceReadOptions._ scala> import org.apache.hudi.DataSourceWriteOptions._ import org.apache.hudi.DataSourceWriteOptions._ scala> import org.apache.hudi.config.HoodieWriteConfig._ import org.apache.hudi.config.HoodieWriteConfig._ scala> val tableName = "hudi_cow_table" tableName: String = hudi_cow_table scala> val basePath = "file:///tmp/hudi_cow_table" basePath: String = file:///tmp/hudi_cow_table scala> val dataGen = new DataGenerator dataGen: org.apache.hudi.QuickstartUtils.DataGenerator = org.apache.hudi.QuickstartUtils$DataGenerator@4bf6bc2d
数据生成器 可以基于行程样本模式 生成插入和更新的样本。
scala> val inserts = convertToStringList(dataGen.generateInserts(10)) inserts: java.util.List[String] = [{
"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.69653, "begin_lon": 0., "e nd_lat": 0.8858, "end_lon": 0.18241, "fare": 34.2845, "partitionpath": "americas/brazil/sao_paulo"}, {
"ts": 0.0, "uuid": "0d612dd2-5f10-4296-a434-b34e6558e8f1", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.36587, "begin_lon": 0.27752, "end_lat": 0.29602, "end_lon": 0.93655, "fare": 43.14, "partitionpath": "americas/brazil/sao_paulo"}, {
"ts": 0.0, "uuid": "0e170de4-7eda-4ab5-8c06-e351e8b23e3d", "rider": "rider-213", "driver": "driver-213", "begin_lat": 0.30634, "begin_...scala> val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2)) warning: there was one deprecation warning; re-run with -deprecation for details df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields] scala> df.write.format("org.apache.hudi"). | options(getQuickstartWriteConfigs). | option(PRECOMBINE_FIELD_OPT_KEY, "ts"). | option(RECORDKEY_FIELD_OPT_KEY, "uuid"). | option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). | option(TABLE_NAME, tableName). | mode(Overwrite). | save(basePath); 20/03/31 15:28:11 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.
mode(Overwrite)覆盖并重新创建数据集(如果已经存在)。 您可以检查在/tmp/hudi_cow_table/<region>/<country>/<city>/下生成的数据。我们提供了一个记录键 (schema中的uuid),分区字段(region/county/city)和组合逻辑(schema中的ts) 以确保行程记录在每个分区中都是唯一的。更多信息请参阅 对Hudi中的数据进行建模, 有关将数据提取到Hudi中的方法的信息,请参阅写入Hudi数据集。 这里我们使用默认的写操作:插入更新。 如果您的工作负载没有更新,也可以使用更快的插入或批量插入操作。 想了解更多信息,请参阅写操作。
scala> df.write.format("org.apache.hudi"). | options(getQuickstartWriteConfigs). | option(PRECOMBINE_FIELD_OPT_KEY, "ts"). | option(RECORDKEY_FIELD_OPT_KEY, "uuid"). | option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). | option(TABLE_NAME, tableName). | mode(Overwrite). | save(basePath); 20/03/31 15:28:11 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. scala> val roViewDF = spark. | read. | format("org.apache.hudi"). | load(basePath + "/*/*/*/*") 20/03/31 15:30:03 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. roViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields] scala> roViewDF.registerTempTable("hudi_ro_table") warning: there was one deprecation warning; re-run with -deprecation for details scala> spark.sql("select fare, begin_lon, begin_lat, ts from hudi_ro_table where fare > 20.0").show() +------------------+-------------------+-------------------+---+ | fare| begin_lon| begin_lat| ts| +------------------+-------------------+-------------------+---+ | 93.618|0.|0.|0.0| | 64.016| 0.12024| 0.30634|0.0| | 27.596| 0.89661|0.088261|0.0| | 33.643| 0.48392| 0.68272|0.0| |34.2845|0.| 0.69653|0.0| | 66.246|0.045928| 0.03035|0.0| | 43.14| 0.27752| 0.36587|0.0| | 41.068| 0.14224| 0.0742|0.0| +------------------+-------------------+-------------------+---+ scala> spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, rider, driver, fare from hudi_ro_table").show() +-------------------+--------------------+----------------------+---------+----------+------------------+ |_hoodie_commit_time| _hoodie_record_key|_hoodie_partition_path| rider| driver| fare| +-------------------+--------------------+----------------------+---------+----------+------------------+ | 807|aa-dd3f-4a7...| americas/united_s...|rider-213|driver-213| 93.618| | 807|0e170de4-7eda-4ab...| americas/united_s...|rider-213|driver-213| 64.016| | 807|fb06d140-cd00-413...| americas/united_s...|rider-213|driver-213| 27.596| | 807|eb1d495c-57b0-4b3...| americas/united_s...|rider-213|driver-213| 33.643| | 807|2b3380b7-2216-4ca...| americas/united_s...|rider-213|driver-213|19.3607| | 807|81a9b76c-655b-452...| americas/brazil/s...|rider-213|driver-213|34.2845| | 807|d24e8cb8-69fd-4cc...| americas/brazil/s...|rider-213|driver-213| 66.246| | 807|0d612dd2-5f10-429...| americas/brazil/s...|rider-213|driver-213| 43.14| | 807|a6a7e7ed-3559-4ee...| asia/india/chennai|rider-213|driver-213|17.1155| | 807|824ee8d5-6f1f-4d5...| asia/india/chennai|rider-213|driver-213| 41.068| +-------------------+--------------------+----------------------+---------+----------+------------------+
该查询提供已提取数据的读取优化视图。由于我们的分区路径(region/country/city)是嵌套的3个级别 从基本路径开始,我们使用了load(basePath + “/*/*/*/*”)。 有关支持的所有存储类型和视图的更多信息,请参考存储类型和视图。
scala> val updates = convertToStringList(dataGen.generateUpdates(10)) updates: java.util.List[String] = [{
"ts": 0.0, "uuid": "0e170de4-7eda-4ab5-8c06-e351e8b23e3d", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.54792, "begin_lon": 0.33181, "en d_lat": 0.62802, "end_lon": 0.41996, "fare": 49.2056, "partitionpath": "americas/united_states/san_francisco"}, {
"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0.88556, "begin_lon": 0.0, "end_lat": 0.38475, "end_lon": 0.07014, "fare": 29.079, "partitionpath": "americas/brazil/sao_paulo"}, {
"ts": 0.0, "uuid": "81a9b76c-655b-4527-85fc-7696bdeab4fd", "rider": "rider-284", "driver": "driver-284", "begin_lat": 0....scala> val df = spark.read.json(spark.sparkContext.parallelize(updates, 2)); warning: there was one deprecation warning; re-run with -deprecation for details df: org.apache.spark.sql.DataFrame = [begin_lat: double, begin_lon: double ... 8 more fields] scala> df.write.format("org.apache.hudi"). | options(getQuickstartWriteConfigs). | option(PRECOMBINE_FIELD_OPT_KEY, "ts"). | option(RECORDKEY_FIELD_OPT_KEY, "uuid"). | option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath"). | option(TABLE_NAME, tableName). | mode(Append). | save(basePath); 20/03/31 15:32:27 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL.
注意,保存模式现在为追加。通常,除非您是第一次尝试创建数据集,否则请始终使用追加模式。 查询现在再次查询数据将显示更新的行程。每个写操作都会生成一个新的由时间戳表示的commit 。在之前提交的相同的_hoodie_record_key中寻找_hoodie_commit_time, rider, driver字段变更。
scala> // reload data scala> spark. | read. | format("org.apache.hudi"). | load(basePath + "/*/*/*/*"). | createOrReplaceTempView("hudi_ro_table") 20/03/31 15:33:55 WARN hudi.DefaultSource: Snapshot view not supported yet via data source, for MERGE_ON_READ tables. Please query the Hive table registered using Spark SQL. scala> scala> val commits = spark.sql("select distinct(_hoodie_commit_time) as commitTime from hudi_ro_table order by commitTime").map(k => k.getString(0)).take(50) commits: Array[String] = Array(807, 224) scala> val beginTime = commits(commits.length - 2) // commit time we are interested in beginTime: String = 807 scala> // 增量查询数据 scala> val incViewDF = spark. | read. | format("org.apache.hudi"). | option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL). | option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). | load(basePath); 20/03/31 15:34:40 WARN hudi.DefaultSource: hoodie.datasource.view.type is deprecated and will be removed in a later release. Please use hoodie.datasource.query.type incViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields] scala> incViewDF.registerTempTable("hudi_incr_table") warning: there was one deprecation warning; re-run with -deprecation for details scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show() +-------------------+------------------+--------------------+-------------------+---+ |_hoodie_commit_time| fare| begin_lon| begin_lat| ts| +-------------------+------------------+--------------------+-------------------+---+ | 224|49.2056| 0.33181| 0.54792|0.0| | 224| 98.87| 0.48327| 0.07303|0.0| | 224| 90.54| 0.|0.016366|0.0| | 224| 90.239| 0.89222|0.054165|0.0| | 224| 29.079|0.0| 0.88556|0.0| | 224| 63.929| 0.6927| 0.23376|0.0| +-------------------+------------------+--------------------+-------------------+---+
这将提供在开始时间提交之后发生的所有更改,其中包含票价大于20.0的过滤器。关于此功能的独特之处在于,它现在使您可以在批量数据上创作流式管道。
scala> val beginTime = "000" // Represents all commits > this time. beginTime: String = 000 scala> val endTime = commits(commits.length - 2) // commit time we are interested in endTime: String = 807 scala> scala> // 增量查询数据 scala> val incViewDF = spark.read.format("org.apache.hudi"). | option(VIEW_TYPE_OPT_KEY, VIEW_TYPE_INCREMENTAL_OPT_VAL). | option(BEGIN_INSTANTTIME_OPT_KEY, beginTime). | option(END_INSTANTTIME_OPT_KEY, endTime). | load(basePath); 20/03/31 15:36:00 WARN hudi.DefaultSource: hoodie.datasource.view.type is deprecated and will be removed in a later release. Please use hoodie.datasource.query.type incViewDF: org.apache.spark.sql.DataFrame = [_hoodie_commit_time: string, _hoodie_commit_seqno: string ... 13 more fields] scala> incViewDF.registerTempTable("hudi_incr_table") warning: there was one deprecation warning; re-run with -deprecation for details scala> spark.sql("select `_hoodie_commit_time`, fare, begin_lon, begin_lat, ts from hudi_incr_table where fare > 20.0").show() +-------------------+------------------+-------------------+-------------------+---+ |_hoodie_commit_time| fare| begin_lon| begin_lat| ts| +-------------------+------------------+-------------------+-------------------+---+ | 807| 93.618|0.|0.|0.0| | 807| 64.016| 0.12024| 0.30634|0.0| | 807| 27.596| 0.89661|0.088261|0.0| | 807| 33.643| 0.48392| 0.68272|0.0| | 807|34.2845|0.| 0.69653|0.0| | 807| 66.246|0.045928| 0.03035|0.0| | 807| 43.14| 0.27752| 0.36587|0.0| | 807| 41.068| 0.14224| 0.0742|0.0| +-------------------+------------------+-------------------+-------------------+---+
今天的文章
【大数据开发运维解决方案】Hadoop2.7.6+Spark2.4.4+Scala2.11.12+Hudi0.5.2单机伪分布式安装分享到此就结束了,感谢您的阅读。
版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。
如需转载请保留出处:https://bianchenghao.cn/bian-cheng-ji-chu/99834.html