Hive02-安装

NiuMT 2020-07-03 20:58:30

Hive

Hive安装

解压apache-hive-1.2.1-bin.tar.gz
修改/opt/module/hive/conf 目录下的hive-env.sh.template 名称为hive-env.sh
配置hive-env.sh 文件

配置HADOOP_HOME 路径 ：
export HADOOP_HOME=/opt/module/hadoop-2.7.2

配置HIVE_CONF_DIR 路径 ：
export HIVE_CONF_DIR=/opt/module/hive/conf

启动hdfs和yarn：sbin/start-dfs.sh、sbin/start-yarn.sh
在 HDFS 上创建/tmp 和/user/hive/warehouse 两个目录并修改他们的同组权限可写。(可不操作，系统会自动创建)

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir /tmp 
[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -mkdir -p /user/hive/warehouse 

[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /tmp 
[atguigu@hadoop102 hadoop-2.7.2]$ bin/hadoop fs -chmod g+w /user/hive/warehouse

Hive基本操作

# 启动hive
[atguigu@hadoop102 hive]$ bin/hive 
# 查看数据库
hive> show databases; 
# 打开默认数据库
hive> use default; 
# 显示default 数据库中的表
hive> show tables; 
# 创建一张表
hive> create table student(id int, name string); 
# 显示数据库中有几张表
hive> show tables; 
# 查看表的结构
hive> desc student; 
# 向表中插入数据
hive> insert into student values(1000,"ss");
# 查询表中数据
hive> select * from student;
# 退出hive
hive> quit;

本地文件导入Hive

将本地/opt/module/data/student.txt 这个目录下的数据导入到 hive 的 student(id int, name string)表中。

# student.txt 注意以tab键间隔
1001 zhangshan 
1002 lishi 
1003 zhaoliu 

# 启动hive
# 创建student 表, 并声明文件分隔符’\t’
hive> create table student(id int, name string) ROW FORMAT DELIMITED FIELDS TERMINATED  BY '\t'; 
# 加载/opt/module/data/student.txt 文件到 student 数据库表中。
# 这个命令其实就是HDFS的put操作。新建相同格式的txt放到对应的HDFS目录下，同样可以查询到数据
hive> load data local inpath '/opt/module/data/student.txt' into table student;
hive> load data inpath "/HDFS/path" into table stu; #这个是从HDFS上mv文件。
# Hive 查询结果
hive> select * from student; 
OK 
1001 zhangshan 
1002 lishi 
1003 zhaoliu 
Time taken: 0.266 seconds, Fetched: 3 row(s)

注意：Metastore 默认存储在自带的derby 数据库中，不能同时打开多个hive窗口，但mysql支持。

MySQL安装

安装MySQL

查看mysql 是否安装，如果安装了，卸载mysql

rpm -qa|grep mysql # 查询
rpm -e --nodeps mysql-libs-XXXXX.x86_64  # 卸载

安装mysql 服务端：rpm -ivh MySQL-server-5.6.24-1.el6.x86_64.rpm
查看产生的随机密码：cat /root/.mysql_secret
查看mysql 状态：service mysql status
启动mysql服务：service mysql start
安装MySql 客户端：rpm -ivh MySQL-client-5.6.24-1.el6.x86_64.rpm
链接mysql：mysql -uroot -pOEXaQuS8IWkG19Xs
修改密码：mysql>SET PASSWORD=PASSWORD(‘000000’);
退出：mysql>exit;

MySql 中user 表中主机配置

配置只要是root 用户+密码，在任何主机上都能登录MySQL 数据库。

[root@hadoop102 mysql-libs]# mysql -uroot -p000000
mysql>use mysql;
mysql>show tables; 
mysql>select User, Host, Password from user;

mysql>update user set host='%' where host='localhost'; 

mysql>delete from user where Host='hadoop102'; 
mysql>delete from user where Host='127.0.0.1'; 
mysql>delete from user where Host='::1'; 

mysql>flush privileges; 
mysql>quit;

配置Metastore 到MySql

驱动拷贝：cp /opt/software/mysql-libs/mysql-connector-java-5.1.27/mysql-c onnector-java-5.1.27-bin.jar /opt/module/hive/lib/

配置 hive-site.xml

先创建 hive-site.xml

  <?xml version="1.0"?> 
  <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 
  <configuration> 
      <property> 
          <name>javax.jdo.option.ConnectionURL</name> 
          <value>jdbc:mysql://hadoop102:3306/metastore?createDatabaseI fNotExist=true</value> 
          <description>JDBC connect string for a JDBC metastore</description> 
      </property> 

      <property> 
          <name>javax.jdo.option.ConnectionDriverName</name> 
          <value>com.mysql.jdbc.Driver</value> 
          <description>Driver class name for a JDBC metastore</description> </description> 
      </property> 

      <property> 
          <name>javax.jdo.option.ConnectionUserName</name> 
          <value>root</value> 
          <description>username to use against metastore database</description> 
      </property> 

      <property> 
          <name>javax.jdo.option.ConnectionPassword</name> 
          <value>000000</value> 
          <description>password to use against metastore database</description> 
      </property> 

  </configuration>

配置完毕后，如果启动 hive 异常，可以重新启动虚拟机。（重启后，别忘了启动hadoop 集群）
打开多个窗口，分别启动 hive：bin/hive

启动 hive 后，回到 MySQL 窗口查看数据库，显示增加了 metastore 数据库

 mysql> show databases; 
 +--------------------+ 
 | Database           | 
 +--------------------+ 
 | information_schema | 
 | metastore          | 
 | mysql              | 
 | performance_schema | 
 | test               | 
 +--------------------+

Hive JDBC访问

启动hiveserver2 服务：[atguigu@hadoop102 hive]$ bin/hiveserver2

启动beeline

 [atguigu@hadoop102 hive]$ bin/beeline 
 Beeline version 1.2.1 by Apache Hive 
 beeline>

连接hiveserver2

 beeline> !connect jdbc:hive2://hadoop102:10000（回车） 
 Connecting to jdbc:hive2://hadoop102:10000 
 Enter username for jdbc:hive2://hadoop102:10000: atguigu（回车） 
 Enter password for jdbc:hive2://hadoop102:10000: （直接回车） 
 Connected to: Apache Hive (version 1.2.1) 
 Driver: Hive JDBC (version 1.2.1) 
 Transaction isolation: TRANSACTION_REPEATABLE_READ 
 0: jdbc:hive2://hadoop102:10000> show databases; 
 +----------------+--+ 
 | database_name  | 
 +----------------+--+ 
 | default        |
 | hive db2       |
 +----------------+--+

Hive命令

[atguigu@hadoop102 hive]$ bin/hive -help 
usage: hive
-d,--define <key=value>    Variable subsitution to apply to hive. commands. e.g. -d A=B or --define A=B
--database <databasename>    Specify the database to use
-e <quoted-query-string>    SQL from command line
-f <filename>    SQL from files
-H,--help    Print help information
--hiveconf <property=value>    Use value for given property
--hivevar <key=value>    Variable subsitution to apply to hive. commands. e.g. --hivevar A=B
-i <filename>    Initialization SQL file
-S,--silent    Silent mode in interactive shell
-v,--verbose    Verbose mode (echo executed SQL to the console)

[atguigu@hadoop102 hive]$ bin/hive -e “select id from student;”

[atguigu@hadoop102 hive]$ bin/hive -f /opt/module/datas/hivef.sql

在 hive cli 命令窗口中如何查看 hdfs 文件系统：hive> dfs -ls /;

在 hive cli 命令窗口中如何查看本地文件系统：hive> ! ls /opt/module/datas;

查看在 hive 中输入的所有历史命令：入到当前用户的根目录/root 或/home/atguigu，查看. hivehistory 文件

Hive 常见属性配置

Hive 数据仓库位置配置

Default 数据仓库的最原始位置是在hdfs 上的：/user/hive/warehouse 路径下。在仓库目录下，没有对默认的数据库 default 创建文件夹。如果某张表属于 default 数据库，直接在数据仓库目录下创建一个文件夹。

在hive-site.xml 文件中添加以下内容：

<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location    of    default    database    for    the warehouse</description>
</property>

查询后信息显示配置

在 hive-site.xml 文件中添加如下配置信息，就可以实现显示当前数据库，以及查询表的头信息配置。

<property>
    <name>hive.cli.print.header</name>
    <value>true</value>
</property>

<property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
</property>

Hive 运行日志信息配置

Hive 的log 默认存放在/tmp/atguigu/hive.log 目录下（当前用户名下）

修改 hive 的 log 存放日志到/opt/module/hive/logs

修改/opt/module/hive/conf/hive-log4j.properties.template 文件名称为hive-log4j.properties

在hive-log4j.properties 文件中修改log 存放位置：hive.log.dir=/opt/module/hive/logs

参数配置方式

查看当前所有的配置信息：hive>set;

参数的配置三种方式：

配置文件方式

默认配置文件：hive-default.xml

用户自定义配置文件：hive-site.xml

注意：用户自定义配置会覆盖默认配置。另外，Hive 也会读入Hadoop 的配置，因为Hive 是作为Hadoop 的客户端启动的，Hive 的配置会覆盖Hadoop 的配置。配置文件的设定对本机启动的所有Hive 进程都有效。
命令行参数方式（hive外）

启动Hive 时，可以在命令行添加 -hiveconf param=value 来设定参数。注意：仅对本次 hive 启动有效

如：bin/hive -hiveconf mapred.reduce.tasks=10;

查看参数设置：hive (default)> set mapred.reduce.tasks;
参数声明方式（hive内）

可以在HQL 中使用SET 关键字设定参数。注意：仅对本次 hive 启动有效。

例如： hive (default)> set mapred.reduce.tasks=100;

查看参数设置 hive (default)> set mapred.reduce.tasks;

上述三种设定方式的优先级依次递增。即配置文件 < 命令行参数 < 参数声明。注意某些系统级的参数，例如 log4j 相关的设定，必须用前两种方式设定，因为那些参数的读取在会话建立以前已经完成了。

集成Tez引擎

Tez是一个Hive的运行引擎，性能优于MR。

用Hive直接编写MR程序，假设有四个有依赖关系的MR作业，上图中，绿色是Reduce Task，云状表示写屏蔽，需要将中间结果持久化写到HDFS。

Tez可以将多个有依赖的作业转换为一个作业，这样只需写一次HDFS，且中间节点较少，从而大大提升作业的计算性能。

安装Tez

将apache-tez-0.9.1-bin.tar.gz上传到HDFS的/tez目录下
本地解压缩apache-tez-0.9.1-bin.tar.gz

在Hive的/opt/module/hive/conf下面创建一个tez-site.xml文件

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>tez.lib.uris</name>
        <value>${fs.defaultFS}/tez/apache-tez-0.9.1-bin.tar.gz</value>
    </property>
    <property>
         <name>tez.use.cluster.hadoop-libs</name>
         <value>true</value>
    </property>
    <property>
         <name>tez.history.logging.service.class</name>        
         <value>org.apache.tez.dag.history.logging.ats.ATSHistoryLoggingService</value>
    </property>
</configuration>

在hive-env.sh文件中添加tez环境变量配置和依赖包环境变量配置

# Set HADOOP_HOME to point to a specific hadoop install directory
export HADOOP_HOME=/opt/module/hadoop-2.7.2

# Hive Configuration Directory can be controlled by:
export HIVE_CONF_DIR=/opt/module/hive/conf

# Folder containing extra libraries required for hive compilation/execution can be controlled by:
export TEZ_HOME=/opt/module/tez-0.9.1    #是你的tez的解压目录
export TEZ_JARS=""
for jar in `ls $TEZ_HOME |grep jar`; do
    export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/$jar
done
for jar in `ls $TEZ_HOME/lib`; do
    export TEZ_JARS=$TEZ_JARS:$TEZ_HOME/lib/$jar
done

export HIVE_AUX_JARS_PATH=/opt/module/hadoop-2.7.2/share/hadoop/common/hadoop-lzo-0.4.20.jar$TEZ_JARS

在hive-site.xml文件中添加如下配置，更改hive计算引擎

<property>
    <name>hive.execution.engine</name>
    <value>tez</value>
</property>

测试
- 启动Hive: bin/hive
- 创建表
  
  hive (default)> create table student( id int, name string);
- 向表中插入数据
  
  hive (default)> insert into student values(1,”zhangsan”);
- 如果没有报错就表示成功了
  
  hive (default)> select * from student;
  
  1 zhangsan

注意事项

运行Tez时检查到用过多内存而被NodeManager杀死进程问题

Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. Application application_1546781144082_0005 failed 2 times due to AM Container for appattempt_1546781144082_0005_000002 exited with exitCode: -103

For more detailed output, check application tracking page:http://hadoop103:8088/cluster/app/application_1546781144082_0005Then, click on links to logs of each attempt.

Diagnostics: Container [pid=11116,containerID=container_1546781144082_0005_02_000001] is running beyond virtual memory limits. Current usage: 216.3 MB of 1 GB physical memory used; 2.6 GB of 2.1 GB virtual memory used. Killing container.

这种问题是从机上运行的Container试图使用过多的内存，而被NodeManager kill掉了。

[摘录] The NodeManager is killing your container. It sounds like you are trying to use hadoop streaming which is running as a child process of the map-reduce task. The NodeManager monitors the entire process tree of the task and if it eats up more memory than the maximum set in mapreduce.map.memory.mb or mapreduce.reduce.memory.mb respectively, we would expect the Nodemanager to kill the task, otherwise your task is stealing memory belonging to other containers, which you don’t want.

解决方法：

关掉虚拟内存检查，修改yarn-site.xml，
yarn.nodemanager.vmem-check-enabled false
修改后一定要分发，并重新启动hadoop集群

Hive09-企业级调优

Hive10-实战