Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

发布于 2016-01-03 06:48:21 | 285 次阅读 | 评论: 0 | 来源: PHPERZ

这里有新鲜出炉的Hadoop教程，程序狗速度看过来！

Hadoop分布式系统

一个分布式系统基础架构，由Apache基金会所开发。用户可以在不了解分布式底层细节的情况下，开发分布式程序。充分利用集群的威力高速运算和存储。

1. 新建表

1) 新建表结构

create table user_table(

id int,

userid bigint,

name string,

describe string comment 'desc表示用户的描述'

)

comment '这是用户信息表'

partitioned by(country string, city string) -- 建立分区，所谓的分区就是文件夹

clustered by (id) sorted by (userid) into 32 buckets

//通过id进行hash取值来分桶，桶类通过userid来排序排序

分桶便于有用数据加载到有限的内存中（性能上的优化----还有join,group by,distinct）

row format delimited -- 指定分隔符解析数据

fields terminated by '\001' -- 字段之间的分隔符

collection items terminated by '\002' -- array字段内部的分隔符

map keys terminated by '\003' -- map字段内部分隔符

//用来分隔符解析数据（load进去的原始数据，hive是不会对它进行任何处理）

stored as textfile; -- 存储格式( rcfile/ textfile / sequencefile )

//存储格式(原始数据，就是textfile格式就行)

总结：

相比textfile和SequenceFile，rcfile由于列式存储方式，数据加载时性能消耗较大，但是具有较好的压缩比和查询响应。数据仓库的特点是一次写入、多次读取，因此，整体来看，rcfile相比其余两种格式具有较明显的优势。

a) Table 内部表（大小写无所谓）

创建:

create table t1(id string);

create table t2(id string, name string) row format delimited fields terminated by '\t';

加载:

load data local inpath '/root/Downloads/seq100w.txt' into table t1;

load data inpath '/seq100w.txt' into table t1; (hdfs中数据移动到/hive/t1文件夹中)

（因此我们直接把hdfs中数据移动到我们表对应的文件夹中也能读取到数据）

load data local inpath '/root/Downloads/seq100w.txt' overwrite into table t1;

b) Partition 分区表

创建:

create table t3(id string) partitioned by (province string);

加载:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

查看某个表中所有的分区

Hive>show partitions 表名;

c) Bucket Table 桶表

创建: create table t4(id string) clustered by (id) into 4 buckets; //通过id来分桶

create table t4(id string) clustered by (id) sorted by (id asc) into 4 buckets; //对桶中数据进行升序排序，使每个桶的连接变成了高效的合并排序（merge-sort）,因此可以进一步提升map端连接的效率

设置均匀插入：set hive.enforce.bucketing = true;

加载: insert into table t4 select id from t3 where province='beijing';

覆盖： insert overwrite table bucket_table select name from stu;

抽样查询：select * from bucket_table tablesample(bucket 1 out of 4 on id); //表示在表中随机选择1个桶的数据

select * from bucket_table tablesample(bucket 1 out of 2 on id); //表示随机选择半个桶的数据

select * from bucket_table tablesample(bucket 1 out of 4 on rand()); //表示随机选择1个桶的数据的部分数据（从某个桶中取样，它会扫描整个表的数据集）

l 数据加载到桶表时，会对字段取hash值，然后与桶的数量取模。把数据放到对应的文件中。任何一桶里都会有一个随机的用户集合

d) External Table 外部表

（t5可以不放在仓库中，可以自定义存储位置,以wlan为仓库）

创建: create external table t5(id string) location '/wlan'; wlan 表示文件夹

EXTERNAL关键字表示创建外部表；数据有外部仓库控制，不是由hive控制，只有元数据（也就是表结构）由hive控制；因此不会把数据移到hive的仓库目录下，而是移动到外部仓库中去，当你drop table 表名，元数据(表结构)会删除，但是数据在外部仓库中，因此不会被hive删除。

hive>create external table t1(id ) row format delimited fields terminated by '\t' location ‘/wlan’；加上便于读取数据，查询的时候不会为Null（\t就是数据的分隔符） ;wlan 表示文件夹，wlan最好与你要创建的表名一致，这样方便查看和管理

create external table hadoop_1(id int,name string) row format delimited fields terminated by '\t' location '/wenjianjia';

load data inpath '/wenjianjia/hello' into table hadoop_1 ;

2) 复制现有表结构

// 新建new_table 表结构和 user_table 一样

create table new_table like user_table;

3) 表重命名

hive> alter table new_table rename to new_table_1;

4) 创建表分区

创建:

create table t3(id string) partitioned by (province string);

加载:

load data local inpath '/root/Downloads/seq100w.txt' into table t3 partition(province ='beijing');

2. 删除表

1) 清空表中数据

hadoop fs –rmr /… 直接删除表在hdfs中存放的数据就行

如果不小心把表也在hdfs中删除了

2) 删除表

drop table test1

3) 删除表分区（删除分区和分区中的数据）

hive> alter table dm_newuser_active_month drop partition (batch_date="201404");

删除表分区，一定要batch_date一定要加：冒号

3. 修改表信息

1) 表添加一个字段

hive> alter table test1 add columns(name string);

2) 修改表的某个字段

注意：change 取代现有表的要修改的列，它修改表模式而不是数据。

alter table 表名 change 要修改的列名修改后的列名修改后的类型 comment ‘备注信息’;

3) 修改表的所有字段

注意：replace 取代现有表的所有列，它修改表模式而不是数据。

alter table 表名replace columns(age int comment 'only keep the first column');

4) 添加表分区

hive> alter table ods_smail_mx_201404 add partition (day=20140401); 单独添加分区

create table user_table_2(

id int,

name string

)

comment '这是用户信息表'

partitioned by(dt string)

stored as textfile;

insert overwrite table user_table_2

partition(dt='2015-11-01')

select id, col2 name

from table_4;

4. 查看表

1) 查看建表语句

show create table tmp_jzl_20150310_diff;

2) 查看表结构

desc tmp_jzl_20150310_diff;

3) 查看表分区

show partitions tmp_jzl_20150310_diff;

4) 查看库中表名

hive> use tmp;

查看tmp库中所有的表

hive> show tables;

查看tmp库中 tmp_jzl_20150504开头的表

hive> show tables 'tmp_jzl_20150504*';

tmp_jzl_20150504_1

tmp_jzl_20150504_2

tmp_jzl_20150504_3

tmp_jzl_20150504_4

Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

Hadoop分布式系统

1. 新建表

1) 新建表结构

2) 复制现有表结构

3) 表重命名

4) 创建表分区

2. 删除表

1) 清空表中数据

2) 删除表

3) 删除表分区（删除分区和分区中的数据）

3. 修改表信息

1) 表添加一个字段

2) 修改表的某个字段

3) 修改表的所有字段

4) 添加表分区

4. 查看表

1) 查看建表语句

2) 查看表结构

3) 查看表分区

4) 查看库中表名

后端技术

前端技术

数据库

热门框架

常用IDE

其他

Hadoop 2.6 + Hive 1.2.1 + spark-1.4.1(3)

Hadoop分布式系统

1. 新建表

1) 新建表结构

2) 复制现有表结构

3) 表 重命名

4) 创建表分区

2. 删除表

1) 清空表中数据

2) 删除表

3) 删除表分区（删除分区和分区中的数据）

3. 修改表信息

1) 表 添加一个字段

2) 修改表的某个字段

3) 修改表的所有字段

4) 添加表分区

4. 查看表

1) 查看建表语句

2) 查看表结构

3) 查看表分区

4) 查看库中表名

后端技术

前端技术

数据库

热门框架

常用IDE

其他

3) 表重命名

1) 表添加一个字段