Hive10-实战

NiuMT 2020-07-03 20:58:30
Hive

需求描述

统计硅谷影音视频网站的常规指标,各种TopN 指标:

  1. 统计视频观看数Top10
  2. 统计视频类别热度Top10
  3. 统计视频观看数Top20 所属类别以及类别包含的Top20 的视频个数
  4. 统计视频观看数Top50 所关联视频的所属类别Rank
  5. 统计每个类别中的视频热度Top10
  6. 统计每个类别中视频流量Top10
  7. 统计上传视频最多的用户Top10 以及他们上传的观看次数在前20 视频
  8. 统计每个类别视频观看数Top10

数据结构

视频表

字段 备注 详细描述
video id 视频唯一id 11 位字符串
uploader 视频上传者 上传视频的用户名String
age 视频年龄 视频在平台上的整数天
category 视频类别 上传视频指定的视频分类
length 视频长度 整形数字标识的视频长度
views 观看次数 视频被浏览的次数
rate 视频评分 满分5 分
ratings 流量 视频的流量,整型数字
conments 评论数 一个视频的整数评论数
related ids 相关视频id 相关视频的id,最多20 个

用户表

字段 备注 字段类型
uploader 上传者用户名 string
videos 上传视频数 int
friends 朋友数量 int

答案

  1. 统计视频观看数Top10
select videoId, views
from gulivideo_orc
order by views desc
limit 10;

dMH0bHeiRNg     42513417
0XxI-hvPRRA     20282464
1dmVU08zVpA     16087899
RB-wUgnyGv0     15712924
QjA5faZF1A8     15256922
-_CSo1gOd48     13199833
49IDp76kjPw     11970018
tYnn51C3X_w     11823701
pv5zWaTEVkI     11672017
D2kJZOfq7zk     11184051
  1. 统计视频类别热度Top10
select category_name, count(*) category_count
from
    (select videoId, category_name
    from gulivideo_orc lateral view explode(category) tmp_category as category_name) t1
group by category_name
order by category_count desc
limit 10;
  1. 统计视频观看数Top20 所属类别以及类别包含的Top20 的视频个数
select category_name, count(*) category_count
from
    (select videoId, category_name
    from
        (select videoId, views, category
        from gulivideo_orc
        order by views desc
        limit 20) t1 lateral view explode(category) tmp_category as category_name) t2
group by category_name
order by category_count desc;
  1. 统计视频观看数Top50 所关联视频的所属类别Rank
select category_name, count(*) category_count
from 
    (select explode(category) category_name
     from
         (select category
         from
             (select related_id
             from
                 (select relatedId, views
                 from gulivideo_orc 
                 order by views desc
                 limit 50) t1
            lateral view explode(relatedId) tmp_related as related_id
            group by related_id) t2
        join gulivideo_orc orc on t2.related_id=orc.videoId) t3 ) t4
   group by category_name
   order by category_count desc;
  1. 统计每个类别中的视频热度Top10
select video.videoId, video.views
from
    (select uploader, videos
from gulivideo_user_orc
order by videos desc
limit 10) t1
join gulivideo_orc video
on t1.uploader=video.uploader
order by views desc
limit 20;
  1. 统计每个类别中视频流量Top10

  2. 统计上传视频最多的用户Top10 以及他们上传的观看次数在前20 视频

  3. 统计每个类别视频观看数Top10

1.给每一种类别根据视频观看数添加rank值(倒序)
select categoryId, videoId, views, 
rank() over(partition by categoryId order by views desc) rk
from gulivideo_category;

2.过滤前十
select categoryId, videoId, views
from
    (select categoryId, videoId, views, 
             rank() over(partition by categoryId order by views desc) rk
     from  gulivideo_category) t1
where rk<=10;