【数仓从1到n】访问咨询主题看板-365速发国际app-bt365登录-365bet娱乐在线-365速发国际app

项目流程

step1: 需求分析(知道有哪些主题,哪些指标,哪些维度step2: 建模分析(是否要压缩,行/列存储,索引,分区,分桶step3: 建模操作,建库建表step4: 数据采集(MySQL业务数据库 -> Hive(ODS层step5: 数据清洗转换(空值过滤,空值转换,列值裁剪,切割step6: 数据分析操作,结合主题,维度指标完成各种需求step7: 数据导出操作,导出分析后的结果数据到mysql中,之后对接FineBI进行可视化展示

需求分析提出需求

这一步就是根据需求把涉及到的表, 维度, 指标, 以及具体的维度字段, 维度字段提取出来

需求1- 总访问客户量

统计指定时间段内，访问客户的总数量。能够下钻到小时数据。

123456789101112131415维度: 时间维度: 年季度月天小时指标: 总访问客户量涉及到表: web_chat_ems_年_月 (事实表)涉及到字段: 时间维度: create_time 特点: 一个字段中涵盖了多个字段的数据转换操作: 需要将 create_time 转换为 yearinfo, quarterInfo, monthInfo, dayinfo,hourinfo 指标字段: sid、session_id、ip 注意: 先去重在count统计

需求2- 地区独立访客热力图

统计指定时间段内，访问客户中各区域人数热力图。能够下钻到小时数据。

12345678910111213141516涉及到维度: 时间维度: 年季度月天小时区域维度涉及到指标: 访问客户量涉及到表: web_chat_ems_年_月 (事实表)涉及到字段: 时间维度字段: create_time 转换操作: 需要将 create_time 转换为 yearinfo, quarterInfo, monthInfo, dayinfo,hourinfo 区域维度: area, country, province,city 指标字段: sid、session_id、ip 先去重, 后统计

需求3- 访客咨询率趋势

统计指定时间段内，不同地区（省、市）访问的客户中发起咨询的人数占比；

咨询率=发起咨询的人数/访问客户量；客户与网咨有说一句话的称为有效咨询。

1234567891011121314151617181920212223242526272829303132涉及到维度: 时间维度: 年季度月天地区维度: 省、市、区域涉及到指标: 咨询客户量, 访问客户量细点: 当遇到需要计算比率的相关的指标的时候, 我们前期只需要计算其分子和分母即可, 最后比率在APP层计算即可涉及表: web_chat_ems_年_月 (事实表)涉及到字段: 时间维度: create_time 转换操作: 需要将 create_time 转换为 yearinfo, quarterInfo, monthInfo, dayinfo,hourinfo 地区维度: area,province,city 指标维度: 访问客户量: sid、session_id、ip 去重统计咨询客户量: sid、session_id、ip 去重统计判断依据: 必须保证 msg_count >= 1发现: 访问量在上述的SQL中, 统计的更加细化, 已经涵盖了当前这个需求内容, 所以此处的访问量无需在计算了此处只需要算咨询量: 在计算咨询量的时候, 既然访问量算的更加细化了, 建议咨询量也以此为标准, 后续可以应用更多需求

需求4- 客户访问量和访客咨询率双轴趋势

统计指定时间段内，每日客户访问量/咨询率双轴趋势图。能够下钻到小时数据。

12345678910111213141516涉及维度: 时间维度: 年季度月天小时涉及指标: 访问量: 不需要计算了, 直接复用需求一的内容咨询量: 涉及表: web_chat_ems_年_月 (事实表) 涉及字段: 时间维度: create_time 转换操作: 需要将 create_time 转换为 yearinfo, quarterInfo, monthInfo, dayinfo,hourinfo 咨询量指标: sid、session_id、ip 去重统计, 统计的时候, 需要进行判断: 必须保证 msg_count >= 1

需求5- 时间段访问客户量趋势 (不需要关心了, 前序涵盖)

统计指定时间段内，1-24h之间，每个时间段的访问客户量。

横轴：1-24h，间隔为一小时，纵轴：指定时间段内同一小时内的总访问客户量。

需求6- 来源渠道访问量占比

统计指定时间段内，不同来源渠道的访问客户量占比。能够下钻到小时数据。

占比: 总访问量与各来源渠道的占比情况

占比: 各来源渠道中,访问量的占比 (依次为标准)

12345678910111213141516涉及维度: 时间维度: 年季度月天小时来源渠道涉及指标: 访问客户量咨询客户量涉及表: web_chat_ems_年_月 (事实表)涉及字段: 时间维度: create_time 来源渠道: origin_channel 访问量和咨询量指标字段: sid、session_id、ip 访问量和咨询量区别: 咨询量多个一个判断: msg_count >= 1

需求7- 搜索来源访问量占比

统计指定时间段内，不同搜索来源的访问客户量占比。能够下钻到小时数据。

占比: 总访问量与各搜索来源的占比情况 (依次为标准)

占比: 各搜索来源中,访问量的占比

12345678910111213涉及维度: 时间维度: 年季度月天小时搜索来源涉及指标: 访问量涉及表: web_chat_ems_年_月 (事实表)涉及字段: 时间字段: create_time 搜索来源: seo_source 指标字段: sid、session_id、ip

需求8- 活跃页面排行榜

统计指定时间段内，产生访问客户量最多的页面排行榜TOPN。能够下钻到小时数据。

1234567891011121314151617181920212223242526涉及维度: 时间维度: 年季度月天小时受访页面涉及到指标: 访问客户量涉及表: web_chat_text_ems_年_月 (事实表) web_chat_ems(事实表)涉及字段: 时间维度: create_time 受访页面维度: from_url 指标字段: sid、session_id、ip表关联条件: id = id说明: 当发现有些字段无法确定的时候, 此时需要通过查看表的数据结构, 从当中找相关的字段, 如果找不到, 思考怎么获得这个字段以当前这个为例, 发现 web_chat_text_ems 和 web_chat_ems 是一种一对一的关系, 简单说, 本质就是一个表, 所以可以直接使用 web_chat_ems中相关的维度字段来处理维度退化操作: 降维度将两个表合并为一个表

需求汇总

将上述需求设计到的表和字段进行汇总.

1234567891011121314151617181920212223242526272829303132333435363738394041424344访问客户量: 共计有 25个需求涉及到维度: 常驻维度: 每一个需求都需要计算的维度时间维度: 年季度月天小时产品属性维度: 有这个维度, 但是并不是每个需求都有的地区维度来源渠道维度搜索来源维度受访页面总访问量咨询客户量: 共计有 15个需求涉及维度: 常驻维度: 时间维度: 年季度月天小时产品属性维度: 地区维度来源渠道总咨询量涉及表: web_chat_text_ems_年_月 (事实表) wcte web_chat_ems(事实表) wce涉及字段: 时间维度: wce.create_time 地区维度: wce.area , wce.country, wce.province,wce.city 来源渠道维度: wce.origin_channel 搜索来源维度: wce.seo_source 受访页面: wcte.from_url 指标字段: sid、ip、session_id 两个表关联条件: wcte.id = wce.id需要转换内容: 时间维度需要转换: 需要将create_time 转换为年季度月天小时 (拉宽操作)合并降维操作: 将两个表合并为一个表是否需要进行拉链呢? 数据只有新增操作, 所以不需要进行拉链操作同步方式: 仅新增同步

建模分析

建模分析主要的作用: 用于分析当前这个主题需要进行那些分层操作, 每个层次结构中需要有那些表, 每个表中需要有那些字段

ODS层: 数据源层(贴源层)12345678910作用: 对接数据源, 保留最完整的原始数据。用于将数据源中数据完整的拷贝到ODS层, 一般来说数据源中有哪些表, 那么会在ODS层构建有那些表, 与之一一对应字段和原始数据一致，需要加一个分区字段：抽取数据的日期建表: 需要构建两张表: web_chat_ems 和 web_chat_text_ems 每个表的字段有那些: 字段内容与业务库的字段一致 + 分区字段(dt) 是否需要分区操作: 仅新增同步需要构建的, 同步周期为天是否需要分桶操作: 不需要, 这样采集数据更加方便选择什么存储格式: ORC, SNAPPY 选择什么压缩格式: SNAPPY

DWD层: 明细层1234567891011121314151617181920212223242526作用: 提供高质量的数据 1- 清洗操作 2- 转换操作 3- 拉链实现说明: DWD层的表的数量与ODS层表的数量是一致的, 表的字段基本上都是涵盖(除非这些字段以后完全用不上) 表字段不受清洗操作的影响，因为清洗操作只过滤行数据，不影响字段列；受到转换操作的影响，转换出来的新字段，需要加到DWD表中。当前主题, 没有拉链操作, 有清洗和转换操作, 需要将其中一个表中create_time字段转换为年季度月天小时建表: 需要构建 web_chat_ems_dwd和 web_chat_text_ems表表字段: 将所有原有字段全部保留 + 扩展出来新字段 + 分区字段(dt) 是否需要分区表: 需要的, dt 天分区是否需要分桶表: 建分桶表(走SMB优化) 建表: 1- 两个表必须都是分桶表 2- 分桶字段必须是JOIN字段,同时还需要进行排序 3- 桶的数量必须一致选择什么存储格式: ORC 选择什么压缩格式: SNAPPY 思考: 还可以走什么优化手段

DWB层: 中间层123456789101112131415作用: 进行维度退化操作, 形成宽表操作(多表合并, 进行Join) 建表: 需要构建为一张表: web_chat_ems_dwb 此表需要将DWD层两个表合并在一起, 合并后, 所有字段都要全部保留是否需要分区表: 需要的, dt 天分区是否需要分桶表: 1- 是否需要采样(1)指标统计是否是相对指标 2) 数据量是否很庞大, 测试环境无法运行, 此时需要采样) 2- 是否需要提升查询效率(不需要了) 不需要构建分桶表选择什么存储格式: ORC 选择什么压缩格式: SNAPPY 思考: 还可以走什么优化手段: 目前在计算咨询量的时候, 需要 msg_count >=1 此时可以尝试使用索引优化(row group index 索引要求, bloom filter index) 满足了 row group index 索引要求

DWS层: 提前聚合(业务层)1234567891011121314151617181920212223242526272829作用: 提前聚合操作注意：当涉及到去重操作时，不要提前聚合。比如说: 按照年月日统计, 此时可以在DWS层先按照日进行统计, 后续在日统计结果基础上, 统计月和年当下主题的指标统计: 先去重, 然后才能统计一般不能进行提前聚合, 因为一旦聚合后, 导致上卷统计结果不准确注意: 当下主题, 无法进行提前聚合操作, 所以当前主题不需要有DWS层举个例子: 2022年 3月 1号访问量100条, 其中重复的有 30个 2022年 3月 2号访问量 100条, 其中重复的50个注意: 同一个用户有可能在 1号和 2号都访问了先按照天统计: 3月1号访问量为 70个 3月2号访问量为 50个统计3月份访问量: 70 + 50 = 120个实际上, 如果直接对原有数据统计, 共计只有100个, 其中有20 个 1号和2号都来了

DM层:1234567891011121314151617181920212223242526272829303132333435363738394041424344作用: 细化统计分析操作. 基于DWS层结果, 进行上卷统计处理指标字段 + 维度字段 + 两个经验字段访问量: 共计有 25个需求涉及到维度: 固有维度: 每一个需求都需要计算的维度时间维度: 年季度月天小时产品属性维度: 有这个维度, 但是并不是每个需求都有的地区维度来源渠道维度搜索来源维度受访页面总访问量咨询量: 共计有 15个需求涉及维度: 固有维度: 时间维度: 年季度月天小时产品属性维度: 地区维度: 来源渠道总咨询量visit_consult_dm表需要有那些字段呢? 指标字段 + 维度字段 + 两个经验字段visit_sid, visit_session_id, visit_ip, consult_sid, consult_session_id, consult_ip,yearinfo,quarterinfo,monthinfo,dayinfo,hourinfo,area,origin_channel,seo_source,from_url,time_type,group_type 表是否需要构建分区表: 需要的分区字段就是我们的时间字段可以将 yearinfo,quarterinfo,monthinfo,dayinfo 作为分区字段表是否为分桶表: 不需要了选择什么存储格式: ORC选择什么压缩格式: SNAPPY思考可以走什么优化? 可以考虑有可能出现group by的数据倾斜问题, 如何解决

RPT层(ADS,APP层) : 数据应用层1234作用: 对接外部应用, 应用方需要什么数据, 从DM层提取出来, 放置到DM层中对接最终应用: 此层就根据实际需求, 建表存储即可

建模操作

根据上面分析到的数据建库建表.

ODS层表:123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051-- 1- 创建库drop database if exists edu_ods;create database if not exists edu_ods;-- 2- 创建表:create table if not exists edu_ods.web_chat_ems( id INT comment '主键' , create_date_time STRING comment '数据创建时间' , session_id STRING comment '七陌SESsionId' , sid STRING comment '访客id' , create_time STRING comment '会话创建时间' , seo_source STRING comment '搜索来源' , seo_keywords STRING comment '关键字' , ip STRING comment 'IP地址' , AREA STRING comment '地域' , country STRING comment '所在国家' , province STRING comment '省' , city STRING comment '城市' , origin_channel STRING comment '投放渠道' , `user` STRING comment '所属坐席' , manual_time STRING comment '人工开始时间' , begin_time STRING comment '坐席领取时间' , end_time STRING comment '会话结束时间' , last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间', last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间', reply_msg_count INT comment '客服回复消息数' , msg_count INT comment '客户发送消息数' , browser_name STRING comment '浏览器名称' , os_info STRING comment '系统名称' )partitioned by(dt string)row format delimited fields terminated by '\t'stored as orcTBLPROPERTIES('orc.compress'='SNAPPY' );create table if not exists edu_ods.web_chat_text_ems( id INT comment'主键', referrer string comment'上级来源页面', from_url string comment'会话来源页面', landing_page_url string comment'访客着陆页面', url_title string comment'咨询页面title', platform_description string comment'客户平台信息', other_params string comment'扩展字段中数据', history string comment'历史访问记录')partitioned by(dt string)row format delimited fields terminated by '\t'stored as orcTBLPROPERTIES('orc.compress'='SNAPPY' );

DWD层建表操作12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061-- 1- 创建库drop database if exists edu_dwd;create database if not exists edu_dwd;-- 2) 创建表drop table if exists edu_dwd.web_chat_ems_dwd;create table if not exists edu_dwd.web_chat_ems_dwd( id INT comment '主键' , create_date_time STRING comment '数据创建时间' , session_id STRING comment '七陌SESsionId' , sid STRING comment '访客id' , create_time STRING comment '会话创建时间' , seo_source STRING comment '搜索来源' , seo_keywords STRING comment '关键字' , ip STRING comment 'IP地址' , AREA STRING comment '地域' , country STRING comment '所在国家' , province STRING comment '省' , city STRING comment '城市' , origin_channel STRING comment '投放渠道' , `user` STRING comment '所属坐席' , manual_time STRING comment '人工开始时间' , begin_time STRING comment '坐席领取时间' , end_time STRING comment '会话结束时间' , last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间', last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间', reply_msg_count INT comment '客服回复消息数' , msg_count INT comment '客户发送消息数' , browser_name STRING comment '浏览器名称' , os_info STRING comment '系统名称' , hourinfo string comment '小时')partitioned by(yearinfo string,quarterinfo string,monthinfo string,dayinfo string)clustered by(id) sorted by (id asc) into 10 bucketsrow format delimited fields terminated by '\t'stored as orcTBLPROPERTIES( 'orc.compress'='SNAPPY' , "orc.bloom.filter.columns"="id");drop table if exists edu_dwd.web_chat_text_ems_dwd;create table if not exists edu_dwd.web_chat_text_ems_dwd( id INT comment'主键', referrer string comment'上级来源页面', from_url string comment'会话来源页面', landing_page_url string comment'访客着陆页面', url_title string comment'咨询页面title', platform_description string comment'客户平台信息', other_params string comment'扩展字段中数据', history string comment'历史访问记录')partitioned by(dt string)clustered by(id) sorted by (id asc) into 10 bucketsrow format delimited fields terminated by '\t'stored as orcTBLPROPERTIES( 'orc.compress'='SNAPPY', "orc.bloom.filter.columns"="id");

DWB层:12345678910111213141516171819202122232425262728293031323334353637383940414243444546-- 创建库drop database if exists edu_dwb;create database if not exists edu_dwb;-- 建表: create table if not exists edu_dwb.web_chat_ems_dwb( id INT comment '主键' , create_date_time STRING comment '数据创建时间' , session_id STRING comment '七陌SESsionId' , sid STRING comment '访客id' , create_time STRING comment '会话创建时间' , seo_source STRING comment '搜索来源' , seo_keywords STRING comment '关键字' , ip STRING comment 'IP地址' , AREA STRING comment '地域' , country STRING comment '所在国家' , province STRING comment '省' , city STRING comment '城市' , origin_channel STRING comment '投放渠道' , `user` STRING comment '所属坐席' , manual_time STRING comment '人工开始时间' , begin_time STRING comment '坐席领取时间' , end_time STRING comment '会话结束时间' , last_customer_msg_time_stamp STRING comment '客户最后一条消息的时间', last_agent_msg_time_stamp STRING comment '坐席最后一下回复的时间', reply_msg_count INT comment '客服回复消息数' , msg_count INT comment '客户发送消息数' , browser_name STRING comment '浏览器名称' , os_info STRING comment '系统名称' , hourinfo string comment '小时', referrer string comment'上级来源页面', from_url string comment'会话来源页面', landing_page_url string comment'访客着陆页面', url_title string comment'咨询页面title', platform_description string comment'客户平台信息', other_params string comment'扩展字段中数据', history string comment'历史访问记录')partitioned by(yearinfo string,quarterinfo string,monthinfo string,dayinfo string)row format delimited fields terminated by '\t'stored as orcTBLPROPERTIES( 'orc.compress'='SNAPPY' , 'orc.create.index'='true');

DM层:12345678910111213141516171819202122232425-- 1- 创建库drop database if exists edu_dm;create database if not exists edu_dm;create table if not exists edu_dm.visit_consult_dm( visit_sid bigint comment 'sid访问量', visit_session_id bigint comment 'session_id访问量', visit_ip bigint comment 'ip访问量', consult_sid bigint comment 'sid咨询量', consult_session_id bigint comment 'session_id咨询量', consult_ip bigint comment 'ip咨询量', hourinfo string comment '小时', area string comment '地区维度', origin_channel string comment '来源渠道维度', seo_source string comment '搜索来源维度', from_url string comment '受访url', time_type string comment '时间标记: 1 小时 2 天 3月 4 季度 5 年', group_type string comment '产品属性维度标记: 1 地区 2 来源渠道 3 搜索来源 4 受访url 5总访问量')partitioned by(yearinfo string,quarterinfo string,monthinfo string,dayinfo string)row format delimited fields terminated by '\t'stored as orcTBLPROPERTIES( 'orc.compress'='SNAPPY');

RPT建表:1暂时不构建, 后续根据应用方需要, 动态建表即可

数据采集

Sqoop命令,个人感觉都写麻了.

数据清洗转换数据分析操作数据导出操作

【数仓从1到n】访问咨询主题看板

相关文章

24小时上门取件的快递服务有哪些？

手机换壳换屏怎么选？从技术到售后，这份攻略助你找到靠谱之选

2024昆明小人国主题乐园介绍+地址+门票价格(附免票政策)

友情链接