data lake v.s. data warehouse
- data warehouse well maintained, with indexewd; write by ETL(once a day), read-only from BA
- data lake, more garbage, not well organized.$$
Schema:
- in data warehous modelling, star schema(snowflake schema): fact(transaction) table and dimension(entity) table
- fact table:
- contains: 1. all the primary keys of the dimension; 2. associated facts or measures(quantity sold, avg sales)
- dimension table:
- Dimension tables provides descriptive information for all the measurements recorded in fact table.
- Dimensions are relatively very small as comparison of fact table.
- Commonly used dimensions are people, products, place and time. ref:
- http://www.dataintegration.ninja/loading-star-schema-dimensions-facts-in-parallel/
- https://stackoverflow.com/questions/20036905/difference-between-fact-table-and-dimension-table
why column storage is a better fit
- fact table contains hundreds of column, usually in OLAP, only need 4-5 of them
- OLTP store the row/doc/entity as one continuous process, very inefficient in OLAP access pattern(4-5%)
- wide column DB like cassandra and HBase are not column oriented storage.
- HDFS, Hive is column storage DB
- compression methid: What is Parquet?
Column Compression
column values bitmap for each possible value run-length encoding
requirement:
- additional row, needs to rewrite everything. -> O(N) very complex -> less write
Summary
- OLTP: a. user facing: huge amount of requests, b. only touch a small number of records in each query based on key c. storage engine use index to find data for the requested idx d. bottleneck: disk seek time
- OLAP(data warehouse) a. for BA, not by end users b. less query times, but each query demanding requires millions of records to be scanned in a short time c. compression: column-oriented storage d. bottleneck: disk bandwidth