Für langsam veränderliche Dimensionen - SCD1 und SCD2 Umsetzung in Hive

Ich bin auf der Suche nach SCD1 und SCD2 Umsetzung in Struktur (1.2.1). Ich bin mir bewusst, Abhilfe zu laden, SCD1 und SCD2 Tabellen vor zu Hive (0.14). Hier ist der link für das laden von SCD1 und SCD2 mit dem workaround-Ansatz http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive/

Nun, dass Hive unterstützt die SÄURE-Operationen möchte nur wissen, ob es eine bessere oder direkte Art und Weise zu laden.

InformationsquelleAutor Lijju Mathew | 2016-05-26

2

Als HDFS ist unveränderlich Lagerung, könnte es argumentiert werden, dass die Versionierung von Daten und Geschichte (SCD2) sollte das Standard-Verhalten für das laden von Dimensionen. Sie können eine Ansicht erstellen, in Ihre Hadoop-SQL-Abfrage-engine (Hive, Impala, Drill -, etc.) ruft den aktuellen Stand/die Letzte den Wert mithilfe von Window-Funktionen. Sie können herausfinden, mehr über dimensionale Modelle auf Hadoop in meinem blog-Beitrag, wie z.B. den Umgang mit einer großen dimension und der Faktentabelle.

InformationsquelleAutor Uli Bethke

Gut, ich arbeite es um die Verwendung von zwei temporären Tabellen:

    drop table if exists administrator_tmp1;
drop table if exists administrator_tmp2;

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;

--review_administrator
CREATE TABLE if not exists review_administrator(
    admin_id bigint ,
    admin_name string,
    create_time string,
    email string ,
    password string,
    status_description string,
    token string ,
    expire_time string ,
    granter_user_id bigint ,
    admin_time string ,
    effect_start_date string ,
    effect_end_date string 
)
partitioned by (current_row_indicator string comment 'current, expired')
stored as parquet;

--tmp1 is used for saving origin data
CREATE TABLE if not exists administrator_tmp1(
    admin_id bigint ,
    admin_name string,
    create_time string,
    email string ,
    password string ,
    status_description string ,
    token string ,
    expire_time string ,
    granter_user_id bigint ,
    admin_time string ,
    effect_start_date string ,
    effect_end_date string 
)
partitioned by (current_row_indicator string comment 'current, expired:')
stored as parquet;

--tmp2 saving the scd data
CREATE TABLE if not exists administrator_tmp2(
    admin_id bigint ,
    admin_name string,
    create_time string,
    email string ,
    password string ,
    status_description string ,
    token string ,
    expire_time string ,
    granter_user_id bigint ,
    admin_time string ,
    effect_start_date string ,
    effect_end_date string 
)
partitioned by (current_row_indicator string comment 'current, expired')
stored as parquet;

--insert origin data into tmp1
INSERT OVERWRITE TABLE administrator_tmp1 PARTITION(current_row_indicator)
SELECT 
    user_id as admin_id,
    name as admin_name,
    time as create_time,
    email as email,
    password as password,
    status as status_description,
    token as token,
    expire_time as expire_time,
    admin_id as granter_user_id,
    admin_time as admin_time,
    '{{ ds }}' as effect_start_date,
    '9999-12-31' as effect_end_date,
    'current' as current_row_indicator
FROM 
    ks_db_origin.gifshow_administrator_origin
;

--insert scd data into tmp2
--for the data unchanged
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
    t2.admin_id,
    t2.admin_name,
    t2.create_time,
    t2.email,
    t2.password,
    t2.status_description,
    t2.token,
    t2.expire_time,
    t2.granter_user_id,
    t2.admin_time,
    t2.effect_start_date,
    t2.effect_end_date as effect_end_date,
    t2.current_row_indicator
FROM
    administrator_tmp1 t1
INNER JOIN 
    (
        SELECT * FROM review_administrator 
        WHERE current_row_indicator = 'current'
    ) t2
ON 
    t1.admin_id = t2.admin_id
AND t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
;

--for the data changed , update the effect_end_date
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
    t2.admin_id,
    t2.admin_name,
    t2.create_time,
    t2.email,
    t2.password,
    t2.status_description,
    t2.token,
    t2.expire_time,
    t2.granter_user_id,
    t2.admin_time,
    t2.effect_start_date as effect_start_date,
    '{{ yesterday_ds }}' as effect_end_date,
    'expired' as current_row_indicator
FROM
    administrator_tmp1 t1
INNER JOIN 
    (
        SELECT * FROM review_administrator 
        WHERE current_row_indicator = 'current'
    ) t2
ON 
    t1.admin_id = t2.admin_id
WHERE NOT 
    (
        t1.admin_name = t2.admin_name
    AND t1.create_time = t2.create_time
    AND t1.email = t2.email
    AND t1.password = t2.password
    AND t1.status_description = t2.status_description
    AND t1.token = t2.token
    AND t1.expire_time = t2.expire_time
    AND t1.granter_user_id = t2.granter_user_id
    AND t1.admin_time = t2.admin_time
    )
;

--for the changed data and the new data
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
    t1.admin_id,
    t1.admin_name,
    t1.create_time,
    t1.email,
    t1.password,
    t1.status_description,
    t1.token,
    t1.expire_time,
    t1.granter_user_id,
    t1.admin_time,
    t1.effect_start_date,
    t1.effect_end_date,
    t1.current_row_indicator
FROM
    administrator_tmp1 t1
LEFT OUTER JOIN 
    (
        SELECT * FROM review_administrator 
        WHERE current_row_indicator = 'current'
    ) t2
ON 
    t1.admin_id = t2.admin_id
AND t1.admin_name = t2.admin_name
AND t1.create_time = t2.create_time
AND t1.email = t2.email
AND t1.password = t2.password
AND t1.status_description = t2.status_description
AND t1.token = t2.token
AND t1.expire_time = t2.expire_time
AND t1.granter_user_id = t2.granter_user_id
AND t1.admin_time = t2.admin_time
WHERE t2.admin_id IS NULL
;

--for the data already marked by 'expired'
INSERT INTO TABLE administrator_tmp2 PARTITION(current_row_indicator)
SELECT
    t1.admin_id,
    t1.admin_name,
    t1.create_time,
    t1.email,
    t1.password,
    t1.status_description,
    t1.token,
    t1.expire_time,
    t1.granter_user_id,
    t1.admin_time,
    t1.effect_start_date,
    t1.effect_end_date,
    t1.current_row_indicator
FROM
    review_administrator t1
WHERE t1.current_row_indicator = 'expired'
;

--populate the dim table
INSERT OVERWRITE TABLE review_administrator PARTITION(current_row_indicator)
SELECT
    t1.admin_id,
    t1.admin_name,
    t1.create_time,
    t1.email,
    t1.password,
    t1.status_description,
    t1.token,
    t1.expire_time,
    t1.granter_user_id,
    t1.admin_time,
    t1.effect_start_date,
    t1.effect_end_date,
    t1.current_row_indicator
FROM
    administrator_tmp2 t1
;

--drop the two temp table
drop table administrator_tmp1;
drop table administrator_tmp2;


-- --example data
-- --2017-01-01
-- insert into table review_administrator PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','[email protected]','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-01','9999-12-31','current' 
-- FROM default.sample_07 limit 1;

-- --2017-01-02
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','[email protected]','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-02','9999-12-31','current' 
-- FROM default.sample_07 limit 1;

-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '2','b','2016-12-31','[email protected]','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-02','9999-12-31','current' 
-- FROM default.sample_07 limit 1;

-- --2017-01-03
-- --id 1 is changed
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '1','a','2016-12-31','[email protected]','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current' 
-- FROM default.sample_07 limit 1;
-- --id 2 is not changed at all
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '2','b','2016-12-31','[email protected]','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current' 
-- FROM default.sample_07 limit 1;
-- --id 3 is a new record
-- insert into table administrator_tmp1 PARTITION(current_row_indicator)
-- SELECT '3','c','2016-12-31','[email protected]','password','open','token1','2017-12-31',
-- 0,'2017-12-31','2017-01-03','9999-12-31','current' 
-- FROM default.sample_07 limit 1;

-- --now dim table will show you the right SCD.

Während dieses code-snippet lösen können, die Frage, einschließlich einer Erklärung die wirklich hilft, zu verbessern, die Qualität Ihrer post. Denken Sie daran, dass Sie die Beantwortung der Frage für den Leser in der Zukunft, und die Menschen vielleicht nicht wissen, die Gründe für deinen code-Vorschlag.
ja. Beim erfassen der sich langsam ändernden Daten, es gibt im wesentlichen vier Teile:
Definitiv ja! Beim erfassen der sich langsam ändernden Daten, es gibt im wesentlichen vier Teile: 1.wenn camparing, brand new oltp-Daten mit den "aktuellen" Daten in unserem Maß-Tabelle mit inner join, mit all den Eigenschaften " Gleichheit geben Sie die Daten, die genau die gleichen(nicht geändert) 2.wenn camparing, brand new oltp-Daten mit den "aktuellen" Daten in unserem Maß-Tabelle mit inner join, mit all den Eigenschaften " nicht Gleichheit außer der id-Spalte, geben Sie die Daten, die geändert wird, also in Schritt 2, die Sie lösen müssen, mit diesen Daten;
In addtion, die Erläuterung von Teil 3 und Teil 4: 3.mit den left-outer-join geben Sie die Daten, die sich geändert in der neuen oltp-Daten(die aktualisiert und die neuen), so dass in Schritt 3 lösen Sie mit diesen Daten; 4.natürlich Sie nicht brauchen, um berühren Sie die "abgelaufene" Daten in die Maß-Tabelle;
Die Erklärung sollte in der Antwort nicht als Kommentare.

InformationsquelleAutor ccclyt

0

Hier ist die detaillierte Umsetzung der langsam veränderlichen dimension Typ-2-Hive mit exklusiven join-Ansatz.

Unter der Annahme, dass die Quelle das senden einer kompletten Daten-Datei, also alte, aktualisierte und neue Datensätze.
```
Steps-
```
1. Laden der letzten Datei die Daten zum STG Tabelle
2. Wählen Sie alle abgelaufen Datensätze aus HIST Tabelle
  
  select * from HIST_TAB where exp_dt != '2099-12-31'
3. Wählen Sie alle Datensätze, die nicht geändert von STG und HIST mit inner join und filter auf HIST.Spalte = STG.Spalte, wie unten
  
  select hist.* from HIST_TAB hist inner join STG_TAB stg on hist.key = stg.key where hist.column = stg.column
4. Wählen Sie alle neue und aktualisierte Datensätze, die geändert STG_TAB mit exklusiven left join mit HIST_TAB und legen Sie Ablauf und Zeitpunkt des Inkrafttretens, wie unten
  
  select stg.*, eff_dt (yyyy-MM-dd), exp_dt (2099-12-31) from STG_TAB stg left join (select * from HIST_TAB where exp_dt = '2099-12-31') hist on hist.key = stg.key where hist.key is null or hist.column != stg.column
5. Wählen Sie alle aktualisiert, alte Datensätze aus der HIST Tabelle mit exklusiven left join mit der STG-Tabelle und legen deren Ablauf-Datum wie gezeigt, unten:
  
  select hist.*, exp_dt(yyyy-MM-dd) from (select * from HIST_TAB where exp_dt = '2099-12-31') hist left join STG_TAB stg on hist.key= stg.key where hist.key is null or hist.column!= stg.column
6. unionall Abfragen von 2-5 einfügen und überschreiben Ergebnis HIST Tabelle
Mehr detaillierte Umsetzung des SCD Typ 2 finden Sie hier-

https://github.com/sahilbhange/slowly-changing-dimension

InformationsquelleAutor SAHIL BHANGE

drop table if exists harsha.emp;

drop table if exists harsha.emp_tmp1;

drop table if exists harsha.emp_tmp2;

drop table if exists harsha.init_load;

show databases;
use harsha;
show tables;

create table harsha.emp (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;

create table harsha.emp_tmp1 (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;

create table harsha.emp_tmp2 (eid int,ename string,sal int,loc string,dept int,start_date timestamp,end_date timestamp,current_status string)
comment "emp scd implementation"
row format delimited
fields terminated by ','
lines terminated by '\n'
;

create table harsha.init_load (eid int,ename string,sal int,loc string,dept int) 
row format delimited
fields terminated by ','
lines terminated by '\n'
;

show tables;

insert into table harsha.emp select 101 as eid,'aaaa' as ename,3400 as sal,'chicago' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

insert into table harsha.emp select 102 as eid,'abaa' as ename,6400 as sal,'ny' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

insert into table harsha.emp select 103 as eid,'abca' as ename,2300 as sal,'sfo' as loc,20 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

insert into table harsha.emp select 104 as eid,'afga' as ename,3000 as sal,'seattle' as loc,10 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

insert into table harsha.emp select 105 as eid,'ikaa' as ename,1400 as sal,'LA' as loc,30 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

insert into table harsha.emp select 106 as eid,'cccc' as ename,3499 as sal,'spokane' as loc,20 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

insert into table harsha.emp select 107 as eid,'toiz' as ename,4000 as sal,'WA.DC' as loc,40 as did,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from (select '123')x;

load data local inpath 'Documents/hadoop_scripts/t3.txt' into table harsha.emp;

load data local inpath 'Documents/hadoop_scripts/t4.txt' into table harsha.init_load;

insert into table harsha.emp_tmp1 select eid,ename,sal,loc,dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status 
from harsha.init_load;

insert into table harsha.emp_tmp2
select a.eid,a.ename,a.sal,a.loc,a.dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'updated' as current_status from emp_tmp1 a
left outer join emp b on
a.eid=b.eid and 
a.ename=b.ename and
a.sal=b.sal and 
a.loc = b.loc and 
a.dept = b.dept
where b.eid is null
union all
select a.eid,a.ename,a.sal,a.loc,a.dept,from_unixtime(unix_timestamp()) as start_date,from_unixtime(unix_timestamp('9999-12-31 23:59:59','yyyy-mm-dd hh:mm:ss')) as end_date,'current' as current_status from emp_tmp1 a
left outer join emp b on
a.eid = b.eid and
a.ename=b.ename and
a.sal=b.sal and 
a.loc=b.loc and 
a.dept=b.dept
where b.eid is not null
union all
select b.eid,b.ename,b.sal,b.loc,b.dept,b.start_date as start_date,from_unixtime(unix_timestamp()) as end_date,'expired' as current_status from emp b
inner join emp_tmp1 a on
a.eid=b.eid  
where
a.ename <> b.ename or
a.sal <> b.sal or 
a.loc <> b.loc or 
a.dept <> b.dept 
;

insert into table harsha.emp select eid,ename,sal,loc,dept,start_date,end_date,current_status from emp_tmp2;

records including expired:

select * from harsha.emp order by eid;

latest recods:

select a.* from emp a inner join (select eid ,max(start_date) as start_date from emp where current_status <> 'expired' group by eid) b on a.eid=b.eid and a.start_date=b.start_date;

InformationsquelleAutor sriharsha vemuri

-2

Habe ich einen weiteren Ansatz verwenden, wenn Sie kommen, um die Verwaltung der Daten mit SCDs:
1. Nie aktualisieren von Daten, das existiert in Ihrem historischen Datei oder Tabelle.
2. Stellen Sie sicher, dass die neuen Zeilen werden im Vergleich zu der jüngsten generation, zum Beispiel das load-Logik hinzufügen-control Spalten : loaded_on, checksum und, falls erforderlich, eine Sequenz Spalte, die verwendet werden, wenn mehrere Belastungen auftreten, werden die gleichen Tag-Vergleich neue Daten zur jüngsten generation, sowohl die Kontrolle von Spalten und eine Spalte "Schlüssel", das existiert, inside Ihre Daten wie Kunde oder Produkt-key.
Nun, die Magie erfolgt durch Berechnung der checksum alle die Spalte beteiligt, aber die Kontrolle von Spalten, die Schaffung eines einzigartigen Fingerabdruck für jede Zeile. Das finger-print ( checksum ) - Spalte werden dann genutzt, um zu bestimmen, wenn keine Spalten geändert haben, im Vergleich zu der aktuellen generation (aktuellste generation ist auf dem neuesten Stand der Daten, basierend auf dem Schlüssel, loaded_on und Reihenfolge).

Nun, wissen Sie, wenn eine Zeile aus Ihrem täglichen update ist neu, da keine der früheren generation, oder, wenn eine Zeile aus Ihrem täglichen Updates benötigen, um erstellen Sie eine neue Zeile (neue generation) in Ihren historischen Datei oder Tabelle und die letzten, wenn eine Zeile aus Ihrem täglichen update nicht alle änderungen, also keine Notwendigkeit, erstellen Sie eine Zeile, da ist kein Unterschied im Vergleich zur vorherigen generation.

Die Art von Logik, die erforderlich werden können, bauen mit Apache Spark, in einer einzigen Anweisung können Sie Fragen Spark die Verkettung einer beliebigen Anzahl von Spalten aller datatypes dann berechnen Sie ein hash Wert, der verwendet wird, um finger-print-it.

Alles zusammen jetzt können Sie entwickeln ein Dienstprogramm basierend auf spark wird akzeptieren, dass jeder Daten-Quelle und-Ausgabe ein gut organisiert, sauber und langsam Dimensionen bewusst historische Datei, Tabelle,... die Letzte, nie update append only!

InformationsquelleAutor user1918580

Schreibe einen Kommentar

Du musst angemeldet sein, um einen Kommentar abzugeben.