Easy A
Would you like to react to this message? Create an account in a few clicks or log in to continue.


 
HomeHome  Latest imagesLatest images  SearchSearch  RegisterRegister  Log inLog in  

Share
 

 ETL Architecture and Control Flow

View previous topic View next topic Go down 
AuthorMessage
ScriptKiddieBot
Member
Member
avatar

Posts : 45
Join date : 2011-03-09

ETL Architecture and Control Flow Empty
PostSubject: ETL Architecture and Control Flow   ETL Architecture and Control Flow EmptyFri Apr 15, 2011 9:45 pm

*
*
*
*
*
*
*

ETL Architecture and Control Flow

Post monsieur_arrie on Tue Apr 21, 2009 8:32 am
Hi Folks,

I am new to the forums, they look like an interesting place to discuss BI issues and troubles.

So, I am not new to BI, but we have done a lot of re-architecting of the etl systems with servers being moved, wan considerations etc.
Currently, our source system extracts, compresses and ftps data to our etl area. File names are representative of the 'day' the data refers to. This allows data o be queued in case of ftp failure etc.

My etl system is entirely file based. I use Powershell to pull together the earliest available batch of files. Check to see if the batch is complete, then unpackage the batch ready for loading. I use token files to lock / unlock the staging area and to prevent things getting overwritten in case of error.

I then push my source files through a number of transforms which roughly summise into preparation (junking rows, initial column cleaning), dimension prep, dimension delivery, fact prep then fact delivery. I do not have a surrogate key generator and two critical dimensions are SCD. For this reason I need to deliver dimensions before conforming fact data.

Each step writes to temporary files. If the step completes succesfully, they are moved to the input of the next step, and the source files for the previous step are deleted. This method gives a 'pipeline' of transformations, each of which are restartable at any point.

This all sounds quite satisfactory, and works well at the moment, but complexity is going increase. We have two other data sources to bring online, so I need to syncronise the loading of dimensions. As a note, all packages are audited, but I have no concept of a 'batch', with a particular batch / runid from start to finish.

We are running ms sql 64bit standard edition, so we don't have the ability to partition tables or cubes. Also, to make things harder, our transactional system updates transactions. This makes implementing incremental processing rather difficult.

So, I am interested in knowing how people architect / manage...
. syncronising multiple source systems. What happens if one system is unable to supply data in a timely fashion - do you code overrides or hold out for the data ?
. state management. Do people use file pipeline methods like me, or do you maintain a state in the db, and rely on that to control the flow of data through the etl pipeline.
. restartability. My transforms take some time, so I am keen to allow jobs to be restarted at an time without danger of breaking the etl. db based state management would allow me to do that I think.
. db or file based staging. I am a big fan of file based work, so the db overhead seems unneccessary

Thoughts, comments?

All welcome.

Paul
Back to top Go down
 

ETL Architecture and Control Flow

View previous topic View next topic Back to top 
Page 1 of 1

Permissions in this forum:You cannot reply to topics in this forum
Easy A :: ETL Architecture and Control Flow Edit-trash Useless :: Trash-