*
*
*
*
*
*
*
ETL Architecture and Control Flow
Post monsieur_arrie on Tue Apr 21, 2009 8:32 am
Hi Folks,
I am new to the forums, they look like an interesting place to discuss BI issues and troubles.
So, I am not new to BI, but we have done a lot of re-architecting of the etl systems with servers being moved, wan considerations etc.
Currently, our source system extracts, compresses and ftps data to our etl area. File names are representative of the 'day' the data refers to. This allows data o be queued in case of ftp failure etc.
My etl system is entirely file based. I use Powershell to pull together the earliest available batch of files. Check to see if the batch is complete, then unpackage the batch ready for loading. I use token files to lock / unlock the staging area and to prevent things getting overwritten in case of error.
I then push my source files through a number of transforms which roughly summise into preparation (junking rows, initial column cleaning), dimension prep, dimension delivery, fact prep then fact delivery. I do not have a surrogate key generator and two critical dimensions are SCD. For this reason I need to deliver dimensions before conforming fact data.
Each step writes to temporary files. If the step completes succesfully, they are moved to the input of the next step, and the source files for the previous step are deleted. This method gives a 'pipeline' of transformations, each of which are restartable at any point.
This all sounds quite satisfactory, and works well at the moment, but complexity is going increase. We have two other data sources to bring online, so I need to syncronise the loading of dimensions. As a note, all packages are audited, but I have no concept of a 'batch', with a particular batch / runid from start to finish.
We are running ms sql 64bit standard edition, so we don't have the ability to partition tables or cubes. Also, to make things harder, our transactional system updates transactions. This makes implementing incremental processing rather difficult.
So, I am interested in knowing how people architect / manage...
. syncronising multiple source systems. What happens if one system is unable to supply data in a timely fashion - do you code overrides or hold out for the data ?
. state management. Do people use file pipeline methods like me, or do you maintain a state in the db, and rely on that to control the flow of data through the etl pipeline.
. restartability. My transforms take some time, so I am keen to allow jobs to be restarted at an time without danger of breaking the etl. db based state management would allow me to do that I think.
. db or file based staging. I am a big fan of file based work, so the db overhead seems unneccessary
Thoughts, comments?
All welcome.
Paul