14

I've been trying to make a decision as to which package to implement. I've looked at Pentaho, Talend, SSIS, Informatica, etc... What do you use?

18

We use Informatica. But you didn't ask our opinion of what we use :-)

My own opinion is that some well written perl (or whatever your favorite so-called "scripting" language is) can beat a shiny GUI anyday.

And I'm not alone in my opinion (from this thread).

And my favorite definition of ETL:

ETL is that which you develop custom systems by hand to..

  • backup the truck and copy legacy systems data ( 20 different systems )
  • marry the data together from different legacy systems ( compute farm )
  • Compress the data by summarizing
  • Do some transformations
  • Load into Oracle using sql*loader Direct path you may get 1/4 billion rows loaded per hour
  • transform the data some more using pl/sql and perl or java
  • compress the data again
  • roll fact tables up by dropping columns
  • complete star schema
  • optimize your partitioned world and make fast

Be told by snobby and stupid large companies you cannot work in their ETL department because you lack direct experience with...

Ab Initio Informatica etc.

GUIs that attempt to automate what you can do manually and probably do better.

from BJE_DBA

Update: and I (++) Allan elsewhere in this thread.

5

We are building a BI platform from the ground up using Pentaho. Pentaho Data Integration 3 (AKA Kettle) provides a nice cross-platform GUI for creating, executing, and monitoring ETL jobs. It has a good number of built-in transformations/actions and has a (moderately) documented plugin API (in Java). It is definitely one of the more mature applications in the Pentaho BI Suite (for better or worse). Oh. And it's free.

Personally, I'm not huge on Java. So I'm looking into Ruby integration (via JRuby) ? which appears very feasible, but isn't happening out-of-the-box.

4

I use SSIS because of its integration with SSAS. Most of my experience is with 2005 which is a bit clunky in places. 2008 looks more polished. I'd recommend it if some of the out-of-the-box components fit with what you plan to use it for or if the ability to link against .NET dll libraries is useful.

4

There's a fantastic Ruby ETL tool, ActiveWarehouse ETL that is part of an open source Ruby based data warehousing package.

It hasn't been updated in a while but it is in widespread use on a daily basis by many.

It has a very intuitive DSL to interact with it.

4

I have just completed a thorough evaluation of Talend 4.0. Previously I used Alteryx and SSIS. Talend is a great open source product- as long as you don't deal in large data sets or need circular flows in data. Its main limitations to me are 1. You cannot join multiple flows of data back together again in same script without writing to text files and starting another text job. This is a structural defect based on the way it generates the Java or Perl code. 2. For large datasets its tools just do not cut it. In my case (Java) no matter how much memory you have allocated to Java the join, unique and aggregate tools all barf with Java memory heap errors. Fine for < million rows but don't try to do 10 million row joins. I have switched my Talend scripts around to bulk import into MySQL or MSSQL for any joins and I am good to go. I am thinking of trying Pentaho to see how it handles large data sets- any one using it for LOTS of data?

3

I've been fairly successful with SQL Server Integration Services 2005 (SSIS), and the only real pain-point was deployment and configuration. Spent a lot of time battling configuration files, getting the packages to pick up configuration values during installation etc. I don't know (but do hope) if it's improved in SQL 2008.

3

I cannot agree more with the previous post which says that a well written script will beat a shiny GUI anyday.

I used Pentaho PDI (kettle) for over 2 years and eventually dumped it for a scripting language. The strength of the GUI tool is that it will get you on speed really fast. The problem with GUI tools is that as soon as you are trying to do something for which a GUI tool does not have a specially designed module you are hitting a wall straight on. You may often use some scripting language within GUI tool but in my experience it is more like hacking around GUI tool limitations.

The best (and fastest) ETL is made from pure SQL, add to this some simple scripting language and you are at home.

I successfully implemented few large Pentaho deployments which integrated Hadoop/Hive with Pentaho using InfoBright and Infinidb databases without using any GUI tool. Instead I used Python as a scripting language and PygramEtl (see example) for loading fact and dimensions and for building aggregate tables.

2

We use SSIS or Cognos Data Manager, depending on requirements. Data Manager is very nice in handling Dimensions, Hierarchies and Star Schemas as first class concepts, but can be a little limited in its database specific functionality (though it is getting better). It's the perfect tool for Ralph Kimball acolytes, but that's a religious war I refuse to enter.

Of course, Cognos has now been bought by IBM, who appear to be happy to keep developing Data Manager separate to their own Data Stage, as they excel in different areas, but who can say what the future holds. I have no experience with Data Stage.

SSIS is a bit more difficult to use creatively, though is probably more versatile after a learning curve, and has less Star Schema specific organisation, but is cheaper (if you already own a SQL license) and has more SQL Server specific functionality.

Pentaho was OK when I used it (2 years ago, when it was a seperate application in the suite 'KETL'), and it does give you the Open Source warm and fuzzies, but was a little buggy and less sophisticated. I do still have a soft spot for it though.

From what I have seen, Informatica is very powerful, versatile, but also very expensive.

Avoid Computer Associates Advantage Data Transformer. It was good in its day, but like most things CA, completely ignored from a development point of view the second it was acquired by them. It looks and feels late 1990s.

From what I know, all the above tools store their metadata in a non proprietary formats, XML for Data Manager and CA-ADT, XML for the rest (KETL I can't remember off the top of my head, but I think it might be DB)

2

I have used perl + the native bulk load mechanism of the database with great success.

/Allan

2

We're using Informatica, it's very convenient and helpful in passing a project to other people. I've heard Oracle Warehouse Builder is very good. We also used SSIS for a project, but decided not to go to it yet. You'll be able to scale to multiple server with Informatica's Grid option, but Informatica is expensive.

For Data Warehousing, I think Wherescape RED is a nice product, a cheaper Kalido.

1

I've tryed Talend and i have to admit that this OpenSource solution got potential.

In the next years, it will be a company to watch

1

I've used the Cognos Data Manager ETL tool and it's pretty nice.

1

I have to agree with David, I am currently using talend for a proof of concept and am finding it does have legs.

While it is fairly new software on the OSS ETL landscape and does have its bugs they are developing into a very viable option. They seem to be chipping away at requested features/showstopping bugs as quickly as the community can bring them up. Definitely worth a look if your shop isn't willing to drop big bucks for a traditional "enterprise" solution.

1

I realize this is old thread but we have been testing a lot of these products including Pervasive and, for Oracle, nothing beats Talnd 3.1 (beta). Because of the fact they are a code generation tool you have complete control over every aspect of the tranformation. While we have unique data tranfer requirements, at this point, pervasive is close, but lacks the true control that Open Studio (or for 6500 Talend Intregration Suite) provides. It is by far the most flexible tool. Basically if you have unique requirements, but still want a visual start to your tranformations, you will not be disappointed with Talend. If you work in an industry with a standard you may have more options.

One side note is that we are a .Net house. So you may have issues retraining your team, but the benefits far outway the costs.

1

Check out CloverETL (www.cloveretl.com) - it is engine based (as are Informatica, AbInitio, DataStage) with performance matching Informatica for 1/10th of the price. The runtime is even open source, the GUI costs few hundred $. It is very scalable - from single cpu desktop capable of processing hundreds of MB to big SMP iron where TB is no volume. It also runs on cluster with MPP style of processing data.

1

We use Wherescape Red. IT is a tool that allows you to build data warehouse 10x faster.

1

I've been using Talend for ad hoc data take ons and with version 3.2 it's a great product, performance is superb. If you want to package jobs and standardise loading this is the perfect tool (version 4 is on it's way).

Previously I did a full data take on with Oracle SQLLoader and PL/SQL which was faster and cheaper to implement than what the Informatica consultants could deliver. I've seen some Perl scripts and I must say that I have have never used Perl but it looks really easy to use and maintain.

From the purchased options, we used Microsoft SSIS for data analysis / profiling with great results, why Oracle doesn't package something like this I don't know.

0

I have some limited experience with SSIS on SQL Server 2005.

0

It all depends on your budget, IT environment, and skills pool. MS SSIS is great and we're currently implementing it. However, as this is a distributed data environment, some data will be first extracted, transformed, and then exported using Pentaho because over there, they have less budget.

In short, my recommendations: - large DB in a MS environment: SSSIS (DB is MS SQL) - large db in open-source environment: Pentaho (DB is MySQL)

I also used in a past project a Progress ETL bundled in a QAD BI package to attack directly a Progress DB and load data into another Progress DB. Again, the choice was driven by the environment and the budget.

/Vey

0

We use Talend Open Studio. It's buggy but very powerful. It's GUI based no, or very little programming needed. Most of jobs can be done w/o understanding Perl or Java (We are using the Perl version). Whatever flows your boat. I think in a year or so it will be huge.

0

We use Informatica, though I personally do not and have no opinion of it. I look at ETL tools with a suspicious eye but I don't doubt that they have their place.

The main point I didn't see made ye t that you might want to consider: I believe Informatica is the markey leader in ETL tools.

0

I used Talend Studio recently and am quite impressed by its capabilites. Talend is a code generator and has GUI for all the routine tasks. I am in a e-discovery shop and deal with large load files (flat files, csv etc) manipulation.

0

I have used SSIS, Talend, Informatica and plain Perl and shell scripts to perform extraction and transformation. As some have already noted many a time you will find your own application or scripts performance better, and other times with your in-house code you can easily extend the functionality. For example, parsing emails according to RFC specs.

Having said that Talend has done a great job for what it will cost you. Though I haven't seen TCO comparing these vendors. Also, with ELT vs ETL it gets more interesting to compare.

-- Shiva

0

Clover ETL is very good! It is probably the best ETL I have used so far.

0

Your choice will depend on your budget but that doesn't mean the more expensive tools are necessarily better - what you want is a good team with a thorough understanding of the principles of good design and a pragmatic approach to problem-solving. In terms of "bang for your buck" you would be very hard pressed to beat Wherescape RED. If your budget is more meagre still and you have good understanding of DW building from the ground up then ETL tools like Talend are superb (or if you are an MS shop of course, SSIS is excellent).

Often the more expensive tools have little, if any, independent/peer support and have business models driven as much by the generation of consultancy as from licensing.

M

-2

A maximum of eight points awarded to the ETL tool for its ease-of-use. A maximum of two points was awarded for each of the four attributes, If the verdict was negative, then points were deducted. If for example, ease of use, WYSIWYG and task compatibility ETL/EAI earned a plus sign and screen design a minus sign, the tool was awarded two points (one points for each plus sign, minus one point for the minus sign). There is a big difference between an ETL tool offering various options and features, and making them easy enough for everyone to use. A number of ETL tools offer certain features that can only be implemented by programming, either from inside or outside the tool. In these cases it was decided not to include this functionality in the final score, because the advantage compared with 'normal programming' was too small. We assumed that for a feature to work efficiently, it had, to some extent, to be implemented in an easy-to-use graphical interface. If you only intend your ETL tool to be used by highly trained IT professionals this may less relevant.

Intraday Tips