Data analysis with Hadoop and Hive. New tools. Old game.

I’ve done a lot of data analysis in my life because more than half of my career I was a part of an actual business rather than ISV. I’ve worked for commodity exchange broker, wholesale and retail companies, Coca-Cola, Bayer and other businesses where I did a lot of analytical work on top of software development. I’ve done my first data analysis project in late 80s with Lotus 1-2-3 when circumstances forced me to pause software development and sell jeans (BTW, that was a great experience at the end – learn how to make a sale). And since then I’ve learned the power of databases and spreadsheets that allowed me to dig the data and help businesses grow.
 
Recently I’ve done another data analysis project with new tools – Hadoop and Hive. I’ve been working with them for the last 2+ years, but more from a development perspective. This was the first time I did actual data digging with them. And just an hour into the analysis I understood that although these are new tools, the actual process is the same. Yes, they provide some structural (sets, arrays, maps) and scalability advantages, but one needs the same data analytical state of mind to get the answers from the data.  And the fact that you have this power doesn’t mean that you have to use it right away. 
 

Getting answers from the data is an iterative process – you write query, analyst the result, optimize your queries and repeat the process. And that’s when you want to move fast – before you make The Query that will give you the ultimate answer. This fastness to find the path is especially important with large data sets and Hadoop latency – start small, move fast until you are really ready to “Release the Kraken”.

 
So, I think that Big Data analysis has not a lot to do with Hadoop & Co, well … unless you are on the operation side. It’s still more about how to dig the data and find a way to answer the question fast before forming it into a repeatable process suitable for the tools.
 
And that’s why learning how to use spreadsheets to analyze data is so important – it teaches how to find the answers. 

Software Structural Quality, or Lessons from Open Source Projects

Software quality is important. Very important. Not a lot of people will argue with this. That’s why we have “tools” like QA and TDD – to ensure that we deliver a high quality software. But most of these “tools” address functional quality of the software. But what about the structural quality? Isn’t it important as well? How often will you see a bug like this: “Method xxxxx is difficult to read and it’s too complex”? Maybe structural quality is not that important and is primarily a topic of software craftsmanship?
 
Software structural quality refers to how it meets non-functional requirements that support the delivery of the functional requirements, such as robustness or maintainability, the degree to which the software was produced correctly.” If the software has 0 bugs (what a noble goal), but is neither easily readable nor maintainable, is it good? “Hey, man, it works”. Yes, no doubt about that. How does that help when new features should be added or existing code should be changed? Unless code has a high structural quality it cannot be easily changed and it will take more and more time to deliver new features. Also, eventually it will effect functional quality, because changing something, even something very small in a very bad code can lead to an unpredictable behavior (read: bugs).
 
So, if a company would like to stay in a business and release new versions of their software, then sooner or later they should take care about “internal” quality. The sooner the better, before it’s too late. How may times have you heard something like this “I’m not touching that code – nobody knows what will happen”.
 
Then why do so many companies produce software that works, but is ugly inside? One of my ex colleagues once said: “If our customers could see our code, they would never buy our application”.
 
Hey, it works. Don’t you get it?”. Yeah, I get it.
NewImage
I’ve been in many companies, but I have not seen high structural quality in code very often. That’s why it’s even more surprising when you look at the quality of the popular open source projects. Most of them are very good – easy to read & understand, nicely structured and covered with multiple unit tests. Shouldn’t it be an opposite – as my job I’ll write a high quality code, but I can be easy on myself with this for-fun projects? Funny, but it doesn’t work this way, because it’s impossible to add crap to an open source project without it being seen and accepted by everybody. But it’s absolutely possible at work – as long as I have 0 bugs, I can put whatever crap I want, nobody will see it anyway. “Don’t you have a code review process?”. Yeah, right – if I contribute crap, why would I reject your bad code.
 
So, what can open source projects teach us the about structural quality of software? 
 
First of all, there should be somebody who cares. Who cares about good code. Who will be a pioneer of “let’s be proud of what we write”.
 
Second, create a culture of good code. Yes, it’s possible to setup a process, but unless the team believes in good code, there is always a way to cheat the process.
 
Have you ever tried to open a PC for upgrade? Yeah. You know what I’m talking about. Have you tried open Mac? Mac Pro is beautiful inside. That was the way Steve Jobs wanted it – the product should be beautiful even inside.
 
And BTW, writing good code is much more fun. So, have fun 🙂 and be proud of what you write.
 
Update: Chris Hume volunteered to edit this post and make it better. Thanks, Chris.
 

What I like about C#

Recently I’ve been asked what features I like most in C#. I’ve been working with it for almost 14 years now (since the end of 2000), and today I use it almost daily along with Java and JavaScript (in Node.js). It’s so fun to witnesses the growth of the language – from the birth to teenage age and to maturity.

 C# is a very nice language and I find the following features of it most useful:

– Generics brings the concept of type parameters. This makes code very clean especially when dealing with collections.

– Linq – added standard and easy-learning pattern for querying and updating data. It’s like SQL for C# collections. Linq allows developers to express what should be done instead of how (forget about for, while loops). I especially like to use it in the form of Lambda Expression. Although .NET framework has a functional language – F# – the ability to do functional programming in C# is great. 

– Parallel Processing – 

– Tasks. If you’ve ever done threading in C# (or any other language for that matter) the simplicity that Task provides in dealing with threading is a life saver. There is a reason why Microsoft recommends that we use Tasks instead of Thread/ThreadPool directly – not only it will work better, but code can be written much faster and cleaner.

– async / await

– dynamics. There is a reason why dynamically typed languages (Ruby, Python, JavaScript) are so popular. If you have never experience before it feels awkward at first, but then you get it and have a blast. I was working with dynamic language since early 90th, and having this functionality as part of C# makes life much easier in some cases.

With these features been around for a while it sometimes amuses me how often I still see that engineers don’t use them. Benefits

So you want to be a software architect. Part 3

In the second post I’ve shared my thoughts about the most important skill that architect needs if he wants to see the materialization of the design. Today we will go over the ability to identify the risk and commitment to the priorities.

Identify risk

As you design the application or subsystem, it’s important to see what the risky parts are. Before you hand the design over to the development team, you must be sure that risk and unknowns were investigated and resolved.

Let’s take a look at an example. In order to speed up the development of the new versions of the product, you need to transition from a monolithic architecture to a Service-Oriented Architecture (SOA). You know how to split your business functionality into multiple independent services. You addressed the issue of internal routing and load balancing. But what method should be used to communicate among the services? There are many approaches. SOAP will be an obvious choice. But this will require your team to learn new skills of building SOAP-based services and operation team to deploy & maintain them. Maybe it’s not a big deal, of course. But you consider to build internal SOA using AJAX/Json services because your team is already familiar with this approach building Web application and operation knows how to deploy and operate them. But what about the performance? It can be slower than other SOAP approaches. Unless you measure and compare performance of both approaches, and than agree with the team that the performance “penalties” are acceptable because of the advantages that the team will utilize the existing skills, you will be risking the transition to the new architecture and delivering it on-time and with high quality. In this case you should measure first and build then, not build first and then test and measure:

  1. Identify the risk
  2. Investigate it
  3. Find alternatives
  4. Agree with the team about the approach
  5. Do it

Risk may present itself in many areas – team skills, complexity or stability of the 3rd party components that you plan to use, performance of the algorithms, scalability path and so on. By “realizing’ the new architecture in your head, identifying the risk as part of the design and addressing it in the earlier stages, you will be able to deliver the design that will be successfully implemented by the team. 

Commitment to priorities

Each new project or release has priorities, Usually they are:

  • time
  • quality
  • scope
  • resources

Priorities are defined by the management and driven by the business needs. And if they are not defined, then you should ask for them. Very often the resources are fixed – your team has the same number of people. Out of the other 3 you can balance only 2 and have to sacrifice one of them. For example: with limited time, it’s not possible to deliver all the features with high quality. And it’s easy to forget about the priority #1, if let’s say it’s time, by creating a perfect design or design an interesting feature (both take time). You have to deliver according to these priorities, even if you disagree with them sometimes. 

(To be continued)

So you want to be a software architect. Part 2. Important.

Last time I’ve started talking about the skills that I think are required for a software architect. The plan was to go over other skills, leaving the one I consider the most important (as of today) for the last. But I think that this should go first.

Influence

Software architect must be able to influence other people, if he wants to see his architecture properly implemented in the final product. The last condition (proper realization of the architecture in the product) is important point whether architect should influence people or not. You see – if architect is only interested in just a set of UML diagrams that she passes to the development team and doesn’t care about the final product, than she should not worry about the influence. Because these diagrams are her final product. If this is how you see your role, than you can skip this post.

Why influence is important? In most companies developers and QA do not report to architect, they report to development/QA managers. And these managers do not report to architect. So, if manager decides to change the way that product should be built than it may be way different than what you have envisioned. And there is not too much that you can do about it. Because even a junior developer will do what he is told by his manager and not by you. This is why the only way for you to see your design in the final product is to “convince” a manager how it should be built. Otherwise you can be very disappointed (if you really care about the final product):

Your vision Actual product
84A613BA-AC7B-4EC0-8ACB-CC426098D71A 9F3C3243-EA51-4DA2-93D9-865CA009C13C

Thus, the most important skill for a software architect has nothing to do with technologies, but rather with their “political” ability to achieve their goal. You have to be a “politician” in order to see the final product the way you’ve envisioned it. And sadly enough it’s even more important than your technical skills to design it. Of course, it depends on your company, but keep this in mind.

(To be continued)

Effective multitasking tip

I’ve been working on multiple projects recently. I’ve been managing multiple work projects effectively for a long time. I’m talking about working on multiple projects outside of my job. I am pretty busy at my work. In addition I do an open source development project. I’ve started this blog. And I’m developing one more application that is very important for me.

My normal day goes like such: wake up early, do some development for one of my personal projects, work on a personal project during commute to work, do my job, commute home and work on one of my projects again and may be some coding after dinner. I love my personal projects a lot and it gives me enough enthusiasm to keep this schedule.

Juggling job and personal projects was challenging for me at first. And that’s where I’ve developed this process that I would like to share:

Every time I’m going to stop working on one project and work on another I record what I was doing and (even more important) what I have to do when I resume working on the same project.

E.g.: fixed 80% of the bug #347. Finish bug fixing and update unit tests.

Simple? Yes. But it makes switching from one project to another less painful and increases productivity dramatically. It helps to get into the interrupted project much faster.

I’ve also realized that the same trick can be used in the office as well: if my development activity is going to be interrupted by a meeting I write down: was doing this, start with this after the meeting.

Try it and maybe it will help you too. Or maybe not. It depends.

oDesk

oDesk is a great time-saver. You know what oDesk is, right? It’s a site where you can hire somebody to do some job for you. Some job that can be done remotely, like code writing, web site creation, QA-ing your code, editing your blog post. Go to their website and learn more about what kind of help you can get.

But, why would I hire somebody when I can do it my self?. I can edit my blog post. I can create web site. I can test my application“. It’s a question that doesn’t have a single answer: what is going to be – time or money? Sometime it’s money, and in this case you test your application yourself before launching it to your customers. But, sometime it’s “cheaper” to pay somebody to test it so that you can concentrate on something more important. I’ve just recently read a comment about a person who was gladly hiring somebody to do a job that would take him only an hour to do, even it would take a hired person 8 hours, because at the end of the day it was “cheaper” in a long run. So, sometime you can “delegate” something to somebody else and this way free your self for more important tasks (including spending more time with you family or exercising more, or just doing nothing and recovering after a tough day/week/month).

I’ve first heard about oDesk site in a book about Micro ISV. So, when I needed to build a team for a startup (with limited resources), I’ve turned to oDesk. And it was very successful experience. Not without challenges, but successful. Not only I’ve had people from around the world helping me to build a new product, but I was paying a fraction of what it would cost me to have a local team.

Ok, I hear you: “Another advertisement of a cheap offshore job market. Give me brake. You don’t know what your are talking about”. Well, as a matter of fact, I do know. Because I did it, with success. And I’ll follow up with what helped me to use offshore team effectively. Note that, offshore teams may not be an option for you – one of my friend who is CEO and founder of his company told me that in some cases (when the software will be used in a high-level security areas) it’s not possible to hire people outside of US. But I think it’s more an exception than a rule.

So, does it mean that the only way to use oDesk is in case of a business? Not at all. You can use it for some personal projects as well. Just remember that you pay you $$$ for your time, time that you can spend with your family, grow you skills or just relax. You can always make some $$$, but you can never buy some extra time. Unfortunately.

And here is how I’ve used oDesk for my personal need. I’ve had this WordPress-based blog hosted with another provider. It was free with my domain hosting, but I was not happy with the performance. I know, that it doesn’t really matter because nobody reads my blog :), but still I was unhappy. So, when Microsoft introduces WordPress hosting on Azure (I was doing a lot of development on Azure that time), I’ve decided to switch (BTW, the performance is much much better on Azure). So, did I know how to transfer my WordPress blog from one provider to another? Of course, not :). Can I do it myself? Absolutely. I can learn easily (there are tons of blogs about how this can be done). But, before moving on, I’ve decided to check what it would cost me if somebody else do it for me. After a quick search on oDesk I’ve found a nice guy in India who can do it for me under $50. So, what’s it going to be: $50 or a day of my life? It was a very easy choice for me (your mileage may vary :)). A few days later I’ve had my blog transferred to Azure. Yes, I’ve spent a little bit of my time to help the guy (just sharing some details, providing credentials for him and doing some securing clean up after the transfer), but it was much less time if I would do it myself.

So, next time when you need something get done, may be oDesk is you friend 🙂

 

So you want to be a software architect

There are many opinion about what software architect does, what skills are required and whether we need them or not. I’ve been in this role for a long time (20+ years), even before I’ve “acquired” the “official” title. I was lucky because a year after the graduation I’ve build the whole application – gathered requirements, architected, designed, coded, tested, operated the application as well as provided customer support. Since then most of my life I was building the complete application rather than working on its pieces here and there. And that’s how I’ve learned it at the first place.

First, let me be clear – I think that architect is required and important part of any software team or “part” of you if you are a MicroISV. And it doesn’t matter whether this person is called architect or engineer or he-knows-how-to or somehow else. There should be someone who knows how to build the whole building, not just superbly paint a single room.

So, some of my thoughts about the skills that good architect needs as well as pros and cons of this role.

Skills

Know how to do it yourself

I think that if you cannot sit down and write the whole application that you are designing yourself, there is a very slim chance that it will be designed properly, because when you design/architect an application, you actually write it in your head.

E2E knowledge

Architect has to know it end to end. From understanding how the business works to the work of customer support and everything in between (requirements, design, development, QA, operation). Without this knowledge it’s difficult to build application that will add value to its users, will be delivered on time, have a reasonable cost and will make the team happy. And this leads to the next required skill.

Side note: why did I put team happiness in this list? Long story short: in a long run happy team makes great product, unhappy team – crap.

Compromise

Meeting these requirements (value for the customers, time to market, cost and team happiness) during the whole lifecycle of the product is difficult.

  • Should the product quality be sacrificed to ship it on time? 
  • Should additional development effort be allocated to make a tool for operation team which will make operation easy? 
  • Should some components be build in house or purchased?
  • …..

Architect faces these questions every day. And the quality of the final product depends on his ability to find optimal solution – find a win-win path.

Keep it all in your head

You should be able to store the whole application in your head. You may not remember all the details, but overall layers, tiers, components, data and execution flow as well as storage design is a must. Application is a composition of multiple pieces (layers) and without seeing the “whole picture” it is difficult to build well balanced application. E.g.: design of particular database tables may be perfect from a DBA perspective, but will require a lot of unnecessary and badly performant code in data access and business layer.

Review code fast

High software quality is not only guaranties the happiness of your customers (functional quality), but also how fast you can deliver new features (structural quality). If your application has 0 bugs, but code is ugly and architecture resembles Winchester House, then there is no way you can add functionality fast. But how can you ensure high structural quality? Yes you can have all these static code analysis toolsin place and they help. Too some extend. But the only way to guaranty a high structural quality of your application is to review all the code. “I don’t have time to review all the code that is checked in?“. Yes you do. It takes 30 minutes to review regular check-ins of a 10 developers team. If you know how to do it right and if you teach your team to write good code (I’ll share my experience later). 

And don’t fool yourself – sooner or later you will review “that” code, because when it breaks one day badly or prevents your from adding a new functionality without total rework, you will have to review & clean it up. So, better do it daily when it’s easier to spot a small problem and fix it early.

BTW, if you know a better way to ensure high structural quality, please let me know.

(To be continued)

Beauty of dynamics

I’ve never liked the out keyword. I know, it’s personal, and you are welcome to disagree with me.

The TryParse pattern happens quite often. My preferred approach is wrapping both return values into an object. Or use KeyValuePair. But it’s not an elegant solution for my taste.

That is one of the reasons I liked seeing Tuples in C# 4.0. But still … not elegant enough. Here comes dynamic. I worked with dynamic languages since … well, long enough. But it didn’t “click” with C#. But today, I was doing some code where I wanted to return an instance of the anonymous class. Something similar to this:

 return customerList.Select(c => new {Id = c.Id, Name = c.Name}); 

And I’ve realized that there is a nice way of handling this situation (as well as TryParse pattern) the following way:

 private object GetNew() { return new { Id = 5, Name = "Joe Doe"; }; }</span></p>
<p><span style="font-size: 16px;">[TestMethod] public void Get_anonymous_object_with_dynamic() { dynamic newObj = GetNew();</span></p>
<p><span style="font-size: 16px;">Assert.AreEqual(5, newObj.Id); Assert.AreEqual("Joe Doe", newObj.Name); } 

So, I’m quite happy with this solution … for now.