Bertrand Florat technical articles

Top Mistakes Made by IT Architects

2024-12-28T00:00:00+00:00

This article has also been published at DZone.

In this previous article, I highlighted some of the worst mistakes a Product Owner (PO) can make.

Now, it’s time for introspection and an analysis of the most common errors I’ve observed in architectural practices throughout my career.

Disclaimer: I’ve personally made a fair share of these mistakes throughout my career ;-).

1. Being a “Seagull Architect”

This type of behavior has seriously damaged the reputation of our profession. Seagull Architects are highly concept-oriented but disconnected from real-world issues and complexity.

Their abstract ideas often fail to address the practical needs of the teams and systems they are meant to support.

How to Recognize Them

It’s been 10 years (or more) since they last wrote production-ready code or resolved a critical production issue.
They are overly confident in their designs and decisions, often assuming they are flawless without seeking feedback.
They rarely confront themselves with developers, integrators, and operations teams, leading to impractical or disconnected solutions.
They focus more on high-level concepts and theoretical frameworks than on practical implementation details.
They like overdesigned solutions including the most design patterns they can or infrastructures overloaded with components.
They champion the KISS principle but fail to acknowledge the real-world constraints that make simplicity challenging to achieve.
They aren’t present at 3 a.m. to resolve urgent breakdowns in the very system they created.

My Own Experience

I once worked as Junior Architect for a “Seagull Architect” who designed a very complex yet seemingly powerful system on paper to manage business traces asynchronously for a large-scale project. Its main purpose was to integrate traces more efficiently and faster.

However, the reality was far different. It took more than one year to develop, and once implemented, it turned out to be slow, overly complex to maintain and deploy, and difficult to administer. Eventually, the system was dismantled but it was so complex that it took more time to drop it than to develop it.

How to Fix It

Stay hands-on: Regularly practice real-world technology in real contexts—whether it’s coding, integration, or operations.
Be truly KISS: Simplicity should be informed by practicality. Collaborate with teams to find solutions that balance simplicity with real-world functionality.
Be agile: Collaborate closely with teams, embrace iteration, and adapt your approach to fit the evolving landscape of technology and business needs.

WARNING: While avoiding the Seagull Architect behavior, it’s also important to avoid the opposite extreme. An architect must remain confident in their decisions once the architecture is properly designed. Constantly undoing or revising decisions at the first issue raised by developers or operations can create chaos and undermine trust. A solutions architect’s role is to carefully consider many factors and find the most balanced solution that aligns with business, technical, and operational needs. Complaints from individual team members often come from a partial view of the big picture. Architects must weigh such feedback thoughtfully but not let it derail well-founded decisions that account for the broader context.

2. Lacking Competence in Non-Functional Requirements (NFR) Collection and Integration

A very important skill for Solution Architects is their ability to competently collect and account for Non-Functional Requirements (NFRs). These requirements, which define the system’s qualities rather than its functions, are critical for creating a solution that is not only functional but also robust, scalable, and adaptable to organizational needs.

Collecting NFRs

Collecting NFRs requires a blend of technical knowledge and strong soft skills. Architects must:

Engage stakeholders effectively to help them articulate their needs clearly.
Illuminate the costs and delays associated with each requirement to ensure realistic expectations.
Align the NFRs with organizational constraints, as outlined in the previous section.

Taking NFRs into Account

Once collected, all NFRs must be considered collectively, as neglecting even one can undermine the entire solution. Key categories of NFRs include Scalability, Performance, Availability, Confidentiality, Accessibility, Evolutivity and Reversibility (discussed previously) and many others.

Failing to address a single critical NFR can result in severe degradation of the solution. For instance:

If the target population size is underestimated, and the architect designed a monolithic, hard-to-scale solution, the system may become useless or prohibitively expensive to fix.
Similarly, neglecting a security requirement, such as an authentication protocol incompatible with the organization’s systems, can render the solution ineffective or non-compliant.

How to Recognize It

NFRs (like expected time to create a report) are vague, incomplete, or absent from project documentation.
Stakeholders express dissatisfaction with the system’s performance, security, or usability after implementation.
Significant rework is required post-deployment to address overlooked NFRs, leading to increased costs and delays.
The system fails to meet operational goals due to issues like poor scalability or inadequate security.

My Own Experience

I experienced a project where operational statistics used by employees and long-term business metrics were combined into a single solution (disclaimer: this is a bad idea). The system attempted to compute both in real-time, but this approach significantly impacted performance.

As a result, the features needed by operational users became far too slow to be practical, hindering day-to-day activities. This issue arose because the distinct requirements for operational responsiveness and long-term analytics were not properly identified and separated during the NFR collection process. It was a clear example of how failing to account for specific performance needs can degrade the entire solution.

How to Fix It

Engage stakeholders thoroughly: Use early and regular workshops, interviews, and surveys to gather a complete set of NFRs, ensuring stakeholders understand the associated costs and trade-offs.
Acquire infrastructure knowledge: Develop a fair understanding of infrastructure areas such as networking, systems, and security to better assess and implement NFRs.
Be comprehensive: Address all major NFR categories, including scalability, security, accessibility, and maintainability, among others.
Align with constraints: Ensure NFRs are realistic and fit within the organizational, technical, and financial constraints.
Validate assumptions: Use Proofs of Concept (POCs) or prototypes to verify that the design meets the stated NFRs.
Review and refine: Regularly review the NFRs as the project evolves to ensure they remain relevant and achievable.
Understand NFRs Affinity and Anti-Affinity: NFRs must be considered holistically rather than in isolation. Some requirements complement each other (e.g., high availability may necessitate clustering, which can also improve performance), while others may conflict. For instance, stringent authentication requirements like MFA can reduce usability, and ensuring confidentiality may negatively impact latency. Balancing these trade-offs is critical to achieving a well-rounded architecture.

3. Falling into the “Golden Hammer Syndrome”

This is a classic pitfall where an architect relies excessively on a single tool, technology, or methodology, treating it as a one-size-fits-all solution. While the tool or approach may be effective in specific contexts, overusing it can lead to poorly designed systems that fail to meet diverse requirements.

This issue is often exacerbated by company fossilization—when architects stay in the same company for too long without exposure to external ideas or trends. Over time, they become overly comfortable with the tools and processes they know, reluctant to explore new possibilities. Additionally, many architects isolate themselves by avoiding meetups, conferences, or industry events, further narrowing their perspectives.

An associated occasional cognitive bias is what I would call the “Technology Stockholm Syndrome”: Sometimes, a technology that is not particularly great becomes deeply entrenched because it was so hard to set up, configure, and learn. The victims of this syndrome feel compelled to continue using it, justifying their investment of time and effort, even when better alternatives are available.

How to Recognize It

Problems are approached as if they can all be solved with the same favorite tool, technology, or framework.
The unique challenges and constraints of individual projects are overlooked or disregarded.
Resistance to adopting new technologies or approaches persists, even when they are clearly more suitable.
The continued use of outdated or overly complex technologies is justified by the effort it took to implement and learn them (Technology Stockholm Syndrome).
Participation in external learning opportunities, such as conferences or meetups, is rare, reinforcing a stagnant and narrow mindset.

My Own Experience

I once worked for a company where the infrastructure services team absolutely refused to use containers, even something as simple as Docker. Their reasoning was rooted in tradition—they preferred to manage systems manually, following a more "traditional" approach.

In one particularly baffling conversation, I even heard the argument that automation was bad because it was "safer for a human administrator to type commands manually." This resistance to modernization not only slowed down development and deployment processes but also increased the risk of human error, ironically undermining the very safety they aimed to preserve.

How to Fix It

Stay versatile: Continuously explore and evaluate new tools, technologies, and approaches.
Understand the problem deeply: Focus on the specific requirements and constraints of each project before deciding on a solution.
Challenge assumptions: Encourage team discussions to evaluate whether the chosen solution truly fits the problem.
Adopt a toolbox mindset: Use the right tool for the right job rather than defaulting to a single, familiar option.
Engage externally: Attend industry meetups, conferences, and workshops to broaden your perspective and stay up to date with trends and best practices.
Rotate roles or projects: Periodically take on new challenges, even within the same company, to avoid falling into overly familiar patterns.
Continuous Learning: Commit to lifelong learning by taking online courses, earning certifications, and keeping up with industry publications. Architects who are eager to learn stay adaptable and open to innovative solutions.
Challenge sunk cost thinking: Regularly assess whether a technology still serves its purpose, even if significant effort was spent on implementation. Don't let past investment justify future inefficiency (Sunk Cost Fallacy bias).

4. Succumbing to “Résumé-Driven Development (RDD)”

Résumé-Driven Development (RDD) is another common mistake made by architects. This occurs when decisions are driven more by what looks impressive on the architect’s résumé than by what is actually needed for the project. Architects guilty of RDD prioritize trendy tools, frameworks, or methodologies to boost their personal career prospects, often at the expense of practicality and project success.

The problem with RDD is that it shifts the focus from solving real problems to adopting flashy technologies that may not align with the team’s skills, the company’s goals, or the project’s requirements.

How to Recognize It

The latest buzzword technologies are heavily favored, even when they add unnecessary complexity to the project.
POCs (Proofs of Concept) and real-world integration work are avoided or minimized, leaving developers to solve critical issues after the technology has already been chosen.
Little to no consideration is given to IT governance, operational constraints, or long-term maintainability, leading to fragile and unsustainable systems.
Whenever an urgent breakdown in the system occurs at 3 a.m., the architect who created it is unavailable to help, having already moved on to another company.

My Own Experience

When designing a large system as a junior architect in the late 2000s, I was seduced by the ESB (Enterprise Service Bus) trend and proposed a solution that included it. In hindsight, the client was not mature enough to effectively integrate such a technology. Even though the ESB solution was Open Source, it added significant costs by introducing unnecessary complexity into many aspects of the project while delivering very little tangible benefit.

This experience taught me the importance of aligning technological choices with the client’s actual maturity and needs rather than following trends.

How to Fix It

Code and integrate: Spend time coding, doing real integration work, and building Proofs of Concept (POCs) to validate ideas in practical scenarios.
Align with business needs: Ensure architectural decisions align with the company's strategic goals and address real-world problems.
Take IT governance into account: Evaluate technologies and solutions in the context of organizational policies, compliance, and long-term sustainability.
Focus on outcomes: Prioritize solving problems over building a résumé. Success is measured by the value delivered, not by the tools used.
Collaborate with teams: Work closely with developers, operations teams, and stakeholders to ensure solutions are practical and achievable.

NOTE: A symmetrical issue often arises in the opposite scenario. Technologies are chosen without assessing the popularity, availability or cost of skills in the market. As a result, finding and keeping qualified personnel becomes either extremely difficult or prohibitively expensive, adding unnecessary challenges to the project.

5. Trying to be “Nostradamus”

Another critical mistake is what I call the "Nostradamus Syndrome"—an architect's tendency to create overly detailed, long-term plans and designs based on predictions of the future.

While strategic thinking is essential, attempting to foresee and address every potential issue far in advance often leads to a lack of agility and results in rigid systems that fail to adapt to changing requirements or technologies.

Instead of taking an iterative approach and addressing challenges step by step, Nostradamus architects invest significant time and effort in plans that often become obsolete before they can be fully implemented. Premature optimization often exacerbates this problem: resources are spent optimizing aspects of the system that may never become critical or relevant, diverting effort from solving current, pressing needs.

NOTE: This error is commonly observed among Seagull Architects as well.

How to Recognize It

Overly detailed designs are created for systems that won’t be implemented for years, often becoming obsolete before completion.
Future requirements, technologies, and constraints are assumed to be predictable and static, leading to inflexible designs.
Iterative and incremental approaches are dismissed in favor of a "big bang" implementation, increasing risks and delays.
Teams struggle to adapt plans to unforeseen changes in business or technical environments, causing frustration and inefficiency.
Resources are wasted on premature optimization, focusing on details that have no immediate value and may never be needed.

My Own Experience

I once worked two years on a very large waterfall project where the technical and business requirements were an astonishing 30,000 pages long. To put this into perspective, a single printed copy of the documentation filled an entire armoire. The project was restarted three times over its lifespan, and after 15 years of planning, development, and countless architecture committees, the final delivery only covered about 20% of the original business scope.

This monumental waste of time and resources was a direct result of trying to predict and design every possible detail upfront instead of adopting an iterative and flexible approach. Premature optimization compounded the problem, as time was spent perfecting features that were never actually used.

How to Fix It

Adopt agility: Break down large plans into smaller, iterative steps that can be tested and adjusted as you progress ans when you have into hand the most information on functional and non-functional requirements.
Focus on near-term value: Prioritize immediate needs and design systems that can evolve over time.
Embrace uncertainty: Accept that not all variables can be predicted and design with flexibility in mind.
Avoid premature optimization: Concentrate on solving immediate and known problems instead of optimizing for hypothetical future scenarios.
Iterate and learn: Regularly review and adapt your architecture based on real-world feedback and changing circumstances.
Involve stakeholders continuously: Collaborate with teams and business stakeholders to ensure designs remain aligned with evolving requirements.

6. Being trapped by the Vendor’s Siren Song

Another common mistake architects make is being overly trusting and receptive to vendors and commercials, often in the name of "better support" or "guaranteed security." This naivety can lead to costly decisions that prioritize vendor solutions over more practical or open-source alternatives.

Some architects are particularly drawn to the perks vendors offer, such as being invited to fancy restaurants, exclusive seminars, or high-profile events. These perks can cloud judgment and make the architect more inclined to recommend the vendor’s solutions, even if they aren’t the best fit for the organization.

Once the contract is signed, the initial promises of dedicated support and expert engineering often vanish, leaving the organization stuck with subpar implementations and rising costs. This approach can also lead to vendor lock-in, where switching to alternative solutions becomes prohibitively expensive or complex.

How to Recognize It

Vendor solutions are prioritized without thorough evaluation of open-source or in-house alternatives, leading to missed opportunities for cost-effective and flexible solutions.
Enthusiasm for vendor-organized events, seminars, or social invitations overshadows critical decision-making.
Excessive trust is placed in vendor promises of long-term support and top-tier engineers, often without adequate verification.
The architecture becomes heavily reliant on proprietary technologies, resulting in a significant risk of vendor lock-in.
High costs are justified with vague assurances of "better security" or "long-term reliability," echoing the old adage: "Nobody will blame you for buying IBM."
Urgent breakdowns in the system occur at 3 a.m., but the architect who recommended the solution forwards the issue to the editor.

My Own Experience

A few years ago, I worked for an organization that was considering switching to a Kubernetes-based infrastructure. I suggested evaluating some basic open-source distributions that would meet their needs without incurring unnecessary costs. However, they ultimately chose a costly solution that charged based on the number of containers used.

The decision was heavily influenced by the allure of fancy nodes auto-scaling features, which turned out to be overkill in their actual use case. After struggling with the high costs and complexity for a few years, they eventually switched back to a cheaper open-source solution. The entire ordeal wasted time, effort, and money, all of which could have been avoided with a more pragmatic and cost-conscious approach.

How to Fix It

Stay critical: Evaluate vendor claims thoroughly and validate them against real-world use cases and references.
Promote diversity: Consider open-source and in-house solutions alongside vendor offerings to maintain flexibility.
Avoid lock-in: Design systems that minimize dependency on any single vendor by using standards and modular approaches.
Conduct due diligence: Ensure contracts have clear performance metrics and support guarantees, and verify the vendor’s track record.
Be cost-conscious: Always weigh the total cost of ownership (TCO) of vendor solutions against their claimed benefits.
Involve engineers: Consult with developers, integrators, and operations teams to assess the feasibility and practicality of vendor proposals.
Engage in realistic POCs: Before committing, involve your teams in building and testing realistic Proofs of Concept (POCs) to validate the vendor’s solution in the actual business and technical context.
Ignore perks: Focus solely on the technical and business merits of a solution, not the perks offered by vendors.

7. Being Misused

One of the most overlooked errors an architect can make is becoming misused—falling into roles or situations where their skills and expertise are underutilized or misdirected. This often happens when the architect is brought into a project too late, handed tasks that do not align with their strategic responsibilities, or consumed by day-to-day operational demands.

Common Scenarios

The Firefighter: The architect is called in as a last resort to rescue a project that has been poorly designed or managed without proper architectural oversight. By this stage, the project often fails to meet critical non-functional requirements like performance or security, and the cost of fixing bugs or adding features grows exponentially due to bad design decisions and lack of rigor.
The Deluxe Secretary: Instead of providing strategic guidance, the architect is relegated to writing reports, compiling presentations, or acting as an intermediary for managers. This reduces their impact to mere administrative tasks.
The Boat Scooper: The architect becomes so consumed by daily tasks—coding, integration, or operational fixes—that they lose sight of the bigger picture. While this may provide short-term relief for the team, it undermines their core role as a visionary and strategist.

How to Recognize It

Architects are brought into projects only after major issues arise, instead of being involved from the beginning to guide the design and strategy.
The role becomes heavily administrative, with a focus on reports and documentation rather than technical strategy or architectural design.
Daily operational tasks overwhelm the role, leaving little to no time for long-term planning or strategic thinking.

My Own Experience

I have personally experienced all three of the situations described above.

For instance, I once worked as a performance architect on a completely delusional project. The goal was to compute hundreds of thousands of tasks using a BRMS and display them on a 5-year Gantt chart in less than two seconds after each drag-and-drop operation. Needless to say, the performance requirements were far removed from reality and nearly impossible to achieve. When I joined the project, the average latency per action was approximately one hour, and the project was already too advanced to completely overhaul the flawed architecture.

In another instance, I found myself spending my days writing memos and reports for managers, not to support the project but to fuel internal political battles. My role as an architect was reduced to being a "deluxe secretary" in the middle of a management war.

In other missions, I ended up spending most of my time coding or helping with integration tasks. This was often due to a lack of budget or competent resources, leaving little time to focus on the big-picture architectural vision.

These experiences taught me the importance of defining clear boundaries and ensuring that architects can focus on their strategic responsibilities rather than being consumed by operational or administrative demands.

How to Fix It

Be proactive: Ensure your involvement from the start of the project to guide its architecture and ensure alignment with non-functional requirements.
Define your role: Establish clear boundaries and expectations for your responsibilities as an architect to avoid being pulled into administrative or low-level tasks. If these boundaries remain unclear or are not respected, don’t hesitate to refuse the job or mission. Sometimes, allowing stakeholders or managers to face the consequences—such as rewriting a poorly managed project from scratch—can serve as a valuable lesson for the future.
Delegate operational tasks: Work with the team to delegate coding, integration, and other daily responsibilities, allowing you to focus on the bigger picture.
Prioritize non-functional requirements: Advocate for and address performance, security, and scalability early in the project lifecycle.
Regularly step back: Dedicate time to reflect on the strategic goals of the project, ensuring your efforts align with the larger vision.
Don’t Hesitate to Resign: I have never regretted resigning when the context was toxic. You can't always go against the tide. Prioritizing your well-being and professional growth is essential, and staying in a harmful environment often does more harm than good.

8. Neglecting Functional, Business, or Legal Aspects

One critical error architects often make is focusing too heavily on technical challenges while neglecting the broader functional, business, or legal considerations of a project. A technically elegant solution is of little value if it fails to meet business objectives, ignores functional requirements, or violates legal and regulatory constraints.

How to Recognize It

Systems are designed to prioritize technical complexity rather than addressing real business problems.
Functional requirements, such as user workflows or critical features, are treated as secondary considerations or overlooked entirely.
Legal and regulatory constraints are neglected, resulting in compliance issues later in the project lifecycle.
Delivered solutions frequently fail to align with stakeholder needs or expectations, leading to dissatisfaction and rework.

My Own Experience

As I explained in my previous article about Product Owners, disalignment is often caused by poor PO practices. However, architects can also create or worsen such situations through their own decisions. For instance, I worked on an archiving system where the architect designed an overly complex solution. The architecture required clients who uploaded documents to first place them on an SFTP drive before making a REST call to complete the process. This design not only imposed unnecessary complexity but also introduced a different contract for each client instead of leveraging a generic approach. Especially for small documents, a simple HTTPS-integrated endpoint would have been much easier to use and maintain, saving significant time and effort for all parties involved.

I have also observed several times that architects collect far more data than necessary simply because they can, often under the justification of future Business Intelligence (BI) needs. This approach, while seemingly forward-thinking, frequently leads to costly reworks when compliance with GDPR or similar data protection directives demands stricter data minimization practices. Instead of enabling better insights, these decisions often create unnecessary risks and reworks.

How to Fix It

Collaborate with stakeholders: Work closely with business owners, legal teams, and end-users to fully understand functional, business, and legal requirements.
Embrace a holistic view: Consider all dimensions of the project—technical, business, functional, and legal—when making architectural decisions.
Prioritize business needs: Ensure that the architecture is directly aligned with business goals and delivers measurable value.
Account for legal compliance: Engage legal experts early in the project to ensure the design adheres to relevant laws and regulations.
Iterate on requirements: Continuously refine requirements in collaboration with stakeholders to adapt to evolving needs and constraints.

9. Neglecting Evolutivity and Reversibility

A common yet critical error in architecture is failing to account for evolutivity (the ability to adapt to changing requirements) and reversibility (the ability to undo decisions). A good architecture should not only meet current needs but also leave room for errors, unforeseen changes, and the evolution of requirements, especially in an agile project.

When evolutivity and reversibility are neglected, systems become rigid and costly to adapt, making even minor changes difficult or impossible without significant rework. As Martin Fowler emphasizes in his Is Design Dead? article, reducing irreversibility is a key strategy to manage complexity in software systems, as it enables flexibility and adaptability in the face of evolving requirements.

How to Recognize It

Systems are designed with a "set in stone" mindset, leaving no flexibility for future changes.
Technologies or frameworks are chosen without considering how easily they can be replaced or updated.
Critical design decisions are made without fallback options, leading to high risks if assumptions turn out to be wrong.
Teams frequently encounter high costs or delays when attempting to implement new requirements or fix errors.

My Own Experience

I once worked for a software editor that developed half of a large retailing solution using PL/SQL, Oracle's stored procedure language. This choice created a significant Oracle vendor lock-in, making it nearly impossible to switch to another RDBMS without substantial rework. This situation wasn't unique—I’ve observed similar patterns in other contexts as well. Such decisions not only limit flexibility but also prevent customers from saving enormous amounts of money by opting for more cost-effective database solutions. Instead, they become trapped in a costly and inflexible ecosystem.

A better solution would have been to implement business rules and logic in an application server, decoupling them from the database layer. This approach would have significantly reduced the dependency on Oracle and made the system far easier to adapt to other technologies in the future.

How to Fix It

Prioritize reversibility: Design systems that allow for easy rollback or replacement of components when necessary.
Embrace modularity: Use modular and loosely coupled designs to make replacing or updating parts of the system easier.
Plan for change: Assume that requirements will evolve and build flexibility into the architecture to accommodate future needs.
Adopt iterative validation: Regularly validate assumptions and decisions through Proofs of Concept (POCs) or incremental implementation.
Document alternatives: Clearly document (using ADRs) the rationale behind decisions and alternative options to revisit if circumstances change.

10. Overlooking Documentation and Governance

While it’s true that outcomes are more important than outputs, neglecting proper documentation and governance can severely hinder communication and collaboration across teams and over time. Documentation serves as the primary means of passing critical information between teams and ensuring continuity as projects evolve or as team members change. Documentation is a bridge between space and time.

The most common types of documentation include architecture documents, Architecture Decision Records (ADRs), and presentations. A good starting point is this architecture document template.

However, documentation is only effective if it is kept up to date. Outdated or inaccurate documentation can quickly erode trust, leaving teams hesitant to rely on it and increasing confusion and inefficiencies.

How to Recognize It

Documentation is either nonexistent or incomplete, making it difficult for teams to understand architectural decisions or system designs.
Existing documentation is outdated, leading to incorrect assumptions and a lack of confidence in its reliability.
Governance processes, such as reviewing and approving architecture decisions, are inconsistent or poorly enforced. It's impossible to know what is validated or not.
Teams frequently rely on verbal communication or informal notes, leading to gaps in knowledge transfer.

My Own Experience

In many legacy products I’ve had to analyze, the architecture documentation was either so poor, outdated, or completely irrelevant that it was impossible to rely on it for even basic questions.

Whenever I needed to address an evolution, implement a fix, or evaluate interoperability, the only reliable source of truth was the code itself. This not only slowed down the process but also increased the risk of misinterpretation and errors, as extracting information directly from the code can be time-consuming and error-prone without proper context.

How to Fix It

Prioritize documentation: Treat documentation as a deliverable, ensuring it is comprehensive and reflects the current state of the system.
Adopt lightweight governance: Establish a governance process that ensures architectural decisions are reviewed and documented without creating excessive bureaucracy.
Use standard templates: Leverage standardized templates for architecture documents and ADRs to streamline the creation and maintenance of documentation.
Favor Architecture As Code: Leverage lightweight markup languages and diagrams as text like illustrated in this article.
Maintain relevance: Regularly review and update documentation to ensure it remains accurate and trustworthy. Plan at least a yearly review of each major document.
Promote accessibility: Make documentation easily accessible to all team members through a centralized repository or documentation portal.
Encourage collaboration: Involve the entire team in the documentation process to ensure completeness and shared understanding.

11. Ignoring Organizational Constraints

A critical error in architecture is designing systems without fully understanding or considering the constraints imposed by the organization. Constraints can take many forms, including organizational, technical, or financial, and failing to account for them can lead to impractical or costly solutions.

NOTE: This problem occurs especially when new in a company or within organizations with poor technical governance or missing enterprise architects.

This issue is often exacerbated by Conway’s Law, which states: “Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure.” While Conway’s Law is often seen as a frustrating reality, it must be taken into account when designing systems (using DDD when possible).

Common Types of Constraints:

Organizational: Poor urbanization of IT landscapes, such as rewriting a service or making products incompatible due to siloed teams or misaligned priorities.
Technical: Lack of awareness of technical rules, such as network zone segmentation, imposed tools, or mandated technologies.
Financial: Choosing expensive solutions that are impractical for large-scale deployment or that exceed the organization’s budget.

It is also essential to consider the organization's maturity before introducing complex and demanding solutions. For instance, Martin Fowler advises that "you must be this tall to use microservices," highlighting the prerequisites and challenges that come with such architectures.

Ignoring these constraints can result in designs that are difficult to implement, manage, or sustain, leading to project delays, escalating costs, or outright failures.

How to Recognize It

Solutions fail to align with organizational IT policies, such as approved tools, network standards, or urbanization guidelines.
Projects experience friction during implementation due to unaddressed organizational or technical realities.
Financial constraints are overlooked, resulting in unsustainable costs for deployment, maintenance, or scaling.
Teams frequently raise concerns about misalignment with organizational practices or existing infrastructure.

My Own Experience

I have often witnessed "innovative" approaches that appeared lightweight and efficient on paper but proved to be the opposite in practice. For example, low-code solutions were chosen to accelerate development, but when deployed on-premises, they turned out to be extremely costly and difficult to integrate into the existing ecosystem.

These approaches often overlooked organizational constraints, such as the need for seamless interoperability with legacy systems, adherence to IT governance, or compliance with specific operational requirements. What initially seemed like a fast-track solution ended up creating significant overhead and delays down the line.

How to Fix It

Understand organizational realities: Collaborate with stakeholders to identify and document organizational, technical, and financial constraints early in the design process.
Engage with cross-functional teams: Work closely with teams from various departments, such as network, SOC, operations, and finance, to ensure alignment with organizational policies and practices.
Balance ambition with pragmatism: While innovative solutions are important, they should respect the organization’s existing constraints and context. Low tech and cheap solutions are sometimes the most appropriate.
Review regularly: Continuously reassess constraints as the organization evolves to ensure that the architecture remains relevant and sustainable.

12. Acting Like a “Factory Worker Architect”

Some architects fall into the trap of behaving like factory workers, mechanically churning out designs, diagrams, or decisions without truly engaging with the bigger picture.

NOTE: This is especially common in IT consulting and services companies and in transversal architecture teams of large organizations.

This mindset can lead to systems that look good on paper but fail to meet real-world needs or adapt to change.

How to Recognize It

The focus is solely on producing artifacts (diagrams, specifications, etc.) without understanding every impacts on development or operations.
The architect works on several projects in parallel.
The work feels robotic, disconnected from user needs, team challenges, or business goals.
The architect is already assigned to a new project when development begins, leaving their Architecture Document as the sole contribution.
If an urgent breakdown occurs in the system they designed at 3 a.m., they are nowhere to be found, having already moved on to other projects and avoiding accountability for the consequences of their design decisions.

My Own Experience

In a large insurance company, I spent several years working as an infrastructure architect. During certain periods, I found myself involved in up to 15 projects simultaneously. Each project often required complex decisions, such as selecting between multiple IT products.

It quickly became clear that this level of dispersion made it impossible to deliver high-quality architectural solutions. Critical issues were overlooked, and the overall coherence and effectiveness of the architecture suffered as a result.

How to Fix It

Embrace collaboration: Work closely with teams to ensure designs are practical and relevant.
Focus on a single large project or few smaller ones but not more to avoid spreading ourselves.
Engage with end-to-end processes: Get involved in how systems are built, deployed, and maintained. Real (and most interesting) issues are mostly seen in production only.
Focus on outcomes, not outputs: Shift the emphasis from deliverables (like documents) to the value those deliverables bring.
Stay adaptable: Be open to feedback and willing to iterate on designs based on real-world inputs and past projects your worked on.

Conclusion

Being an IT architect is both a privilege and a responsibility. Architects shape the foundation of systems that drive business success, but they must also avoid the many pitfalls that can undermine their work. From neglecting Non-Functional Requirements (NFRs) to ignoring organizational constraints or becoming too disconnected from the realities of development and operations, the mistakes outlined in this article are all too common—and all avoidable with the right mindset and practices. A good architect doesn’t just aim for technical brilliance; he balances innovation with practicality, aligns systems with business goals, and ensures their designs can evolve and adapt over time. Documentation, collaboration, and an understanding of constraints are all critical to success.

If an architect is confident in their design, they will be eager to stay involved in the project, even during the deployment and operational phases. As a Litmus test, ask yourself this: “Am I willing and able to be there at 3 a.m. to resolve urgent breakdowns in the very system I created?” If the answer is no, it may be time to reconsider whether your architecture is robust, practical, and well-suited to the realities of its environment. Only by taking full ownership of our decisions can we truly learn and improve. Ultimately, it’s through pain—whether it’s the pain of debugging a flawed design at odd hours or witnessing a project fail due to poor choices—that we grow as architects. Embrace the lessons, refine your approach, and strive to create systems that not only work but inspire confidence and resilience in those who rely on them.

La Sonde Humaine

2024-10-21T00:00:00+00:00

Je veux vous parler des bugs de demain,
Sans supervision, il peut casser ton chemin.
Je veux vous parler
Des bugs, de vous.
Je vois à l'intérieur des logs, des erreurs
Qui ne sont pas traqués, qui parfois me font peur.
Alertes
Qui peuvent nous rendre fous

Nos systèmes sont fragiles, nous, pauvres développeurs,
Sans SRE pour guider, on perd tous nos repères.
La sonde humaine, la charge, elle t'appartient.
Tu as les alertes d'aucun de tes serveurs
La sonde humaine, la charge, elle t'appartient
Si tu laisses le système tomber dans l'incertain,
C'est la fin
Hmm, la fin
Hmm, la fin
Hmm, la fin.

Mon code ne tient plus sans tests au quotidien,
Les users se plaignent, les tickets ne servent à rien.
Quelqu'un doit bien
Surveiller le réseau

Je suis un programme tournant sans supervision
Le rythme des erreurs, c'est ça, ma punition.
Je suis chargé
D'inaccessibilité

Si par malheur, au cœur du serveur,
Je rencontre une faille qui m'mette de sale humeur,
Oh, faudrait pas que je reste planté
Faudrait pas que je reste planté, non
Faudrait pas que je reste planté
Faudrait pas que je reste planté

C'est la sonde humaine, c'est l'outil du lointain,
La sonde humaine, la charge, elle t'appartient,
Si tu laisses le système tomber dans l'incertain,
C'est la sonde humaine, c'est l'outil du lointain,
La sonde humaine, la charge, elle t'appartient
Tu as les alertes d'aucun de tes serveurs.

Si tu laisses les erreurs prendre en main ton destin,
c'est la fin
Hmm, la fin
Hmm, la fin
Hmm, la fin
Hmm, la fin

My Top 11 Integration Blueprints

2024-07-28T00:00:00+00:00

IT integration involves configuring complex systems within large infrastructures to ensure all components work harmoniously. This challenging task requires a blend of coding skills and unique expertise. The following blueprints are derived from my pragmatic experience gained from the trenches of various large projects and apply to system integrators, DevOps engineers (with a focus on automation), and Site Reliability Engineers (SRE) who prioritize availability.

1. Less is Better

Distributed systems, such as microservice architectures, involve numerous configuration parameters, modules, and infrastructure components. Regular cleanups and refactoring are essential to avoid errors, cognitive fatigue, security risks, and wasted effort:

Obsolete parameters can override newer ones, causing unexpected issues.
Keeping obsolete and unmaintained modules online poses significant security risks.
Integration refactoring should focus on renaming and documenting active parameters, not obsolete ones.

My advice:

Like a Boy Scout, leave the workspace cleaner than you found it by removing obsolete parameters.
Question every parameter addition: is it really necessary? Can its value be standardized across environments? If so, wouldn't it be simpler to hardcode the same value in every environment from testing to production (e.g., for a login or a database name)?
When adding a new integration artifact, immediately consider its future removal. Document a removal date or condition for new feature flags. For instance, add a removal date or condition against any new feature flag with a comment like: 'remove this FF once the xyz feature is fully deployed'.

2. Factorize

Managing multiple environments (testing, staging, UAT, performance, pre-production, production) requires effective parameter management.

My advice:

Consolidate parameters at various levels, such as globally or by environment type (once you have dropped useless parameters identical in any environment as stated previously). For instance, in my current organization, we have from 3 to 5 UAT (User Acceptance Test) environments, but all of them share many parameter values (such as database tuning). We can set parameters globally (for all environments), for an environment kind (like UAT or performance tests), or for a specific environment (like UAT2 or UAT5). Note that, while being handy, it comes with a minor issue: it makes consolidated views more challenging to get.
Use or develop tools to get a flattened view of all parameters for an environment.
Try to find the right parameter 'grain' when dealing with composite parameters like URLs made of a protocol, a domain, and optional ports and context paths. Most of the time, it's better to split such parameters into several independent ones (like myservice.port and myservice.host instead of myservice.url). Parts that never change (like protocol) can be hard-coded to avoid useless parameter proliferation (see the 'less is better' principle before). This strategy allows better reuse and factorization of parameters.

3. Enforce Production Parity

Testing environments should allow detecting integration issues as soon as possible. Like with code, integration issues cost exponentially more when discovered near the production stage. Common integration issues include encoding of content or filenames between Windows and Unix systems, tuning differences among component versions (such as a database or application server), timezone issues, libraries or kernel dependencies, permissions issues linked with container isolation, and system tuning like memory swapping.

My advice:

In testing environments, avoid using the same value for different parameters (like user=foo, pwd=foo) because doing so prevents detecting errors in code such as an erroneous copy-paste (user=conf('user'); pwd=conf('user') instead of pwd=conf('pwd')).
If your production environment uses a vault to store secrets, set up at least one testing environment to do so as well.
Know when to use the root user in IaC code or operational scripts. Avoid running processes as root for security reasons. The same applies to simple scripts that can use regular service accounts. Most of the time, root should only be used to deploy binaries and application configuration but not at runtime. This also applies to containers. Using root can create issues by generating files that can't be read or updated by other processes.
First deploy in an iso-production pre-production environment as similar to production (including network and security equipment) as possible. Think not only about the servers but also the infrastructure as a whole (proxies, routers, firewalls, API gateways).
Always perform at least minimal benchmarks on sensitive new features in an iso-production environment (pre-production is a good match for this purpose).

4. Ratchet Quality

It is common to see problems occurring repeatedly. It is crucial to set up continuous improvement throughout the process.

My advice:

Automate sanity checks, for instance, using Git server hooks.
Detect configuration changes in code by using tools like GitLab approvals. They make it mandatory for developers to inform integrators.
When automation is not possible, write down a procedure every time you have to do something more than once. Ensure the procedure is documented as soon as possible so it can be updated each time a new step is added or removed. This will prevent steps from being forgotten. Good procedure candidates include deployment in production or pre/post steps of a starting/ending Scrum iteration.
After each deployment, make sure to write a report identifying and suggesting improvements. In such a deployment report, we include for each issue a flag indicating whether the issue was detected before or after deployment in production and a list of mitigation actions and linked tickets to track each action. This report can be used to gather and analyze data, as well as to visualize how the quality evolves over time and thus assess the risk of each new deployment.
The dev team sometimes asks to remove some controls at critical moments (for instance, temporarily removing a code coverage quality check in an extreme rush). If everyone agrees, it is okay to disable such systems, but always enforce a way to reset back to its initial state (creating a task in a Kanban board, for instance) to avoid forgetting to re-enable it.

5. Doing 'On Rails'

In software design, 'On Rails' principles advocate minimizing the number of configurations and using conventions in code. This is an excellent principle, and its corollary is that we should apply it to the remaining configuration as well. Try to design and enforce conventions whenever possible.

Enforcing the same parameter names and structures makes refactoring easier (it is then straightforward to perform global replacements) and allows for effective tooling (it is simple to write tools using regular expressions to extract values, for instance).

My advice:

When applying, always add unity in the parameter name itself (and not only in documentation). This makes it 'screaming' and avoids errors.

Example: Not long ago, we had serious availability issues because we only discovered that a timeout was expressed in seconds and not milliseconds. It allowed a poorly written transaction to block the database connection much longer than expected, leading to exhaustion and eventually unavailability of the application. If the parameter had been called timeoutSecs instead of timeout, the costly error would probably have been avoided.

When possible, use namespacing like system1_database_name, system1_database_user, etc.
Make parameter names self-explanatory (this can be called "screaming" integration naming). If you don't know how to name a parameter, use the rubber duck method: explain its purpose aloud and use the answer as the name. Don't hesitate to use long names when required. The main requirement is clarity and lack of ambiguity.
When possible, use strong typing (such as an integer instead of a string for a number).
Follow the least astonishment principle. For instance, if all numbers are expressed as integers, don't use a Long for only one of them without reason.

Note: the least astonishment principle applies to the existing codebase but also (and more importantly) to state-of-the-art conventions outside the organization. This should improve the onboarding time for new members.

Apply a common scheme for parameter documentation. For this purpose, we use JSON-formatted comments in our .properties or .yaml files, including role, default value, etc. The JSON format allows the generation of consolidated web pages and validates the completeness of comments using a JSON Schema. For instance:

  {"desc": "Max heap value of the JVM in GiB", "type": "integer", "min": 1, "max": 5, "default": 2}

6. Improve Monitoring

Monitoring is challenging. It must not only alert when issues occur but also serve as a proactive tool to detect problems before the users do.

My advice:

For on-premise systems: use both local monitoring systems, which may be affected by the same network or system problems as the observed system, and external uptime check tools that, being outside the LAN, can detect issues like high latency or internet breakages. Configure quick alert channels like pagers or phone messages to ensure you don't need to check your emails before being aware of an incident.
In post-mortems, note if the detection channel was the users themselves contacting support or proactively our monitoring systems. Collect statistics to determine if the tools are effective.
Fix or ignore any false positive alerts ASAP. In my experience, any false positive occurring more than once a day makes the whole alerting system useless because it becomes untrusted. If you temporarily silence an alert, make sure to track a task to re-enable it as soon as possible.
Configure correct thresholds for alerts (like 10% remaining free disk), but also alert on unexpectedly low numbers of events by setting up an alert when nothing happens during a given time (we call this 'nologs' probes) as it may hide a technical issue.
Perform regular exploratory checks even on apparently healthy systems. Check for system supervision (memory usage, CPU peaks, deviations in business indicators). Even with adequate monitoring, slow memory leaks, for instance, can be hidden by periodic reboots. Sudden surges or decreases in business indicators can be caused by undetected technical issues. Check for rare stack traces in logs as well and try to identify the root cause, often hiding complex issues like concurrency bugs.

7. Apply a Strict and Comprehensive Versioning Scheme

Comprehensive versioning is paramount in complex systems. It allows module traceability and ensures that the correct code is running. Semantic versioning provides additional important metadata about module versions.

My advice:

Everything should be versioned automatically as soon as it forms a module or an independent item. This includes modules (an API, a GUI, or a batch mainly) but also independent sets of DDL-SQL statements or a JSON Schema, for instance.
Never reuse an existing version once it has been merged into the main branch (in most workflows, it is when the Merge/Pull Request has been merged). Developers often discover a problem just after merging and may be tempted (if the system allows it) to retag a branch with the same version or do so with a set of SQL statements. This invariability leads to many issues due to collisions. When a version has been used, never reuse it but increment it, even if it may seem overkill.
Always use the semantic versioning scheme for libraries and modules.
For applications based on numerous microservices made of many modules, we often still need a global 'logical' version encompassing all constituting modules. This kind of version is not linked with the code versioning itself like we do with independent modules but with end-user deployment visibility and integration. We use this scheme: [MAJOR].[MINOR].[PATCH] with:
- MAJOR incremented when a new significant feature is actually deployed in production to end users. Note that we increment this even when canarying, but other strategies are possible.
- MINOR incremented when at least a single module's new version has been deployed in production.
- PATCH incremented when no new module version has been released, but at least a single commit has been done and deployed in production in IaC code (like a simple configuration change).
Expose a version endpoint (like a /version) in each API. We developed simple scripts based on curl that download from a centralized Git repository the expected version of any of the ~40 modules in text-based (ASCIIDOC) format and compare the list with actually deployed modules by calling their version endpoint.

8. Automate Toils

Automating toils increases developers' and integrators' productivity, avoids human errors, and allows them to focus on what's really important.

My advice:

Start with long and error-prone procedures. Our best example is the tagging at each end of the sprint for our ~40 modules. We wrote a shell script to handle Maven and NPM bumping, Git tagging, and GitLab pipeline orchestration via API calls. This saved a lead tech an entire day every 3-week sprint and eliminated most errors.
Don't automate if there is no pain. Even if automation comes with the previously stated benefits, like any code, it must be maintained, and you won't replace a chore with another.
Start small. As you grow, integration scripts often benefit from being rewritten from a shell to more advanced script-friendly languages like Python, Go, or Groovy.
Don't overdesign; sometimes manual or low-tech semi-manual solutions can offer a great ROI.

9. Learn How to Deal with Incidents and Problems

Finding and fixing production issues is a large part of an SRE's daily work. It leverages rigorous procedures, training, and team cooperation.

My advice:

Write down, exercise, and continuously improve your incident procedure so everybody knows exactly what to do in case of an incident.
Once it seems that the problem is not caused solely by the infrastructure, involve the dev team ASAP to accelerate issue identification and fixing.
When investigating, change only a single parameter at a time and test again. Don't try to chase two rabbits at once.
Always begin by reproducing the problem (when possible); there is no more difficult issue to fix than one that doesn't exist.
Write down a post-mortem after significant incidents or problems. A simple tracking ticket can be enough most of the time, but it must be created systematically for each occurrence, even if it will probably be closed as a duplicate and regardless of the supposed severity. Document any log, monitoring chart, hypothesis, etc., picked up during the analysis. Don't spend hours reading logs before centralizing into the ticket; write them down immediately like police search evidence. Logs, especially on complex microservices systems, are hard to collect and curate.
Tag issues as incidents or problems once you find the root cause or figure out that this is a duplicated issue. An incident becomes a problem if something can be done about it, and then it should introduce countermeasures tasks to be followed.
Link issues with one another when pertinent.
Be stubborn when identifying root causes. Trace the work done in the tracking tickets. Only give up when there are no more hypotheses to test or leads to follow. It is often a good idea to leave a difficult issue for a week or two and reopen the case with new facts, ideas, or points of view.
Don't ignore weak signals but always open a ticket, even if you have no time to investigate. It can be used as a base for further investigations if the issue returns.

10. Invest in Documentation

Documentation should not be considered a complementary task but embedded into most integrator actions: post-mortems, tools code documentation, tools manuals, troubleshooting, procedures, etc.

My advice:

Communicate troubleshooting via documentation for the dev team, other integrators, or even yourself in a few months when you will have totally forgotten the issue. Always paste stack traces and/or error messages to allow quick searching and identification of issues and - if any - workarounds.
How to know if you should document something? My litmus test is: "Can someone external to the very subject guess it by themselves?" If the answer is 'no', document it.
As stated in the previous sections, don't over-document self-explanatory and on-rails things. Document all that needs to be documented BUT ONLY what should be documented. Don't forget that useful documentation requires a lot of work and maintenance.
As stated before in the 'less is better' section, drop or upgrade any obsolete document ASAP. Most people ignore documentation that failed them several times. If your document is not fully updated, the whole thing will be thrown out.
Try to include living documentation into scripts or applications as much as possible (by coding a usage() method in scripts displaying all available arguments) and leverage this documentation to guide the user (for instance, suggest a workaround when they make a common error).
Avoid text processor formats but use raw text like Markdown or, even better, Asciidoc.
Don't ignore spelling and grammar. Due to the broken windows theory, most people could take your documentation unseriously if it contains spelling errors. Make sure to write simple sentences without confusion or text that can create misunderstandings.

11. Leverage Communication

Though mainly technical, integration requires a lot of soft skills. Communication with the development team and the operations team is paramount.

My advice:

Like anyone working actively on complex systems, integrators can and will make mistakes (like setting an incoherent parameter value). The organization should work in a non-blaming environment so the integrators can be transparent and report issues ASAP.
Don't be afraid to ask for help from others.
In borderline domains dealing with the developer teams, like new parameters, I strongly advise setting up operational meetings before each release. To avoid forgetting important changes related to integration (like a parameter renaming), we leverage Merge Requests approval on some sensitive parts of the code (like a .properties file listing all parameters of a module).
Don't forget that trust doesn't exclude control. Always double-check important facts.
As stated in the previous section, communication comes with documentation. Documentation is a way to transfer important information across time and space to others and often to yourself in the future.

Conclusion

Integration in IT is a complex but vital task requiring a mix of technical and soft skills. By following these blueprints, integrators, DevOps engineers, and SREs can ensure smoother operations, better communication, and more effective problem-solving. Adopting these practices not only improves system reliability but also fosters a culture of continuous improvement and collaboration.

Additional resources suggestions

Google SRE Books
Living documentation by Cyrille Martraire.
Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation by David Farley, Jez Humble)

Beyond Murphy' Law

2023-12-27T00:00:00+00:00

This article has also been published at DZone.

Updated: 03/11/2025

Murphy's Law ("Anything that can go wrong will go wrong, and at the worst possible time.") is a well-known adage, especially in engineering circles. However, its implications are often misunderstood, especially by the general public. It's not just about the universe conspiring against our systems; it's about recognizing and preparing for potential failures.

Many view Murphy's Law as a blend of magic and reality. As Site Reliability Engineers (SREs), we often ponder its true nature. Is it merely a psychological bias, where we emphasize failures and overlook our unnoticed successes? Psychology has identified several related biases, including Confirmation and Selection biases. The human brain tends to focus more on improbable failures than successes. Moreover, our grasp of probabilities is often flawed – the Law of Truly Large Numbers suggests that coincidences are, ironically, quite common.

However, in any complex system, a multitude of possible states exist, many of which can lead to failure. While safety measures make a transition from a functioning state to a failure state less likely, over time, it's more probable for a system to fail than not.

The real lesson from Murphy's Law isn't just about the omnipresence of misfortune in engineering but also how we respond to it: through redundancies, high availability systems, quality processes, testing, retries, observability, and logging. Murphy's Law makes our job more challenging and interesting!

Today, however, I'd like to discuss complementary or reciprocal aspects of Murphy's Law that I've often observed while working on large systems:

Complementary Observations to Murphy's Law

The Worst Possible Time Complement

Often overlooked, this aspect highlights the 'magic' of Murphy's Law. Complex systems do fail, but not so frequently that we forget them. In our experience, a significant number of failures (about one-third) occur at the worst possible times, such as during important demos.

For instance, over the past two months, we had a couple of important demos. In the first demo, the web application failed due to a session expiration issue, which rarely occurs. In the second, a regression embedded in a merge request caused a crash right during the demo. These were the only significant demos we had in that period, and both encountered failures. This phenomenon is often referred to as the 'Demo Effect'

The Impossibility Complement

Murphy's law states that everything that can go wrong" will indeed go wrong. Despite obviously being a syllogism, I would add that even what can't go wrong actually does.

Developers often note that even systems deemed infallible can fail, a sentiment captured by "it works on my machine".Failures can stem from various factors, including inadequate testing, unrealistic simulations, neglecting dev-prod parity, or overlooking the need for high-concurrency tests.

The causes are numerous: insufficient testing datasets, unrealistic loads, lack of robustness tests, disregarding the dev-prod parity principle, or failing to test in a highly concurrent environment.

This can occur with both functional and technical issues:

In a recent issue involving a current project, we had a serious technical crash when we received a invalid date ('29 february' in a non-leap year ) from a partner (date was typed as string by end-users without any control from their side). Our business analysts had explicitly advised against testing date validity, assuming such an error "couldn't happen".
Another incident involved a technical glitch (system out of memory), despite our belief that it was impossible after configuring our Java Virtual Machines to utilize all available memory at startup. In theory, Java limits memory usage. Yet, our Java application was terminated by the Linux kernel's 'oom-killer' to prevent a complete server freeze. This was possible because our program ran on a Virtual Machine managed by ESXi, which could perform 'ballooning,' a mechanism to force VMs to swap memory to disk. This process was largely unknown to developers, integrators, and most operators, proving challenging to understand.

The lesson learned: the importance of adopting highly defensive programming and creating robust systems cannot be overstated.

The Conjunction of Events Complement

The combination of events leading to a breakdown can be truly astonishing.

For example, I once inadvertently caused a major breakdown in a large application responsible for sending electronic payrolls to 5 million people, coinciding with its production release day. The day before, I conducted additional benchmarks (using JMeter) on the email sending system within the development environment. Our development servers, like others in the organization, were configured to route emails through a production relay, which then sent them to the final server in the cloud. Several days prior, I had set the development server to use a mock server since my benchmark simulated email traffic peaks of several hundred thousand emails per hour. However, the day after my benchmarking, when I was off work, my boss called to inquire if I had made any special changes to email sending, as the entire system was jammed at the final mail server.

Here’s what had happened:

An automated Infrastructure as Code (IAC) tool overwrote my development server configuration, causing it to send emails to the actual relay instead of the mock server;
The relay, recognized by the cloud provider, had its IP address changed a few days earlier;
The whitelist on the cloud side hadn't been updated, and a throttling system blocked the final server;
The operations team responsible for this configuration was unavailable to address the issue.

The Squadron Complement

Problems often cluster, complicating resolution efforts. These range from simultaneous issues exacerbating a situation to misleading issues that divert us from the real problem.

I can categorize these issues into two types:

The Simple Additional Issue: This typically occurs at the worst possible moment, such as during another breakdown, adding more work or slowing down repairs. For instance, in a current project I'm involved with, due to legacy reasons, certain specific characters inputted into one application can cause another application to crash, necessitating data cleanup. This issue arises roughly once every 3 or 4 months, often triggered by user instructions. Notably, several instances of this issue have coincided with much more severe system breakdowns.
The Deceitful Additional Issue: These issues, when combined with others, significantly complicate post-mortem analysis and can mislead the investigation. A recent example was an application bug in a Spring batch job that remained obscured due to a connection issue with the state-storing database, caused by intermittent firewall outages.
The Nullifying Issue: In some cases, two bugs can nullify each other, making it difficult to detect either issue. For example, I encountered a situation where a sequence was incorrectly initialized to 1 instead of zero. However, this went unnoticed because the process intended to reset the sequence was also faulty. As a result, the sequence appeared to function normally. This scenario highlights how interdependent bugs can mask each other, complicating diagnosis and resolution.

The Camouflage Complement

Using ITIL's problem/incidents framework, we often find incidents that appear similar but have different causes.

We apply the ITIL framework's problem/incident dichotomy to classify issues, where a problem can generate one or more incidents.

When an incident occurs, it's crucial to conduct a thorough analysis by carefully examining logs to figure out if this is only a new incident of a known problem or an entire new problem. Often, we identify incidents that appear similar to others, possibly occurring on the same day, exhibiting comparable effects but stemming from different causes. This is particularly true when incorrect error-catching practices are in place, such as using overly broad catch(Exception) statements in Java, which can either trap too many exceptions or, worse, obscure the root cause.

The Over-Accident Complement

Like chain reactions in traffic accidents, one incident in IT can lead to others, sometimes with more severe consequences.

I can recall at least three recent examples illustrating our challenges:

Maintenance Page Caching Issue: Following a system failure, we activated a maintenance page, redirecting all API and frontend calls to this page. Unfortunately, this page lacked proper cache configuration. Consequently, when a few users made XHR calls precisely at the time the maintenance page was set up, it was cached in their browsers for the entire session. Even after maintenance ended and the web application frontend resumed normal operation, the API calls continued to retrieve the HTML maintenance page instead of the expected JSON response due to this browser caching.
Debug Verbosity Issue: To debug data sent by external clients, we store payloads into a database. To maintain a reasonable database size, we limited the stored payload sizes. However, during an issue with a partner organization, we temporarily increased the payload size limit for analysis purposes. This change was inadvertently overlooked, leading to an enormous database growth and nearly causing a complete application crash due to disk space saturation.
API Gateway Timeout Handling: Our API gateway was configured to replay POST calls that ended in timeouts due to network or system issues. This setup inadvertently led to catastrophic duplicate transactions. The gateway reissued requests that timed out, not realizing these transactions were still processing and would eventually complete successfully. This resulted in a conflict between robustness and data integrity requirements.

The Heisenbug Complement

A 'heisenbug' is a type of software bug that seems to alter or vanish when one attempts to study it. This term humorously references the Heisenberg Uncertainty Principle in quantum mechanics, which posits that the more precisely a particle's position is determined, the less precisely its momentum can be known, and vice versa.

Heisenbugs commonly arise from race conditions under high loads or other factors that render the bug's behavior unpredictable and difficult to replicate in different conditions or when using debugging tools. Their elusive nature makes them particularly challenging to fix, as the process of debugging or introducing diagnostic code can change the execution environment, causing the bug to disappear.

I've encountered such issues in various scenarios. For instance, while using a profiler, I observed it inadvertently slowing down threads to such an extent that it hid the race conditions.

On another occasion, I demonstrated to a perplexed developer how simple it was to reproduce a race condition on non-thread-safe resources with just two or three threads running simultaneously. However, he was unable to replicate it in a single-threaded environment.

The UFO Issue Complement

A significant number of issues are neither fixed nor fully understood. I'm not referring to bugs that are understood but deemed too costly to fix in light of their severity or frequency. Rather, I'm talking about those perplexing issues whose occurrence is extremely rare, sometimes happening only once.

Occasionally, we (partially) humorously attribute such cases to Single Event Errors caused by cosmic particles.

For example, in our current application that generates and sends PDFs to end-users through various components, we encountered a peculiar issue a few months ago. A user reported, with a screenshot as evidence, a PDF where most characters appeared as gibberish symbols instead of letters. Despite thorough investigations, we were stumped and ultimately had to abandon our efforts to resolve it due to a complete lack of clues.

The Non-Existing Issue Complement

One particularly challenging type of issue arises when it seems like something is wrong, but in reality, there is no actual bug. These non-existent bugs are the most difficult to resolve! The misconception of a problem can come from various factors including: looking in the wrong place (such as the incorrect environment or server), misinterpreting functional requirements, or receiving incorrect inputs from end-users or partner organizations.

For example, we recently had to address an issue where our system rejected an uploaded image. The partner organization assured us that the image should be accepted, claiming it was in PNG format. However, upon closer examination (that took us several staff-days), we discovered that our system's rejection was justified: the file was not actually a PNG.

The False Hope Complement

I often find Murphy's Law to be quite cruel. You spend many hours working on an issue, and everything seems to indicate that it is resolved, with the problem no longer reproducible. However, once the solution is deployed in production, the problem reoccurs. This is especially common with issues related to heavy loads or concurrency.

The Anti-Murphy's Reciprocal

In every organization I've worked for, I've noticed a peculiar phenomenon, which I'd call 'Anti-Murphy's Law'. Initially, during the maintenance phase of building an application, Murphy’s Law seems to apply. However, after several more years, a contrary phenomenon emerges: even subpar software appears not only immune to Murphy's Law but also more robust than expected. Many legacy applications run glitch-free for years, often with less observation and fewer robustness features, yet they still function effectively. The better the design of an application, the quicker it reaches this state, but even poorly designed ones eventually get there.

I have only some leads to explain this strange phenomenon:

Over time, users become familiar with the software's weaknesses and learn to avoid them by not using certain features, waiting longer, or using the software during specific hours.
Legacy applications are often so difficult to update that they experience very few regressions.
Such applications rarely have their technical environment (like the OS or database) altered, to avoid complications.
Eventually, everything that could go wrong has already occurred and been either fixed or worked around: it's as if Murphy's Law has given up.

However, don't misunderstand me: I'm not advocating for the retention of such applications. Despite appearing immune to issues, they are challenging to update and increasingly fail to meet end-user requirements over time. Concurrently, they become more vulnerable to security risks.

Conclusion

Rather than adopting a pessimistic view of Murphy's Law, we should be thankful for it. It drives engineers to enhance their craft, compelling them to devise a multitude of solutions to counteract potential issues. These solutions include robustness, high availability, fail-over systems, redundancy, replays, integrity checking systems, anti-fragility, backups and restores, observability, and comprehensive logging.

In conclusion, addressing a final query: can Murphy's Law turn against itself? A recent incident with a partner organization sheds light on this. They mistakenly sent us data and relied on a misconfiguration in their own API Gateway to prevent this erroneous transmission. However, by sheer coincidence, the API Gateway had been corrected in the meantime, thwarting their reliance on this error. Thus, the answer appears to be a resounding NO.

Top Mistakes Made by Product Owners in Agile Projects

2023-09-01T00:00:00+00:00

As a Product Owner (PO), your role is crucial in steering an agile project towards success. However, it's equally important to be aware of the pitfalls that can lead to failure. In this blog post, we'll explore the actions that should be avoided to ensure your agile project stays on track and delivers valuable outcomes. It's worth noting that the GIGO (Garbage In - Garbage Out) effect is a significant factor: no good product can come from bad design.

This article has also been published at DZone.

On Agile and Business Design Skills

Lack of Design Methodology Awareness

One of the initial steps towards failure is disregarding design methodologies such as Story Mapping, Event Storming, Impact Mapping, or Behavioral Driven Development. Treating these methodologies as trivial or underestimating their complexity or power can hinder your project's progress. Instead, take the time to learn, practice, and seek coaching in these techniques to create well-defined business requirements.

For example, I once worked on a project where the PO practiced Story Mapping without even involving the end-users...

Ignoring Domain Knowledge

Neglecting to understand your business domain can be detrimental. Avoid skipping internal training sessions, Massive Open Online Courses (MooCs), and field observation workshops. Read domain reference books and, more generally, embrace domain knowledge to make informed decisions that resonate with both end-users and stakeholders.

To continue with the previous example, the PO who was new in the project domain field (although having basic knowledge) missed an entire use-case with serious architectural implications due to a lack of skills, requiring significant software changes after only a few months.

Disregarding End-User Feedback

Overestimating your understanding and undervaluing end-user feedback can lead to the Dunning-Kruger effect. Embrace humility and actively involve end-users in the decision-making process to create solutions that truly meet their needs. Failure to consider real-world user constraints and work processes can lead to impractical designs. Analyze actual and operational user experiences, collect feedback, and adjust your approach accordingly. Don't imagine their requirements and issues but ask actual users who deal with real-world complexity all the time.

For instance, a PO I worked with ignored or postponed many obvious GUI issues from end-users, rendering the application nearly unusable. These UX issues included the absence of basic filters on screens, making it impossible for users to find their ongoing tasks. These issues were yet relatively simple to fix. Conversely, this PO pushed unasked-for features and even features rejected by most end-users, such as complex GUI locking options. Furthermore, any attempt to set up tools to collect end-user feedback was dismissed.

Team Dynamics

Centralized Decision-Making

Isolating decision-making authority within your hands without consulting IT or other designers can stifle creativity and collaboration. Instead, foster open communication and involve team members in shaping the project's direction. The three pillars of agility, as defined in the Agile Manifesto, are Transparency, Inspection and Adaptation. The essence of an agile team is continuous improvement, which becomes challenging when a lack of trust hinders the identification of real issues.

Some POs unfortunately adopt a "divide and rule" approach, which keeps knowledge and power in their sole hands. I have observed instances where POs withheld information or even released incorrect information to both end-users and developers, and actively prevented any exchange between them.

Geographical Disconnection

Geographically separating end-users, designers, testers, PO and developers can hinder communication. Leverage modern collaboration tools, but don't rely solely on them. Balance digital tools with face-to-face interactions to maintain strong team connections and enables osmotic communication, which has proven to be highly efficient in keeping everyone informed and involved.

The worst case I had to deal with was a project where developers were centralized in the same building as the end-users, while the PO and design team were distributed in another city. Most workshops were done remotely between both cities. In the end, the design result was very poor. It improved drastically when some designers were finally collocated with the end-users (and developers) and were able to conduct in situ formal and informal workshops.

Planning and Execution

Over-Optimism and Lack of Contingency Plans

Hope should not be your strategy. Don't overselling features to end-users. Being overly optimistic and neglecting backup plans can lead to missed deadlines and unexpected challenges. Develop robust contingency plans (Plan B) to navigate uncertainties effectively. Avoid promising unsustainable plans to stakeholders. After two or three delays, they may lose trust in the project.

I worked on a project where the main release was announced to stakeholders by the PO every two months over a 1.5-year timeline without consulting the development team. As you can imagine, the effect was devastating over the image of the project.

Inadequate Stakeholder Engagement

Excluding business stakeholders from demos and delaying critical communications can lead to misunderstandings and misaligned expectations. Regularly engage stakeholders to maintain transparency and gather valuable feedback.

As an illustration, in a previous project, we conducted regular sprint demos; however, we failed to invite end-users to most sessions. Consequently, significant ergonomic issues went unnoticed, resulting in a substantial loss of time. Additionally, within the same project, the Product Owner (PO) organized meetings with end-users mainly to present solutions via fully completed mockups, rather than facilitating discussions to precisely identify operational requirements, which inhibited them.

Embracing Waterfall Practices

Thinking in terms of a waterfall approach, rather than embracing iterative development, can hinder progress, especially on a project meant to be managed with agile methodologies. Minimize misunderstandings by providing regular updates to stakeholders. Break features into increments, leverage Proof of Concepts (POC), and prioritize the creation of Minimal Viable Products (MVP) to validate assumptions and ensure steady progress.

As an example, I recently had a meeting with end-users explaining that a one-year coding tunnel period resulted in a first application version almost unusable and worse than the 20-year-old application we were supposed to rewrite. With re-established communication and end-users' involvement, this has been fixed in a few months.

Producing Too Much Waste

As a designer, avoid creating a large stock of User Stories (US) that will be implemented in months or years. This way, you work against the Lean principle to fight the overproduction muda (waste) and you produce many specifications at the worst moment (when knowing the least about actual business requirements), and this work has all chances to be thrown away.

I had an experience where a PO and their designer team wrote US until one year before they were actually coded and left almost unmaintained. As expected, most of it was thrown away or, even worse, caused various flaws and misunderstandings among the development team when finally planned for the next sprint. Most backlog refinements and explanations had to be redone. User stories should be refined to a detailed state only one or two sprints before being coded. However, it's a good practice to fill the backlog sandbox with generally outlined features. The rule of thumb is straightforward: user stories should be detailed as close to the coding stage as possible. When they are fully detailed, they are ready for coding. Otherwise, you are likely to waste time and resources.

Volatile Objectives

Try to set consistent objectives at each sprint. Avoid context switching among developers, which can prevent them from starting many different features but never finishing any.

To provide an example, in a project where the Product Owner (PO) interacted with multiple partners, priorities were altered every two or three sprints mainly due to political considerations. This was often done to appease the most frustrated partners who were awaiting certain features (often promised with unrealistic deadlines).

Lack of Planning Flexibility

Utilize the DevOps methodology toolkit, including tools such as feature flags, dark deployments, and canary testing, to facilitate more streamlined planning and deployment processes.

As an architect, I once had a tough time convincing a PO to use canary-testing deployment strategy to learn fast and release early while greatly limiting risks. After a resounding failure when opening the application to the entire population, we finally used canary-testing and discovered performance and critical issues on a limited set of voluntary end-users. It is now a critical aspect of the project management toolkit we use extensively.

Extended Delays Between Deployments

Even if a product is built incrementally within 2 or 3-week timeframes, many large projects (including all those I've been a part of) tend to wait for several iterations before deploying the software in production. This presents a challenge because each iteration should ideally deliver some form of value, even if it's relatively small, to end-users. This approach aligns with the mantra famously advocated by Linus Torvalds: 'Release early, release often.'

Some Product Owners (PO) are hesitant to push iterations into production, often for misguided reasons. These concerns can include fears of introducing bugs (indicating a lack of automated and acceptance testing), incomplete iterations (highlighting issues with user story estimation or development team velocity), a desire to provide end-users with a more extensive set of features in one go, thinking they'll appreciate it, or an attempt to simplify the user learning curve (revealing potential user experience (UX) shortcomings). In my experience, this hesitation tends to result in the accumulation of various issues, such as bugs or performance problems."

Design Considerations

Solution-First Mentality

Prioritizing solutions over understanding the business needs can lead to misguided decisions. Focus on the "Why" before diving into the "How" to create solutions that truly address user requirements.

As a bad practice, I've seen User Stories including technical content (like SQL queries) or presenting detailed technical operations or screens as business rules.

Oversized User Stories

Designing large, complex user stories instead of breaking them into manageable increments can lead to confusion and delays. Embrace smaller, more focused user stories to facilitate smoother development, predictability in planning, and testing. Inexperienced Product Owners (POs) often find it challenging to break down features into small, manageable User Stories (US). This is a sort of art, and there are numerous ways to accomplishing it based on the context. However, it's important to remember that each story should deliver value to end-users.

As an example, in a previous project, the Product Owner (PO) struggled to effectively divide stories or engaged in purely technical splitting, such as creating one User Story (US) for the frontend and another for the backend portion of a substantial feature. Consequently, 50% of the time, this resulted in incomplete User Stories that required rescheduling for the subsequent sprint.

Neglecting Expertise

Avoiding consultation with experts such as UX designers, accessibility specialists, and legal advisors can result in suboptimal solutions. Leverage their insights to create more effective and user-friendly designs.

As a case in point, I've observed multiple projects where the lack of a proper user experience (UX) led to inadequately designed graphical user interfaces (GUIs), incurring substantial costs for rectification at a later stage. In specific instances, certain projects demanded legal expertise, particularly in matters of data privacy. Moreover, I encountered a situation where a Product Owner (PO) failed to involve legal specialists, resulting in the final product omitting crucial legal notices or even necessitating significant architectural revisions.

Ignoring Performance Considerations

Neglecting performance constraints, such as displaying excessive data on screens without filters, can negatively impact user experience. Prioritize efficient design to ensure optimal system performance.

I once worked on a large project where the Product Owner (PO) requested the computation of a Gantt chart involving tens of thousands of tasks spanning over 5 years. Ironically, in 99.9% of cases, a single week was sufficient. This unnecessarily intricate requirement significantly complicated the design process and resulted in the product becoming nearly unusable due to its excessive slowness.

Using the wrong words

Failing to establish a shared business language and glossary can create confusion between technical and business teams. Embrace the Ubiquitous Language (UL) Domain-Driven Design principle to enhance communication and clarity.

I once worked on a project where PO and designers didn't set up any business terms glossary, used custom vocabulary instead of a business one, and used fuzzy or interchangeable synonyms even for the terms they coined themselves. This created many issues and confusion among the team or end-users and even duplicated work.

Postponing Legal and Regulatory Considerations

Late discovery of legal, accessibility, or regulatory requirements can lead to costly revisions. Incorporate these considerations early to avoid setbacks during development.

I observed a significantly large project where the Social Security number had to be eliminated later on. This led to the need for additional transformation tools since this constraint was not taken into account from the beginning.

Code Considerations Interferences

Refine business requirements and don't interfere with code organization, which often has its own constraints. For instance, asking the development team to always enforce the reuse (DRY) principle through very generic interfaces comes from a good intention but may greatly overcomplicate the code (which violates the KISS principle).

In a recent project, a Product Owner (PO) who had a background in development frequently complicated the design by explicitly instructing developers to extend existing endpoints or SQL queries instead of creating entirely new ones, which would have been simpler. Many developers followed the instructions outlined in the User Stories (US) without fully grasping the potential drawbacks in the actual implementation. This occasionally resulted in convoluted code and wasted time rather than achieving efficiency gains.

Acceptance Testing

Neglecting Alternate Paths

Focusing solely on nominal cases (“happy paths”) and ignoring real-world scenarios can result in very incomplete testing. Ensure that all possible paths, including corner cases, are thoroughly tested to deliver a robust solution.

In a prior project, a multitude of bugs and crashes surfaced exclusively during the production phase due to testing being limited to nominal scenarios. This led to team disorganization as urgent hotfixes had to be written immediately, tarnishing the project's reputation and incurring substantial costs.

Missing Acceptance Criteria

Leverage the Three Amigos principle to involve cross-functional team members in creating comprehensive acceptance criteria. Incorporate examples in user stories to clarify expectations and ensure consistent understanding. Example mapping is a great workshop to achieve it. Being able to write down examples ensures many things: firstly that you have at least one realistic case for this requirement and that it is not imaginary; secondly, listing different cases is a powerful tool to gain an estimation of the alternate paths exhaustively (see the previous point) and make them emerge; lastly, it is one of the best common understanding material you can share with developers.

By way of illustration, when designers began documenting real-life scenarios using Behavioral Driven Development (BDD) executable specifications, numerous alternate paths emerged naturally. This led to a reduction in production issues (as discussed in the previous section) and a gradual slowdown in their occurrence.

Lack of Professional Testing Expertise

Incorporating professional testers and testing tools enhances defect detection and overall quality. Invest in thorough testing to identify issues early, ensuring a smoother user experience. Not using tools also makes it more difficult for external stakeholders to figure out what has been actually tested. Conducting rigorous testing is indeed a genuine skill.

In a previous project, I witnessed testers utilizing basic spreadsheets to record and track testing scenarios. This approach rendered it difficult to accurately determine what had been tested and what hadn't. Consequently, the Product Owner (PO) had to validate releases without a clear understanding of the testing coverage. Tools like the Open Source SquashTM are excellent for specifying test requirements and monitoring acceptance tests coverage. Furthermore, the testers were not testing professionals but rather designers, which frequently resulted in challenges when trying to obtain detailed bug reports. These reports lacked precision, including crucial information such as the exact time, logs, scenarios, and datasets necessary for effective issue reproduction.

Take-Away Summary

Symptom	Possible Causes and Solutions
A solution that is not aligned with end-users' needs.	Ineffective Workshops with End-Users: - If workshops are conducted remotely, consider organizing them onsite. - Ensure you are familiar with agile design methods like Story Mapping. Insufficient Attention to End-Users' Needs: - Make sure to understand the genuine needs and concerns of end-users, and avoid relying solely on personal intuitions or managerial opinions. - Gather end-users' feedback early and frequently. - Utilize appropriate domain-specific terminology (Ubiquitous Language).
Limited Trust from End-Users and/or Development Team.	Centralized Decision-Making: - Foster open communication and involve team members in shaping the project's direction. - Enhance transparency through increased communication and information sharing. Unrealistic Timelines: - Remember that "Hope is not a strategy"; avoid excessive optimism. - Aim for consistent objectives in each sprint and establish a clear trajectory. - Employ tools that enhance schedule flexibility and ensure secure production releases, such as canary testing.
Design Overhead.	User story overproduction: - Minimize muda (waste) and refine user stories only when necessary, just before they are coded. Challenges in Designer-Development Team Communication: - Encourage regular physical presence of both design and development teams in the same location, ideally several days a week, to enhance direct and osmotic communication. - Focus on describing the 'why' rather than the 'how'. Leave technical specifications to the development team. For instance, when designing a database model, you might create the Conceptual Data Model, but ensure the team knows it's not the Physical Data Model.
Discovery of Numerous Production Bugs.	Incomplete Acceptance Testing: - Develop acceptance tests simultaneously with the user stories and in collaboration with future testers. - Conduct tests in a professional and traceable manner, involving trained testers who use appropriate tools. - Test not only the 'happy paths' but also as many alternative paths as possible. Lack of Automation: - Implement automated tests, especially unit tests, and equally important, executable specifications (Behavioral Driven Development) derived from the acceptance tests outlined in the user stories. Explore tools like Spock.

Conclusion

By avoiding these common pitfalls, you can significantly increase the chances of a successful agile project. Remember, effective collaboration, clear communication, and a user-centric mindset are key to delivering valuable outcomes. A Product Owner (PO) is a role, not merely a job. It necessitates training, support, and a readiness to continuously challenge our assumptions.

It's worth noting that a project can fail even with good design when blueprints and good coding practices are not followed, but this is an entirely different topic. However, due to the GIGO effect, no good product can ever be released from a bad design phase.

Make Your Jobs More Robust with Automatic Safety Switches

2023-08-28T00:00:00+00:00

This article has also been published at DZone.

In this article, I'll refer to a 'job' as a batch processing program, as defined in JSR 352. A job can be written in any language but is scheduled periodically to automatically process bulk data, in contrast to interactive processing (CLI or GUI) for end-users. Error handling in jobs differs significantly from interactive processing. For instance, in the latter case, backend calls might not be retried as a human can respond to errors, while jobs need robust error recovery due to their automated nature. Moreover, jobs often possess higher privileges and can potentially damage extensive data.

Consider a scenario: What if a job fails due to a backend or dependency component issue? If a job is scheduled hourly and faces a major downtime just minutes before execution, what should be done?

Based on my experience with various large projects, implementing automatic safety switches for handling technical errors is a best practice.

Enhancing Failure Handling with Automatic Safety Switches

When a technical error occurs (e.g., timeout, storage shortage, database failure), the job should attempt several retries (as per best practices outlined below) and halt immediately at the current processing step. It's advisable to record the current step position, allowing for intelligent restarts once the system is operational again.

Only human intervention, after thorough analysis and resolution, should reset the switch. While in a disabled state, any attempt to schedule the job should log that it's inactive and cannot initiate. This is also the opportune moment to create a post-mortem report, valuable for future failure analysis and potential adjustments to code or configuration for improved robustness (e.g., adjusting timeouts, adding retries, or enhancing input controls).

The switch can then be removed, enabling the job to recommence or complete outstanding steps (if supported) during the next scheduled run. Alternatively, immediate execution can be forced to prevent prolonged downtime delays, especially if job frequency is low. Delaying a job's execution excessively can lead to end-user latency and potential accumulation of such delays, eventually overwhelming the job's capacity.

Rationale for Automatic Safety Switches

Prevention of Data Corruption: They can avert significant data corruption resulting from bugs by halting activity during unexpected states.
Error Log Management: They help prevent system flooding with repetitive error logs (such as database access error stack traces). Uncontrolled log volumes might also exacerbate issues like filesystems filling.
Facilitating System Repair: A system without an automatic safety switch significantly complicates the diagnostic and fixing process. Human operators cannot make decisions with clarity since the system remains enabled and could potentially jam again as soon as it's scheduled."
Resource Exhaustion Mitigation: Continuing periodic jobs during technical errors caused by resource exhaustion (memory, CPU, storage, network bandwidth, etc.) worsens the situation. Automatic safety switches act as circuit breakers, stopping jobs and freeing up resources. After resolving the root problem, operators can restart jobs sequentially and securely.
Security Enhancement: Many attacks, including brute force attacks, SQL injections, or Server Side Injection (SSI), involve injecting malicious data into a system. Such data might be processed later by jobs, potentially triggering technical errors. Stopping the job improves security by forcing human or team analysis of the data. Similarly, halting a job after a timeout can help foil a resource exhaustion-type attack, such as a ReDOS (Regular Expression Denial of Service).
Promoting System Analysis: Organizations that overlook job robustness often allow failed jobs to run in subsequent schedules, adopting a risky approach. Automatic safety switches necessitate human intervention, detecting every failure. This encourages systematic analysis, post-mortem documentation, and long-term improvements.
Preventing Excessive Costs: Implementing a throttling mechanism that pauses operations upon hitting predetermined thresholds, along with an automated safety feature that requires analysis, can protect organizations from incurring significant additional costs due to bugs or intentional attacks when interacting with external systems that incur charges.
Code Reuse: Besides emergency handling, the code written for this purpose can be repurposed to disable a job without altering the scheduling. This is similar to the Suspend: true attribute in Kubernetes CronJobs. In a recent project, we utilized this functionality to conveniently initiate job maintenance. By setting the stop flag, the maintenance script then awaits the completion of all jobs.

Implementing Effective Safety Switches

Simple Implementation: The most straightforward approach involves each job, during scheduling, checking for a persistent stop flag. If present, the job exits with a log. The flag can be implemented, for example, through a file, a database record, or a REST API result. For robustness, a stop file per job is preferable, containing metadata like the reason for stopping and the date. This flag is set on technical errors and removed only by a human operator's initiative (using commands like rm or more advanced methods like a shell script for instance).
Coupling with Retrying Mechanism: Safety switches must work alongside a robust retry solution. Jobs shouldn't halt and require human intervention at the first sign of intermittent issues like database connection saturation or occasional timeouts due to backups slowing down the SAN. Effective systems, such as the Spring Retry library, incorporate exponential backoff with jitter. For instance, setting 10 tries, including the initial call, results in retries spaced exponentially apart (1-second interval, then 2 seconds, and so on). This entire process spans 10 to 15 minutes before failing if the root cause isn't resolved within that timeframe. Jitter introduces small random intervals to avoid retry storms where all jobs simultaneously retry.
Ensure Exclusive Job Launches: Like any batch processing solution, guarantee that jobs are mutually exclusive—ensuring a new job isn't launched while a previous instance is still running.
Business Error Handling: Business errors (e.g., poorly formatted data) shouldn't trigger safety switches, unless the code lacks defensive measures and unexpected errors arise. In such cases, it's a code bug and qualifies as a technical error, warranting the safety switch trigger and requiring hotfix deployment or data correction.
Facilitate Smooth Restarts: When possible, allow seamless restarts using batch checkpoints, storing the current step, processing data context, or even the presently processed item.
Monitoring and Alerting: Ensure that monitoring and alerting systems are aware of job stoppage triggered by automatic safety switches. For example, email alerts could be sent or jobs could be highlighted in red within a monitoring system.
Semi-automatic Restarts: While we always advocate for thorough system analysis during production issues, there are moments when having jobs halted for human intervention isn't practical, especially during weekends. A middle-ground solution between routine automatic job restarts and a complete halt is to authorize an automatic restart after a predetermined period. In our scenario, we've set a mechanism to remove the stop flag after 8 hours. This allows the job to try restarting if no human intervention has addressed the issue by then. This approach merges the benefits of an automatic safety switch, such as preventing data corruption or log overflow, with certain drawbacks. For instance, it might overlook the importance of a systematic analysis and the resulting continuous improvement. Hence, we believe this solution should be implemented judiciously.

Conclusion

Automatic safety switches prove invaluable in handling unexpected technical errors. They significantly reduce the risk of data corruption, empower operators to address issues thoughtfully, and foster a culture of post-mortems and robustness improvements. However, their effectiveness hinges on not being overly sensitive, as excessive interventions can burden operators. Thus, coupling these switches with well-designed retry mechanisms is crucial.

Datasets staticity level

2023-07-16T00:00:00+00:00

Datasets staticity level

[Article also published on DZone.]

A common challenge when designing applications is determining the most suitable implementation based on the frequency of data changes. Should a status be stored in a table to easily expand the workflow? Should a list of countries be embedded in the code or stored in a table? Should we be able to adjust the thread pool size based on the targeted platform?

In a current large project, we categorize datasets based on their staticity level, ranging from very static to more volatile:

Level 1 : Very static datasets

These types of data changes always involve business rules and impact the code. A typical example is the list of states in a workflow (STARTED, IN_PROGRESS, WAITING, DONE, etc.). The indicative size of this dataset is usually between 2 to 20 entries.

From a technical perspective, it is often implemented as an enumeration (a finite list of literal values like Enumerated Types in PostgreSQL, enums in Java, or TypeScript, for instance). Alternatively, it can be managed as constants or a list of constants.

You can use the following litmus test: "Does any item from this list need to be included in an 'if' statement in the code?".

Changing this type of data requires a new release and/or a Data Definition Language (DDL) change and is not easily administrable.

Level 2: Rarely changing datasets

Think of datasets like a list of countries/states or a list of currencies. These datasets rarely exceed a few tens of entries. We refer to them as "nomenclatures".

From a technical standpoint, they can be managed using a configuration file (JSON/YAML/CSV/properties, etc.) or within a database (a table if using a relational database like PostgreSQL, a document or a list of documents if using a NoSQL Document database like MongoDB, etc.).

It is often a good idea to provide an administration GUI that allows adding, changing, or removing entries of this kind if your budget permits.

These lists are often required to initiate the use of an application, even if the data may change later on. Therefore, it is advisable to package the application with a minimal dataset before its first use. For example, a Liquibase configuration can be released with the application to create a minimal set of countries in the database if it doesn't exist yet. However, be cautious to use an idempotent "CREATE IF NOT EXIST" scheme to avoid conflicting with preexisting data.

Depending on the packaging and technologies used, a change in this type of data may or may not require a new release. If your application includes a mechanism for embedding a minimal dataset (such as a configuration file or a Liquibase or SQL script executed automatically), it will likely require a new release. While this may initially be seen as a constraint, it ensures that your application is self-contained and always operational from its deployment, which is often worthwhile.

When storing nomenclatures in a database, a common strategy is to create a table for each nomenclature (e.g., a table for currencies, a table for countries). If, like us, your application requires a more flexible approach, you can use a single NOMENCLATURE table for each microservice and differentiate the nomenclatures using a simple column (e.g., a NOMENCLATURE name). All nomenclatures are then consolidated in a single technical table, and it is straightforward to retrieve a specific nomenclature using a WHERE clause on the nomenclature name. If you want to maintain an ordering, you can further enhance this approach by assigning an ordinal value to each nomenclature entry.

Level 3: Volatile datasets

Most applications persist large amounts of data, which we refer to as "volatile data". This type of data can involve an unlimited number of records managed by an application, such as user profiles, addresses, or chat discussions.

A change, addition, or removal of a record in this kind of dataset should never require a new release (although backups are still necessary). The code is generally designed to handle such changes in a generic manner rather than on a case-by-case basis.

This type of data is typically not administrable through code changes but is managed through regular front/back-office GUIs or batch programs.

Summary

Choosing the appropriate level of staticity is crucial to ensure the maintainability and modifiability of an application and can help avoid potential pitfalls. Using an incorrect solution to handle a particular staticity level can lead to unnecessary integration and release tasks or make the application less maintainable.

Level	Change frequency	Indicative size	Administrable?	Change requires a new release?	Technical solution examples
1	low	2-20	no	yes	List of constants, Java enum, Enumerated PostgreSQL type
2	medium	10-100	yes	Depends on choosen solution	Nomenclature table, configuration file
3	high	> 100	no	no	Regular database records

Make great architecture diagrams with C4 and Plantuml (2/2)

2022-08-10T00:00:00+00:00

. Is narrow-scoped only to a given 




Best practices (voir cours)

numbering

use characters
⌨ r


- min constraints
- Use Lay_Distance(x,y,) à l'intérrieur d'une zone
- Eviter les Lay_U...
- Utiliser des sprites
skinparam linetype poluline

infra:
https://c4model.com/#DeploymentDiagram

https://sarafian.github.io/tips/2021/03/11/plantuml-tips-tricks-1.html


    adding hidden lines a -[hidden]- b
    extending the length of a line a --- b (more dashes, longer line)
    specifying preferred direction of lines (a -left- b)
    swapping association ends (a -- b → b -- a)
    changing the order of definitions (the order does matter... sometimes)
    adding empty nodes with background/border colors set to Transparent
https://crashedmind.github.io/PlantUMLHitchhikersGuide/index.html

https://crashedmind.github.io/plantuml.github.io/

a -[norank]-> b

Rempalcer x fleches par une seule de niveau VM

Le - de Lay , custom, de temps en temps, tenter sans

lignes de couleurs/pleine/dashed..

BackendClient -[hidden]- Logging

AddRelTag("backup", $textColor="orange", $lineColor="orange", $lineStyle = DashedLine())


Separation static/dynamic + imports

1) jouer sur le layout
LAYOUT_LEFT_RIGHT()

2) Jouer sur les Rel_U...
'vers api et queues
Rel_U(rece_requetehubee_batch, api_hubee, "HTTPS")

Une seule modif à la fois

numeroter les liens

Architecture as Code with C4 and Plantuml

2022-06-10T00:00:00+00:00

Architecture as Code with C4 and Plantuml

(This article has also been published at DZone)

Introduction

I'm lucky enough to currently work on a large microservices-based project as a solution architect. I'm responsible for designing different architecture views, each targeting very different audiences, hence different preoccupations:

The application view dealing with modules and data streams between them (targeting product stakeholders and developers)
The software view (design patterns, database design rules, choice of programming languages, libraries...) that developers should rely upon;
The infrastructure view (middleware, databases, network connections, storage, operations...) providing useful information for integrators and DevOps engineers;
The sizing view dealing with performance;
The security view, which is mainly transversal.

ℹ️ NOTE

We use this Open Source Template to document our architecture.

Our current project architecture is fairly complex because of the number of modules (tens of jobs, API and GUI modules), because of the large number of external partners and because of its integration with a large legacy information system.

At this time, we have to maintain more than one hundred architecture diagrams. Following a living documentation approach, we adapt and augment diagrams, text and tables several times a day. As we will see later, it's often a collaborative process taking advantage of several great tools.

The Sample Application

We illustrate this article with a fictional AllMyData microservices application. This is a .gov web application enabling any company to get all its information known to all the public administrations.

We can split our feature "Deliver Companies Data" into two main call chains:

A first call chain is made of the GUI requests that create requests into the system.
A second one is made of a job launched periodically and consuming new requests. It gathers data about the company both from a local repository and from another administration IS (Information System), produces a PDF report and sends an e-mail to the company original requester.

The C4 Model

We use the C4 model to represent our architecture. It is beyond the scope of this tooling article to describe it in depth but I invite you to have a look at this very pragmatic approach. I find it very natural to design complex architectures. It leverages the UML2 standard and provides a great dichotomy between high level concerns and code-level ones.

ℹ️ NOTE

Archimate could be another good fit for us but probably overkill in our context of very low modelization adoption and knowledge. Also, we like the C4 KISS/low tech approach that takes many human psychological criteria into account. Note that some Archimate tools support C4 diagrams using some mapping between concepts. Not sure it is good idea to mix both though.

In our context, we currently use three main C4 diagrams types (note that C4 and UML2 contain others not listed here):

System landscape diagrams provide a very high-level view of the system. We use it to describe the general application architecture.

Container diagrams are used to describe the middleware, databases, and many other technical components as well as data streams between them. They are similar to UML2 deployment diagrams but more natural in my opinion. In the application view, we mainly display modules and databases and in the infrastructure view, we drill down into technical devices like reverse proxies, load balancers, cluster details, etc. We also use C4 dynamic diagrams, very similar to container diagrams but including call numbering.

Various UML2 diagrams (sequence, activity, classes). We use them with parsimony and only to express a pattern or something especially important or complex but certainly not for ordinary code.

ℹ️ NOTE

I'm a quite reluctant to use the C4 container term because of the risk of confusion with Docker/OCI containers (as pointed out by Simon Brown, the C4 creator). In our organization, we prefer to call them deployable units. The C4 model encourages terminology adaptation. A C4 container is basically a separated deployable process. The C4 documentation states: "Essentially, a container is a separately runnable/deployable unit (e.g. a separate process space) that executes code or stores data".

In the C4 model, a container can contain one or more software components. This concept doesn't refer to infrastructure components, but to large pieces of code (like a set of Java classes). We barely use C4 components in our architecture document because we don't really need to go into that level of details (our hexagonal architecture makes things simple to design and understand just by reading the code and our agile approach makes us prefer limiting the design documentation we have to maintain).

Plantuml

Plantuml is an impressive tool that generates instantly diagrams from a very simple textual DSL (Domain Specific Language).

For instance, this very short text:

@startuml
   [Browser] -> [API Foo]: HTTPS
@enduml

...is enough to produce this diagram:

Plantuml comes with hundreds of features and syntax goodies, sometimes undocumented and evolving very quickly. I suggest this website as a clear and exhaustive documentary reference.

Check out some real-world examples here.

Plantuml Combined With C4

Plantuml component diagrams can be customized as C4 diagrams using this extension library.

It is included as a standard library in recent Plantuml versions.

Just import it at the top of your Plantuml diagrams and use C4 macros:

@startuml
   !include <C4/C4_Container>
   !include <tupadr3/devicons2/chrome>
   !include <tupadr3/devicons2/java>
   !include <tupadr3/devicons2/postgresql>
   LAYOUT_LEFT_RIGHT()
   Container(browser, "Browser","Firefox or Chrome", $sprite="chrome")
   Container(api_a, "API A","Spring Boot", $sprite="java")
   ContainerDb(db_a, "Database A","Postgresql", $sprite="postgresql")
   Rel(browser,api_a,"HTTPS")
   Rel_R(api_a,db_a,"pg")
@enduml

is exported as:

ℹ️ NOTE

Always export diagrams in SVG format to allow unlimited zooming. It is a appreciable when dealing with large diagrams.
We use here the online latest version but you may prefer to use a static downloaded version in an air-gap mode.

Diagrams Factorization

A great thing about Plantuml is the factorization capabilities using the !include and !includesub preprocessor directives.

It is possible to include local or remote diagrams (ie. starting with @startuml and ending with the @enduml directive) with the !include keyword.

It is also possible to import diagram fragments (ie. starting with !startsub and ending with the !endsub directive):

File fragments.iuml:

!startsub dmz
  !include <C4/C4_Container>
  !include <tupadr3/devicons2/chrome>
  !include <tupadr3/devicons2/java>
  Container(browser, "Browser","Firefox or Chrome", $sprite="chrome")
  Container(api_a, "API A","Spring Boot", $sprite="java")
!endsub

!startsub intranet
  !include <C4/C4_Container>
  !include <tupadr3/devicons2/postgresql>
  ContainerDb(db_a, "Database A","Postgresql", $sprite="postgresql")
!endsub

!startsub extranet
  !include <C4/C4_Container>
  !include <tupadr3/devicons2/postgresql>
  ContainerDb(db_b, "Database B","Postgresql", $sprite="postgresql")
!endsub

File diags-1.puml:

@startuml use-case-1
  ' We only include context-related sub-diagams
  !includesub fragments.iuml!dmz
  !includesub fragments.iuml!intranet
    
  Rel(browser,api_a,"HTTPS")
  Rel_R(api_a,db_a,"pg")
@enduml

Filtering Unlinked Containers

Since mid-2020, Plantuml supports a game-changing feature for software architects: the remove @unlinked directive. It only keeps from a C4 diagram the containers calling or being called and drop any other.

This feature (along with the diagram fragments capacities) was a requirement to achieve the diagram patterns described below.

Sprites

Thousands of sprites are available to decorate the C4 containers. They are now embedded directly into the last Plantuml releases. They include Devicons, Font-Awesome, Material, Office, Weather and many other icon libraries. Most software, hardware, network and business-oriented icons are ready to use out of the box!

From my experience, using sprites inside C4 containers makes the diagrams airier and thus more pleasant to read. Maybe does it help our brain to identify faster the nature of each container?

Note that even if you can use different background colors to differentiate C4 containers based on a specific criteria (for instance, I use a light grey for external APIs), we recommend using sprites instead to represent nature as it makes cleaner diagrams and the default blue color is fine in most of the cases.

Plantuml IDE Plugins

Plantuml is a very versatile technology that can be used in many different contexts including:

A simple base64 encoded URL like https://www.plantuml.com/plantuml/uml/SoWkIImgAStDuL9GK8XsAielBqujYbNGjLE8TWpmL73BpuzLi5Bm20a92EPoICrB0Qe40000;
Inside a Word processor like LibreOffice or Word;
From programming languages like Groovy, Java or Python;
In most IDE like Intellij IDEA thanks to this plugin;
Or in Eclipse with this plugin;
But my own favorite is the VScode plugin. Among other features, it supports multi-diagrams generation from a single .puml file and multi-diagrams/multi-puml files diagrams generation. It can be finely tuned.

Architecture as Code

A very nice side-effect of the IDE Plantuml integration is the fact that you can not only create diagrams much faster by being released from the arrangement chore but also write them as you code. Diagrams can be automatically generated and refreshed as you type.

Mob Designing

This kind of tooling enables what I would call Mob design. Especially at the beginning of our project but still currently, we used to brainstorm about the software architecture. Using Plantuml and a large shared screen, it is very convenient to create and compare several architecture scenarios.

"What if the API A is called directly by the client B?" Or "Should it be called asynchronously by the job J?" ...

In the same manner that end-users truly need to visualize screen mockups, developers and architects think better in front of diagrams. This also greatly limits misunderstandings induced by the limitation and numerous ambiguities of natural languages.

Inventory and Dependencies Diagrams

As a blueprint we use the !include and/or !includesub directives to separate:

Inventory diagrams show static elements of the architecture (classified into different network zones and represented by boundaries) but don't display relations between them. They are useful to respond to questions like "What contains zone xyz?" or "Which modules cover system xyz?"). It is particularly useful in the application view to clearly display systems modules of complex microservices architectures or in the infrastructure views to represent nodes in each network zone and their deployable units. This kind of diagram uses C4 container diagrams.
Dependencies diagrams leverage the static diagrams but augment them with calls between the containers. Inventory diagrams can be used alone but dependencies diagrams have to import the inventory diagram. It should respond to questions like "Which module/container is called by X" or "Which modules/container does X call ?". It is also helpful for impact studies: "What's the impact if I change API X?".

Example of an inventory diagram:

File inventory.puml:

@startuml
header Inventory diagram
!include <C4/C4_Container>
!include <tupadr3/devicons2/chrome>
!include <tupadr3/devicons2/java>
!include <tupadr3/devicons2/postgresql>
!include <tupadr3/devicons2/nginx_original>
!include <tupadr3/devicons2/react_original>
!include <tupadr3/devicons2/android>
!include <tupadr3/devicons2/groovy>
!include <tupadr3/material/queue>
!include <tupadr3/material/mail>
!include <tupadr3/devicons2/dot_net_wordmark>
!include <tupadr3/devicons2/oracle_original>
!include <office/Concepts/web_services>
skinparam linetype polyline
HIDE_STEREOTYPE()
SHOW_PERSON_PORTRAIT()

System(client, "Client") {
    Container(spa, "SPA allmydata-gui", "Container: javascript, React.js", "Graphical interface for requesting information", $sprite="react_original")
    Container(mobile, "AllMyData mobile application", "Container: Android", "Graphical interface allowing to request information", $sprite="android")
}    

Enterprise_Boundary(organisation, "System organisation B") {
    Container_Ext(saccounting, "Accounting system", "REST service", $sprite="web_services")
}

Enterprise_Boundary(si, "Information System") {
    Container(static_resources, "allmydata-gui Web Application", "Container: nginx", "Delivers static resources (js, html, images ...)", $sprite="nginx_original")
    Container(sm, "allmydata-api", "Container: Tomcat, Spring Boot", "REST service allowing to request information", $sprite="java")
    Container(crep, "Companies repository", "Container", "SOAP webservice providing data about companies known by administration A", $sprite="dot_net_wordmark")
    ContainerDb(crep_db, "companies-repository-db", "Container: SqlServer", "Stores companies data",$sprite="oracle_original")
    Container(batch, "allmydata-batch", "Container: groovy", "Process requests, launched by cron every minute", $sprite="groovy")
    ContainerQueue(queue, "requests-queue", "Container: RabbitMQ", "Stores requests", $sprite="queue")
    ContainerDb(amd_db, "allmydata-db", "Container: PostgreSQL", "Stores requests history and status",$sprite="postgresql")
    Container(sreporting, "service-reporting-pdf", "Container: Tomcat, JasperReport", "Reporting REST service", $sprite="java")
    Container(smails, "mail server", "Container: Postfix", "Send emails", $sprite="mail")
}
@enduml

Example of dependency diagram (importing its inventory counterpart and adding a person and a bunch of calls):

File dependencies.puml:

@startuml dependencies
  header Dependencies diagram

  !include inventory.puml
  
  Rel(client, static_resources, "HTTPS")
  Rel(spa,sm,"REST call","HTTPS")
  Rel(sm,queue,"AMQP")
  Rel(sm,amd_db,"psql")
  Rel(batch, queue, "AMQP")
  Rel_R(batch, saccounting, "HTTPS")
  Rel(batch, sreporting,"HTTP")
  Rel(batch, smails, "SMTP")

  remove @unlinked
@enduml

Dynamic Diagrams to Describe Call Chains

Once we have provided the system big picture using both an inventory and dependencies view, we describe the detailed architecture of each main feature using a third kind of C4 diagram: C4 dynamic diagrams. C4 container and dynamic diagrams are very similar but the latter comes with automatic call numeration.

ℹ️ NOTE

Some may prefer good old UML2 sequence diagrams for complex interactions. In most cases, I find the C4 dynamic diagrams easier to read when dealing with container interactions.
When working on complex code design, we rather use UML2 sequence diagrams.

C4 dynamic diagrams target developers. They detail calls or data streams between C4 containers involved in the context of a given feature, hence providing a detailed view of each call chain.

The feature term should be intended in the agile meaning (fulfills a stakeholder need). It can be something like "Allow an enterprise to access its data online" or "Pay for an order".

This kind of diagram can still contain zones or boundaries (already available in the inventory or dependencies diagrams), thus setting up the call chain in a more global context.

The feature architecture leverages one or more call chains and a call chain is made of a group of ordered calls or actions (like calling an API, writing a file on disk, etc.) all performed synchronously. Any further call is referenced in the next call chain.

ℹ️ NOTE

By 'synchronous', we mean a set of activities sharing the same logical "transaction". A technically asynchronous call (like when using reactive programming) still applies. On the contrary, in the case where a call chain produces a message as a part of an Event Driven Architecture, this event consumption and computation by another module are NOT counted in the same call chain even if the production and the consumption of the event are technically almost instantaneous.
When considered helpful, we augment the diagrams with some textual context (using AsciiDoc) before or after the diagram but this text should be synthetic, not redundant with the diagram itself. Call chain diagrams are however often sufficient in themselves.

We leverage inventory diagrams fragments and unlinked container filtering explained before to achieve an effective Architecture As Code pattern.

File call chain deliver-1.puml (note the remove @unlinked usage here):

@startuml deliver-1.puml
  !include inventory.puml
  !include <C4/C4_Dynamic>
  ' For call chains, we advise to put a header (displayed by default at the upper-right 
  ' side of the diagram) to ease its identification.
  header deliver-1

  Person_Ext(company, "Company", "[person] \nWeb client (PC, tablet, mobile)")
  Rel(client, static_resources, "Visit https://allmydata.gouv", "HTTPS (R)")
  Rel(client, spa, "Retrieves information via")
  Rel(spa,sm,"REST call","HTTPS (W)")
  RelIndex(LastIndex()-1,sm,queue,"Produces a request message to the queue","AMQP (W)")
  RelIndex(LastIndex()-2,sm,amd_db,"Stores the request data","JDBC (W)")
  increment()
  
  ' Remove all C4 containers imported from inventory.puml file but not involved 
  ' in this call chain to make the diagram much cleaner
  remove @unlinked
@enduml

ℹ️ NOTE

It is paramount to standardize call chains naming (like deliver-1, pay-3, ...) because it becomes a strong vector of communication between developers and business analysts. It is then possible to talk using canonical names like deliver-1 3-1 for instance. This is a massive misunderstanding killer, time saver and is one of the main benefits of this methodology.

I suggest to simply using the <feature>-<incrementing number> naming scheme.

File call chain deliver-2.puml (note the 'remove @unlinked' usage here):

@startuml deliver-2.puml
  !include inventory.puml
  header deliver-2

  Rel(sm,amd_db,"JDBC CRUD calls","psql")
  Rel(batch, queue, "Consume each request message", "AMQP (R)")
  Rel(batch, amd_db, "Read various very interesting data about the requester company", "JDBC (R)")
  Rel(batch, saccounting, "Get more interesting data from the Accounting system", "HTTPS (R)")
  Rel(batch, sreporting, "Produces a great PDF including great pie charts", "HTTP (W)")
  Rel(batch, smails, "Send an e-mail to original requester with the attached PDF", "SMTP (W)")
  Rel(batch, amd_db, "Store the request data (date, final status...)", "JDBC (W)")
 
  remove @unlinked
@enduml

ℹ️ NOTE

Each call should detail used network protocols along with a modifier flag (R: Read, W: Write, E:Execute). These flags are important to figure out the call intention. More than a single flag on the same call is possible.

In our context, these call chain diagrams provide enough architectural details to code the application. They are the only design documentation we write before actually coding. Apart from them, the real (and best) documentation is the (clean) code itself.

Conclusion

I hope this introduction has aroused your curiosity about coding architectures using Plantuml and C4. A future article will provide our diagramming best practices and some Plantuml useful tips in an architectural context, keep in touch!

I will finish with a personal feeling I can't formally demonstrate but observed many times: the graphical "harmony" of an architectural diagram is directly proportional with its intrinsic quality. It is, therefores possible to form a first opinion of complex architecture with just a glimpse of the main diagram on the wall...

In the same order of ideas, dependencies diagrams highlight the strategic modules and reflect the balances of power hidden behind the architecture (as expected by the Conway's Law).

Designing Human-Targeted Random IDs

2022-04-10T00:00:00+00:00

Designing Human-Targeted Random IDs

Article also published on DZone.

ℹ️ NOTE We don't deal here with technical ID used as primary keys in relational databases. See my previous article here if you seek a great way to generate them.

Context

During one of my recent projects, I have been asked to design a scheme of IDs highly usable by humans. The business requirement was mainly to create pseudo-random values that can't be inferred or guessed in order to be used as a secret token printed on some official documents for future controls.

Later on, we had a similar requirement with lower security concerns: generating human-readable file numbers that can be printed on associated documents, verbalized on phone or typed when doing searches.

Another well-known example (in France at least) is the ID (aka “SNCF number”) attached by the French railway company with each train travel so one can open easily any travel details from your smartphone without being fully authenticated.

Main Criteria

After having compared existing solutions and analyzed the business stakeholder's requirements, these criteria emerged:

These IDs have to be short to be easily typed, read or verbalized on phone by a human (no more than six to ten characters).
They have to integrate systems that prevent and detect typos.
They don't have to be unique (and can't because of their small size and thus variability). However, the system has to prevent collisions either by coupling these IDs with some others values (like a person last name) or by retrying another attempt when a shuffle value already exists (the solution we use). You’ll have to remind that closed items may own the same ID (when doing search by ID, for instance, make sure to make status into account).
When possible, avoid generating offending terms or acronyms (like F*** in English). We didn't actually searched for a solution so far but maintaining a dictionary per targeted language seems the best guess (thanks for Rumen Dimov for his feedback).

How To Make These Values Truly Usable?

Limit the number of possible characters by using more than base-10 (decimal) numbers but add lowercase and uppercase letters. Avoid using others characters (punctuation marks, diacritics,...) that are more difficult to read. Hence, in theory, we can generate numbers made of up to 10 digits + 26 lowercase ASCII letters + 26 uppercase ASCII letters = base-62 numbers.
Ease typing and reading as much as possible: the number should be composed of no more than four or five characters easily memorized as a whole, like aGty3. If longer, split the ID using hyphens (avoid underscores that could be difficult to read when used as an hyperlink).
Make sure that these values can be easily pasted using a single command into clearly separated text fields.

How To Prevent And Detect Typos ?

Exclude confusing characters. Keep in mind that the similarity depends as well on the used fonts: a 'l' can be easily distinguished from a '1' when using a plain old monotype font but less when using a sans-serif one. We advise excluding the most problematic cases: 'O' and '0' (zero), 'Z' and '2' or 'l' and '1'. By dropping these characters, we now deal with base-56 numbers.
Reserve some bits as a CRC or checksum in order to detect most typos early on the frontend. Such systems are used by banks for decades on IBAN account for instance (using the MOD97 algorithm). Users will thank you for notifying them early and this GUI-side surface control prevents issuing some useless server-side queries and ugly error logs on the backend.

ℹ️ NOTE Some light CRC solution can’t detect all but most of the possible typos.

What About The Security ?

If these human-readable IDs are used in serious maters dealing with money, security or official documents, make sure to use a cryptographically secure pseudorandom number generator (CSPRNG) to generate the numbers that you will then convert to your base-56 value. For instance, when using a Linux server, make sure to use /dev/random and not /dev/urandom. This will greatly reduce the risk of collisions (the fact of generating twice the same value in a short amount of time).
The ID length should be proportional with the required difficult to guess it.

Some Examples Please

Imagine you want only want to avoid '0'/'O' and '1'/'l' confusions and you want to generate ID with a collision risk as low as 1/2,6.10¹⁷, you can generate numbers (using a CSPRNG) like:

aTy2-5fTk-rp9z

bUD5-64kP-hlA4

For less critical use cases, fewer characters may be enough:

aTy2-5fTk

64kP-hlA4

For short-live and low-risk ID, see what SNCF does for travel files (only six capital letters):

XSDTGE

Conclusion

Generating readable random IDs for human can be easily achieve but a bunch of requirements must be taken into account. Their scheme has to vary according to the targeted usage but keep in mind that changing an existing scheme is cumbersome and can require maintaining several IDs schemes during a long time. I hope that this article will help you to think about the not-so-obvious criteria making it easier to design them right at the first attempt. I would be glad to get feedback if I have forgotten important or obvious points.

How to Do UUID as Primary Keys the Right Way

2021-12-28T00:00:00+00:00

(This article has been also published at DZone)

How to Do UUID as Primary Keys the Right Way

TL;DR: UUID V4 or its COMB variant are great to mitigate various security, stability, and architectural issues, but be aware of various pitfalls when using them.

Why Do We Need Technical IDs in the First Place?

Any properly designed relational database table owns a Primary Key (PK) allowing you to uniquely and stably identify each record. Even if the primary key can be composite (built of several columns), it is a widespread good practice to dedicate a special column (often named id or id_<table name>) to this end. This special column is used to technically identify records and can be used as foreign keys in relations.

ℹ️ NOTE

Do not confuse technical (also named "surrogate") keys with business keys. The most important tables (so-called entities in domain-driven design) may contain an alternate human-readable ID column (like the customer ID "G2F6D"). In this article, we will focus only on technical PKs. They should only be processed and readable by machines, not humans.

How to Choose Good IDs

Good IDs are:

Unique (no collisions). This is enforced by the UNIQUE constraint automatically added by RDBMS on every PK.
Not reusable. Reusing a deleted row PK is technically possible in relational databases but is a very bad idea because it contributes to generating confusion (for example, an older log file can reference an ID reused in the meantime by a new entity, thus conducting to false deductions).
Meaningless. Developers should not parse IDs for the wrong reasons. Imagine an ID starting with the current year. Some developers may ignore that a date_creation column exists and will only rely on the PK's first four digits. If the ID format changes or is buggy (because of bad timezones handling for instance), some subtle issues may arise. Even if this is largely discussed, I would warn against using natural keys as PK altogether. It may limit your options in the future. For example, if you use e-mail as PK, you implicitly forbid modification of it in future releases: never say "never." Another problem with natural keys is the difficulty of ensuring unicity due to functional issues even when everything has been done to avoid them. I once worked for a French governmental agency and observed both issues in different projects: 1) Legacy code relied on the first digit of the NIR (social identity number) to get the people type thus ignoring possible type reassignments (though the current type was available as a dedicated column); 2) We recently discovered that this unique ID was not so unique (for example, an ID shared temporarily by several members of an immigrant family or collisions following cities merging). The real world is just too complex to make any assumption about unicity.

Which ID Format to Choose?

The two formats matching these rules are AFAIK:

Auto-incremented integers (starting at 1 or any larger value: 1, 100, 154555). Most RDBMS like PostgreSQL or Oracle provides the SEQUENCE object allowing to auto-increment a value while respecting the ACID principles. MySQL provides the AUTO_INCREMENT attribute.
Using a text-based random UUID V4 (universally unique identifier), also referred to as GUID (globally unique identifier) by Microsoft. Example: 9d17210c-2d5f-11ea-978f-2e728ce88125.

When working on existing projects, I often observe that PK is designed as auto-incremented integers. Though this may be considered an obvious and no-brainer choice, it may be a bad idea in the long run...

Let's consider both options. Each argument is provided along with an importance weight (from 1=minor to 5=major).

Why You Should Use Auto-Incremented Integers

[importance: 3] Known, understood, and simple solution. Leverages sequences on modern RDBMS. Comes with very little possibility of performance issues due to bad design.
[importance: 1] Pretty easy to read and verbalize when the number of digits remains reasonable (but, as stated before, technical PK should not be used by humans anyway — use a business key instead).

Why You Should Avoid Auto-Incremented Integers

Risks to Introduce Bugs

[importance 2] In some SGBDR like PostgreSQL, a nextval operation is not truly transactional. In rollback cases, the value is still incremented. It is hence possible to get "holes" (absent values) in PK sequences. This is not an issue by itself as unicity is preserved but can induce subtle bugs if developers rely on PK to count the number of items instead of using a proper COUNT query.
[importance 2] Likewise, developers may rely on PK values to sort items chronologically instead of using a dedicated date column. In case of sequence reset and ID reuse, this may conduct the wrong sorts.
[importance 1] The key can be huge and if developers used an int variable to map the PK instead of a long one, you can encounter silent overflow errors. For instance, in Java, if you map a PK to an Integer or primitive int and the PK gets larger than 2,147,483,647, the variable will silently map to the opposite (negative) value.

Security Risks When Using Auto-Incremented Integers as PK

Using them clearly makes your application an easier target:

[importance 3] Auto-incremented integers leak the number of items treated by a unit of time. A competitor can easily deduce how many sales you made in a month. Or an attacker can get a good idea of how many requests your system is supposed to handle and finely tune a DDOS attack.
[importance 5] Auto-incremented integers are predictable. If used in URLs, they become a traversal directory (also referred to as a "path traversal") exploit vector. For example, an HTTP GET URL can easily be forged from a regular URL path (https://.../1234/...) to another one (like https://.../1235/...). If the application implements a proper access management, the attacker would get a 403 code (as expected) but if it is not the case or if some endpoints have been forgotten, he can get sensitive data. The defense-in-depth principle promotes several layers of controls, and never relies on a single one.
[importance 4] Similarly, auto-incremented integers make possible large bulk data downloads (in case of bad access controls). It is trivial to write scripts scraping on IDs (like a curl inside a for loop in bash).

Architectural Issues

[importance 4] Auto-incremented integers make the integration of two systems more difficult. Imagine that your company buys another one and you have to merge an existing customer database into yours using an ETL. If both systems use auto-incremented integers as PK, you will have to avoid collisions by resetting sequences to new highs and not already used values. All foreign keys (FK) will have to be recomputed.
[importance 2] I think that from an architectural viewpoint, a database should only store data, not create it. With sequences, we leave to the RDBMS the creation logic.
[importance 1] With sequences, we mix inserts and data generation (insert into ... values nextval('id_seq')) and we have to keep the new value returned by the INSERT clause if we want to use it in the following queries. A creation function returning a value does not appear very logical to me. It is also possible to perform a SELECT nextval('id_seq') followed by an INSERT clause. This doesn't appear more logical to me to have to read something to be able to make a creation...

Operations Risks

[importance 1] When using integers as keys for every entity, it is much easier to confuse an item with another (for instance confusing request_id=10 with article_id=10).
[importance 1] When deleting an item, an operator can confuse a value with another (delete ... where id=4 instead of delete ... where id=40 for instance). This problem doesn't affect UUID as it is virtually impossible to type a matching UUID by chance.

The Other Way: Random UUID

The alternative approach is to use UUID (RFC 4122, ISO/IEC 9834-8:2005) version 4 or variants.

UUID Pitfalls

Using UUID V1, V2: only the V4 (random value) version of UUID is acceptable. UUID based on timestamps (V1, V2) and MAC address may lead to collisions at very high generation frequency (in the same millisecond), but worse, they leak important data (time of the generation and machine identification data). That could help attackers or give bad ideas to developers (see above why IDs should be meaningless).
Using the wrong database type: Most modern RDBMS come with a UUID type. In PostgreSQL, a UUID uses 128 bits of storage size, not 288 as we may infer naively from a UUID textual format.
Changing your mind: if you went with integers, stick with it on existing projects.
Not using a cryptographically-secure pseudorandom number generator (CSPRNG): you will encounter collisions and create security flaws. When using a low-quality or buggy pseudorandom generator, the collision risk is very high and may occur several times by day or even hour. Under Linux, use /dev/random and not /dev/urandom.
Using a CSPRNG but blocking your application when entropy is exhausted: If using /dev/random under Linux, a great solution is to use the haveged daemon to feed the CSPRNG.

UUID Misconceptions

Using UUID requires that you for collisions. As explained on this Wikipedia page, the risk of collision is so infinitesimal that it can be ignored. There is a collision probability of 50% every 2.71E18 generations (if you generate without stopping 10 IDs per second, you can expect a 50% probability of collision every 8.5 billion years). The sole control I would advise is to correctly trap SQL errors, as a collision would throw a UNIQUE constraint violation error. Any good code would handle this type of technical error and retry anyway. Real-world production databases already throw erratic SQL errors on a regular basis (like ObjectOptimisticLockingFailureException using Hibernate for instance) so the work is probably already done or it should be.
"UUID is more difficult to read and verbalize." As explained before, UUID is by no way meant for humans. Instead, use additional functional values for this. When well designed, they would be better than long integers. But UUID is often read by developers as well (when working on test doubles for instance). I observed in several projects that even then, UUID readiness is not an issue and no developers complained about it. We even figured out that transmitting UUID between developers (by instant messaging for instance) is safer than transmitting integers because nobody would type them and copy/paste prevents typos.

ℹ️ NOTE

NoSQL databases do not rely on integers as keys but on UUID (see MongoDB _id or CouchDB id attributes on documents for instance). This is due to their distributed nature, but I have never heard developers complain about it.

UUID V4 Advantages

[importance 5] Ensures a total non-significance. URLs containing PK are totally unpredictable. This prevents various exploits like path traversal or mass data downloads.
[importance 3] Greatly reduces the complexity of integration between databases.
[importance 2] Prevents all potential bugs and operations errors listed above.
[importance 1] No more sequence required: the business code can generate UUID by itself without using the database.
[importance 2] The code is easier to test because it is trivial to mock UUID without any RDBMS and their sequences features.
[importance 2] Most languages, frameworks, and tools support them.

UUID Real But Negligible Issues

UUID uses more space on disks and in memory (buffers). On most databases, a long one uses 64 bits whereas a UUID takes 128 bits. On a large database, it would only add 8MB every one million items.
There can be an impact on INSERT latency. Inserting one million lines against a PostgreSQL database takes about 25 seconds using UUID V4 and 6 with integers. This is noticeable only by very write-oriented workloads.
SQL queries require more CPU cycles to be performed because of the key size (two cycles for 128 bits vs a single one for 64 bits integers). In practice, the overhead is negligible.
In some very seldom cases (when containing only digits), UUID could be confused by badly parameterized WAF (web application firewall) with credit card numbers. Think about it when using F5 ASM for instance.

UUID V4 Real Issues and How to Fix Them

[importance 3] The UUID V4 looks fairly simple to implement but requires a minimal amount of skills and knowledge. If your team lacks tech leaders/software architects and has no idea of how to get a good source of randomness or of the difference between UUID versions, go for auto-incremented integers — it may save you from painful refactorings.
[importance 4] On most RDBMS, using genuine UUID V4 on large databases is not appreciated by DBAs because it fragments indexes, hence slowing them down when refreshing and during queries. If too fragmented, indexes have to be loaded entirely into memory, generating important performance issues if they don't fully fit into RAM and if disks have to be accessed.
[importance 2] Another performance issue deals with journals caused by fragmentation. Defragmentation (REINDEX or VACUUM) can become much slower and data replication (when enabled) can be impacted if relying on journals. On PostgreSQL, this phenomenon is called "WAL write amplification" by DBAs. Note, however, that the storage hardware has a large impact on performance when dealing with this issue. With SSD and NVMe disks, making random data access by design greatly mitigates this issue.

These two last performance issues can easily be fixed using a UUID V4 variant: the UUID short prefix COMB. COMB means "combined" because it mixes UUID V4 randomness with a hint of time. Its principle is to "sacrifice" two bytes of randomness and use them as a rolling sequence based on the current time (epoch value) with a minute-wide precision. Every minute, the prefix is incremented (it will thus run through all values from 0000 to FFFF in about 45 days). A sample sequence of such UUID could be:

2fe8-6aca-f113-4ef4-8b69-1b5de35d0832
2fe8-ec69-7acc-4cff-91c9-f658b331ee67
2fe9-8b94-993f-4176-9991-1f9e778a79a0 <- note the minute-wide increment
2fe9-b041-d0de-4552-b6b5-449a8ee32134
2fe9-da35-ce9d-4d4a-90e5-c2a4c89f18c7
2fe9-...

This way, UUID PKs induce far less fragmentation and index performances are similar to the ones observed with auto-incremented integers.

Several implementations exist. If you use Java, check this library to generate PKs from your application code.

If you prefer letting the RDBMS create the UUID itself, several implementations exist (like this PostgreSQL extension) but this adds a bit of complexity to install and configure the RDBMS.

We observed a few minor drawbacks with this method though:

It is a bit more difficult for developers to distinguish UUID as they start with the same bytes if created in a small amount of time. They have to check the last bytes.
Loss of entropy slightly increases the chance of collisions as only 12 bytes over 14 are now truly shuffled. However, the two bytes rolling prefix still add a nonnegligible entropy that is based on the current time. If we estimate that we actually lose only a single byte of entropy, the collisions risk is still negligible. You now have a 50% chance to get a collision every 1.05E16 generated UUID. If you generate continuously 10 UUID per second, you have a 50% chance to get a collision every 33.5 million years.
If PKs have to be generated by an ETL (typically during a migration process), replacing the built-in standard UUID V4 generator with a short prefix COMBO may require a few lines of code and/or some integration work. For instance, for PENTAHO, we had to integrate a Java library into the stream.

A first glimpse of production constraints for developers

2021-05-24T00:00:00+00:00

(This article has also been published on DZone)

In most organizations (except truly DevOps teams), developers are not allowed to access the production environment for stability, security, or regulatory reasons. A major drawback of this approach is the mental distance created between developers and the real world. Likewise, monitoring is usually managed only by operators, and developers receive little feedback except when they must fix application bugs (ASAP, of course). As a result, most developers have very little idea of what a real production environment looks like and, more importantly, of the non-functional requirements needed to write production-proof code.

Involving developers in resolving production issues is beneficial for two main reasons:

It is highly motivating to see tangible evidence of a real running system on a large infrastructure (data centers, clusters, SAN, etc.) and gain insights about the performance or business metrics of their applications (number of transactions per second, number of concurrent users, etc.). It’s also common for developers to feel unaccountable as they are rarely directly contacted when an outage occurs.
It can significantly improve the quality of delivered code by guiding design considerations for operational aspects like logs, monitoring, performance, and integration.

What Do Developers Often Misunderstand about Production?

Concurrency Is Omnipresent

Production is highly concurrent, whereas development is mostly single-threaded. Concurrency can occur among threads of the same process (e.g., on an application server) or across different processes running locally or on other machines (e.g., among n instances of an application server running across different nodes). This concurrency can cause various issues like starvation (slowdowns when concurrently waiting for a shared resource), deadlocks, or scope issues (data overriding).

What can I do?

Perform minimal stress tests on the DEV environment or even on your own machine using injectors like JMeter or Gatling. When using frameworks like Spring, make sure to understand and correctly apply scoping best practices (for instance, don't use a Singleton with a state).
Simulate concurrency using breakpoints or sleeps in your code and check the context of each stalled thread.

Volume Is Huge: You Must Add Limits

In production, everything is XXXL (number of log lines written, number of RPC calls, number of database queries...). This has major implications on performance but also on operability. For instance, writing an Entering function x/Leaving function x type of log could help in development but can flood the Ops team with GiB of logs in production.

Keep in mind this metaphor: If your DEV environment is a sandbox, production is the Sahara

In production, real users or external systems will massively stress your application. If (for instance) you don't set a maximum size for attachment files, you will soon encounter network and storage issues (as well as CPU and memory as collateral damage). Many limits can be set at the infrastructure level (like circuit breakers in API Gateways), but most of them have to be coded into the application itself.

What can I do?

Make sure nothing is 'open bar': always paginate results from databases (using OFFSET and LIMIT for instance or using the seek method), restrict input data sizes, set timeouts on any remote call, ...
Think carefully about logs. Perform operability acceptance tests with real operators and Site Reliability Engineers (SRE).

Production Is Distributed and Redundant

While in DEV, most components (like an application server and a database) run inside the same node, they are usually distributed (i.e., some network link exists between them) in production. The network is very slow in comparison with local memory (at scale, if a local CPU instruction takes one second, a LAN network call takes a full year).

In DEV, the instantiation factor is 1: every component is instantiated only once. In any production environment dealing with serious high availability, performance, or fail-over requirements, every component is redundant. There are not only servers but clusters.

What can I do?

Don't hardcode URLs or make assumptions about the colocalization of components (I've seen code where the localhost hostname was hardcoded).
If possible, reduce dev/prod parity by using a locally distributed system on your workstation, like a local Kubernetes cluster (see K3S, for instance).
Even if this kind of issue should be detected in an integration testing environment, try to keep in mind that your code will eventually run concurrently on several threads and even nodes. This has implications for tuning the number of connections in datasources, among other considerations.
Always favor stateless architectures.

Anything Can Happen in Production

One of the most common phrases I’ve heard from developers dealing with a production issue is, "This is impossible, this can't happen." But it does happen. Due to the very nature of production (high concurrency, unexpected user behaviors, attacks, hardware failures...), very strange things can and will happen.

Even after thorough postmortem studies, the root cause of a significant proportion of production issues will never be diagnosed or solved (from my own experience, about 10% of cases). Some abeyant defects occur only with a rare combination of exceptional events. Some bugs happen once in 10 years or, by chance (or misfortune?), never occur during the entire application lifetime.

A small story: I recently encountered a bug in a node.js job that occurred about once every 10,000 runs (when a randomly generated password contained an unescaped double dollar character sequence).

Check out any production log, and you’ll probably see erratic errors here or there (this is rather scary, trust me).

Preventing expected issues is good practice, but truly good code should control and handle the unexpected correctly.

Hardware or network failures are very common. For instance, network micro-cuts can occur (see the 8 Fallacies of Distributed Computing): servers can crash, and the filesystem can become full.

Don't trust data coming from other modules, even your own. For example, an integration error could make a module call a deprecated version of your API. You might also get corrupted files with wrong encoding or incorrect dates. We recently received a payload from a partner containing a date of February 30th...

Don't even trust your own database (add as many constraints as possible, like NOT NULL, CHECK, ...): corrupted data can appear due to bugs in previous module versions, migrations, administration script issues, stalled transactions, integration errors on encoding or timezones... Let any application run for several years and perform some data sanity checks against your own database—you may be surprised.

Users and external batch systems should be treated as monkeys (with all due respect).

Don’t rely on human processes, but assume they can do anything. For example, two common PEBCAK problems I observed recently on front-end components:

Double submit (some users double-click instead of single-clicking). Some REST RPC calls are hence done twice, and concurrency oddities occur in the backend.
Private browsing: for various reasons, users switch to this mode, and strange things happen (like local data loss or browser extensions being disabled).

Most of the time, users won’t admit to or even realize these errors. They might use an unsupported browser, use a personal machine instead of a professional one, open the web app in multiple tabs, and do many other things you wouldn’t expect.

What can I do?

Make your code as robust as possible, write anti-corruption layers, and normalize strings. When parsing data, control time formats, encoding, formats (if using hexagonal architecture, perform these controls as soon as possible in the 'in' adapters).
Add as many constraint checks in your database as possible. Don’t rely solely on the domain layer code.
When possible, instead of writing your own controls, rely on a shared contract (like a JSON or XSD Schema) and ensure you actually validate every inbound but also your own and outbound streams.
Think about retries, robust error handling, double submission, replays from save points in batch jobs, etc.
When writing tests, consider as many borderline or seemingly impossible cases as possible.
Use chaos-engineering tools (like Simian Army) that randomly generate errors to test your code's resiliency.
Plan for rejected data handling.
To address human errors, identify problematic users, book a meeting, and observe them using your application before asking any direct questions to avoid guiding them.
Build a realistic testing dataset and maintain it. Add new data as soon as you're aware of a previously unconsidered special case. Manage these datasets like your code (versioning, cleanup, refactoring, documentation...).
Don't ignore weak signals. When something unusual happens in development, it will probably happen in production as well, and it will be far worse then.
When fixing an issue, make sure to identify all the places it can occur; don't just fix it in the localized instance.
Add clever logs to your code. A clever log includes:
- A canonical identifier in the message (an error code like ERR123 or an event ID like NEW_CLIENT). This greatly eases monitoring by enabling regex matching.
- All required debugging context (like a person UUID, timestamp, etc.).
- The appropriate verbosity level.
- Stack traces for errors, so developers can easily locate the problem in the code.

Issues Never Walk Alone

In production, things never get better on their own: hope is not a strategy. Due to Murphy’s law, anything that can fail will fail.

Worse: issues often occur simultaneously. An initial incident can induce another, even if they seem unrelated at first glance (for instance, an Out Of Memory error can create pressure on the JVM Garbage Collector, which in turn increases CPU usage, leading to queued work latency and ultimately generating client timeouts).

Sometimes, it’s even worse: truly unrelated issues may occur simultaneously by sheer misfortune, making diagnosis much more challenging and potentially leading down the wrong path during a post-mortem.

What can I do?

Don’t leave issues in logs in production or DEV unresolved. Most issues can be detected in development or acceptance environments. Often, we observe problems and ignore them, thinking they’re transient due to some integration issue or intermittent network glitch. This type of issue should instead be taken as an opportunity to reveal a real underlying problem and should not be ignored.
When you observe something unusual, stop immediately and take a few minutes to analyze the issue or add new test cases. Consider that you may have discovered an abeyant defect that could take days to diagnose and resolve later in production.

In Production, Everything Is Complicated and Time-consuming

For some good, but also not so good, reasons, every change should be controlled and traced in a regulated IS. A single SQL statement must be tested in several testing or pre-production environments and then applied by a DBA.

Any simple Unix command has to be documented in a procedure and executed by the Ops team, who are the only ones with access to the servers. Most of these operations must be planned, documented in depth, motivated, and traced in one or more ticket systems. Changing a simple file or a single row in a database can hardly take less than half a day when counting all involved personnel.

The costs increase exponentially as we approach production. See [Capers Jones, 1996] or [Marco M. Morana, 2006]: a bug can cost as little as $25 to fix in DEV but as much as $16,000 in a live production environment.

Even though modern software engineering promotes CD (Continuous Deployment) and the use of IaC (Infrastructure As Code) tools like Kubernetes, Terraform, or Ansible, deploying in production is still a significant event in most organizations, and many DevOps concepts remain theoretical. Deploying a release can’t be done daily but often occurs about once a week or even a month. Each release usually must be validated by the product owner's acceptance tests (which involve many manual and repetitive operations). Any blocking issue requires a hotfix, involving significant administrative and build work.

What can I do?

Perform as much unit, integration, and system testing as possible before reaching the production environment.
Add hot-reloading configuration capabilities to your modules (such as changing log verbosity via a simple REST call to an administrative endpoint).
Make sure that all processes and operations (ticket system, contacts, alert methods, etc.) are documented and quickly accessible. If not, document them to save significant time in future incidents.

In Production, an Error is an Error

Correctly monitored systems trigger alerts for each significant error, and someone is generally responsible for analyzing it.

If too many false positives occur, supervisors will lower their attention level and may increasingly overlook actual problems.

They may also add filters or exceptions, which take time to write and test and can sometimes be buggy, leading to missed genuine alerts.

When dealing with hypervision (a single centralized and consolidated view of large systems used in control rooms), things are mostly binary: it either works or it doesn’t. False positives can quickly disrupt this view and defeat its very purpose.

What can I do?

Ensure that your logs don’t generate false positives. For instance, if you’re using lazy authentication to re-authenticate to an API when a token has expired, don’t log an error when the token expires—only if re-authentication fails.
Make sure your logs are well-structured and categorized by package, module, etc., so that false positives can be filtered out if necessary.

Production Is Very Stressful

When an incident occurs in production, your stress level may vary depending on the industry you’re working in. But even if you work for a medium-sized e-commerce company rather than a nuclear facility or hospital, any problem generates a lot of pressure from customers, management, and other teams depending on you. Ops teams are accustomed to this and most are impressively calm when dealing with such events—it’s part of their job, after all. When the issue originates from your code, you may have to work alongside them and shoulder part of the pressure.

What can I do?

Make sure to be prepared before an incident by writing or familiarizing yourself with procedures (for instance, see the excellent SRE Book by Google, Chapter 14).
Be confident in your logs and monitoring metrics to help you find the root cause (for example, prepare insightful dashboards and centralized log queries in advance).
For any complex issue, begin the investigation by creating a post-mortem document to centralize notes, stack traces, logs, or graphs that illustrate your hypothesis.

In Production, You Don't Have a Single Version to Manage

In practical scenarios, it's typically not feasible to compel all your internal or external clients to simultaneously upgrade to your latest API or data model. You must handle complex paths that require supporting multiple versions concurrently. For example, in the extensive French Tax Information System, the most crucial central API (such as the Person API) provides three versions of each endpoint. Approximately every year, a new version is introduced, the second is deprecated and becomes the third, and the third is decommissioned. All three versions must coexist with a shared data model.

What can I do? Always include a version in your API URLs (for instance, /v1/foo/bar). Incorporate model versions into your data model. For example, in NoSQL, add a modelVersion attribute that allows your code or ETL tools to determine the version of each data item individually. Consider managing data of varying versions; for instance, write conditional code based on the data model version.

In Production, You Usually Don't Start from Scratch

In development, when your database structure (DDL) evolves, you simply drop and recreate it. In production, in most cases, data is already present, and you have to perform migrations or adaptations (using ETL or other tools). Similarly, if clients are already using your API, you can't change the signature without careful consideration; instead, you must think about backward compatibility. If you need to, you can deprecate certain code, but then you must plan for end-of-service.

What can I do?

In development, don't simply drop DDL; use incremental change tools like Liquibase to 'code' changes. These same tools should be used in production.
Check that your libraries or API remain backward compatible through integration tests.
Use Semantic Versioning conventions to signal breaking changes.

Security Is More Significant in Production

In any seriously protected production environment, many security systems are set up. These are often absent from other environments due to their added complexity and costs. For instance, you may find additional level 3 and 4 firewalls, WAFs (Web Application Firewalls operating at level 7 against HTTP(S) calls), API Gateways, IAM systems (SAML, OIDC, etc.), and HTTP(S) proxies or reverse proxies. Internet calls are usually forbidden from servers, which can only use replicated and cached data (such as local package repositories). Consequently, many security infrastructure differences can mask issues that will only be discovered in pre-production or even production.

What can I do?

Don't use the same values for different credential parameters. This can mask some integration issues in production, where parameters are more likely to differ, and unique passwords are typically used for each resource.
Make sure to understand the limitations of the security infrastructure before coding related user stories.
Test security infrastructure using containers.

Conclusion

It's beneficial for developers to be curious and learn about production on their own by reading blogs, books, or simply asking colleagues. As a developer, do you know how many cores a medium-range server has (per socket)? How much RAM is on each blade? Have you considered where the data centers running your code are located? How much power consumption your modules use in kWh each day? How data is stored in SAN? Are you familiar with fail-over systems like load balancers, RAID, standby databases, virtual infrastructure management, and SAN replications? You don't have to be an expert, but knowing the basics is important and rewarding.

I hope this provided developers with a first glimpse of production constraints. Production is a world where everything is amplified: the severity of issues, costs, and time required to fix systems. Always keep in mind that your code will eventually run in production, and simply functioning is not enough: your code must be production-proof to keep the organization’s IS running smoothly. Then, everything will be fine, and everyone will be home early instead of pulling out their hair late into the night.

Release of the first version of our Project architecture document template

2021-04-16T00:00:00+00:00

Four years after the first release of a template in French, we release a revisited English version.

This architecture template is applicable to most management IT projects, regardless of the general architecture chosen (monolithic, SOA, micro-service, n-tier, …). It has already been used on several important projects including large organizations. It is maintained on a regular basis.

Discover it at GitHub.

Proper strings normalization for comparison purpose

2020-12-22T00:00:00+00:00

(This article has also been published at DZone)

TL;DR

In Java, do:

String normalizedString = Normalizer.normalize(originalString,Normalizer.Form.NFKD)
.replaceAll("[^\\p{ASCII}]", "").toLowerCase().replaceAll("\\s{2,}", " ").trim();

Nowadays, most strings are Unicode-encoded and we are able to work with many different native characters with diacritical signs/accents (like ö, é, À) or ligatures (like æ or ʥ). Characters can be stored in UTF-8 (for instance) and associated glyphs can be displayed properly if the font supports them. This is good news for cultural specificities respect.

However, we often observe recurring difficulties when comparing strings issued from different information systems and/or initially typed by humans.

Human brain is a machine to fill gaps. Hence it has absolutely no problem to read or type 'e' instead of 'ê'.

But what if the word 'tête' ('head' in French) is correctly stored in an UTF-8 encoded database but you have to compare it with an end-user typed text missing accents?

We also have often to deal with legacy systems or modern ones filled up with legacy data that doesn't support the Unicode standard.

Another simple illustration of this problem is the use of ligatures. Imagine a product database storing various items with an ID and a description. Some items contain ligatures (a combination of several letters joined together to create a single character like ’Œuf’ - egg in French). Like most French people, I have no idea of how to produce such a character, even using a French keyboard. I would spontaneously search the items descriptions using oeuf. Obviously, our code has to take care of ligatures if we want to return a useful result containing ’Œuf’.

How to fix that mess?

Rule #1: Don't even compare human text if you can

When you can, never compare strings from heterogeneous systems. It is surprisingly tricky to do it properly (even if it is possible to handle most cases like we will see below). Instead, compare sequences, UUID or any other ASCII-based strings without spaces or ‘special’ characters. Strings coming from different information systems have a good probability to store data differently (lower/upper case, with/without diacritics, etc.). On the contrary, good ids are free from encoding issues being plain ASCII strings.

Example:

System 1 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"Œeuf brouillé"}

System 2 : {"id":"8b286f72-b366-47a4-9537-59d39411979a","desc":"OEUF BROUILLE"}

If you compare ids, everything is simple and you can go home early. If you compare description, you'll have to normalize it as a prerequisite or you'll be in big trouble.

Characters normalization is the action of computing a canonical form of a string. The basic idea to avoid false positives when comparing strings coming from several information systems is to normalize both strings and to compare the result of their normalization.

In the previous example, we would compare normalize("Œeuf brouillé") with normalize("OEUF BROUILLE"). Using a proper normalization function, we should then compare 'oeuf brouille' with 'oeuf brouille' but if the normalization function is buggy or partial, strings would mismatch. For example, if the normalize() function doesn't handle ligatures properly, you would get a false positive by comparing 'œuf brouille' with 'oeuf brouille'.

Rule #2: Normalize in memory

It is better to compare strings at the last possible moment and to do so in memory and not to normalize strings at storage time. This at least for two reasons:

If you only store a normalized version of your string, you lose information. You may need proper diacritics later for displaying purpose or others reasons. As an IT professional, one of your tasks is to never lose information humans provided you.
What if some items have been stored before the normalization routine has been set up? What if the normalization function changed over time?

To avoid these common pitfalls, simply compare in memory normalize(<data system 1>) with normalize(<data system 2>). The CPU overhead should be negligible if you don't compare thousands of items per second...

Rule #3: Always trim externally and internally

Another common trap when dealing with strings typed by humans is the presence of spaces at the beginning or in the middle of a sequence of characters.

As an example, look at these strings: ' Wiliam' (note the space at the beginning), 'Henry ' (note the space at the end), 'Gates III' (see the double space in the middle of this family name, did you notice it at first?).

Appropriate solution:

Trim the text to remove spaces at the beginning and at the end of the text.
Remove surnumerous spaces in the middle of the string.

In Java, one of the way to achieve it is:

s = s.replaceAll("\\s{2,}", " ").trim();

Rule #4: Harmonize letters casing

This is the most known and straightforward normalization method: simply put every letters to lower or upper case. AFAIK, there is no preference for one or the other choice. Most of developers (me included) use lower case.

In Java, just use toLowerCase():

s = s.toLowerCase();

Rule #5: Transform characters with diacritical signs to ASCII

When typed, diacritical signs are often omitted in favor of their ASCII version. For example, one can type the German word 'schon' instead of 'schön'.

Unicode proposes four Normalization forms that may be used for that purpose (NFC, NFD, NFKD and NFKC). Check-out this enlightening illustration.

Detailing all these forms would go beyond the scope of this article but basically, keep in mind that some Unicode characters can be encoded either as a single combined character or as a decomposed form. For instance, 'é' can be encoded as \u00e9 code point or as the decomposed form '\u0065' ('e' letter) + '\u0301' (the diacritic '◌́'') afterward.

We will perform a NFD ("Canonical Decomposition") normalization method on the initial text to make sure that every character with accent is converted to its decomposed form. Then, all we have to do is to drop the diacritics and only keep the 'base' simple characters.

In Java, both operations can be done this way:

s = Normalizer.normalize(s, Normalizer.Form.NFD)
	.replaceAll("[^\\p{ASCII}]", "");

Note: even if this code covers this current issue, prefer the NFKD transformation to deal with ligatures as well (see below).

Rule #6: Decompose ligatures to a set of ASCII characters

The other thing to understand is that Unicode maintain some compatibility mapping between about 5000 ‘composite’ characters (like ligatures or roman precomposed roman numeral) and a list of regular characters. Characters supporting this feature are documented (check the 'decomposition' attribute in Unicode characters documentation).

For instance; the roman numeral Ⅻ (U+216B) can be decomposed with NFKD normalization as a 'X' and two 'I'. Likewise, the ĳ (U+0133) character (like in 'fĳn' - 'nice' in Dutch) can be decomposed into a 'i' and a 'j'.

For these kinds of 'Siamese twins' characters, we have to apply the NFKD ("Compatibility Decomposition") normalization form that both decompose the characters (see 'Rule #5' previously) but also maps ligatures to several 'base' characters. You can then drop the remaining diacritics.

In Java, use:

s = Normalizer.normalize(s, Normalizer.Form.NFKD)
	.replaceAll("[^\\p{ASCII}]", "");

Now the bad news : for obscure reasons, Unicode doesn't support decomposition equivalence of some widely used ligatures like French 'œ' and 'æ' or the German eszett 'ß'. If you need to handle them, you will have to write your own replacements before applying the NFKD normalization :

	s = s.replaceAll("œ", "oe");
	s = s.replaceAll("æ", "ae");
	s = Normalizer.normalize(s, Normalizer.Form.NFKD)
	.replaceAll("[^\\p{ASCII}]", "");

Rule #7: Beware punctuation

This a more minor issue but according to your context you may want to normalize some special punctuation characters as well.

For example, in a literary context like a text-revision software, it may be a good idea to map the em/long dash ('—') character to the regular ASCII hyphen ('-').

AFAIK, Unicode doesn't provide mapping for that, just do it yourself the old good way:

s = s.replaceAll("—", "-");

Final word

String normalization is very helpful to compare strings issued from different systems or to perform appropriate comparisons. Even fully English localized projects can benefit from it, for instance to take care of case or trailing spaces or when dealing with foreign words with accents.

This article exposes some of the most important points to take into consideration but it is far from exhaustive. For instance, we omitted Asian characters manipulation or cultural normalization of semantically equivalents items (like 'St' abbreviation of 'Saint') but I hope it is a good start for most projects.

References

http://www.unicode.org/reports/tr15/

https://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html

https://en.wikipedia.org/wiki/Unicode_equivalence#Normalization

https://minaret.info/test/normalize.msp

Why did I rewrite my blog using Eleventy ?

2020-11-05T00:00:00+00:00

Reasons to change

This personal home page and blog was previously self-hosted using a great Open Source Wiki engine: Dokuwiki. It worked great for long years but few months ago, I felt than it was time to change lanes and embrace the JAM Stack (JavaScript / API & Markdown).

Issues with traditional wikis

Security: many spam in comments, possible PHP vulnerabilities
Regular upgrades to be performed against the engine
Many plugins required to make something useful. Old ones, conflicting ones...
Not so easy to customize the rendered pages
Slower than a static website
Much larger electricity consumption to serve pages
Requires PHP modules to be installed and tunned along with the HTTP server
Most wiki engines require a database (even if it is not the case of Dokuwiki)
Not so easy reversibility. One way way is to use Pandoc to translate wiki syntax to markdown.

Opportunities with the JAM Stack

Ability to write articles using a more widespread markdown languages than one of the numerous Wiki syntaxes around
None vulnerability possible (except from the Web server itself) as the produced website is only static HTML
Using Git (advanced version control) and associated ecosystem (Merge Requests...)
Possibility to use CI/CD tools to deploy new pages
Can be deployed on CDN (even if I continue to self-host it)
Possibility to use great IDE to write articles (like VSCode and all its extensions)
Faster preview of rendered page : I can now see in my browser the result in less than a single second
Containers-friendly (using a nginx docker image typically)
It's the new trend ! (OK, it's a kind of RDD but it may be useful in current professional context)

The not-so-good using the JAM Stack

You have to rely on external services to perform some basic features like adding comments (already disabled in my case, too many spam messages) or full-text searches

Eleventy

Well, I finally decided to switch to the JAM Stack. But it is very crowded. I already use Antora at work to generate great technical documentation using Asciidoc but it was not suitable for a blog. I also used Jekill for a long time with Github pages (see Jajuk website) but I find it complicated, aging and too restrictive.

After a quick look at the most popular platform (Hugo), I gave up. Basically, I felt than I had to learn a full world before being able to make a website and I haven't this time.

Then, I heart about a new simple platform: Eleventy. I loved the Unix-like idea behind it: a very low level tool leveraging on existing templating engines like Liquid or Nunjucks and allowing to mix HTML and markdown contents. It also leverages a convention over configuration principle enabling results in no time.

Last but not least: it is very fast (near as fast as Hugo). It is a JavaScript tool great for most frontend developers who can use npm, sass... Look at this page if you want to see sample code using Eleventy.

I finally rewrote my website in raw CSS, HTML, Markdown and Liquid templates thanks to Eleventy. It only toke me a single day to grasp basic Eleventy concepts and port the existing website. I finally got a full control over my pages.

Note that another common strategy is to use an existing theme (like a Bootstrap-based theme) and to make the HTML generic using templating templates. I gave up this method because I wanted something simple, very light and something I fully control and understand...

Comment faire de bons ADR (décisions d'architecture) ?

2020-05-06T00:00:00+00:00

Registre des décisions d'architecture

Un registre de décisions d'architecture sert à consigner les décisions importantes d'architecture (les ADR, Architecture Decision Record).

Le but est de permettre la connaissance et la compréhension des choix a posteriori et de partager les décisions. Le dossier d'architecture quant à lui ne reprend pas ces choix mais ne fait apparaître que la décision finale.

Il n'y a qu'un seul registre d'ADR par projet.

Format d'un ADR

Chaque ADR est constitué d'un fichier unique au format asciidoc avec ce nom : [séquence XYZ démarrant à 001]-[decision].adoc.

Format de la décision : en minuscule sans espaces avec des tirets comme séparateur. Exemple : 007-API-devant-bases-existantes-perennes.adoc.

Chaque ADR contient idéalement le contenu suivant (adaptable en fonction des besoins) :

1) Historique

Donner le statut et l'historique des changements d'états
Les statuts possibles sont : TODO (à rédiger), WIP (Work In Progress), PROPOSE,REJETE, VALIDE, DEPRECIE, REMPLACE.
Si le statut est VALIDE, détailler la date et les décideurs qui ont validé.
Si le statut est REMPLACE, donner la référence de l'ADR à prendre en compte.
Ne jamais supprimer un ADR (le mettre en statut DEPRECIE) et ne pas réutiliser l'ID d'un autre ADR du même module.
Mentionner l'éventuel ADR qui le remplace. Exemple: Remplacé par l'ADR 002-...

2) Contexte

Présente les choix possibles, les problématiques, les forces en jeu (techniques, organisationnelles, réglementaires, financières, humaines ...). Donner les forces, faiblesses, opportunités et risques de chaque solution (voir méthode SWOT).

Note:

Si un point est rédhibitoire, l'indiquer.
Numéroter les solution pour y référer sans ambiguïté
Pour les cas les plus simples, deux paragraphes avantages/inconvénients pour chaque solution peuvent suffire.
Dans certains cas, l'ADR ne peut contenir qu'une seule solution, le but étant de documenter les raisons de cette architecture.

3) Décision

Donner la décision retenue (être affirmatif et rappeler le numéro de solution retenue). Exemple: Nous effectuerons les signatures de PDF au fil de l'eau (solution 1).

4) Conséquences

Donner les éventuelles conséquences de la décision en terme de mise en œuvre. Ne pas reprendre les forces, faiblesses des solutions mais plutôt les conséquences pratiques de la décision. Donner les actions permettant de réduire les éventuels risques induis par la solution.

Exemples :

* Il conviendra de prévoir des logs spécifiques pour le traitement

* Le risque d'indisponibilité sera couvert par des astreintes renforcées

Format du registre

Idéalement, un registre d'ADR propose un rendu visuel de tous les ADR avec leur statut et leur historique respectifs de façon à disposer d'une vue globale sur la situation de chaque décision. Statuts et historiques ne doivent en aucun cas être dupliqués car implique une double maintenance qui a très peu de chance d'être faite correctement. Dans la plupart des cas, mieux vaut ne faire figurer ces informations que dans chaque ADR même si cela implique de les ouvrir un pas un. Une alternative est de classer les ADR dans des sous-répertoires par statut mais cela rend le parcours des ADR plus difficiles.

Si vous utilisez Asciidoc (ce que je recommande vivement), une astuce existe : l'inclusion de tags. L'idée est de laisser le statut et l'historique mais chaque ADR mais de les inclure dans un tableau pour former le registre. Exemple :

Dans 001-dedoublonnage-requetes.adoc :

## Statut
// tag::statut[]
`VALIDE`
// end::statut[]

## Historique
// tag::historique[]
Validé le 26 nov 2019 avec xyz
// end::historique[]

et dans le registre (README.adoc) :

.Table Liste et statuts des ADR RECE
[cols="2,1a,4a"]
|===
|ADR |Statut |Historique

|link:001-dedoublonnage-requetes.adoc[001-dedoublonnage-requetes]
|include::001-dedoublonnage-requetes.adoc[tags=statut]
|include::001-dedoublonnage-requetes.adoc[tags=historique]

|link:002-appels-synchrones.adoc[002-appels-synchrones]
|include::002-appels-synchrones.adoc[tags=statut]
|include::002-appels-synchrones.adoc[tags=historique]
...
|===

Exemple complet d'ADR

    ## Historique
    Statut: `VALIDE`

    * Validé par xyz le 28 janvier
    * Proposé par z le 02/01/2020

    ## Contexte

    <Présentation générale de la problématique>

    # Solution 1: <description solution>
    ## Forces
    - Limite l'utilisation du réseau

    ## Faiblesses
    - Moins robustesse

    ## Opportunités

    ## Risques
    - [rédhibitoire] Nécessite que la signature se fasse en synchrone ou en fil l'eau

    # Solution 2: <description solution>
    ## Forces
    ## Faiblesses
    ## Opportunités
    ## Risques

    ## Décisions
    La solution 2 est retenue

    ## Conséquences
    - Vérifier la configuration des JVM pour utiliser un générateur d'aléas

Conseils d'utilisation

Ne pas hésiter à ajouter des images/schémas... Penser à Mermaid et Plantuml.
Ne pas horodater les modifications de l'ADR lui-même, c'est le rôle de l'outil de gestion de version (GIT). Utiliser des messages de commit explicites.
Un bon ADR doit être :
- court ;
- clair ;
- pertinent (explique bien le contexte, les choix possibles et la décision retenue) ;
- accessible de tous (Wiki, Github..., pas de documents bureautique) ;
- tracé (changelog, commits Git, ...) ;
- transparent : s'il manque des éléments de décision, les mentionner.

Autres resources

Liens : liste des templates d'ADR courants

V3 modèle de dossier d'architecture

2019-09-01T00:00:00+00:00

Voir https://github.com/bflorat/modele-da

Le modèle a été augmenté, simplifié et corrigé. Surtout, il prend la voie d'une documentation vivante en étant repris en asciidoc (il sera donc maintenant possible de proposer des merge requests par exemple). Les diagrammes sont toujours en Plantuml mais la plupart ont été repris en diagrammes C4.

Retours et PR appréciés

Summary of Cal Newport's "Deep Work" book

2018-05-31T00:00:00+00:00

I just finished "Deep work", an interesting book. I only regret it doesn't contain any reference concerning the pomodoro technique.

Here's my few raw notes :

Deep work : “professional activities performed in a state of distraction-free concentration that push cognitive capabilities to their limit”. For high skills, difficult to replicate.
Shallow work : “non cognitive demanding, logistic-style tasks, often performed while distracted.” Low value, easily replicable 
Deep work hypothesis : the ability to perform a deep work is rare and valuable. Those who are capable will thrive. 
The core abilities : 
- quickly master hard things
- produce elite level with speed
Both depends on deep work

Myelin : by triggering always the same paths, better signal -> more focus = more intelligence
High quality work = time x intensity of focus 

Metric black hole : we don't actually measure value of tasks we perform
Principe of least resistance : given that we don't actually measure value of our work, we do first what is easier : shallow work.
Busyness as a proxy for productivity : in knowledge works, difficult to estimate our own value : a lot of shallow work makes false feeling of produced value 
Cult of the Internet : everything from the Internet (like facebook) is considered a piori as good in IT :  hugh error.
Neuroscience : what you are is the sum of what you focus on. Happier when we focus on flow activities. We need goals, challenges, feedback.
We all have a limited amount of will-power so we need to save it for deep work.

Profiles of deep workers:
- bimodal : monastic-like activities for few days, shallow work during the rest of the time
- rhythmic philosophy : moment reserved every day, use a chain method like a cross on the calendar : we want to avoid any hole in the chain.
- journalist philosophy : switches between shallow work and deep work all the day long (hard) 

Ideas to help deep work:
- grand gesture : leave habits, work in an hotel for ie
- help serendipity by meeting people from others disciplines
- stop to work the evening to let the unconscious mind to solve problems for you (less work = more CPU to solve problems in your mind background)
- also rest because we all have a limited amount of available attention
- perform of shutdown ritual every end of day (like saying 'work performed') -> brain conditioned to stop running thoughts. Otherwise, Zeigarnik effect (we remember better interrupted tasks because we want to solve it)
- search boredom to help the brain to rewire
- schedule the day by blocks, change blocks during the day if required 

Deep work meditation to solve complex problems:
- Store variables of current state of the problem
- ask question to force the brain to go to the next problem and no looping
- fight distracted thoughts

Memorization technique (see the book for more details) : imagine large objects in 5 rooms of our house, map the objects with a set of celebrities and imagine scenes. Each person maps a value (like a number of a card value) 

Avoid any-benefits tools like facebook, concentrate on craftsman approach : only consider tools that help significantly to reach the lead goals 
To determinate if a tool that help : 
- list the key activities you need to realize to reach the lead goals 
- for each activities, ask yourself if the tool helps or not

4DX (Four disciplines of eXecution) :
- focus on widely import goals (measurable few goals)
- focus on lead goals, not long term goals
- use scoreboards
- perform periodic summaries 

Law of the vital fews (Pareto principle) : 80% of a given effect is done by 20% of the possible causes
During leisure, avoid using Internet, do high-level activities like reading literature
Evaluate shallow work performed by week and confront it to your boss and ask him to validate.
To determine if a work is shallow : how many months would it take to teach an hypothetical post graduate to make it ?
Say "no" by default, provide vague explanation to avoid questions.
Process centric e-mails to close the loop and free the mind : state clearly the next steps on every subject (every action)
Avoid replying to e-mails on subject without interest, coming with too much work to reply etc..

Benefits of Hardware-based Full Disk Encryption and sedutil

2018-05-31T00:00:00+00:00

We need to protect our personal or professional data, especially when located on laptops that can easily be stolen. Even if it is not yet fully widespread, many companies or personal users encrypt their disks to prevent such issues.

They are three major technologies to encrypt the data (most of the time, the same symmetric cipher is used:AES 128 or 256 bits) :

Files-level encryption tools (7zip, GnuPG, openSSL...) where we encrypt one or more files (but not a full file system)
Software FDE = Full Disk Encryption (dm-crypt, encfs, TrueCrypt under Linux ; BitLocker, SafeGuard under MS Windows among many others) where a full file system is encrypted. Most of these softwares map a real encrypted file system to a in-memory clear filesystem. For instance, you open an encrypted /dev/sda2 filesystem with dm-crypt/Luks this way :

    sudo cryptsetup luksOpen /dev/sda2 aClearFileSystemName  
    <enter password>
    mount /dev/mapper/aClearFileSystemName /mnt/myMountPoint

Hardware-based Full Disk Encryption (also named SED = Self-Encrypting Disk) where hard disk encrypt themselves in their own build-in disk controller. We'll focus here on this technology.

To make it work, you need :

a SED-capable hard disk or SSD (I for one own a Samsung 840 PRO and a 850 EVO that support it, most professional disks do).
a compatible BIOS that support SED. You can then set a disk-level user password in the BIOS (and optionally an administrator password to unlock the user password). When the computer boots, the BIOS asks interactively for a disk password [1]. Note that many BIOS (especially on desktops or on non-professional laptops) doesn't support this feature because the constructor has not enable it (maybe to avoid customer complaints about password loss ?).

Once the correct BIOS disk password entered, the disk becomes totally 'open' (we say 'unlocked'), exactly like it has never been encrypted. None software is involved afterward. It is important to understand than a SED always encrypts the data. There is no way to disable this behavior (however, it doesn't cause any significant effect on the IO performance however because the IO volume is unchanged and because the disk controller comes with a build-in AES chipset). The real encryption key (MEK = Media Encryption Key) is located inside the disk itself (but cannot be accessed). The user password (named KEK = Key Encryption Key) is used to encrypt / decrypt the MEK. Keeping the disk password unset is like keeping a safe open : the data is still encrypted but decrypted when accessing the disk exactly as if none security system ever existed. When you set the user password, you close the safe door using your key. Note that there is no (known) way to recover a disk if you loose your password : you not only loose your data but you also loose your disk : it becomes a piece of junk from where none data can be read or written to.

I used dm-crypt (the default FDE software under Linux) for my own laptop until soon as I bought a SED-enabled Samsung SSD but I never managed to use them on my own computer because my AMI BIOS doesn't support this feature. The only option then was to use a software file system encryption. This works but comes with several complications or drawbacks :

you need a /boot partition in clear to bootstrap the process. An attacker can easily alter this partition and add keyloggers for instance ;
you have to change some kernel options and make sure to set the right modules loading order at startup or resume (ans keep them when updating the kernel) ;
the TRIM SSD feature [2] is now supported by dm-crypt but it comes with security concerns ;
you need dm-crypt commands on liveCD distros when performing system backups.

The only benefit of using software FDE I can think of is the possibility to check the cipher code source (when using an open source solution like dm-crypt of course). This is not the case of hardware encryption even if none severe issue has been reported so far AFAIK.

SED hardware-based disks are much simpler to use in comparison :

you only have to set a BIOS password and it's done !
you save a significant amount of CPU usage ;
it is possible to destroy definitively a drive by changing its password once for all when decommissioning a laptop for instance (but it is also a drawback when the password is lost unintentionally).

But :

once unlocked, the disk remains in this state while the computer is powered (this include while suspended on RAM). Login window doesn't change anything : an attacker can read the drive by plugging directly to the SATA port (DMA attack) and even worse ; [a warm reboot (a restart) keeps the drive open !]{.ul} It means that one can access the unlocked disk simply by inserting a Live CD/USB and rebooting the computer. The Live CD/USB is booted and all the drive data is available when mounted ! ; This is why, when using SED, [you should always hibernate]{.ul} (suspend-on disk) instead of suspending on RAM : when hibernating, the drive actually loses power and is locked again. Of course, you'll get the same effect when turning off your computer.
you need a SED-capable BIOS. Note that you can also use the hdparm command to unlock a SED drive but it requires to boot a Live CD/USB. Then launch something like the command bellow and then restart your computer. However, it is not actually practicable ;

    sudo hdparm --user-master u --security-set-pass 'pass' /dev/sdb

if you loose the disk password, the disk is simply dead (but is may be a benefit as stated before) ;
you may depend of a special BIOS manufacturer because it trims or hash the disk password (KEK). Another BIOS may use another algorithm. It means that moving a drive from a computer to another may lead to be unable to unlock the drive, even with the same password.
because the operating system and its settings is not yet booted, only the QUERTY keyboard layout is available, you have to keep this in mind when choosing and typing it ;
you have to trust the hardware security chipsets.

The OPAL specification published by the Trusted Computing Group (AMD, IBM, Intel, HP...) fixes some of these issues :

you can always save the disk when loosing the disk password (of course, data is still lost, fortunately) thanks the PSID Revert function (the PSID is a number printed on the disk proving than you can physically access the drive) ;
the KEK hashing and triming is now standard : the same drive could be moved from a computer to another :
you can use SED even without BIOS support because OPAL comes with a mechanism called 'shadow MBR'. Basically, you flash a mini-OS (the PBA = Pre-Boot Authorization) up to 128MB to a dedicated area of the disk. This OS is provided to the BIOS when booting. A password window is then displayed. If the password is correct, the real MBR of the drive (the Master Boot Record = boot code) is then decrypted and executed. No more need for BIOS SED support and even better : a new open source OPAL implementation (sedutil) is available and its code source can be reviewed much more easilly than the BIOS binary firmware.

The new sedutil project comes with :

some PBA images ready to flash to the drive
the sedutil-cli command to administer the OPAL disk (setting up a drive in OPAL configuration, changing the password, PSID revert...) . Note that these commands requires to set libata.allow_tpm=1 to the kernel flags if run from an installed Linux. You can also, like me, use sedutil-cli from a rescue image booted from USB. See the list of commands. See also how to Setup a drive.

This worked perfectly for me and I now use my Samsung 850 EVO drive in SED OPAL mode. Note that sedutil doesn't support suspend on RAM (when resuming, the drive is as if it was dead, you'll get IO errors all over the place). Always use hibernation instead (as I already stated, it's the only safe way to use SED drives anyway).

[1] Note that it has nothing to do with the main BIOS user password that "protect" your machine (then your disk data is still in clear and can be read simply by moving it to another computer or by removing the BIOS battery)

[2] TRIM is used for SSD to free ASAP unused blocks and increase the disk lifespan.

One month with Ansible

2018-02-03T00:00:00+00:00

Ansible is an Open Source IT automation tool written in python and sponsored by RedHat. Best known alternatives are Puppet, Chef and Salt.

I used Ansible for the first time (2.4.3, last release in early 2018) in an attempt to produce some quite sophisticated Docker Swarm docker-compose files and others yaml configuration files that includes a significant volume of logic (port number increments, conditional suffixes, variable number of sections according to lists of items, etc.)

I achieved my goals in about five or six days of effective work, including the reading of most of the official manual. Be able to achieve such a real task in six days is acceptable when we have to learn it first but I think I would have made it in a single day in bash (that I already know). However, Ansible is much more powerful. My first contacts and real works with Ansible were really enjoyable and I was very surprised to make it work so easily. I also tried to apply all the documented best practices with success. Sadly, I spent the last three days struggling with the last 5% of remaining work, dealing with limitations/bugs that I found hard to understand and quite irritating.

What I liked

The concept of desired state is very powerful: Ansible playbooks (list of tasks to performed against some servers) are idempotent : only the final states have to be described (like " a /tmp/foo directory with 600 rights), not the actions required to reach it (like in bash : mkdir, chown, chmod...). It's powerful partially because you don't have to test existence of the final state (in a bash in exit on error mode, you would have to check existence of each directory for instance).
Ansible is agentless : nothing to install on targeted servers. All you need is an ssh key exchange to allow the headless ssh connections. Ansible generates python scripts from the playbook, copy them using scp or sftp and run them remotely using ssh as well.
The role concept is a kind of operation process packaged (like "add a mysql user" or "create and configure an Apache server"). It enables a lot of reuse and is really great. A marketplace of shared roles is available on Galaxy.
The manual and reference documentation is good and extensive.

What I found irritating

UPDATE November 2019 : all of the issues described here has been resolved in the mean time by the Ansible team, KUTGW !

I don't like yaml for complex structures. I find it harder to read than json and syntax errors are very frequent and occur a great waste of time. The data structures are described by (space) indentation I found brittle. Worst : different indentation forms can be both valid but mean different things (like a map of map or one more key/value for the current map). Validators exist but AFAIK, formatters doesn't. However, yaml comes with fine features like comments or multi-documents.
Playbooks execution is rather slow because of a new ssh connection for each task + one for the generated python scripts sending to remote host. Note however that even if tasks are always executed sequentially, the tasks are run in parallel against all the targeted servers.
You need to create a playbook that just wrap a role to run it, you cannot launch a role directly from command line
There are 16 kinds of loops in Ansible like with_fileglob or with_filetree. Is it really necessary ?
I wasn't able to increment a variable inside a loop in a jinja2 templates : https://github.com/pallets/jinja/issues/641 . This is a feature, not a bug. Incrementing things (like ports) is nevertheless a very basic requirement IMO. Hopefully, there is a workaround (using a list, append and pop).
It isn't possible to match a directory with with_fileglob : https://github.com/ansible/ansible/issues/17136. You have to use with_filetree that comes with other constraints.
It is difficult to debug the templating, especially when using templates fragments (with import). On any template module error, you only get the playbook line and the full template content (very difficuly to read BTW).
I find the syntax sometimes twisted, like when we have to use doubles quotes around variables and sometimes not. Also, why should we add white space around the variables names ? (like ). I find this ugly and annoying. Apparently, we can drop the spaces in playbooks but not in the jinja2 templates...
Ansible is not compatible with python 3.0 to 3.5. Sometimes (like with the copy module), I didn't get any error message despite the fact that the python package on the target server was unsupported.
It is not possible to copy recursively with src_remote (https://github.com/ansible/ansible/issues/14131). I had to use a hack (run template on the Ansible host using connection: local ) and then to copy using src instead of src_remote.

Final thoughts

As a conclusion, Ansible is a good product but can become cumbersome when trying to make it run too much logic. It is mainly a declarative system, not imperative. Next time, we'll have a look at salt, it may be a more suitable solution, or maybe not ?

Dashboard under XFCE real howto

2016-10-30T00:00:00+00:00

If like me you like both XFCE and Gnome-Shell dashboard/ window picker, here's how I configured my desktop for the nearest most Gnome-like experience :

1) Install xfdashboard (the dashboard itself). I ised version 0. Note : this release comes with a hot corner plugin, no more need to use xdotool or brightside.

2) Add or enable these commands to be run at X startup (in XFCE Settings / Sessions and startup / application autostart ) : xfdashboard -d (deamon mode for a faster display)

3) Configure XFdashboard using xfdashboard-settings :

In 'plugins', select the 'hotcorners' plugin
make sure to restart xfdashboard to enable this new plugin : xfdashboard -q, then xfdashboard -d &

4) Add the preferred applications into the vertical side bar (no GUI, xfce4-settings-editor cannot edit arrays), here's a sample command :

xfconf-query -c xfdashboard -p /favourites -n -t string -s "exo-file-manager.desktop" -t string -s "exo-terminal-emulator.desktop" -t string -s "jetbrains-idea-ce.desktop" -t string -s "owncloud.desktop" -t string -s "simple-scan.desktop" -t string -s "gnome-calculator.desktop" -t string -s "firefox.desktop" -t string -s "thunderbird.desktop" -t string -s "zim.desktop" -t string -s "libreoffice-writer.desktop"

5) If you are in multi-monitors mode and you want to see all windows on the primary display and not spread on several monitors, see my workaround : in /usr/share/themes/xfdashboard/xfdashboard-1.0/xfdashboard.css (or in the others themes xfdashboard.css files) , change filter-monitor-windows: true; to filter-monitor-windows: false;

The IT crowd, entropy killers

2016-07-16T00:00:00+00:00

I once asked myself "how to define our job in the most general sense of the term, we, computer scientists ?".

Our fields are very diverse but according to me, the greatest common divisor is "entropy hunter".

All we do have is geared toward the same goal : decrease the level of complexity of a system by modeling it and transforming a bunch of semi-subjective rule into a Turing machine program that can't execute the indecisive.

Everything we do, including documentations, workshops with the stakeholders, project aspects, and not only the programming activities should be about chasing doubt. Every word, every single line of code should kill ambiguity.

Take design activities : most of human thoughts are fuzzy. This is the reason why waterfall (traditional) project management processes where all designs are done in one go can't work : the humans need to see something to project themselves using it and go further in their understanding.

Business designs are subjective in many ways, for instance :

by describing missing cases (or less often, unexisting cases)
by words ambiguity. Here's a small anecdote : last week, I worked on a specification document written in French with the word : "soit" : "the file contains two kinds of data, soit data1 and data2". This sentence could be understood in two opposite ways because the French word "soit" means "either/or" but also "i.e.". Hence, this sentence could mean at the same time "the file contains data1 AND data2 kinds" or "the file contains data1 OR data2 types". I encounter this kind of uncertainty several times a week.
by lacking of examples. The example are often much more demanding and objectionable. They require a better understanding of the system. Moreover, designing by the example (like in BDD) tend to be more complete because when you start to provide nominal examples, you are tempted to provide the corner case ones. (read BDD in Action by John Ferguson Smart for more).

On the opposite, a program is deterministic. It is a more formal (and modeled thus reduced) version of a complex reality. The more a reality need cases and rules to be described entirely, the more the program is complex but it is still much simpler than the reality it describes.

The quality of all we do should IMO be measured in the light of the amount of complexity we put into our programs. The less complexity we used to model a system, the better a program is.

Programming is craftsmanship and requires skills

2016-06-12T00:00:00+00:00

Many managers think that programming is easy, it's just a bunch of for, if, else and switch clauses after all, isn't it ?

But coding is difficult because it is mainly about TAKING DECISIONS ALL THE TIME.

Driving is easy because you don't have to take decisions about the way to turn the steering wheel; walking is easy, you don't even have to think about it; Drilling a 10 mm hole into a wall is easy because the goal is clear and because you don't have many options to achieve it...

Software is difficult and is craftsmanship because there are always many ways to achieve the same task. Take the simplest example I can think about : an addition function : we want to add a and b to get c=a+b.

* Should I code this the object-oriented way ( a.add(b) ) or the procedural way ( add(a,b) ) ?

* How should I name this ? add() ? sum() ? How should I name the arguments ?

* How should I document the function ? is there some project conventions about it ?

* Should I return the sum or store it into the object itself ?

* Should I code this test first (TDD) ? write an UT afterwards or write no test at all ?

* Does my code scale well ? does it use a lot of memory ?

* Which visibility for this function ? private, public, package ?

* Should I handle exceptions (a is null for instance) or from the caller ?

* Should the arguments be immutable ?

* Is it thread-safe ?

* Should this function be injected within an utility class ?

* If I'm coding in object oriented, is it SOLID compliant ? what about inheritance ? ...

* ... tens of others questions any good coder should ask to himself

If all of this decisions could be taken by a machine, coders would not be required at all because we would just generate code (and we sometimes do it using MDD technologies, mainly for code skeletons with low added value).

We -coders- would then all be searching for a new job. But, AFAIK, this is not the case, we are still needed, still relevant. All companies still need costly software craftsmen !

Q.E.D. ;-)

I can't agree more with the manifesto for software craftsmanship.

Deployment scripts should always be refreshed from VCS prior execution

2016-06-12T00:00:00+00:00

After few months of continuous deployment scripts writing for a pretty complex architecture (two JBoss instances, a mule ESB instance, one database to reset, a BPM server, each being restarted in the right order and running from different servers), I figured out a good practice in this field : scripts have to be auto-updated.

When dealing with highly distributed architectures, you need to install this kind of deployment script (mostly Bash) on every involved node and it becomes soon very cumbersome and error prone to maintain them on every server.

We now commit them into a VCS (Subversion in our case), it is the master location of the scripts. Then, we try :

To checkout them before running when possible. For instance, we used a Jenkins job to launch our deployment process (written as a bash script). The job is parametrized to checkout the SVN repository for the script before running it from the Jenkins workspace. This is very convenient.
When this is not possible (for instance when the script should be executed on another server than the CI server), we checkout the script from the Jenkins server and push them (using scp for instance) to targeted server before executing it (using ssh).
Sometimes, when the call must be asynchronous on another server, we simply trigger a script by creating remotely an empty file. A very simple croned bootstrap script (not refreshed itself) detect the file change, update the script (svn co) and run it.

Retours Eclipse DemoCamp 2015 Nantes

2015-05-31T00:00:00+00:00

J'ai eu le plaisir de me rendre à l'Eclipse DemoCamp Nantes jeudi dernier au Hub Creatic (il est difficile à trouver car pas encore indiqué, c'est le bâtiment jaune vif à coté de l'école Polytech Nantes. C'était la première fois que je m'y rendais et je dois dire que j'ai été impressionné, dommage qu'il ne soit pas en centre ville).

Nous avons eu un panorama extrêmement éclectique mais passionnant du monde Eclipse en 2015, de l'internet des objets (IOT) à l'usine logicielle de grands groupes en passant par l'informatique pour les enfants. Ceci montre, si besoin était, la force de traction du monde Eclipse en tant qu'IDE bien sûr mais surtout en tant que plate forme.

Gaël Blondelle de la fondation Eclipse

l'a très bien expliqué : la force d'Eclipse est avant tout sa capacité fédératrice : la version Luna a été réalisée par 400 développeurs issus de 40 sociétés différentes.

La notion de release train (livraison simultanée et annuelle de tous les projets en juin) assure une stabilité et une intégration de qualité entre les centaines de plugins.

Une notion émergente concerne également les Working groups regroupant des travaux par thème comme :

LocationTech orienté SIG . Un des projets les plus innovants de ce groupe est Mobile Map générant des cartes directement calculées sur le smartphone.
IOT fédérant les projets autour de l'internet des objets. Deux projets intéressants : Eclipse Smart Home pour la domotique et Eclipse SCADA proposant des librairies et outils SCADA (Supervisory Control and Data Acquisition) servant au monitoring de nombreux hardwares.
Eclipse Science pour des projets de visualisation ou de traitements scientifiques.
PolarSys regroupe des projets pilotés par Thales, le CEA, Airbus, Ericson... pour les projets de modélisation autour de l'embarqué (Papyrus SysML, Capella...).

Laurent Broudoux et Yann Guillerm, architectes au MMA

nous ont ensuite exposé l'historique de déploiement et leur stratégie multi-versions d'Eclipse. Leur DSI regroupe 800 personnes dont 150 utilisateurs d'Eclipse travaillant sur des projets aussi variés que du legacy (Cobol, Flex, Java historique) à des projets plus novateurs (mobile, applications Web à base de Grails...).

En résumé, la construction d'une nouvelle version de l'atelier (unique jusqu'en 2012) prenait jusqu'à 50JH en partant d'Eclipse de base et en intégrant/testant tous les plugins nécessaires. La nouvelle stratégie se décline en deux axes :

Construire un atelier modulaire en trois couches : 1) une base (seed) : distribution Eclipse pré-packagée ; 2) des plugins communautaires (Confluence, Mylyn, Subclipse...) 3) des plugins maison principalement autour des outils de groupware Confuence.
Différentier les ateliers suivants les besoins (6 variantes, 4 familles) :
- une usine legacy à base de Galileo
- une usine « High Tech » pour les « usages » (CMS, mobile) basée sur la distribution Grails GGTS (et bientôt intégrant les technologies Android ADT) ;
- une usine « cœur de métier » basée sur Juno (je n'ai pas noté la distribution Eclipse utilisée comme seed) pour les applications JEE (fournit les briques techniques de persistance SQL et NoSql, la modélisation UML, les outils MDD (Acceleo, ATD...), M2E pour l'intégration Maven... ;
- une usine de modélisation d'architecture pour gérer le patrimoine, les études d'impacts, la déclinaison des scénarios projets... Cette modélisation d'entreprise se base sur un Metamodèle dérivé de TOGAF. Cet atelier se base sur la seed SmartEA (d'OBEO).

La stack technologique des usines s'appuie notamment sur Mylyn (gestion des taches), Confluence (wiki d'entrerise), Maven, Chef pour le Configuration Management, SVN comme VCS.

Stéphane Bégaudeau d'OBEO

nous a ensuite présenté les outils de développement et d'intégration de l'écosystème NodeJS. Le scaffolding par archétypes se fait via l'outil Yeoman . Le package manager des librairies JS est npm. On dispose également des librairies/frameworks angular.js, ember.js et backbone.js. Bower est un gestionnaire de packet pour librairies JS. Le build se fait soit avec Grunt (modèle configuration over code) ou (préféré), le plus récent Gulp (code over configuration) plus simple. Min.js assure des fonctions de minification du code. Pour les tests, on dispose de Jasmine (BDD), Mocha, Qunit. PhantomJS et CasperJS permettent des tests headless. Istanbul assure l'analyse de la couverture de code. JSHint effectue les tests de style. Karma teste l'ubiquité des pages (responsive design). Pour finir, Stéphane nous présente Eclipse Orion, l'IDE Eclipse Web basé sur NodeJS. Cet IDE assure entre autres la complétion du code, la coloration syntaxique, vient avec un très bon support de Git et peut être étendu par plugins.

Hugo Brunelière d'Atlanmod

nous a fait découvrir le programme de recherche ARTIST proposant des outils d'ingénierie des modèles et des méthodologies pour migrer une application traditionnelle en application cloud-friendly. Le programme de 10M€ est développé principalement par l'INRIA et ATOS (Spain). Le programme propose :

De la méthodologie via un handbook, un modèle de certification.
Des outils d'analyse de faisabilité métier et technique, de la rétro-ingénierie, des outils d'optimisation.

La modélisation est faite en UML stéréotypé sous Enterprise Architect principalement. Des DSL textuels à base de XText sont également utilisés ainsi que des DSL graphiques à base de SIRIUS. L'analyse M2T est faite via Modisco et la transformation de modèle M2M en ATL. Le reporting se base sur BIRT. La méthodologie est outillée par EPF (Eclipse Process Framework). Un modèle de maturité cloud-friendly a été développé : le modèle MAT (Maturity Assessment Tool).

Stévan Le Meur de Codenevy

nous a fait une démonstration de Eclipse CHE , une plate-forme SaaS pour les développeurs basée sur Orion et Docker. Un poste de développement peut être très facilement provisionné puis "déployé" en Web pur (c'est la notion de « Factory Codenevy »). Il est possible de sélectionner puis d'exécuter des containers Docker faisant tourner des SA Tomcat, Jboss ou autre en local ou à distance. Une nouvelle fonctionnalité d'intégration GitHub en avant première (clone puis pull request en quelques clics sans rien installer) a fini de nous bluffer.

Maxime Porhel d'OBEO

nous a présenté un environnement de programmation graphique pour les cartes Arduino et à destination des enfants. Ce DSL graphique très simple a bien sûr été développé en SIRIUS (la version Open Source d'OBEO Designer). Une démonstration très rigolote a prouvé le concept sur une carte Arduino AVR inclus dans un kit DFRobots. Ce sont mes enfants qui vont être contents :-)

Enfin,

Fred Rivard, fondateur de IS2T

nous a expliqué les enjeux économiques et technologiques du Java embarqué. 100Md de micro-contrôleurs de 1 à 15$ sont actuellement déployées au niveau mondial. 25 % tournent en environnements « balisés » : iOS, Android et Linux. Le reste est est extrêmement éclaté en centaines de technologies sur lesquelles on programme encore en assembleur. Le ticket de départ d'un projet se chiffre à 1M€ minimum et il faut sortir le produit en moins de six mois pour être rentable vis à vis de la concurrence. Le Big Data ne pourra se développer harmonieusement que si le Little data (les devices, l'IOT) qui l'alimente devient plus économique. IS2T vise à développer des JVM embarquées extrêmement rapides et légères en mémoire (rentable en terme de mémoire à partir de 100K de mémoire flash vis à vis de code classique). Toutes ces technologies sont regroupées autour de la plate-forme MicroEJ). IS2T développe également un « store » d'applications embarquées pour ce type de hardware. Fred nous a présenté de façon ludique de nombreux exemples d'utilisation comme cette montre connectée qui s'allume en 48ms alors qu'il faut 500ms pour soulever son bras pour lire l'heure : la montre peut donc être arrêtée le plus clair de son temps, son autonomie est en décuplée.

A noter quelques apartés en séance

sur Oomph, un nouvel installer pour les plugins Eclipse permettant également de centraliser le paramétrage des développeurs.

Retour sur l'Agile Tour 2014 Nantes

2014-10-15T00:00:00+00:00

J'ai eu la chance d'assister à la journée Agile Tour 2014, version nantaise, à l'école des mines. Bien organisé, riche en rencontres et retours d'expériences, comme tous les ans...

Les world cafés

Une innovation intéressante cette année : les 'World cafés' entre les conférences et pendant lesquels un sujet est discuté par un groupe éphémère et dont un seul membre (le scribe) reste pour consolider les idées qui sont ensuite présentées. Concept favorisant les échanges entre les participants. A cette occasion, j'ai notamment pu échanger avec la responsable d'une grande mutuelle qui m'expliquait qu'elle avait du mal à trouver des prestations de MCO agiles alors que de notre coté, nous avions encore du mal à trouver des clients prêts à partir en (vrai) agile en mettant en front du projet un PO (Product Owner) disposant de pouvoir de décision, d'une expertise fonctionnelle et de temps pour s'investir sur son projet.

Comment impliquer vos clients dans leurs projets ?

J'ai tout simplement adoré cette conférence très concrète et profonde à la fois. Benoit Charles-Lavauzelle (CEO de Theodo) et Julien Laure (coach agile, scrum master) présentent l'histoire de leur société et comment ils sortent des projets (maintenant) réussis en scrum. La société qui développait des projets au forfait (sites B2B en PHP/Symfony) a été proche du dépôt de bilan en 2011. L'insatisfaction des clients était forte à cause de l'effet tunnel : une fois terminées, les applications ne correspondaient pas au besoin que le client pensait avoir exprimé. La société s'est alors tourné vers la méthode scrum qu'elle a appliqué by the book. L'échec a été grand et la cause peut sembler évidente a posteriori : il n'y avait pas de PO du coté du client, donc pas d'implication. Sans PO, le projet navigue à vue. La société a décidé en 2013 ne ne plus faire que des projets en scrum avec implication forte du client. Malgré de fortes réticences des clients qui ne voulaient être facturées au temps passé et non plus au forfait, la société a vu son CA passer alors de 1.2 à 5M€ cette année. Les clients sont venus pour l'expertise technique en PHP/Symfony et sont restés pour la qualité et le respect des délais (95% des clients recommandent la société).

Comment Theodo a-t-elle réussi à impliquer le client ?

D'abord, rassurer le client : l'inviter aux plannings de sprint, estimer avec lui (en pocker planning) pour qu'il se rende compte des difficultés techniques. Faire des sprints courts (une semaine ici).
Etre transparent, Theodo suit précisement chaque écart au standard (voir support P28).
Burdowncharts visibles par le client en live via outils Web.

Qu'est ce qu'on bon PO ?

Il faut choisir le PO qui porte (vraiment) le projet, possède le pouvoir de décision (attention aux erreurs de casting).
Il faut du feedback permanent avec le PO : système d'évaluation hebdomadaire et portant sur la vélocité et l'accompagnement.

Comment faire valider le PO ?

Board électronique avec les taches à valider : Trello (très simple à utiliser pour le client).
e-mail quotidien en mode digest avec toutes les questions en suspens, URL importantes, n+1 en copie. Envoyé après le daily.
Une fiche d'auto-eval agile (voir support P44) permet d'évaluer la qualité "technique" du sprint et d'arbitrer entre le court et le long terme.

Bilan

Le PO travaille de un à deux jours par semaine avec l'équipe, ce n'est pas de trop !
Un nouveau problème émerge avec les grands comptes : la distance avec le PO et la généralisation des proxy-PO représentant du PO coté prestataire. Un proxy-PO, c'est mieux que rien (mais à peine mieux).

L'Intelligence collective au service de l'innovation et de l'industrialisation

Clément Duport (Alyotech) nous fait part de sa vision de l'innovation. Il explique que le nœud gordien des politiques IT actuelles réside en ce domaine dans l'ambivalence entre la créativité, le risque, la liberté du coté de l'innovation et l'harmonisation, le contrôle, l'ordre du coté de industrialisation. Ceci conduit à une vraie schizophrénie (OK, nous avons chez Capgemini le Lab'Innovation qui résout en partie ce dilemme en proposant cet espace d'innovation à nos clients). En fait, il explique qu'il faut les deux pour avancer, il faut trouver la bon niveau entre l'ordre (pour survivre) et le désordre (pour avancer). "Créer, c'est se souvenir de ce qui n'a pas eu lieu" (Siri Hustvedt). L'innovation peut émerger d'une démarche industrielle, par recombinaison d'idées.

Faire de la conception en équipe sans architecte

Ly-Jia Goldstein nous fait part de son expérience de développeuse en équipe suivant les préceptes du software craftmanship et de l'XP. Elle explique qu'un bon processus de développement en XP et s'appuyant sur le BDD, le tout en responsabilisant au maximum les membres de l'équipe (en instaurant des decisions techniques collégiales) pouvait se passer d'architecte (logiciel). Ceci présente de nombreux avantages comme un meilleur bus factor, une plus grande réactivité projet et une meilleur fluidité du refactoring. De bons points ont été soulevés. Néanmoins, la conférence tournait de mon point de vue autour du rôle d'architecte logiciel uniquement. Il me semble qu'un cadre d'architecture général (urbanisation, architecture technique, catalogue de solutions, cadre industrialisé, PIC) soit incontournable dans les grands SI, même s'il est vrai que les équipes, constituées en grande partie d'ingénieurs, gagneraient à être plus proactives sur un plan logiciel et éviter des situations telles que celle-ci :

My cloud, my way

2014-08-25T00:00:00+00:00

I just finished to setup my personal cloud storage. It has been a long and difficult task and I'd like to share with people with similar requirements a bunch of useful information and pointers that would have save me a lot of time.

Summary diagram

Orange: HTTPS stream; Green: synchronization stream; Blue: Webdav stream; Red: security system

My requirements

Safe : strongly encrypted storage for data and backups, encrypted communications, easy to backup and restore. Client-side encryption is optional.
Ecological : reduced footprint, especially when dealing with the energy.
Cheap : free or very low price for large amount of storage space (200 GB to 1 TB).
Open : should run under the three main operating systems (Linux, Windows, OSX) ; HTTP proxy compliant; Available from anywhere using a simple web browser.
Fast : I mean less than 10 minutes to detect changes from my 110 GB / 90K files. Low CPU consumption on the client side and on the server side appreciated.

Kinds of files in the cloud storage

Emerging file usage patterns I identified for me so far are :

Exchange" : temporary storage to easily share files between computers. Synchronous writing. I use this typically when leaving the office to upload a document I want to work on at home from another computer and I want to make sure that the file is immediately uploaded into the cloud without having to wait for the next synchronization. Note that would be largely useless if I kept my computers online but I suspend them to save energy.
"Pure cloud" : primary source is the cloud. Can be read/written from any node but the preferred node in case of conflict is the cloud itself. I use it for few TODO notes that should be available from anywhere. The synchronization can be asynchronous.
"Archive" : same than "Pure cloud" but for archiving purpose only, few writes, few reads, files to kept. I use this to save some backups.
"Unidirectional copy" : asynchronous copy of a directory into another node for read-only when off-line. I use this to get a copy of some directories located on the cloud only but sometimes required when offline (for instance I want on my office laptop a read-only snapshot of my personal notes uploaded from my personal laptop).
"Unidirectional sync" : a directory is primary on a node (this node is preferred in case of conflict) and is asynchronously synchronized into the cloud and then possibly other nodes. The directory can be written only on the primary node. This is the main pattern I use for most of my data.
"Bidirectional sync" : Shared directory between several nodes. Any node can read or write. I don't use this mode because my experience showed that it comes at the cost of numerous conflicts : if you have to edit files from an offline computers (on the train for instance), you quickly get conflicts. It is often too late to properly reconsiliate them when you figured out the problem. I prefer to use the "Pure cloud" pattern for files that can be written by several nodes. In the "Pure cloud" pattern, however, you can only access these files read-only when offline because they will be overridden by the cloud version at the next synchronization.

The different streams of the infrastructure

HTTPS using a browser

Typical use case : I'm traveling and I want to watch/show a picture / an administrative asset etc.
Usage frequency : low
From where ? anywhere on the planet
Requirements : a browser and a login/password
Modalities : read-only, the files are browsed using the default Apache tree explorer.
My experience : the navigation is so fast (even on my CubieBoard and my pretty low upload bandwidth) that I find this useful to find a document even from home.

Remote filesystem mount point

Typical use case :
- Copying some files to backup, when I want to get sure to upload a file into the cloud without waiting for the next scheduled sync (when leaving office for instance)
- Performing filesystem operations against the mount point (count files, check size recursively, remove directories...)
- Editing a note file located on the cloud.
Usage frequency : mounted at startup, pretty low effective usage (once or twice a day)
From where ? office, home
Requirements : a mounting software (I use davfs2)
Modalities : works well even through a HTTP proxy. It works using a cache by design so the local and the remote files may not be different during a period of time, never use this for a synchronization (using sync or unison for instance) because it doesn't preserve time, see below "Note about Webdav".
My experience : OK if you only use it for occasional use cases described previously. Comes with a significant latency that increase the time of the 'df' commands for instance. I plan to mount it only on demand and stop to mount it automatically at startup.

Local access to synchronized files

Typical use cases : doing real work (like development) at home or office that can't afford low latencies when saving files.
From where ? home, office.
Usage frequency : always on in background.
Modalities : sync every 1h30, the full sync of the entire collection takes from one to two minutes. Only the cloud contain all the data : on my office computer, I only store professional projects files and I only synchronize them to the cloud, same for my home computer with the personal stuff.
My experience : works well but the merge/conflict priorities must be clear and forged into the sync commands. Never user bidirectional sync (see "Patterns : Kinds of files in the cloud storage") that can turn bad due to conflicts.

The solutions I tried during the last year

SparkleShare : based on Git. As the website now states, it is good for small storage required (very good for that purpose) but Git is not designed for large binary storage so SparkleShare turns rapidly too slow to remain usable.
Wuala : very good and clever, many features, client-side encryption but : 1) not open source so we have to trust them on the client-side encryption code about the fact that there is no backdoor included (difficult to believe nowadays ;-) ) 2) expensive.
Owncloud : Pretty good, I now consider the release 5 as a serious solution, it meets all my criteria BUT is soooooo slow (on my CubieBoard, 1 Ghz ARM, SATA3 adapter)... Even when using a finely tunned MySql database (asynchronous IO among others things) instead of the packaged SQLite, it becomes very slow after few 10Ks of files mainly because of the high number of SQL queries it has to perform (not only when using the Web GUI but also when using the Webdav interface). The synchronization client 1.4 (for Seven and Ubuntu) is very slow (takes more than one hour to detect changes or fails in time out most of the time) and takes a significant amount of CPU (10 or 20%) even on powerful computers (i7, 4 cores). After a extensive use of Owncloud during several months I had to try another thing, too bad... I may give it another try in several years.
Hand-crafted solution : I finally decided to solve the problem the Unix way, ie many small and powerful specialized tools chained one to the other and it finally works even better than excepted initially. See details bellow.

Not tested but not that far from my requirement

Client-side encryption with EncFS + Dropbox/Hubic/Google Drive or others free storage services. The main problem are 1) the cost of the storage, free plans provides only few GB 2) The web GUI are unusable because all directories and files names are encrypted. You'll find a lot of tutorials and blogs about this solution on the Web.
Seafile : Not tested because it is not compatible with HTTP proxies, looks promising on the paper.

Features I don't care about (but you may do)

Directories/files sharing /groupware features like concurrent editing : most of modern tools like Owncloud support this.
Version control (Owncloud is bundled with a plugin for that purpose). I still use a SCM (Git) for some directories (like source code or text notes) on the original source directory (and sometimes on the replicated locations) but I ignore the .git directories (which contain the local repository) so the source and the destination have their own local repository that doesn't collide (a git local repository is not intended to be shared among several computers)

Note about Webdav

Webdav is an ancient technology re-emerging thanks to the cloud storage trend, most cloud providers comes with a Webdav connectivity.
The good
- It is based upon HTTP so HTTP-proxy compliant out of the box.
- A distant Webdav service can be mounted under Linux (using davfs2) or the others OS.
- The Bad : however, my conclusion is that this technology is not really reliable to build a cloud meeting my requirements :
  - Time or rights are not preserved upon copy.
  - Mainly due to previous restriction, the synchronization (using rsync or unison for instance) is not reliable and even dangerous.
  - I observed sometimes (using davfs2) that some files existing on the server side are not visible from the client (even with a regular name).
  - Webdav requires a cache on the client and comes with write latencies, often of several seconds or tens of seconds.
  - Installation is often cumbersome, especially under Windows XP/Vista/Seven that comes with different bugs so we need to change the windows registry (I never ended to make it work under Seven).
  - Webdav has a bad reputation when it comes about security but "Secure" Webdav, ie Webdav +Basic/Digest authentication under HTTPS looks enough (I'm not a security expert though).

Note about the hardware, a CubieBoard 1

Excellent lightweight device : a bit more expensive than a Raspberry but more powerful (1Ghz ARM CPU), more memory (512MB) and a SATA3 adapter to avoid using a slower USB connector.
My hdparm stats :

Timing cached reads: 796 MB in 2.00 seconds = 398.06 MB/sec Timing buffered disk reads: 326 MB in 3.00 seconds = 108.52 MB/sec
Note that a CubieBoard 2 has been recently made available, the main evolution is a dual core ARM CPU. Looks good but my CubieBoard 1 looks still enough for me alone.
The measured power consumption including the transformer goes from 3W (100% idle) to 6W (100% CPU + extensive IO usage)
The (excellent) tutorial I followed to install Debian is here
The bad :
- I had a lot of IO failures due to lack of power of the 2.5' hard disk. I finally found a solution : in addition to the regular 5V/0.5A power jack cable, I had to plug another USB cable into the female mini USB port : using this double power supplies, the SATA connector works like a charm.
- CPU is enough for a single person remote access (Apache, on-fly encryption, unison...) but not enough to compress tens GB of data when doing backups. I have to backup using a tar method, even gzip is far too slow and would take days (~1MB/sec). It's still OK because I have a very large volume of free disk.
- I regularly backup the system (about 1GB) using a microSD card stored in a safe place far from the server.

Note about EncFS

EncFS is a filesystem encryption program. It map a "real" filesystem with encrypted files to a userspace 'in memory' filesystem. It is very simple to use, stores the files encrypted file by file, even the directories and file names are encrypted. The encryption is very strong using the paranoia mode ("Cipher: AES Key Size: 256 bits PBKDF2 with 3 second runtime, 160 bit salt"according the man page).

If an attacker or a burglar physically stoles the server, he has to unplug the server thus to shutdown it. Without the password, the data is safety encrypted on the hard disk and is lost for the attacker.
Note that EncFS doesn't actually use your password to encrypt the files but actually uses a self-generated internal password itself encrypted using your password. It is cool because this way, you can change the filesystem password (EncFS provides some admin command for that), none file has actually to be encrypted again.
Another cool thing with EncFS is that fact that even root can't access the filesystem, only the user that mounted the filesystem into its userspace (www-data when used in an Apache context) is able to.
A last cool thing is that all the files are already encrypted for backup : one doesn't have to encrypt the files during the backup process (hopefully given the size of the data and my server CPU, it would be simply impossible in my case). The backup files can be stored on a regular filesystem as the data is already encrypted. Moreover, the per file EncFS encryption mechanism allows incremental backup (mandatory as well in my case).
I also use EncFS to store my local files on laptop so the data is never available in clear all over the process (encrypted on my laptop, encrypted during the transfer using a strong SSL encryption and finally encrypted on the server side)
The CPU overhead is minor. the EncFS process has some 60-80% CPU usage on the (fanless) server CPU during a short period of time when accessing files but I still get a lot of wait IO so the disk access is actually a greater speed limiter.
The only (minor) drawback is the fact that one have to provide a password to mount the filesystem (done only once when booting the server).

About Unison

Unison is an excellent tool to synchronize two locations. It is simpler and more powerful than rsync for that special purpose. I initially tried to synchronize the local files on my laptop with the Webdav mount point but it has been a disaster for the reasons I explained before.

Unison can also work over SSH but require a unison on the server side as well. This way, I assume Unison detects changes from the server and send only a final digest over SSH, it is impressively fast.
I use cron or bash scripts with sleep loops for the synchronization scheduling.
I configure unison to ignore paths in order to synchronize partial part of some directories located on the cloud into different nodes. For instance, let's say that I work at home on project 'p1' and at work on project 'p2', I want to get :
On the cloud, all the projects : /mydata/myprojects/p1, /mydata/myprojects/p2
On my personal laptop : /home/me/p1 (only 'p1' files, none 'p2' file)
On my office laptop : /home/me/p2 (only 'p2' files, none 'p1' file)

The technical stack in use

Apache with SSL and Webdav modules
- The same Apache Virtual host for Webdav and HTTPS, the first is obviously Read-only, the second can be written or mounted.
- I use a RSA 4096 bits certificate to make the communication safer.
- The HTTPS virtual host is protected using a Digest Authentication password.
- I use port 80 (for a HTTP tunnel) and port 443 (for Webdav and plain HTTPS) because HTTP proxy usually only allow them. Using an HTTP tunnel allows me to synchronize my directories even behind an HTTP proxy when required.
Unison for file synchronization.
I use several well known security systems including a iptables firewall restringing every port but 80 and 443. Fail2ban is configured to ban attackers that failed to login into SSH or Apache services.
http-tunnel is a very simple http tunneling tool that work very well. I is available as a standard Debian package as well. I had a problem using it with unison though behind an HTTP proxy due to packets length. The solution for me has been to set the -c option to a high value :
```
htc **-c 100M** -F 1058 mydomain.com:80 . 
```
The cloud and laptop local data is stored encrypted using EncFS.
The server files are backed up using the excellent tool 'backup-manager'. EncFS makes the backup security free as I explained in the EncFS section. Naturally, the backups files have regularly to be saved into an external disk physically protected and located far away from the server in case of disaster or thief.

Final thoughts

I finally met all my requirements :

Very cheap (disk price : 0.08€/GB at this day + 5.50€ / year of electricity for an average consumption of 4W + 60€ for the CubieBoard =~ 22€/year for 1TB of storage over a 5 years amortization period)
Pretty safe solution. By security I mean mainly confidentiality, authentication and backup. All the data is stored at home, away from large Internet companies.
Large storage space (1 TB).
Very fast : synchronization usually lasts less that 2 min and has no significant effect on the client nor server CPU. It performs several orders of magnitude better than every solutions I tried before.

How I manage my passwords

2014-08-16T00:00:00+00:00

With so many websites and system credentials we have to remember, settling down to an acceptable password policy is challenging. After years of trial and error, I'm approaching something I eventually find convenient and safe enough.

UPDATE 2020

For type-1 passwords according to my categorization, I now use this method from ANSSI (French Security Agency) : create a memorable passphrase and use only the first letters. For instance : When I go to work, I always stop at Bob's becomes : wig2w,IasaB's .

What I learned

Don't use the same password for all credentials : if one is cracked, attackers will go straight to the similar resources (like Facebook after Twitter) and gain access to them. Using small variants is not enough, especially of the pattern if obvious (like mypassword_facebook and mypassword_twitter ) .
Change often your passwords (I have still some work to be done here).
Use passphrases or long passwords because the most important thing for a credential is the length, not the estimated complicity. Check out this excellent website, it explains it ways better than I would.
Human are extremely predictable, never trust yourself when choosing a password, only trust randomness and maths.
Never use a generated password from the Web, you never now if the website is safe or if the communication between you and this website is (even under HTTPS, the communication can be intercepted and store for further analysis by malicious governments for instance).
Don't trust password "strength" evaluators that are based upon the kind of characters, their case, the special characters presence and so on but doesn't deal with emerging patterns that would dramatically reduce the entropy and makes the password trivial to guess. For example, aBcDeFgHiJ1234567 is evaluated as very strong but would be broken down in minutes by any attacker.
Only rely on randomness from the real word (like using dices or coins), not on pseudo random number generators (like /dev/urandom under Gnu/Linux). However, I feel free to use random number generators when available ( /dev/random under Gnu/Linux). OK, I know it is less safe than using physical stuffs but I feel it's an acceptable trade-of between security and convenience.
Don't let your browser to remember the most important passwords and perform regular cleanups of every passwords you already stored into it. However, I for one make exceptions for low to moderate importance passwords GIVEN THAT 1) I NEVER leave my computer unlocked, even for a few minutes 2) all my personal data is stored on FDE or LUKS/dm-crypt encrypted volumes.
There are two types of passwords :
- [Type 1] The passwords you need to remember because you often need them (like login on your systems) or because you must remember them when you don't have your computer with you, when traveling for instance (Paypal, Online bank, webmail passwords etc.). You should create a strong yet memorable passphrase for each of them. The best method to achieve it is probably using the Diceware method. If you aren't already familiar with it, I can't advice you enough to read it and its FAQ.
- [Type 2] The passwords you don't need to remember because you don't use them often. In this case, free your mind and store them using a wallet program like keepass or an encrypted raw text file. Don't use proprietary program that could contain backdoors but only Free/Open Source softwares.
Not all passwords have to be equally safe. The more a password is safe, the more it is difficult to remember and the longer it is to type, hence altering user experience. When dealing with 'stored' passwords, you should always use very long and complex passwords because there is no inconvenient to do so in this case. You can use a (local) password generator of very length random strings with many numbers, different letter cases and special characters because you don't have to remember them anyway but only to copy/paste them from the wallet (BTW, most of them come with a convenient feature of pushing temporary the passwords into the clipboard and can generate new passwords as well). The length and complexity of the passwords to remember, for their part, can be calibrated according different levels. For example : 4 diceware words for low/medium security level and 6 words and case/special characters variations for the most sensible credentials.
Use a personal salt (a salt is a string we add to a password to make sure that an attacker cannot use pre-computed rainbow tables and break your password in seconds). Most websites don't actually store your password but only a MD5/SHA-1 hash of your password along with a salt set on a per user basis. This is the current state of the art but this is not always the case and you can't expect all the websites you use to enforce this basic rule. Using your own salt is an additional precaution in the case where the website stores the passwords hashes without salt. Of course, it is useless if the website stores the password in clear.

The errors I made

I used online password generators. Some are cool because they map easy to remember passphrases to strong passwords. So. what's the problem ? 1) You have to come back to their website every time you need the password ; 2) same as before, you can't trust the website or the communication anyway; 3) What if the online service shuts down ? answer : you loose all your passwords (you don't even know the algorithm they use to map a passphrase to a strong password so you can't rewrite it by yourself to get back your passwords from the passphrases you still remember).
I tried various methods to remember my passwords. Some are based upon a base password on which we apply a transformation (like a->@, i ->! and so on) and that we specialize according to the website (like MyP@wd-f@cebooK and MyP@wd-Tw!tteR ). What's wrong with that ? 1) The special characters substitution is often hard-coded into the attacker dictionary and has nearly zero advantage in comparison with the initial character; 2) Imagine that in my case an attacker cracks my Facebook password, do you think it will be difficult for him to find the Twitter one once he knows my pattern ?

The final solution I set up

Disclaimer : while most of the tools or methods exposed here are proved, the adaptations of my own may reveal wrong, I don't claim to be a security expert.

For type 1 passwords

I use the raw Diceware method or a small free software password generator running locally on my desktop without any external dependency and made of only few hundred of lines of code (that I checked). I also hacked the program to use /dev/random instead of /dev/urandom. The program used the diceware 8k dictionary. For medium security level, I use a Diceware three words scheme + a salt. For high security passwords, I use a five Diceware words scheme (that I'll call the 'base') + a salt + a random number/special character pattern. To increase the passphrase entropy, I use this following personal method*. The basic idea is to use the passphrase base itself to add entropy without adding things to remember like positions of special characters :

The salt is made of the concatenation of each first letter of the Diceware words and a '+' character.
The five Diceware words are expressed in lower case without separator (never use space between words because of the noise made by the space bar, you would give a significant hit to a spy).
A special character + three number (like '''587'') I'll have to remember in addition to the base passphrase. The location of the pattern is given using this basic algorithm : the word number is given by the alphabetical order of the base password, then the location of the pattern into the matching word is given by the alphabetic order of word letters itself (I don't detail the boundary limits cases here).
Example of resulting password for this Diceware pass phrase : ''dec scan labile deify shafer'' becomes : ''dslrs+decscanlabiledeif'587yshafer'' (d of 'dec' = 4 so the pattern is included in the 4th word, deify and 'd' in 'deify' gives 4th position in 'deify' word).

(*) The Kerckkoffs security principle states that knowing the security tools or methods in use doesn't provide any significant advantage to the attacker, I hope this is still the case here.

For type 2 passwords

I don't like much wallet programs because I find them too 'formal' and too cumbersome to add new entries. I finally use a HTML/Javascript small free software page I run locally. My passwords are AES-256 encrypted on a file I open using any text editor. Then I paste the encrypted text into this web page, type the master password and the clear text with passwords is then displayed in a text area, ready for copy/paste or CTRL-F searches. I read the Javascript code to check for backdoors and hacked it slighly, adding a timer to clear the password and the clear text area after a short delay so the passwords information is hidden automatically even if I forget to close the browser tab.

Les bugs mystiques

2014-08-16T00:00:00+00:00

L'immense majorité des bugs que nous subissons tous les jours trouve une explication rationnelle assez facilement. Une autre catégorie, heureusement extrêmement rare est celle des bugs dits "mystiques". Prenez les logs de tout système complexe et fortement chargé tels un serveur d'application ou un moniteur transactionnel : je vous prédis que vous y trouvez toujours sur une période suffisante des messages d'erreur étranges et non reproductibles...

Définition

Je nomme « bug mystique » un bug non reproductible, c'est à dire se produisant aléatoirement et pour des raisons inconnues.

Etymologie

Ce terme, issu de l'argot informatique et identique dans plusieurs langues ("Mystical bug" en Anglais) traduit parfaitement le coté ésotérique de ces étrangetés.

Le paradoxe

Pourtant, quoi de plus antinomique que d'un coté l'informatique et la programmation, issues des mathématiques (un programme étant une formule mathématique) et de l'autre le monde opaque de l'Incertain, du Hasard, du Destin ? Il est pourtant difficile en informatique de générer le hasard à volonté : les algorithmes des générateurs pseudo aléatoires sont complexes et utilisent de nombreuses données issues de l'environnement du calculateur comme l'heure, le mouvement de la souris pour un résultat souvent médiocre (des suites identiques apparaissent souvent). A l'opposé, les bugs mystiques semble apparaître aléatoirement car c'est leur nature : ils sont non reproductibles et peuvent subvenir n'importe quand et souvent à partir de situations d'origine a priori identiques.

Potentialité des bugs mystiques

Un informatien a dit un jour que certains bugs pouvaient se produire statistiquement une fois par siècle, c'est à dire sur une période au moins 5 à 10 fois supérieures à la durée de vie du programme lui-même. Je pense que c'est exact. Certains bugs mystiques peuvent ne jamais apparaître et rester tapis au fin fond d'obscurs tests ou boucles dont les conditions sont si improbables qu'il ne se produira jamais effectivement bien qu'il existe potentiellement.

Les causes des bugs mystiques

Un bug mystique peut se produire en autres :

A cause des données à traiter : Dans le cas de l'exécution d'une primitive avec des arguments extrêmement particuliers par exemple.
A cause d'un problème physique comme le changement simultané de plusieurs bits en mémoire vive, une erreur de lecture d'un support physique, une micro-coupure électrique, un bug matériel du processeur ou d'autres composants électroniques...
A cause du code lui-même : bug du compilateur ou d'une machine virtuelle, bug dans le langage, utilisation inappropriée de fonctions spéciales... J'ai déjà vu des commentaires dans des sources Pro*C du type "Ne pas supprimer ce commentaire sinon le programme plante" qui ne mentaient pas, des caractères spéciaux ou écrits en hexadécimal produisant des effets imprévus à la compilation ou à l'exécution...
A cause de la gestion de la mémoire : écrasement de segments mémoire de données par du code ou le contraire. Ce genre de bug sont souvent à la base des "exploits" utilisés par les craker pour casser sécurisés.
A cause de problèmes passifs (ne produisant pas de bugs) de plusieurs modules ou API et qui, utilisés ensemble, se combinent pour faire émerger un nouveau bug actif.
A cause du multi-tâches : à mon avis, la source principale de bugs mystiques dans les langages contemporains comme le Java. Malgré les outils de verrous proposés par ces langages, il est souvent difficile d'éviter totalement les situations imprévues et d'accès concurrents à des ressources partagées en mémoire.
A cause de la gestion des transactions : Il faut utiliser correctement les outils de gestion de concurrence (ACID) pour éviter les événements imprévus lors de l'accès à une ressource comme une base de donnée, un MOM, un système externe etc... Ce type de composant peut avoir un comportement différent selon le vendeur (un accès concurrent dans une base de donnée peut lever une erreur ou se mettre en attente du verrou par exemple).
A cause des problèmes de synchronisation inter transactionnels : Imaginez qu'un utilisateur A demande une fiche client dans une transaction propre. Il lit la fiche plusieurs minutes puis décide de modifier la donnée X. Entre temps, un utilisateur B a modifié la donnée Y dans une transaction de mise à jour terminée. Si la transaction de mise à jour envoie toutes les données en une seule fois (mode bulk), la mise à jour de l'utilisateur B sera écrasée par celles de l'utilisateur A et pourtant, chaque transaction se passe correctement et aucun accès concurrent n'est constaté.
A cause des "dead locks" : un dead lock est un blocage définitif de deux taches se produisant dans le cas très particuliers d'accès concurrent à deux ressources distinctes dans un ordre précis (le thread 1 accède à la ressource A puis B et le thread 2 accède à la ressource B puis A en même temps que 1).
A cause de nombreuses autres raisons comme des effets de bords rares etc.

Conclusion : le jardin secret du programme

Le bug mystique donne une nouvelle dimension aux systèmes d'information et semble faire émerger une sorte de chaos, une conscience qui dépasse le cadre créé par l'humain. Les bugs mystiques se cachent dans le jardin secret du programme, hors de portée de la pensée et de la compréhension des développeurs. Ils agacent, non seulement à cause du bug lui-même mais surtout à cause de l'impression que le programme cache quelque chose, qu'il possède le pouvoir mystérieux de sortir de son chapeau l'un de ces bugs à sa guise.

Undocumented Oracle PreparedStatement optimization

2014-08-15T00:00:00+00:00

We just get a 20% response time gain on a 600+ lines query under Oracle. Our DBA noticed that queries were faster when launched from SQLDeveloper than from our JEE application using the JDBC Oracler 11g driver. We looked at the queries as they actually arrived to the Oracle engine and they where under the form : SELECT... WHERE col1 like ':myvar1' OR col2 LIKE ':myvar2' AND col3 IN (:item1,:myvar2,...) and not 'SELECT... WHERE col1 LIKE ':1' OR col2 LIKE ':2' AND col3 IN (:3,:4,...) like usual when using PreparedStatement the regular way.

Indeed, every PreparedStatement documentation I'm aware of, beginning with the one from Sun states that we have to use '?' to represent bind variables in queries. These '?' are replaced by ':1', ':2', '3' ... by the JDBC driver. So the database has no way to now in our case that :2 and :4 have the same value. This information is lost.

We discovered that we can use PrepareStatement by providing queries with named bind variables instead of '?'. Of course, we still have to set the right value using the setXXX(int position,value) setters for every bind variable occurrence in the query. Then, queries arrive to Oracle like when using SQLDeveloper, with named bind variables.

OK but what's the deal with all this ?

I'm not sure but I think that this optimization may allow Oracle optimizer to be cleverer, especially for queries with redundant parts. It is especially good for queries with duplicated sub SELECT with IN condition containing all the same list of items. Maybe Oracle create on-the fly WITH clauses or similar optimizations in this case ?

Note that this optimization may only work with Oracle and is probably only useful for very large or redundant queries. I don't recommend it in most cases. AFAIK, neither Hibernate nor Spring-JDBC implements this optimization.

How to get bind variables values from Oracle

2014-05-11T00:00:00+00:00

If you already used JDBC prepared statement, you know what are bind variables : the '?' in the query, like in : SELECT col1,col2 from t_table where col1 in (?,?,?) AND col2 = ? For the record, all compiled queries with the same number of '?' are cached by Oracle, hence (most of the time) faster to execute. But how to debug passed values ? This is often valuable like yesterday where one of our services tried to insert value too large for a column (a 4 digits integer into a NUMBER(5,2)).

There is several ways to achieve it, one is using a 'wrapper' JDBC driver (like log4jdbc) that audit and log the values but it's a bit intrusive.

A very simple non-intrusive way for a specific need is to query the v$sql table, the Oracle internal log. A sample query is given bellow (source Stack Overflow) :

select s.sql_id, 
       bc.position, 
       bc.value_string, 
       s.last_load_time, 
       bc.last_captured
from v$sql s
  left join v$sql_bind_capture bc 
         on bc.sql_id = s.sql_id 
        and bc.child_number = s.child_number
where s.sql_text like 'delete from tableA where fk%' -- or any other method to identify the SQL statement
order by s.sql_id, bc.position;

It works like a charm !

Move to Github done smoothly

2014-02-01T00:00:00+00:00

The Jajuk issue tracker and the Git repository are now moved to GitHub (see previous article for the context).

Repository move

Obviously and by nature, the Git repository move has been very simple. I just had to drop my previous origin (pointing to the gitorious project url), to add the new Github origin and to push all my branches. The push of the master branch toke around 30 mins and the others branches (develop, hotfix) almost no time at all. Note that the -u option used in the push command recreates the upstream tracking references.

git remote del origin
git remote add origin git@github.com:jajukteam/jajuk.git
git push -u origin master

The only problem occurred when dropping our Gitorious repository (error 500 -> timeout?)

Issue tracker move

I tried several Trac to Github migration tools, most of them didn't work and finaly settled down with trac2github. It is written in PHP, reads the database (supports mysql, postgres and sqlite) and call the GitHub REST API V3 to create the tickets. It creates the milestones, labels, tickets and comments with good defaults. It had some bugs when working with a postgres database and I has to patch it (two of my push request has been integrated). I also pushed a patch to obfuscate emails from comments.

I also figured out another problem (not linked with the migration tool) : we used the DeleteTicket Trac plugin to drop spam tickets but GitHub issues ids have to be continuous. Origin and destination issues ids are hence now shifted, this is a problem when the code comments have references to a ticket number but we had no solutions for this problem AFAIK.

Have a look at the brand new issue tracker ! : https://github.com/jajuk-team/jajuk/issues

BitBucket vs Github issue tracker choice for Jajuk

2014-01-20T00:00:00+00:00

We are currently moving our Jajuk Trac issue tracker to a better place, mainly for spam reasons. A developer suggested BitBucket, others (me included) GitHub which I already use. I cloned our secondary project QDWizard on a private BitBucket repository to make an opinion. I have to say BitBucker is really good too.

According to me, both systems deliver the most important features :

Simple to import from Trac.
Export facilities to make change possible in the future.
Clean and simple GUI.
Clean roadmap/version support.
Assignation facilities.

But:

Github has much more users (around 4M compared to 1M for BitBucket). More developers already have accounts and are used to it.
GitHub GUI is a bit faster.
GitHub is more "open source" minded, I feel BitBucket more enterprise oriented (private repositories).
BitBucket is free only until 5 developers.

Specifically about issues management : the issue manager in Bitbucket is not actually Jira but a lightweight tracker. It doesn't come (hopefully) with the full workflow support. Like most tracker, each ticket has a type (a "kind" : bug, enhancement, proposal, task) , a priority (trivial,..., blocker) and a status ("workflow" : "on hold", "resolved", "duplicate", "invalid", "wontfix" and "closed"). Note that these states can't be changed nor augmented (many users asked for adding "tested" but it has never been added). It's like Trac without the possibility to customize new types and new status. Some Jajuk Trac types are not supported : "known issue", "Limitation", "patch", "support request", "to_be_reproduced" (and we map our "discussion" to BitBucket "Proposal"). Some status are missing too : "worksforme", "not_enough_information". I suppose a migration would have force us to map several status and several types to the same Bitbucket kind/workflow.

From its side, Github comes with (according to me) a very elegant solution : there are no tickets priorities, types or states but only "labels" like : "important", "bug", "wont fix" , <whatever>... OK, it may be more laxist but on the other side :

 - it allows to add any labels to qualify a ticket against any aspect you may think about ;
-  it doesn't force to use potentially useless fields like priority.

I suppose the migration scripts will be able to simply create any new labels to reflect our existing status and status (yet to be proven). We still have to run the migration script, I'll test this probably this week end.

Keynux Epure S4 laptop review

2012-10-28T00:00:00+00:00

Main thoughts

I bought a Keynux Epure S4 three mouths ago now and it is time to turn it out. At the risk of spoiling, I can already tell you that this laptop rocks and is a good deal. Why did I bought a Keynux in the first place ? simply because it was (AFAIK in December 2011) the only French laptop assembler dealing with my three main criteria :

Running as good as possible under Linux.
No Microsoft tax.
Custom and fine grained hardware choice.

I use this laptop mainly for development, to run virtual machines (along with a "regular" browsing / office use of course). I (almost) don't play games or have others high GPU usages. My main strategy was to select the less expensive Keynux laptop and then to move upmarket the most important components for me (like hard drive, CPU and memory). It cost me around 1400 € (VTA and transport included).

Specifications

Model website and specifications
My custom Epure is basically (see complete specifications bellow for more details) :
- a Clevo W251HSQ laptop chassis.
- an i7 dual core CPU with hyperthreading, the OS sees 2x2= 4 CPU (note that most others i7 are quad-cores : the OS seems 8 CPU but a dual core is fine for development).
- a built-in Intel HD Graphics 3000 GPU.
- a 500GB XT SSD/HS Seagate hybrid.
- 8 GB SO-DIMM RAM DDR-3 / 1333 MHz (2 x 4 Go) RAM.

I hesitated between a pure SSD and hybrid hard disk and finally bought the hybrid to get more storage at the best price and because my usage implies a lot of writes during VM executions. I'm very happy with this solution and the boot takes about 20 secs.

Linux configuration

I use a xubuntu 11.10 desktop. Xfce is a lightweight desktop manager allowing to boot faster and to save memory, power and CPU at usage (a screenshot of my laptop on your right).

Kernel boot options

(Grub configuration under /etc/default/grub under Ubuntu) : GRUB_CMDLINE_LINUX_DEFAULT="quiet splash acpi_osi= i915.modeset=1 add_efi_memmap i915.i915_enable_rc6=7 i915.i915_enable_fbc=1 i915.lvds_downclock=1"

acpi_osi= makes the lightness Fn keys to work (don't ask me)...
acpi_osi= i915.modeset=1 add_efi_memmap i915.i915_enable_rc6=7 i915.i915_enable_fbc=1 i915.lvds_downclock=1 enables the GPU eco mode (thanks Jean-Baptiste) : saves me around 50% of power consumption and 10 degrees (and hence makes the fan mush less noisy in the same time). Power measures on batteries (using powertop) : dropped from 37 watts to 22 watts.
Don't use the pcie_aspm=force option to save more power (see here), some components (probably the Ethernet card) doesn't support ASPM and I got random freezes when plugging the Ethernet cable for instance).

Xubuntu configuration

The sound was always muted at startup. To fix that, store the current volume state using alsactl :

sudo alsactl store

Out of the box, my LG LED projector didn't display any image neither in VGA nor in HDNI mode. After i915 Xorg drivers upgrades, the HNDI works. You can install them using this PPA repository.

Issues

VGA display don't work with my LG LED projector (see previous item) but it does with my HP external screen so it must be specific to the projector (it worked with my previous Lenovo however).

The good

Very smart and plain chassis (I would suggest a M505 black mouse along with it for a perfect look).
Gorgeous 1600x900 screen. Very good color display.
Price : without the Microsoft tax, you get about 150 € and this laptop should be about 200 € less expensive than a comparable Lenovo.
Impressively fast for development usage.
Standard charger standards (I even managed to recycle an old charger).
The keyword typing feeling is very pleasant.
Light packaging
Reactive and professional support.

The bad

No embedded light (to light keyboards up in the dark).
No physical wifi ON/OFF nor volume buttons.
Only three USB 2 ports.
The Ethernet plug is inverted (pin points toward the ground) and has no activity LED.

Small troubles

The screen can be opened only by around 100 degrees from the keyboard. Higher opening can be useful when using the laptop on some ergonomic supports.
The power plug is not very well positioned (on the left, I would prefer on the right) and feels fragile.
By default, the French 220V plug bents at an angle. This makes unplugging very difficult. I had to change for a straight plug cable.
The LED on the charger is annoying when used in a dark room.
The BIOS cannot be parametrized (beside time and few others things). However, it's a way to make the laptop safer.
The "End" and "Begin" keys are mixed with the numeric keys and makes their use confusing, I would prefer independent keys.
No "pseudo wheel" on the touch pad.

Conférence RMLL - Un retour des tranchées de l'Open Source (Jajuk)

2009-06-10T00:00:00+00:00

Conférence à retrouver ici.

Conférence RMLL - Meilleurs projets en SSII avec l'Open Source

2009-06-09T00:00:00+00:00

Video à retrouver ici.

Conférence Solutions Linux - L’approche orientée modèles DSM

2008-01-29T00:00:00+00:00

Slides à retrouver ici.

Linux on a VIA ME6000 and external hard disk real howto

2005-01-26T00:00:00+00:00

My goal was pretty simple: install a Linux distribution on a VIA ME6000 Mini-ITX PC without internal disk to be able to remove the case pan (ME6000 card has no pan) in order to get a 0 db Linux Box. Actually, it toke me near than two months to achieve it despite the little help from various howtos (knoppix howto on VIA for example) and forums. Some of the problems I get came from the CPU and some others from the fact I booted from an external usb disk.

Distribution choice

I've choosen Mandrake 10.1 that work perfectly on my box despite the fact I prefered Suse. I tried:

Suse 9.2: simple, it can't work (since suse 8.2 apparently) because it uses a i586 cmov instruction that is not supported by the Samuel 2 VIA CPU. It freezes during install.
Debian Sarge : Boot but installer (text mode) is unreadable, it must have to deal with the video card I guess... I gave up.
Knoppix 3.7 : works perfectly as Live CD. Awesome. Nevertheless, when I installed it on my disk (sda3), it booted the kernel (I never figured out how it could be possible, read next chapter) but when mounting devices, I got a kernel panic due to a "devfs type not found" problem with kernel 2.4 or 2.6. This problem appeared with knoppix 3.5 apparently and we got none support from knoppix forum. A friend of mine told me that we have to install devfs package to solve this but I have no time to try again, tell me if it works.
DSL (Damn Small Linux) : I managed to install it on a USB pendrive with a lot of pain (read carefully partitioning howto from DSL howtos). The USB pendrive partition must be FAT16 type, have the bootable flag, have a number of head/track=32 and number of cylinders must be less or equal to 1024. However, I gave up to use it : it looks to be a nice dist but is too light for my daily needs.
Mandrake 10.1 : works and provides all the functionalities you can expect from this kind of dist. I kept it.

How to boot from an external USB hard disk real howto

First of all, current bios makes very hard to boot from USB hard drives. It is near impossible to make it work, especially if you won't re-partitionate your disk. I tried about one month, reading tons of forums threads, howto, dist docs... I gave up to boot from USB hard disk but I managed to create a Boot CD. It is very simple indeed under Mandrake (when you know the right command):

Install your Mandrake on the disk (/dev/sda1 or any other partition, use use /dev/sda3).
Insert you MDK disk 1, type F1.
Enter in rescue mode (type "rescue").
Select "Go to console".
Chroot to your disk : mkdir /sda1; mount /dev/sda1 /sda1; chroot /sda1
Launch mkrescue --iso to create a proper boot image matching current kernel, root...
Burn this image (rescue.iso file) with cdrecord under Linux (cdrecord --scanbus; cdrecord -dev=<your device like 1,0,0> -speed=1 rescue.iso) or any Burning utility from Windows.
Boot from CD (change Bios settings if needed), choose default option (Linux), it should boot your disk.

Asus L8400K overview

2003-09-26T00:00:00+00:00

Description

General

CPU	Mem	HD	Screen	Video	Lectors	Price	Autonomy/weigth	Sound	Network
PIII850	128Mo	20 Go	14,1TFT	S3 Savage MX/MV	DVD 8X floppy	About $2000	2-4 h 2.9 kg	ESS Allegro 1988-1	Ethernet: Realtek 8139 Modem : ESS winmodem

CPU

Mem

Screen

Video

Lectors

Price

Autonomy/weigth

Sound

Network

PIII850

128Mo

20 Go

14,1TFT

S3 Savage MX/MV

DVD 8X

floppy

About $2000

2-4 h

2.9 kg

ESS Allegro 1988-1

Ethernet:

Realtek 8139

Modem : ESS winmodem

Connectors

2 PCMCIA port
1 PS2 ( you can use a double ps2 plug to use keyboard & mouse at the same time )
1 Infra Red port
1 TV out ( S-video )
Audio: 1 out, 1 in, 1 jack
2 USB
1 RJ45 for ethernet and modem
1 serial port ( little one)
1parallel port
1 kensington hole
1 VGA out
2 built-in loudspeakers
1 microphone

Note: I didn't get any portbar connector in spite of the advert description.

dmesg

screenshot 1

General feelings

Excellent product to be used under linux at work and at office. Mine was sold with Microsoft Windows Millenium. After having resized partition ( with partition magics ), I installed Suse 7.2 and everything was OK.

Update : I had to change the mother card ( 300€, by Asus France support ) after 2 years of good services. It didn't boot any more.

Summary for Suse 7.2 and Mandrake 8 ( kernel 2.4.4/ XFree 4.0.3/KDE 2.1.1 ) :

Video	yes
Sound	yes
DVD	CD-ROM reading: yes Data DVD reading: yes Video DVD reading: yes (see ogle : http://www.dtek.chalmers.se/groups/dvd/downloads.html )
Mouse	yes ( see XF86config )
IR	? (must work )
APM	yes , I use KDE module to check battery level
Ethernet	yes
Modem	no ( winmodem ) Check http://www.linmodems.org but don't expect it to work before a long time.

To sum up, everything was perfect except the modem. I will use one old one on serial port or I will buy a cheap one for PCMCIA.
Every hardware part have been detected without any additional configuration ( but mouse under Suse ) and I got a running and usable system in less than half an hour. I advise you to avoid Mandrake 8.0 with this lapstop because I had some BIOS clocks problems with this distribution.

Useful information

Star Office 5.2 can freeze your notebook. If you use it, put that line in your profile:

export SAL_DO_NOT_USE_INVERT50=true

Change BIOS settings: OS= others. Note that suspend to RAM works perfectly.

To avoid big and ugly fonts under text console, put vga=791 in your /etc/lilo.conf and type 'lilo' under root:

   image=/boot/vmlinuz
   label=linux
   vga=791
   root=/dev/hda6
   append=" quiet"
   read-only

To add some RAM : you have one slot bellow the keyboard. I tried to add a 256 Mo memory to reach 384 Mo but in this case, system detects only 256Mo, so put only a 128Mo memory.

Update: A L8400K owner reports that he uses a Kingston RAM and that it works perfectly. (now 384Mb ).

X11 config

Using Suse 7.2, I had a bad X11 configuration with sax: the touchpad didn't work at install ( random jumps ). The mouse for the touchpad must be PS/2 to solve problem. Now, I use my lapstop with the touchpad and a USB cordless/optical/wheel logiteck mouse, both running perfectly.

Here's my XF86Config.

Kernel compilation

I recompiled the 2.4.4 kernel without problem.

Here's my compilation config.