Security Data Lake Concept laptop

Estimated reading time: 5 minutes

New cybersecurity threats continue to emerge every day and hackers develop new intrusion techniques to access sensitive data and breach IT systems. This is why it is necessary to collaborate with high-level experts who keep track of new developments in the field of IT security. With the birth and continuous evolution of Big Data, the concept of Data Lake and Security Data Lake has also established itself.

For a company, it is expensive to hire a team that deals exclusively with the internal security of a system, which is why many turn to professionals, using a Security Operations Center as a Service (SOCaaS) This service, offered by SOD, also includes an SDL. Let us now try to understand what it is and what their importance and convenience is.

Security Data Lake: what they are

Security Data Lake Concept Big Data

Un Data Lake è un archivio che include grandi quantità di dati, strutturati e non, che non sono stati ancora elaborati per uno scopo specifico. These have a simple architecture to store data. Each item is assigned a unique identifier and then tagged with a set of metadata.

When a business question arises, data scientists can query the Data Lake in order to discover data that could answer the question. Since the Data Lakes are sources that will store sensitive company information, it is necessary to protect them with effective security measures, however the external data ecosystem that feeds the Data Lakes is very dynamic and new problems could regularly arise that undermine its security.

Users authorized to access Data Lakes, for example, could explore and enrich its resources, consequently also increasing the risk of violation. If this were to occur, the consequences could be catastrophic for a company: violation of employee privacy, regulatory information or compromise of business-critical information.

A Security Data Lake, on the other hand, is more focused on security. It offers the possibility of acquiring data from many security tools, analyzing them to extract important information, mapping the fields following a common pattern.

The data contained in an SDL

There are countless different varieties of data, in different formats, JSON, XML, PCAP and more. A Security Data Lake supports all these types of data, ensuring a more accurate and efficient analysis process. Many companies leverage Big Data to develop machine learning-based threat detection systems. An example, for this eventuality, is the UEBA system integrated with the SOCaaS offered by SOD.

A Security Data Lake allows you to easily access data, making it available, also offering the opportunity for real-time analysis.

Apache Hadoop

It is a set of Open Source programs that allows applications to work and manage an enormous amount of data. The goal is to solve problems that involve high amounts of information and computation.

Apache Hadoop includes HDFS, YARN, and MapReduce. When we talk about Hadoop, therefore, we are referring to all those tools capable of interfacing and integrating with this technology. The role of Hadoop is essential because with them it is possible to store and process data at a very low cost compared to other tools. Furthermore, it is possible to do this on a large scale. An ideal solution, therefore, for managing an SDL.

Security Data Lake Concept laptop

Hadoop Distributed File System (HDFS): is one of the main components of Apache Hadoop, it provides access to application data without having to worry about defining schemas in advance.

Yet Another Resource Negotiator (YARN): It is used to manage computing resources in clusters, giving the possibility of using them to program user applications. It is responsible for managing the allocation of resources throughout the Hadoop ecosystem.

MapReduce: is a tool with which processing logic can be transferred, thus helping developers write applications capable of manipulating large amounts of information in a single manageable dataset.

What advantages does Hadoop offer?

It is important to use Hadoop because with it You can leverage clusters of multiple computers to analyze large amounts of information rather than using a single large computer. The advantage, compared to relational databases and data warehouse, lies in Hadoop’s ability to manage big data in a fast and flexible way.

Other advantages

Fault tolerance: Data is replicated across a cluster, so it can be easily recovered in the event of disk or node errors or malfunctions.

Costs: Hadoop is a much cheaper solution than other systems. It provides compute and storage on affordable hardware.

Strong community support: Hadoop is currently a project supported by an active community of developers who introduce updates, improvements and ideas, making it an attractive product for many companies.


In this article we learned the differences between a Data Lake and a Security Data Lake, clarifying the importance of using these tools in order to guarantee the correct integrity of the IT systems present in a company.

Collecting infrastructure data is only the first step for efficient analysis and the resulting security offered by monitoring, essential for a SOCaaS. Ask us how these technologies can help you manage your company’s cyber security.

For doubts and clarifications, we are always ready to answer all your questions.

Useful links:

open data

Estimated reading time: 5 minutes

With the advent of big data platforms, IT security companies can now make guided decisions on how to protect their assets. By recording network traffic and network flows, it is possible to get an idea of the channels on which company information flows. To facilitate the integration of data between the various applications and to develop new analytical functionalities, we the Apache Open Data Model meets.

The common Open Data Model for networks, endpoints and users has several advantages. For example, easier integration between various security applications, but companies are also made it easier to share analytics in case new threats are detected.

Hadoop offers adequate tools to manage a Security Data Lake (SDL) and big data analysis. It can also detect events that are usually difficult to identify, such as lateral movement , data leaks, internal problems or stealth behavior in general. Thanks to the technologies behind the SDL it is possible to collect the data of the SIEM to be able to exploit them through SOCaaS since, being a free Open Data Model, the logs are stored in such a way that they can be used by anyone.

open data model nodes

What is Hadoop Open Data Model

Apache Hadoop is free and open source software that helps companies gain insight into their network environments. The analysis of the collected data leads to the identification of potential security threats or any attacks that take place between the resources in the cloud.

While traditional Cyber Threat Intelligence tools help identify threats and attacks in general, an Open Data Model provides a tool that allow companies to detect suspicious connections using flow and packet analysis.

H adoop Open Data Model combines all security-related data (events, users, networks, etc.) into a single visual area that can be used to identify threats effectively. It is You can also use them to create new analytical models. In fact, an Open Data Model allows the sharing and reuse of threat detection models.

An Open Data Model also provides a common taxonomy to describe the security telemetry data used to detect threats. Using data structures and schemas in the Hadoop platform it is possible to collect, store and analyze security-related data.

Open Data Model Hadoop, the advantages for companies

  • Archive a copy of the data security telemetry
  • Leverage out-of-the-box analytics to detect threats targeting DNS, Flow and Proxy
  • Build custom analytics based on your needs
  • – Allows third parties to interact with ‘Open Data Model
  • Share and reuse models of threat detection, algorithms, visualizations and analysis from the community Apache Spot .
  • Leverage security telemetry data to better detect threats
  • Using security logs
  • Obtain data from users , endpoints and network entities
  • Obtain threat intelligence data

Open Data Model: types of data collected

To provide a complete security picture and to effectively analyze cyber threat data, you need to collect and analyze all logs and alerts regarding security events and contextual data related to the entities you are dealing with referenced in these logs . The most common entities include the network, users and endpoints, but there are actually many more, such as files and certificates.

Due to the need to collect and analyze security alerts, logs and contextual data, the following types of data are included in the Open Data Model.

Security Event Alerts in Open Data Model

These are event logs from common data sources used to identify threats and better understand network flows. For example operating system logs, IPS logs, firewall logs, proxy logs, web and many more.

Network context data

These include network information that is accessible to anyone from the Whois directory, as well as resource databases and other similar data sources.

User context data

This type of data includes all information relating to the management of users and their identity. Also included are Active Directory, Centrify and other similar systems.

Endpoint context data

Includes all information about endpoint systems (server, router, switch). They can come from asset management systems, vulnerability scanners and detection systems.

Contextual threat data

This data contains contextual information on URLs, domains, websites, files and much more, always related to known threats.

Contextual data on vulnerabilities

This data includes information on vulnerabilities and vulnerability management systems.

Articles from the RoadMap

This is file context data, certificates, naming convention.

open data model cover

Name of attributes

A naming convention is required for an Open Data Model in order to represent attributes between the vendor’s products and technologies. The naming convention consists of prefixes (net, http, src, dst, etc) and common attribute names (ip4, usarname, etc).

It is still a good idea to use multiple prefixes in combination with one attribute.


We have seen what the Hadoop Open Data Model is and how it can be used thanks to its ability to filter traffic and highlight potential cyber attacks by listing suspicious flows, threats to users, threats to endpoints and major network threats.

If you have any doubts or would like further clarification, do not hesitate to contact us by pressing the button below, we will be happy to answer any question.

Useful links: