Data Cleaning and Visualization

Scenario
You have been provided an export from DCE’s incident response team’s security
information and event management (SIEM) system. The incident response team extracted
alert data from their SIEM platform and have provided a .CSV file (MLData2023.csv), with
500,000 event records, of which approximately 3,000 have been ‘tagged’ as malicious.
The goal is to integrate machine learning into their Security Information and Event
Management (SIEM) platform so that suspicious events can be investigated in real-time.
security data.
Data description
Each event record is a snapshot triggered by an individual network ‘packet’. The exact
triggering conditions for the snapshot are unknown. But it is known that multiple packets
are exchanged in a ‘TCP conversation’ between the source and the target before an event is
triggered and a record created. It is also known that each event record is anomalous in
some way (the SIEM logs many events that may be suspicious).
A very small proportion of the data are known to be corrupted by their source systems and
some data are incomplete or incorrectly tagged. The incident response team indicated this
is likely to be less than a few hundred records. A list of the relevant features in the data is
given below.
Assembled Payload Size (continuous) The total size of the inbound suspicious
payload. Note: This would contain the data
sent by the attacker in the “TCP
conversation” up until the event was
triggered
DYNRiskA Score (continuous) An un-tested in-built risk score assigned by
a new SIEM plug-in
IPV6 Traffic (binary) A flag indicating whether the triggering
packet was using IPV6 or IPV4 protocols
(True = IPV6)
Response Size (continuous) The total size of the reply data in the TCP
conversation prior to the triggering packet
Source Ping Time (ms) (continuous) The ‘ping’ time to the IP address which
triggered the event record. This is affected
by network structure, number of ‘hops’ and
even physical distances.

This question has been answered.

Get Answer