Análisis de Datos con Pandas - Filtrando y Resumiendo Datos¶
Autor: Jose Rodriguez (@Cyb3rPandah)
Proyecto: Infosec Jupyter Book
Organización Pública: Open Threat Research
Licencia: Creative Commons Attribution-ShareAlike 4.0 International
Referencia: https://mordordatasets.com/notebooks/small/windows/02_execution/SDWIN-200806115603.html
Representando el Archivo JSON como Dataframe usando Pandas¶
La representación del archivo JSON como un Dataframe de Pandas puede involucrar el uso de comandos como wget, unzip. Esto fue explicado con más detalle en el notebook que lleva por título Representando el Archivo JSON como Dataframe usando Pandas, y además se encuentra en la misma carpeta del presente notebook. Para mantener la estructura del presente notebok en un formato simple, los JSON files requeridos para este workshop ya han sido desargados y descomprimidos. Estos archivos se encuentran en la carpeta sets_datos.
psremoting_json = 'sets_datos/covenant_psremoting_command_2020-08-06115603.json'
a) Importando la librería Pandas¶
import pandas as pd
b) Leyendo Archivo JSON¶
Usaremos el método pandas.read_json.
Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html
df = pd.read_json(path_or_buf = psremoting_json, lines = True)
df.head(5)
tags | @version | EventType | Version | ThreadID | EventTime | Task | AccountType | PipeName | Channel | ... | ContextInfo | Payload | AdditionalInfo | AdditionalInfo2 | OperationType | Properties | MiniportNameLen | MiniportName | param3 | param4 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [mordorDataset] | 1 | ConnectPipe | 1.0 | 4208 | 2020-08-06 11:56:04 | 18 | User | \lsass | Microsoft-Windows-Sysmon/Operational | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | [mordorDataset] | 1 | INFO | 3.0 | 4208 | 2020-08-06 11:56:05 | 10 | User | NaN | Microsoft-Windows-Sysmon/Operational | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | [mordorDataset] | 1 | INFO | 3.0 | 4208 | 2020-08-06 11:56:05 | 10 | User | NaN | Microsoft-Windows-Sysmon/Operational | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | [mordorDataset] | 1 | AUDIT_SUCCESS | 1.0 | 1808 | 2020-08-06 11:56:06 | 12810 | NaN | NaN | Security | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | [mordorDataset] | 1 | AUDIT_SUCCESS | 1.0 | 1808 | 2020-08-06 11:56:06 | 12810 | NaN | NaN | Security | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 180 columns
c) Conociendo las columnas o atributos del Dataframe¶
Usaremos el método pandas.DataFrame.info.
Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html
df.info(verbose = True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4284 entries, 0 to 4283
Data columns (total 180 columns):
# Column Dtype
--- ------ -----
0 tags object
1 @version int64
2 EventType object
3 Version float64
4 ThreadID int64
5 EventTime object
6 Task int64
7 AccountType object
8 PipeName object
9 Channel object
10 EventTypeOrignal object
11 SourceName object
12 Hostname object
13 @timestamp object
14 OpcodeValue float64
15 Message object
16 SourceModuleName object
17 ProcessId object
18 EventReceivedTime object
19 port int64
20 AccountName object
21 UtcTime object
22 Domain object
23 ExecutionProcessID int64
24 host object
25 Severity object
26 SeverityValue int64
27 EventID int64
28 UserID object
29 ProviderGuid object
30 RecordNumber int64
31 Image object
32 Keywords int64
33 ProcessGuid object
34 SourceModuleType object
35 RuleName object
36 TargetProcessGUID object
37 SourceProcessGUID object
38 CallTrace object
39 SourceImage object
40 GrantedAccess object
41 SourceProcessId object
42 SourceThreadId float64
43 TargetProcessId object
44 TargetImage object
45 DestAddress object
46 SourcePort float64
47 Category object
48 Opcode object
49 Application object
50 Direction object
51 RemoteMachineID object
52 DestPort float64
53 LayerRTID float64
54 LayerName object
55 SourceAddress object
56 FilterRTID float64
57 Protocol object
58 RemoteUserID object
59 TargetObject object
60 SubjectLogonId object
61 PrivilegeList object
62 SubjectDomainName object
63 SubjectUserSid object
64 SubjectUserName object
65 TargetDomainName object
66 ElevatedToken object
67 TargetOutboundDomainName object
68 IpPort object
69 ImpersonationLevel object
70 ProcessName object
71 KeyLength float64
72 TargetOutboundUserName object
73 RestrictedAdminMode object
74 LogonProcessName object
75 LogonGuid object
76 WorkstationName object
77 TargetLinkedLogonId object
78 TargetLogonId object
79 TargetUserSid object
80 TransmittedServices object
81 IpAddress object
82 LmPackageName object
83 TargetUserName object
84 AuthenticationPackageName object
85 VirtualAccount object
86 LogonType float64
87 GroupMembership object
88 EventIdx float64
89 EventCountTotal float64
90 DestinationPortName object
91 DestinationIp object
92 SourceIp object
93 Initiated object
94 DestinationHostname object
95 SourceHostname object
96 DestinationPort float64
97 SourceIsIpv6 object
98 User object
99 DestinationIsIpv6 object
100 SourcePortName object
101 ActivityID object
102 ObjectType object
103 AccessMask object
104 AccessList object
105 ShareName object
106 Device object
107 EnabledPrivilegeList object
108 DisabledPrivilegeList object
109 ShareLocalPath object
110 AccessReason object
111 RelativeTargetName object
112 Details object
113 OriginalFileName object
114 ParentCommandLine object
115 ParentProcessId float64
116 FileVersion object
117 Company object
118 CurrentDirectory object
119 IntegrityLevel object
120 CommandLine object
121 ParentImage object
122 LogonId object
123 Hashes object
124 Description object
125 TerminalSessionId float64
126 ParentProcessGuid object
127 Product object
128 Signed object
129 SignatureStatus object
130 Signature object
131 ImageLoaded object
132 ObjectName object
133 ResourceAttributes object
134 TransactionId object
135 RestrictedSidCount float64
136 ObjectServer object
137 HandleId object
138 SourceHandleId object
139 TargetHandleId object
140 NewProcessName object
141 NewProcessId object
142 TokenElevationType object
143 MandatoryLabel object
144 ParentProcessName object
145 Service object
146 NewSd object
147 OldSd object
148 TargetFilename object
149 CreationUtcTime object
150 Path object
151 Priority float64
152 TaskName object
153 Status object
154 IsExecutable object
155 Archived object
156 QueryResults object
157 QueryName object
158 QueryStatus float64
159 TargetServerName object
160 TargetLogonGuid object
161 TargetInfo object
162 TicketEncryptionType object
163 ServiceName object
164 ServiceSid object
165 PreAuthType float64
166 TargetSid object
167 TicketOptions object
168 param1 object
169 param2 object
170 ContextInfo object
171 Payload object
172 AdditionalInfo object
173 AdditionalInfo2 object
174 OperationType object
175 Properties object
176 MiniportNameLen float64
177 MiniportName object
178 param3 object
179 param4 object
dtypes: float64(19), int64(9), object(152)
memory usage: 5.9+ MB
Filtrando Columnas o Atributos de nuestro Dataframe¶
Seleccionando las columnas ‘@timestamp’,’Hostname’,’Channel’,’EventID’ usando una lista con los nombres de las columnas.
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']].head()
@timestamp | Hostname | Channel | ParentImage | Image | EventID | |
---|---|---|---|---|---|---|
0 | 2020-08-06T15:56:07.158Z | MORDORDC.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\windows\system32\dns.exe | 18 |
1 | 2020-08-06T15:56:07.158Z | MORDORDC.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | NaN | 10 |
2 | 2020-08-06T15:56:07.158Z | MORDORDC.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | NaN | 10 |
3 | 2020-08-06T15:56:08.170Z | WORKSTATION5.theshire.local | Security | NaN | NaN | 5156 |
4 | 2020-08-06T15:56:08.170Z | WORKSTATION5.theshire.local | Security | NaN | NaN | 5156 |
Filtrando Filas o Registros de nuestro Dataframe¶
a) Una condición¶
Filtrando nombres de procesos que incluyan el string wsmprovhost.exe.
(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
[df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)]
.head(5)
)
@timestamp | Hostname | Channel | ParentImage | Image | EventID | |
---|---|---|---|---|---|---|
656 | 2020-08-06T15:56:24.416Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | C:\Windows\System32\svchost.exe | C:\Windows\System32\wsmprovhost.exe | 1 |
666 | 2020-08-06T15:56:24.419Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
669 | 2020-08-06T15:56:24.420Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
673 | 2020-08-06T15:56:24.423Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
675 | 2020-08-06T15:56:24.424Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
b) Más de una condición: Operadores AND y OR¶
Podemos usar múltiples condiciones usando los operadores & (AND) y | (OR). El uso de parentesis es importante cuando trabajamos con múltiples condiciones.
(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
(df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]
.head(5)
)
@timestamp | Hostname | Channel | ParentImage | Image | EventID | |
---|---|---|---|---|---|---|
656 | 2020-08-06T15:56:24.416Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | C:\Windows\System32\svchost.exe | C:\Windows\System32\wsmprovhost.exe | 1 |
666 | 2020-08-06T15:56:24.419Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
669 | 2020-08-06T15:56:24.420Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
673 | 2020-08-06T15:56:24.423Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
675 | 2020-08-06T15:56:24.424Z | WORKSTATION6.theshire.local | Microsoft-Windows-Sysmon/Operational | NaN | C:\Windows\System32\wsmprovhost.exe | 7 |
Resumiendo Filas o Registros de nuestro Dataframe¶
a) Resumiendo los eventos de seguridad para el proveedor Sysmon¶
Ahora podemos realizar la agrupación del dataframe anterior usando el método groupby y la columna que representa el número de identificaión del evento de seguridad.
Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html
(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
(df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]
.groupby(['EventID']).size()
)
EventID
1 1
5 1
7 91
11 3
12 148
17 1
18 1
23 2
dtype: int64
El código ejecutado previamente nos devuelve una Serie. En caso quisieramos convertir este objeto a un dataframe, podemos usar el método to_frame.
Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.to_frame.html
(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
(df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]
.groupby(['EventID']).size().to_frame(name = 'Frequencia')
)
Frequencia | |
---|---|
EventID | |
1 | 1 |
5 | 1 |
7 | 91 |
11 | 3 |
12 | 148 |
17 | 1 |
18 | 1 |
23 | 2 |
b) Resumiendo y Ordenando los eventos de seguridad para el proveedor Sysmon¶
Similar al codigo anterior, pero ahora vamos a agregar la operacion de ordenamiento usando el método sort_values.
Referencia: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html
(
df[['@timestamp','Hostname','Channel','ParentImage','Image','EventID']]
[(df['Image'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False)) |
(df['ParentImage'].str.contains('wsmprovhost.exe',case = False, na = False, regex = False))]
.groupby(['EventID']).size().to_frame(name = 'Frequencia').sort_values(by = 'Frequencia', ascending = False)
)
Frequencia | |
---|---|
EventID | |
12 | 148 |
7 | 91 |
11 | 3 |
23 | 2 |
1 | 1 |
5 | 1 |
17 | 1 |
18 | 1 |