Python regex digit

Python regex digit DEFAULT

Python Regex basics in 5 minutes

When I started web scraping, I often had strings with strange characters as \n\nnn and so on. I used the replace function to get rid of it. I built constructions like .str.replace(‘wall of text’,’’).str.replace(‘wall of text’,’’). The result was 1) bad readability, 2) always just individual solutions for one term and 3) a very inefficient process. So I started to use regular expressions for cleaning my data. In the following article I want to share my findings and some tricks with you. Have fun reading.

Scenario 1: Extract information from webscraped data

Python providesthe library which is designed to handle regular expressions. Let´s import it and create a string:

import re
line = "This is a phone number: (061) — 535555. And this is another: 01–4657289, 9846012345"

has four different functions: and . We will start with to learn about the basics. In our string are obviously three different telephone numbers which we scrapped for our marketing department (the string is a real world example from stackoverflow. Our colleagues don´t need the whole string instead they want a list with each number separated. In a first step we try to extract every digit from our string:

foundNum = re.findall('\d', line)
print("The numbers found are: ", foundNum)

Output:

The numbers found are: ['0', '6', '1', '5', '3', '5', '5', '5', '5', '0', '1', '4', '6', '5', '7', '2', '8', '9', '9', '8', '4', '6', '0', '1', '2', '3', '4', '5']

As you can see takes two arguments. The first term is the expression we are looking for. In this case the digits. The second term is the string which contains our pattern of interest. Well, our marketing department can´t use this for calling any customer so we have to look for a different method.

foundNum = re.findall('\(\d', line)
print("The numbers found are: ", foundNum)

Output:

The numbers found are: ['(0']

Now we looked for the combination parentheses and digit. There is one match which is the beginning of the first phone number. If we come up with this result, the marketing department will not ask us for anything again… Okay next try:

foundNum = re.findall("\(*\d*\d", line)
print("The phone numbers found are: ", foundNum)

Output:

The phone numbers found are: ['(061) - 535555', '01 - 4657289', '9846012345']

Now we have the perfect solution. We added a list between the expression, that the string has to beginn and to end with a digit. The term says that the characters “-”, “ “, digits or brackets must in between the digits. Perfect! Our marketing department will be happy about this and can call our customers. Happy end!

Scenario 2: Clean and anonymize customer data

We just had a day off after our great performance in the scraping project. However, after we checked our mails in the morning, we noticed another request. Our colleague from the sales department has some strings which contain expressions in parentheses. For privacy reasons, the terms within the parentheses must be removed. Okay we learned a lot already, so let´s go:

string_df = '500€ (Mr. Data) Product Category 1'
re.findall(r"\([a-zA-Z .]*\)", string_df)

Output:

['(Mr. Meier)']

Yeah perfect. This was easy. We just checked for a string which starts and ends with parentheses and included a list which contain alphabetical characters and a point. When we started to write a mail with our great results, we noticed that we don´t need to search for this term. Instead we have to remove this expression. We go a step backwards and introduce a new function . takes three arguments. The first argument is the expression we are looking for, the second argument is the term we want to use to replace the old one and the last argument is the string we want to use:

re.sub(r"\([a-zA-Z .]*\)", "-XY-", string_df)

We replaced the name with the expression “-XY-”. Now the sales department can send the data to our dashboard supplier. Another happy end.

Scenario 3: Split strings in several parts

We start feeling comfortable with regular expressions. If we call us experts it would be some kind of overselling but we found for every problem a solution, didn't we? After we helped out our colleagues in the other departments, we focus on our own work. We received new data for an explorative data analysis. As always we look at the first row of our DataFrame. The second column is looking strange:

strange_cell = '48 Mr.Data 57'

From our previous experiences, we derive that the first digit stays for the amount of units which were bought by Mr. Data. The second digit tells us the customer id. We need this information in three separate columns but how? Regex provides another function called :

re.split(" ", strange_cell)

Output:

['48', 'Mr.Data', '57']

This is exactly what we were looking for. With the number of happy ends, you could think you are watching a Disney movie, right?

A short excourse

After a long successful day we received another mail of the marketing department where they ask for a modification of our first task: They just need the first phone number in the string because this is the business phone number:

line = "This is a phone number: (061) — 535555. And this is another: 01–4657289, 9846012345"

With the function we consider every match. An easy workaround would be that we just give out the first element of the list:

foundNum = re.findall(r"\(*\d[- \d()]*\d", line)[0]
print("The phone numbers found are: ", foundNum)

Output:

The phone numbers found are: (061) - 535555

Another more elegant solution would be to use the fourth regex function which gives the first match in a string back:

foundNum = re.search(r"\(*\d[- \d()]*\d", line)
print("The phone numbers found are: ", foundNum)

Output:

The phone numbers found are: <re.Match object; span=(24, 38), match='(061) - 535555'>

Conclusion

I hoped you enjoyed reading and understood why is so important for data scientists. It´s a powerful method to modify strings in the process of data cleaning and data transformation. If you work with data you will need skills to handle and transform strings. is the perfect solution. For an overview of different expressions I recommend the following pages:

Sours: https://towardsdatascience.com/python-regex-basics-in-5-minutes-b28c0df8d51d

Python Regex – Get List of all Numbers from String

To get the list of all numbers in a String, use the regular expression ‘[0-9]+’ with re.findall() method. [0-9] represents a regular expression to match a single digit in the string. [0-9]+ represents continuous digit sequences of any length.

where str is the string in which we need to find the numbers. re.findall() returns list of strings that are matched with the regular expression.

Example 1: Get the list of all numbers in a String

In the following example, we will take a string, , and find all the numbers, , present in the string.

Python Program

Run

Output

Example 2: Get the list of all continuous digits in a String

In the following example, we will take a string, , and find all the numbers, , present in the string.

Python Program

Run

Output

Summary

In this tutorial of Python Examples, we learned how to get all the numbers form a string as a list, using Python Regular Expressions, with the help of example programs.

Sours: https://pythonexamples.org/python-regex-extract-find-all-the-numbers-in-string/
  1. Freightliner side fairings
  2. Werewolf professor lupin
  3. Mid entry diesel pusher

Find all the numbers in a string using regular expression in Python

Given a string str containing numbers and alphabets, the task is to find all the numbers in str using regular expression.

Examples:

Attention reader! Don’t stop learning now. Get hold of all the important DSA concepts with the DSA Self Paced Course at a student-friendly price and become industry ready.  To complete your preparation from learning a language to DS Algo and many more,  please refer Complete Interview Preparation Course.

In case you wish to attend live classes with experts, please refer DSA Live Classes for Working Professionals and Competitive Programming Live for Students.

Input: abcd11gdf15hnnn678hh4
Output: 11 15 678 4

Input: 1abcd133hhe0
Output: 1 133 0

Recommended: Please try your approach on {IDE} first, before moving on to the solution.

Approach: The idea is to use Python re library to extract the sub-strings from the given string which match the pattern [0-9]+. This pattern will extract all the characters which match from 0 to 9 and the + sign indicates one or more occurrence of the continuous characters.

Below is the implementation of the above approach:

 

 

Sours: https://www.geeksforgeeks.org/find-all-the-numbers-in-a-string-using-regular-expression-in-python/
[5 Minute Tutorial] Regular Expressions (Regex) in Python

re — Regular expression operations

Regular Expression Examples¶

Checking for a Pair¶

In this example, we’ll use the following helper function to display match objects a little more gracefully:

defdisplaymatch(match):ifmatchisNone:returnNonereturn'<Match: %r, groups=%r>'%(match.group(),match.groups())

Suppose you are writing a poker program where a player’s hand is represented as a 5-character string with each character representing a card, “a” for ace, “k” for king, “q” for queen, “j” for jack, “t” for 10, and “2” through “9” representing the card with that value.

To see if a given string is a valid hand, one could do the following:

>>> valid=re.compile(r"^[a2-9tjqk]{5}$")>>> displaymatch(valid.match("akt5q"))# Valid."<Match: 'akt5q', groups=()>">>> displaymatch(valid.match("akt5e"))# Invalid.>>> displaymatch(valid.match("akt"))# Invalid.>>> displaymatch(valid.match("727ak"))# Valid."<Match: '727ak', groups=()>"

That last hand, , contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:

>>> pair=re.compile(r".*(.).*\1")>>> displaymatch(pair.match("717ak"))# Pair of 7s."<Match: '717', groups=('7',)>">>> displaymatch(pair.match("718ak"))# No pairs.>>> displaymatch(pair.match("354aa"))# Pair of aces."<Match: '354aa', groups=('a',)>"

To find out what card the pair consists of, one could use the method of the match object in the following manner:

>>> pair=re.compile(r".*(.).*\1")>>> pair.match("717ak").group(1)'7'# Error because re.match() returns None, which doesn't have a group() method:>>> pair.match("718ak").group(1)Traceback (most recent call last): File "<pyshell#23>", line 1, in <module>re.match(r".*(.).*\1","718ak").group(1)AttributeError: 'NoneType' object has no attribute 'group'>>> pair.match("354aa").group(1)'a'

Simulating scanf()¶

Python does not currently have an equivalent to . Regular expressions are generally more powerful, though also more verbose, than format strings. The table below offers some more-or-less equivalent mappings between format tokens and regular expressions.

Token

Regular Expression

, , ,

,

To extract the filename and numbers from a string like

/usr/sbin/sendmail-0errors,4warnings

you would use a format like

%s-%derrors,%dwarnings

The equivalent regular expression would be

(\S+)-(\d+)errors,(\d+)warnings

search() vs. match()¶

Python offers two different primitive operations based on regular expressions: checks for a match only at the beginning of the string, while checks for a match anywhere in the string (this is what Perl does by default).

For example:

>>> re.match("c","abcdef")# No match>>> re.search("c","abcdef")# Match<re.Match object; span=(2, 3), match='c'>

Regular expressions beginning with can be used with to restrict the match at the beginning of the string:

>>> re.match("c","abcdef")# No match>>> re.search("^c","abcdef")# No match>>> re.search("^a","abcdef")# Match<re.Match object; span=(0, 1), match='a'>

Note however that in mode only matches at the beginning of the string, whereas using with a regular expression beginning with will match at the beginning of each line.

>>> re.match('X','A\nB\nX',re.MULTILINE)# No match>>> re.search('^X','A\nB\nX',re.MULTILINE)# Match<re.Match object; span=(4, 5), match='X'>

Making a Phonebook¶

splits a string into a list delimited by the passed pattern. The method is invaluable for converting textual data into data structures that can be easily read and modified by Python as demonstrated in the following example that creates a phonebook.

First, here is the input. Normally it may come from a file, here we are using triple-quoted string syntax

>>> text="""Ross McFluff: 834.345.1254 155 Elm Street...... Ronald Heathmore: 892.345.3428 436 Finley Avenue... Frank Burger: 925.541.7625 662 South Dogwood Way......... Heather Albrecht: 548.326.4584 919 Park Place"""

The entries are separated by one or more newlines. Now we convert the string into a list with each nonempty line having its own entry:

>>> entries=re.split("\n+",text)>>> entries['Ross McFluff: 834.345.1254 155 Elm Street','Ronald Heathmore: 892.345.3428 436 Finley Avenue','Frank Burger: 925.541.7625 662 South Dogwood Way','Heather Albrecht: 548.326.4584 919 Park Place']

Finally, split each entry into a list with first name, last name, telephone number, and address. We use the parameter of because the address has spaces, our splitting pattern, in it:

>>> [re.split(":? ",entry,3)forentryinentries][['Ross', 'McFluff', '834.345.1254', '155 Elm Street'],['Ronald', 'Heathmore', '892.345.3428', '436 Finley Avenue'],['Frank', 'Burger', '925.541.7625', '662 South Dogwood Way'],['Heather', 'Albrecht', '548.326.4584', '919 Park Place']]

The pattern matches the colon after the last name, so that it does not occur in the result list. With a of , we could separate the house number from the street name:

>>> [re.split(":? ",entry,4)forentryinentries][['Ross', 'McFluff', '834.345.1254', '155', 'Elm Street'],['Ronald', 'Heathmore', '892.345.3428', '436', 'Finley Avenue'],['Frank', 'Burger', '925.541.7625', '662', 'South Dogwood Way'],['Heather', 'Albrecht', '548.326.4584', '919', 'Park Place']]

Text Munging¶

replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters:

>>> defrepl(m):... inner_word=list(m.group(2))... random.shuffle(inner_word)... returnm.group(1)+"".join(inner_word)+m.group(3)>>> text="Professor Abdolmalek, please report your absences promptly.">>> re.sub(r"(\w)(\w+)(\w)",repl,text)'Poefsrosr Aealmlobdk, pslaee reorpt your abnseces plmrptoy.'>>> re.sub(r"(\w)(\w+)(\w)",repl,text)'Pofsroser Aodlambelk, plasee reoprt yuor asnebces potlmrpy.'

Finding all Adverbs¶

matches all occurrences of a pattern, not just the first one as does. For example, if a writer wanted to find all of the adverbs in some text, they might use in the following manner:

>>> text="He was carefully disguised but captured quickly by police.">>> re.findall(r"\w+ly\b",text)['carefully', 'quickly']

Finding all Adverbs and their Positions¶

If one wants more information about all matches of a pattern than the matched text, is useful as it provides match objects instead of strings. Continuing with the previous example, if a writer wanted to find all of the adverbs and their positions in some text, they would use in the following manner:

>>> text="He was carefully disguised but captured quickly by police.">>> forminre.finditer(r"\w+ly\b",text):... print('%02d-%02d: %s'%(m.start(),m.end(),m.group(0)))07-16: carefully40-47: quickly

Raw String Notation¶

Raw string notation () keeps regular expressions sane. Without it, every backslash () in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

>>> re.match(r"\W(.)\1\W"," ff ")<re.Match object; span=(0, 4), match=' ff '>>>> re.match("\\W(.)\\1\\W"," ff ")<re.Match object; span=(0, 4), match=' ff '>

When one wants to match a literal backslash, it must be escaped in the regular expression. With raw string notation, this means . Without raw string notation, one must use , making the following lines of code functionally identical:

>>> re.match(r"\\",r"\\")<re.Match object; span=(0, 1), match='\\'>>>> re.match("\\\\",r"\\")<re.Match object; span=(0, 1), match='\\'>

Writing a Tokenizer¶

A tokenizer or scanner analyzes a string to categorize groups of characters. This is a useful first step in writing a compiler or interpreter.

The text categories are specified with regular expressions. The technique is to combine those into a single master regular expression and to loop over successive matches:

fromtypingimportNamedTupleimportreclassToken(NamedTuple):type:strvalue:strline:intcolumn:intdeftokenize(code):keywords={'IF','THEN','ENDIF','FOR','NEXT','GOSUB','RETURN'}token_specification=[('NUMBER',r'\d+(\.\d*)?'),# Integer or decimal number('ASSIGN',r':='),# Assignment operator('END',r';'),# Statement terminator('ID',r'[A-Za-z]+'),# Identifiers('OP',r'[+\-*/]'),# Arithmetic operators('NEWLINE',r'\n'),# Line endings('SKIP',r'[ \t]+'),# Skip over spaces and tabs('MISMATCH',r'.'),# Any other character]tok_regex='|'.join('(?P<%s>%s)'%pairforpairintoken_specification)line_num=1line_start=0formoinre.finditer(tok_regex,code):kind=mo.lastgroupvalue=mo.group()column=mo.start()-line_startifkind=='NUMBER':value=float(value)if'.'invalueelseint(value)elifkind=='ID'andvalueinkeywords:kind=valueelifkind=='NEWLINE':line_start=mo.end()line_num+=1continueelifkind=='SKIP':continueelifkind=='MISMATCH':raiseRuntimeError(f'{value!r} unexpected on line {line_num}')yieldToken(kind,value,line_num,column)statements=''' IF quantity THEN total := total + price * quantity; tax := price * 0.05; ENDIF;'''fortokenintokenize(statements):print(token)

The tokenizer produces the following output:

Token(type='IF',value='IF',line=2,column=4)Token(type='ID',value='quantity',line=2,column=7)Token(type='THEN',value='THEN',line=2,column=16)Token(type='ID',value='total',line=3,column=8)Token(type='ASSIGN',value=':=',line=3,column=14)Token(type='ID',value='total',line=3,column=17)Token(type='OP',value='+',line=3,column=23)Token(type='ID',value='price',line=3,column=25)Token(type='OP',value='*',line=3,column=31)Token(type='ID',value='quantity',line=3,column=33)Token(type='END',value=';',line=3,column=41)Token(type='ID',value='tax',line=4,column=8)Token(type='ASSIGN',value=':=',line=4,column=12)Token(type='ID',value='price',line=4,column=15)Token(type='OP',value='*',line=4,column=21)Token(type='NUMBER',value=0.05,line=4,column=23)Token(type='END',value=';',line=4,column=27)Token(type='ENDIF',value='ENDIF',line=5,column=4)Token(type='END',value=';',line=5,column=9)
Frie09

Friedl, Jeffrey. Mastering Regular Expressions. 3rd ed., O’Reilly Media, 2009. The third edition of the book no longer covers Python at all, but the first edition covered writing good regular expression patterns in great detail.

Sours: https://docs.python.org/3/library/re.html

Digit python regex

In this tutorial, you will learn about regular expressions, called RegExes (RegEx) for short, and use Python's module to work with regular expressions. RegEx is incredibly useful, and so you must get your head around it early. Regular expressions are the default way of data cleaning and wrangling in Python. Be it extraction of specific parts of text from web pages, making sense of twitter data or preparing your data for text mining – Regular expressions are your best bet for all these tasks.

What is a regular expression in Python?

You may be familiar with searching for text using shortcut ctrl + F and entering the text you are looking for. Regular expressions go one step further: They allow you to specify a pattern of text to search for. Essentially RegEx as a sequence of characters that defines a search pattern. Knowing regular expressions can mean the difference between solving a problem in 3 steps and solving in 3,000 steps.

For example, you may need to find in some text a phone number that you don't know, but if you live in the USA or Canada, you know it will be three digits, followed by a hyphen, then another three digits followed by a hyphen and then four more digits. Humans are good at recognising patterns, so you will know that 415-555-3456 is a phone number, but 6789,78564,67708879 is not.

Regular expressions are supported by most of the programming languages like Python, Perl, R, Java and many others. In this post, you’ll explore regular expressions in Python only.

How do you use regular expressions in Python?

If you don't know how to use regexes and you want to find a phone number in a string, you will have to write a relatively complex function, and it will take longer for your code to run, compare to regular expressions. I hope by now, I managed to convince you to learn regex and save yourself a ton of time.

Regular expressions are descriptions for a pattern of text. For instance, a in a regex stands for a digit character - that is, any single numeral 0 to 9. The regex is used by Python to match a string of three numbers, a hyphen, three more numbers, another hyphen, and four numbers. Anything else would not match the regex.

- Braces

Regular expression for the same pattern can be also defined as .   Adding a in curly brackets after a pattern is like saying, "Match this pattern three times." So and will find the same pattern - phone number format.

Consider this code: . This means at least n, and at most m repetitions of the pattern left to it. This RegEx matches at least two digits but not more than four digits.

Character Classes

In the phone number regex example, you learned that could stand for any numeric digit. There are many such shorthand character classes, as shown below.

- Matches any decimal digit. Equivalent to any single numeral 0 to 9.

- Matches any character that is not a numeric digit from 0 to 9.

- Matches where a string contains any whitespace character. Equivalent to any space, tab, or newline charecter. (Think of this as matching "space" charecters.)

- Matches any character that is not a space, tab or newline.

- Matches any alphanumeric character (digits and alphabets), or the underscore charecter. Equivalent to .

- Matches any non-alphanumeric character. Any charecter that is not a letter, number, or the underscore charecter.

- Matches if the specified characters are at the end of a string. Expession will match text "I love Python" but, would not match I like Python Programming.

Square brackets - make your own  charecter classes

From time to time, you will want to match a set of characters, but you will find that the shorthand character classes ( , , , and so on) are too broad. In such a case, you can define your character class using square brackets. As an illustration, the character class will match any lowercase vowel.

- Square brackets specifies a set of characters you wish to match.

MetaCharacters

To define regular expressions, metacharacters are used. For example, and are metacharacters. Metacharacters are characters that are interpreted in a special way by a RegEx engine. Here's a list of metacharacters: [].^$*+?{}()\|

Period\Dot - A period matches any single character (except newline ).

Caret - The caret symbol is used to check if a string starts with a certain character.

Dollar Symbol - The dollar symbol is used to check if a string ends with a certain character.

Star - The star symbol matches zero or more occurrences of the pattern left to it.

Plus - The plus symbol matches one or more occurrences of the pattern left to it.

Question mark -The question mark symbol matches zero or one occurrence of the pattern left to it.

Vertical bar - Vertical bar is used for alternation ( operator).

Parentheses - Parentheses is used to group sub-patterns.                        For example, match any string that matches either a or b or c followed by xz.

Backlash - backlash is used to escape various characters including all metacharacters. For example, match if a string contains followed by . Here, is not specially interpreted by a RegEx engine. If you are unsure if a character has special meaning or not, you can put in front of it. This makes sure the character is not treated specially.

How to escape MetaCharacters in Regex using Python

If you need to define a simple pattern like we did with the phone number exsample then you don't need to worry about metacharacters if you use in function. Remember that  the underscore charecter is considered  an  alphanumeric character (digits and alphabets) by Regex.

However, if you need to define a slightly more complex pattern where a pattern includes one or multiple metacharacters then you need to know how to escape such characters in Python. This can be done by using the backslash . The string value represents a single newline charecter, not a backslash followed by a lowercase n. You need to enter the escape character to print a single backlash. So is the string that represents a backslash followed by a lowercase n.

Alternatively, you can use to mark your string as a raw string, which does not escape charecters, by putting it before the first quote of the string value. Since Regex  frequently use backlashes and other metacharecters in them, it is convinient to pass raw strings to the function instead of typing extra backslashes. Entering is easier than typing .

Python RegEx - module

Python has a module named to work with regular expressions. You can find all the regex functions in Python in the module. To use it, we need to import the module:

Passing a string value representing your Regex to returns a Regex object .

The most common uses of regular expressions are:

  • Search a string (search and match)
  • Finding a string (findall)
  • Break string into a sub strings (split)
  • Replace part of a string (sub)

Let’s look at the methods that library “re” provides to perform these tasks.

  1. re.compile () -   Compile a regular expression pattern into a regular expression object, which can be used for matching using its , and other methods, described below. The expression’s behaviour can be modified by specifying a flags value. Values can be any of the following variables, combined using bitwise OR (the operator).
  2. re.match() - If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. It will return if the string does not match the pattern.
  3. re.search()  -  scans provided string value looking for the first location where the pattern Regex matches. If a match is found, then re.search() returns a match object. Otherwise, it returns .
  4. re.findall() - method returns a list of strings containing all matches.
  5. re.split() - Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. If maxsplit is nonzero, at most maxsplit splits occur, and the remainder of the string is returned as the final element of the list.The method splits the string where there is a match and returns a list of strings where the splits have occurred.
  6. re.sub() - The method returns a string where matched occurrences are replaced with the content of replace variable.

In this article, I did not cover all or functions, constants, and an exception that module provides, but I will provide detailed walkthrough tutorials later in the Regex series of tutorials. If you want to learn more about module re check out its documentation.

Sours: https://re-thought.com/python-regular-expressions/
[5 Minute Tutorial] Regular Expressions (Regex) in Python

.

Now discussing:

.



127 128 129 130 131